lihaocruiser
's Collections
LLM-RL
updated
Direct Preference Optimization: Your Language Model is Secretly a Reward
Model
Paper
•
2305.18290
•
Published
•
64
Fine-Grained Human Feedback Gives Better Rewards for Language Model
Training
Paper
•
2306.01693
•
Published
•
3
Self-Rewarding Language Models
Paper
•
2401.10020
•
Published
•
151
Secrets of RLHF in Large Language Models Part II: Reward Modeling
Paper
•
2401.06080
•
Published
•
28
ReFT: Reasoning with Reinforced Fine-Tuning
Paper
•
2401.08967
•
Published
•
31
sDPO: Don't Use Your Data All at Once
Paper
•
2403.19270
•
Published
•
41
The Lessons of Developing Process Reward Models in Mathematical
Reasoning
Paper
•
2501.07301
•
Published
•
99
O1 Replication Journey -- Part 3: Inference-time Scaling for Medical
Reasoning
Paper
•
2501.06458
•
Published
•
31
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
Paper
•
2412.21187
•
Published
•
40
Reinforcement Learning for Reasoning in Large Language Models with One
Training Example
Paper
•
2504.20571
•
Published
•
98