Abstract
Group-based reinforcement learning from verifier rewards suffers from biased advantage estimation that underestimates hard prompts and overestimates easy prompts, which is addressed through a history-aware adaptive difficulty weighting method.
Reinforcement Learning from Verifier Rewards (RLVR) has emerged as a widely used approach for post-training large language models on reasoning tasks, with group-based methods such as GRPO and its variants gaining broad adoption. These methods rely on group-relative advantage estimation to avoid learned critics, yet its theoretical properties remain poorly understood. In this work, we uncover a fundamental issue of group-based RL: the group-relative advantage estimator is inherently biased relative to the true (expected) advantage. We provide the first theoretical analysis showing that it systematically underestimates advantages for hard prompts and overestimates them for easy prompts, leading to imbalanced exploration and exploitation. To address this issue, we propose History-Aware Adaptive Difficulty Weighting (HA-DW), an adaptive reweighting scheme that adjusts advantage estimates based on an evolving difficulty anchor and training dynamics. Both theoretical analysis and experiments on five mathematical reasoning benchmarks demonstrate that HA-DW consistently improves performance when integrated into GRPO and its variants. Our results suggest that correcting biased advantage estimation is critical for robust and efficient RLVR training.
Community
This paper fundamentally shows that:
"The commonly used group-relative advantage estimator is inherently biased except at p_t = 0.5: it systematically underestimates true advantage on hard prompts and overestimates true advantag on easy prompts".
This bias is not just random—it becomes deterministic in extreme difficulty regimes, meaning the estimator must underestimate for very hard prompts and must overestimate for very easy prompts. This analysis highlights this as a core limitation in group-relative methods and motivates corrections that better align estimated and true advantage.
This discovery implies that we should adjust advantage estimation based on prompt difficulty:
"For hard prompts, we should increase the estimated advantage to encourage more exploration;
For easy prompts, we should decrease the estimated advantage to prevent over-exploitation."
To determine a prompt’s difficulty level in practice, we use a short-term historical average reward as an anchor—new prompts are compared against this dynamic anchor to infer whether they are relatively hard or easy, enabling adaptive reweighting of advantage estimates.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper