Papers
arxiv:2601.08521

Your Group-Relative Advantage Is Biased

Published on Jan 13
Authors:
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Group-based reinforcement learning from verifier rewards suffers from biased advantage estimation that underestimates hard prompts and overestimates easy prompts, which is addressed through a history-aware adaptive difficulty weighting method.

AI-generated summary

Reinforcement Learning from Verifier Rewards (RLVR) has emerged as a widely used approach for post-training large language models on reasoning tasks, with group-based methods such as GRPO and its variants gaining broad adoption. These methods rely on group-relative advantage estimation to avoid learned critics, yet its theoretical properties remain poorly understood. In this work, we uncover a fundamental issue of group-based RL: the group-relative advantage estimator is inherently biased relative to the true (expected) advantage. We provide the first theoretical analysis showing that it systematically underestimates advantages for hard prompts and overestimates them for easy prompts, leading to imbalanced exploration and exploitation. To address this issue, we propose History-Aware Adaptive Difficulty Weighting (HA-DW), an adaptive reweighting scheme that adjusts advantage estimates based on an evolving difficulty anchor and training dynamics. Both theoretical analysis and experiments on five mathematical reasoning benchmarks demonstrate that HA-DW consistently improves performance when integrated into GRPO and its variants. Our results suggest that correcting biased advantage estimation is critical for robust and efficient RLVR training.

Community

This paper fundamentally shows that:

"The commonly used group-relative advantage estimator is inherently biased except at p_t = 0.5: it systematically underestimates true advantage on hard prompts and overestimates true advantag on easy prompts".

This bias is not just random—it becomes deterministic in extreme difficulty regimes, meaning the estimator must underestimate for very hard prompts and must overestimate for very easy prompts. This analysis highlights this as a core limitation in group-relative methods and motivates corrections that better align estimated and true advantage.

This discovery implies that we should adjust advantage estimation based on prompt difficulty:

"For hard prompts, we should increase the estimated advantage to encourage more exploration;
For easy prompts, we should decrease the estimated advantage to prevent over-exploitation."

To determine a prompt’s difficulty level in practice, we use a short-term historical average reward as an anchor—new prompts are compared against this dynamic anchor to infer whether they are relatively hard or easy, enabling adaptive reweighting of advantage estimates.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.08521 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.08521 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.08521 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.