F-GRPO Training for Qwen Math Reasoning

Training Qwen model using F-GRPO (Focal Loss-enhanced Group Relative Policy Optimization).

Based on Paper

"F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare" (Feb 2026)

Method

F-GRPO improves upon standard GRPO by:

  • Down-weighting easy prompts (high success rate)
  • Up-weighting hard prompts (low success rate)
  • Using Focal Loss-inspired scaling: advantage_fgrpo = (1 - success_rate)^ฮณ * advantage_grpo

Configuration

  • Model: Qwen/Qwen2.5-3B (fits T4 Small 16GB)
  • Dataset: DeepMath-103K or GSM8K
  • Steps: 50 (minimal training, Raschka's finding)
  • Hardware: T4 Small GPU (16GB VRAM)

Training Script

  • train_fgrpo_trl.py - Main training script using TRL GRPOTrainer

Results Expected

Based on paper (Qwen2.5-7B):

  • pass@256: 64.1 โ†’ 70.3 (+6.2 points)
  • No extra compute cost

Usage

This Space automatically runs training on startup. Results are pushed to silentspuck/fgrpo-qwen-math-test.

Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support