GRPO Fine-tuned Qwen2.5-0.5B for IIT-JEE Math

Model Description

This model is fine-tuned using Group Relative Policy Optimization (GRPO) on IIT-JEE mathematics datasets.

Training Details

  • Base Model: Qwen/Qwen2.5-0.5B-Instruct
  • Method: GRPO (Reinforcement Learning)
  • Datasets: JEEBench, JEE Main 2025, JEE-NEET Benchmark
  • LoRA Rank: 32
  • Training Epochs: 3

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Yagna1/grpo-qwen2.5-0.5b-jee-math")
tokenizer = AutoTokenizer.from_pretrained("Yagna1/grpo-qwen2.5-0.5b-jee-math")

messages = [
    {"role": "system", "content": "You are a math solver. Solve step-by-step and provide your final answer in \\boxed{} format."},
    {"role": "user", "content": "What is the derivative of x^2?"}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

Reward Functions

  1. Format Reward (0.3): Ensures answers are in \boxed{} format
  2. Correctness Reward (1.0): Validates mathematical accuracy
  3. Length Reward (0.1): Encourages concise solutions (10-300 words)

Limitations

  • Optimized for IIT-JEE level mathematics
  • Best performance on algebra, calculus, and geometry problems
  • May require multiple generations for complex problems

Citation

@misc{grpo-jee-math-2024,
  author = {Yagna1},
  title = {GRPO Fine-tuned Qwen2.5 for IIT-JEE Math},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Yagna1/grpo-qwen2.5-0.5b-jee-math}
}
Downloads last month
5
Safetensors
Model size
0.5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Yagna1/grpo-qwen2.5-0.5b-jee-math

Base model

Qwen/Qwen2.5-0.5B
Finetuned
(647)
this model