GRPO Fine-tuned Qwen2.5-0.5B for IIT-JEE Math
Model Description
This model is fine-tuned using Group Relative Policy Optimization (GRPO) on IIT-JEE mathematics datasets.
Training Details
- Base Model: Qwen/Qwen2.5-0.5B-Instruct
- Method: GRPO (Reinforcement Learning)
- Datasets: JEEBench, JEE Main 2025, JEE-NEET Benchmark
- LoRA Rank: 32
- Training Epochs: 3
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Yagna1/grpo-qwen2.5-0.5b-jee-math")
tokenizer = AutoTokenizer.from_pretrained("Yagna1/grpo-qwen2.5-0.5b-jee-math")
messages = [
{"role": "system", "content": "You are a math solver. Solve step-by-step and provide your final answer in \\boxed{} format."},
{"role": "user", "content": "What is the derivative of x^2?"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
Reward Functions
- Format Reward (0.3): Ensures answers are in \boxed{} format
- Correctness Reward (1.0): Validates mathematical accuracy
- Length Reward (0.1): Encourages concise solutions (10-300 words)
Limitations
- Optimized for IIT-JEE level mathematics
- Best performance on algebra, calculus, and geometry problems
- May require multiple generations for complex problems
Citation
@misc{grpo-jee-math-2024,
author = {Yagna1},
title = {GRPO Fine-tuned Qwen2.5 for IIT-JEE Math},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/Yagna1/grpo-qwen2.5-0.5b-jee-math}
}
- Downloads last month
- 5