OLMo3-7B LiveCodeBench Standard RL
Standard DR-GRPO training on LiveCodeBench (medium+hard problems) with binary correctness reward.
Training Details
Base model: allenai/Olmo-3-7B-Instruct
Algorithm: DR-GRPO (advantage_estimator=dr_grpo)
Dataset: LiveCodeBench med_hard_lightweight subset
W&B runs: ldsfw880, b80ytl09 (resumed after error)
Hyperparameters
| Parameter | Value |
|---|---|
| Learning rate | 5e-7 |
| KL coefficient | 0.0 |
| Group size (n_samples_per_prompt) | 16 |
| Rollout batch size | 16 |
| Train batch size | 16 |
| Prompt max len | 8096 |
| Generate max len | 16384 |
| GPUs | 4 |
| Episodes | 20 |
| Save steps | 8 |
| Precision | bf16 |
| Zero stage | 2 |
| Gradient checkpointing | enabled |
| Adam offload | enabled |
Training Script
python3 -m openrlhf.cli.train_ppo_ray \
--advantage_estimator dr_grpo \
--pretrain allenai/Olmo-3-7B-Instruct \
--prompt_data train.jsonl \
--input_key prompt_full \
--label_key ground_truth \
--remote_rm_url reward_func.py \
--rollout_batch_size 16 \
--micro_rollout_batch_size 1 \
--train_batch_size 16 \
--micro_train_batch_size 1 \
--n_samples_per_prompt 16 \
--max_samples 10000 \
--prompt_max_len 8096 \
--generate_max_len 16384 \
--actor_learning_rate 5e-7 \
--init_kl_coef 0.0 \
--normalize_reward \
--param_dtype bf16 \
--zero_stage 2 \
--actor_num_gpus_per_node 4 \
--vllm_num_engines 4 \
--apply_chat_template \
--attn_implementation flash_attention_2 \
--use_liger_kernel \
--colocate_all_models \
--vllm_gpu_memory_utilization 0.5 \
--num_episodes 20 \
--save_steps 8 \
--save_hf_ckpt \
--gradient_checkpointing \
--adam_offload
Checkpoints
Each checkpoint is available as a revision tag:
step-8,step-16,step-24,step-32,step-40,step-48,step-56,step-64,step-72,step-80
Load a specific checkpoint:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("agurung/olmo3-7b-lcb-standard-rl", revision="step-40")
Model tree for agurung/olmo3-7b-lcb-standard-rl
Base model
allenai/Olmo-3-1025-7B Finetuned
allenai/Olmo-3-7B-Instruct-SFT Finetuned
allenai/Olmo-3-7B-Instruct-DPO Finetuned
allenai/Olmo-3-7B-Instruct