OLMo3-7B LiveCodeBench Standard RL

Standard DR-GRPO training on LiveCodeBench (medium+hard problems) with binary correctness reward.

Training Details

Base model: allenai/Olmo-3-7B-Instruct Algorithm: DR-GRPO (advantage_estimator=dr_grpo) Dataset: LiveCodeBench med_hard_lightweight subset W&B runs: ldsfw880, b80ytl09 (resumed after error)

Hyperparameters

Parameter	Value
Learning rate	5e-7
KL coefficient	0.0
Group size (n_samples_per_prompt)	16
Rollout batch size	16
Train batch size	16
Prompt max len	8096
Generate max len	16384
GPUs	4
Episodes	20
Save steps	8
Precision	bf16
Zero stage	2
Gradient checkpointing	enabled
Adam offload	enabled

Training Script

python3 -m openrlhf.cli.train_ppo_ray \
  --advantage_estimator dr_grpo \
  --pretrain allenai/Olmo-3-7B-Instruct \
  --prompt_data train.jsonl \
  --input_key prompt_full \
  --label_key ground_truth \
  --remote_rm_url reward_func.py \
  --rollout_batch_size 16 \
  --micro_rollout_batch_size 1 \
  --train_batch_size 16 \
  --micro_train_batch_size 1 \
  --n_samples_per_prompt 16 \
  --max_samples 10000 \
  --prompt_max_len 8096 \
  --generate_max_len 16384 \
  --actor_learning_rate 5e-7 \
  --init_kl_coef 0.0 \
  --normalize_reward \
  --param_dtype bf16 \
  --zero_stage 2 \
  --actor_num_gpus_per_node 4 \
  --vllm_num_engines 4 \
  --apply_chat_template \
  --attn_implementation flash_attention_2 \
  --use_liger_kernel \
  --colocate_all_models \
  --vllm_gpu_memory_utilization 0.5 \
  --num_episodes 20 \
  --save_steps 8 \
  --save_hf_ckpt \
  --gradient_checkpointing \
  --adam_offload

Checkpoints

Each checkpoint is available as a revision tag:

step-8, step-16, step-24, step-32, step-40, step-48, step-56, step-64, step-72, step-80

Load a specific checkpoint:

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("agurung/olmo3-7b-lcb-standard-rl", revision="step-40")

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Model tree for agurung/olmo3-7b-lcb-standard-rl

Base model

allenai/Olmo-3-1025-7B

Finetuned

allenai/Olmo-3-7B-Instruct-SFT

Finetuned

allenai/Olmo-3-7B-Instruct-DPO

Finetuned

allenai/Olmo-3-7B-Instruct

Finetuned

(131)

this model