OLMo3-7B LiveCodeBench Standard RL

Standard DR-GRPO training on LiveCodeBench (medium+hard problems) with binary correctness reward.

Training Details

Base model: allenai/Olmo-3-7B-Instruct Algorithm: DR-GRPO (advantage_estimator=dr_grpo) Dataset: LiveCodeBench med_hard_lightweight subset W&B runs: ldsfw880, b80ytl09 (resumed after error)

Hyperparameters

Parameter Value
Learning rate 5e-7
KL coefficient 0.0
Group size (n_samples_per_prompt) 16
Rollout batch size 16
Train batch size 16
Prompt max len 8096
Generate max len 16384
GPUs 4
Episodes 20
Save steps 8
Precision bf16
Zero stage 2
Gradient checkpointing enabled
Adam offload enabled

Training Script

python3 -m openrlhf.cli.train_ppo_ray \
  --advantage_estimator dr_grpo \
  --pretrain allenai/Olmo-3-7B-Instruct \
  --prompt_data train.jsonl \
  --input_key prompt_full \
  --label_key ground_truth \
  --remote_rm_url reward_func.py \
  --rollout_batch_size 16 \
  --micro_rollout_batch_size 1 \
  --train_batch_size 16 \
  --micro_train_batch_size 1 \
  --n_samples_per_prompt 16 \
  --max_samples 10000 \
  --prompt_max_len 8096 \
  --generate_max_len 16384 \
  --actor_learning_rate 5e-7 \
  --init_kl_coef 0.0 \
  --normalize_reward \
  --param_dtype bf16 \
  --zero_stage 2 \
  --actor_num_gpus_per_node 4 \
  --vllm_num_engines 4 \
  --apply_chat_template \
  --attn_implementation flash_attention_2 \
  --use_liger_kernel \
  --colocate_all_models \
  --vllm_gpu_memory_utilization 0.5 \
  --num_episodes 20 \
  --save_steps 8 \
  --save_hf_ckpt \
  --gradient_checkpointing \
  --adam_offload

Checkpoints

Each checkpoint is available as a revision tag:

  • step-8, step-16, step-24, step-32, step-40, step-48, step-56, step-64, step-72, step-80

Load a specific checkpoint:

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("agurung/olmo3-7b-lcb-standard-rl", revision="step-40")
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for agurung/olmo3-7b-lcb-standard-rl