SmolLM2-135M-Reasoning-SLERP-Champion

🏆 NEW CHAMPION — Zero-cost SLERP merge that outperforms 200+ training experiments!

Method

This model is a SLERP (Spherical Linear Interpolation) merge of two DPO-trained models at t=0.3:

  • 70%: exp146 (SFT→DPO sigmoid, constant scheduler, 10K) — best GSM8K training result
  • 30%: exp138 (SFT→DPO nca_pair, cosine scheduler, 10K) — best ARC-C training result

Both parent models were trained on SmolLM2-135M-Instruct with:

  1. SFT on ~100K reasoning + non-reasoning data
  2. DPO alignment with different loss types and schedulers

Benchmark Results (lm-eval, 0-shot)

Metric Baseline exp146 (prev champion) This Model Delta
ARC-C (acc_norm) 27.73% 29.18% 29.18% TIE
ARC-E (acc) 54.12% 57.28% 57.11% -0.17pp
GSM8K (flex) 0.38% 2.35% 2.43% +0.08pp 🏆
HellaSwag (acc_norm) 42.99% 42.95% 43.01% +0.06pp
PIQA (acc_norm) 66.92% 67.14% 66.92% -0.22pp

Key Insights

  1. Zero training cost: Pure weight interpolation, no GPU training needed
  2. GSM8K record: 2.43% — highest across 200+ experiments at 135M scale
  3. HellaSwag above baseline: Only post-training model to exceed baseline HellaSwag
  4. SLERP is asymmetric: t=0.3 creates constructive interference; t=0.5 destroys capabilities (GSM8K=1.21%)
  5. Complementary specialists merge well: Combining sigmoid+constant and nca_pair+cosine DPO creates synergy

Full Experiment Report

Part of a comprehensive 200+ experiment optimization study. See the full report at: https://github.com/lldois/tiny_resoning_llm/blob/main/plus_copilot/reports/experiment_report.md

Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lldois/SmolLM2-135M-Reasoning-SLERP-Champion

Finetuned
(305)
this model