SmolLM2-135M-Reasoning-SLERP-Champion
🏆 NEW CHAMPION — Zero-cost SLERP merge that outperforms 200+ training experiments!
Method
This model is a SLERP (Spherical Linear Interpolation) merge of two DPO-trained models at t=0.3:
- 70%: exp146 (SFT→DPO sigmoid, constant scheduler, 10K) — best GSM8K training result
- 30%: exp138 (SFT→DPO nca_pair, cosine scheduler, 10K) — best ARC-C training result
Both parent models were trained on SmolLM2-135M-Instruct with:
- SFT on ~100K reasoning + non-reasoning data
- DPO alignment with different loss types and schedulers
Benchmark Results (lm-eval, 0-shot)
| Metric | Baseline | exp146 (prev champion) | This Model | Delta |
|---|---|---|---|---|
| ARC-C (acc_norm) | 27.73% | 29.18% | 29.18% | TIE |
| ARC-E (acc) | 54.12% | 57.28% | 57.11% | -0.17pp |
| GSM8K (flex) | 0.38% | 2.35% | 2.43% | +0.08pp 🏆 |
| HellaSwag (acc_norm) | 42.99% | 42.95% | 43.01% | +0.06pp |
| PIQA (acc_norm) | 66.92% | 67.14% | 66.92% | -0.22pp |
Key Insights
- Zero training cost: Pure weight interpolation, no GPU training needed
- GSM8K record: 2.43% — highest across 200+ experiments at 135M scale
- HellaSwag above baseline: Only post-training model to exceed baseline HellaSwag
- SLERP is asymmetric: t=0.3 creates constructive interference; t=0.5 destroys capabilities (GSM8K=1.21%)
- Complementary specialists merge well: Combining sigmoid+constant and nca_pair+cosine DPO creates synergy
Full Experiment Report
Part of a comprehensive 200+ experiment optimization study. See the full report at: https://github.com/lldois/tiny_resoning_llm/blob/main/plus_copilot/reports/experiment_report.md
- Downloads last month
- 16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for lldois/SmolLM2-135M-Reasoning-SLERP-Champion
Base model
HuggingFaceTB/SmolLM2-135M Quantized
HuggingFaceTB/SmolLM2-135M-Instruct