SmolLM2-135M-Reasoning-SLERP-Champion

🏆 NEW CHAMPION — Zero-cost SLERP merge that outperforms 200+ training experiments!

Method

This model is a SLERP (Spherical Linear Interpolation) merge of two DPO-trained models at t=0.3:

70%: exp146 (SFT→DPO sigmoid, constant scheduler, 10K) — best GSM8K training result
30%: exp138 (SFT→DPO nca_pair, cosine scheduler, 10K) — best ARC-C training result

Both parent models were trained on SmolLM2-135M-Instruct with:

Metric	Baseline	exp146 (prev champion)	This Model	Delta
ARC-C (acc_norm)	27.73%	29.18%	29.18%	TIE
ARC-E (acc)	54.12%	57.28%	57.11%	-0.17pp
GSM8K (flex)	0.38%	2.35%	2.43%	+0.08pp 🏆
HellaSwag (acc_norm)	42.99%	42.95%	43.01%	+0.06pp
PIQA (acc_norm)	66.92%	67.14%	66.92%	-0.22pp

Zero training cost: Pure weight interpolation, no GPU training needed
GSM8K record: 2.43% — highest across 200+ experiments at 135M scale
HellaSwag above baseline: Only post-training model to exceed baseline HellaSwag
SLERP is asymmetric: t=0.3 creates constructive interference; t=0.5 destroys capabilities (GSM8K=1.21%)
Complementary specialists merge well: Combining sigmoid+constant and nca_pair+cosine DPO creates synergy

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Base model

Quantized

Finetuned

(305)

this model