Phi-mini-MoE + MoA + Pruning + Specialization
What's Special
This model adds Mixture of Attention (MoA) routing to Phi-mini-MoE, then:
- ✂️ Pruned 25% of attention heads (kept only important ones)
- 🎯 Forced expert specialization (each expert focuses on specific tasks)
- ⚡ ~3x faster than OLMoE-1B-7B
Stats
- Base: Phi-mini-MoE (7.6B total, 2.4B active)
- Attention heads: 32 → 24 (pruned 25%)
- Training iterations: 10
- Expert specialization: 16.7%
Files
moa_router.pt- Trained + pruned MoA routertraining_data.json- Self-play examplesexpert_stats.json- Expert specialization profilespruning_stats.json- Which heads were pruned
By
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support