Phi-mini-MoE + MoA + Pruning + Specialization

What's Special

This model adds Mixture of Attention (MoA) routing to Phi-mini-MoE, then:

  • ✂️ Pruned 25% of attention heads (kept only important ones)
  • 🎯 Forced expert specialization (each expert focuses on specific tasks)
  • ~3x faster than OLMoE-1B-7B

Stats

  • Base: Phi-mini-MoE (7.6B total, 2.4B active)
  • Attention heads: 32 → 24 (pruned 25%)
  • Training iterations: 10
  • Expert specialization: 16.7%

Files

  • moa_router.pt - Trained + pruned MoA router
  • training_data.json - Self-play examples
  • expert_stats.json - Expert specialization profiles
  • pruning_stats.json - Which heads were pruned

By

maxie-12321

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support