wikimedia/wikipedia
Viewer • Updated • 61.6M • 255k • 1.23k
Distilled with Distily library using teacher model HuggingFaceTB/SmolLM-135M on dataset wikimedia/wikipedia.
LlamaForCausalLMLlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(49152, 576)
(layers): ModuleList(
(0-14): 15 x LlamaDecoderLayer(
(self_attn): LlamaSdpaAttention(
(q_proj): Linear(in_features=576, out_features=576, bias=False)
(k_proj): Linear(in_features=576, out_features=192, bias=False)
(v_proj): Linear(in_features=576, out_features=192, bias=False)
(o_proj): Linear(in_features=576, out_features=576, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LigerSwiGLUMLP(
(gate_proj): Linear(in_features=576, out_features=1536, bias=False)
(up_proj): Linear(in_features=576, out_features=1536, bias=False)
(down_proj): Linear(in_features=1536, out_features=576, bias=False)
)
(input_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
(post_attention_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
)
)
(norm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=576, out_features=49152, bias=False)
)
LlamaForCausalLM -> LlamaForCausalLM--- teacher model modules
+++ student model modules
@@ -2,7 +2,7 @@
(model): LlamaModel(
(embed_tokens): Embedding(49152, 576)
(layers): ModuleList(
- (0-29): 30 x LlamaDecoderLayer(
+ (0-14): 15 x LlamaDecoderLayer(
(self_attn): LlamaSdpaAttention(
(q_proj): Linear(in_features=576, out_features=576, bias=False)
(k_proj): Linear(in_features=576, out_features=192, bias=False)
@@ -10,17 +10,16 @@
(o_proj): Linear(in_features=576, out_features=576, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
- (mlp): LlamaMLP(
+ (mlp): LigerSwiGLUMLP(
(gate_proj): Linear(in_features=576, out_features=1536, bias=False)
(up_proj): Linear(in_features=576, out_features=1536, bias=False)
(down_proj): Linear(in_features=1536, out_features=576, bias=False)
- (act_fn): SiLU()
)
- (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
- (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
+ (input_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
+ (post_attention_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
)
)
- (norm): LlamaRMSNorm((576,), eps=1e-05)
+ (norm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=576, out_features=49152, bias=False)
Trained on 706,573,563 tokens from the wikimedia/wikipedia dataset.
998,00020231101.entrainDistillationObjective(
logits_loss_component=LossComponent(
weight=1,
loss_fn='kl'
),
hs_loss_component=LossComponent(
weight=0
),
attn_loss_component=LossComponent(
weight=0
)
)
The following hyperparameters were used during training:
0.0002424228Adam with betas=(0.9,0.999) and epsilon=1e-08polynomial1.0DistillationObjective( logits_loss_component=LossComponent( weight=1, loss_fn='kl' ), hs_loss_component=LossComponent( weight=0 ), attn_loss_component=LossComponent( weight=0 ) )<torch.optim.lr_scheduler.LambdaLR object at 0x718c02862f80>NoneNone{'num_hidden_layers': 15}None[('lm_head', False)]FalseTrueHuggingFaceTB/SmolLM-135MFalseFalsewikimedia/wikipedia20231101.entraintext10000000.002False42False0.01.00.00TrueBase model
HuggingFaceTB/SmolLM-135M