--- language: - en license: mit tags: - llm - decoder-only - transformer - from-scratch - research - educational - 80m - pytorch - pretraining - custom-architecture pipeline_tag: text-generation inference: parameters: temperature: 0.7 top_p: 0.95 --- # ๐Ÿง  Mini-LLM โ€” 80M Parameter Transformer (Pretrained From Scratch) [![MIT License](https://img.shields.io/badge/license-MIT-green.svg)]() [![Model Size](https://img.shields.io/badge/params-80M-blue.svg)]() **Mini-LLM** is an 80M parameter decoder-only transformer trained **fully from scratch** using a custom tokenizer, custom architecture, and custom training loop. It is designed as an educational + research-friendly minimal LLM that demonstrates how modern LLM components are built end-to-end. --- ## โœจ Key Features - **80M parameters** โ€” compact but fully functional LLM - **Trained from scratch** (no borrowed checkpoints) - Custom **Byte-Level BPE tokenizer (32k vocab)** - Modern architecture components: - RoPE (Rotary Position Embeddings) - RMSNorm - SwiGLU FeedForward layer - FlashAttention (via PyTorch SDPA) - GQA-ready Attention implementation - **2B tokens** mixed corpus (FineWeb + WikiText + Wikipedia) - Training logs, checkpoints, plots all included for transparency - Released under a permissive license for research & learning --- ## ๐Ÿ“ Model Architecture | Component | Value | |----------|-------| | Type | Decoder-only transformer | | Parameters | ~80M | | Layers | 16 | | Embedding dim | 384 | | Attention heads | 6 | | KV Heads | 6 | | MLP Hidden Dim | 1536 (SwiGLU) | | Max sequence length | 2048 | | Norm | RMSNorm | | Positional Encoding | RoPE | | Tokenizer | SentencePiece BPE (32k vocab, byte fallback) | --- ## ๐Ÿ“ฆ Files in This Repo - `checkpoints/` โ†’ Pretrained model state_dict + optimizer - `safetensors/` โ†’ Final consolidated .safetensors file - `logs/` โ†’ Training logs in JSONL - `plots/` โ†’ Train/val loss curves - `tokenizer.json` โ†’ HF-compatible tokenizer - `spm.model` โ†’ SentencePiece model --- ## ๐Ÿงช Quick Usage (HF Transformers) ```python from transformers import AutoTokenizer, AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Ashx098/Mini-LLM", trust_remote_code=True) tok = AutoTokenizer.from_pretrained("Ashx098/Mini-LLM") prompt = "Hello, how are you?" inputs = tok(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=50) print(tok.decode(outputs[0], skip_special_tokens=True)) ``` ## ๐Ÿš€ Training Details ### Optimizer - **AdamW** (ฮฒ1=0.9, ฮฒ2=0.95, weight decay=0.1) - **Learning rate**: 6e-4 (cosine annealing + warmup) ### Batch โจ‰ Sequence - **Global batch size** = 32 - **Sequence length** = 2048 - **Gradient accumulation** = 8 ### Hardware - Trained on 1ร— NVIDIA A100 80GB ## ๐Ÿ“Š Training Curve

Final loss reached: ~3.25 ## ๐Ÿ’ฌ Example Outputs **Prompt**: "Hello, how are you" **Output**: "Hello, how are you?" **Prompt**: "Python is a programming language that" **Output**: "Python is a programming language that allows the history..." ## โš ๏ธ Limitations - Small model โ†’ limited reasoning, hallucination likely - Not instruction-tuned - Not suitable for production usage - Best viewed as a learning + research artifact ## ๐Ÿ“œ License MIT License โ€” free for research, modification, and further training. ## ๐Ÿ™Œ Credits Developed by **Avinash Mynampati** Built from scratch using PyTorch + custom training pipeline. ### Want to fine-tune or extend it? You can: - Train further with your own dataset - Add LoRA adapters - Use it to learn attention, RoPE, SwiGLU, etc. - Build a tiny instruction-tuned version (coming soon!) ## ๐Ÿ“ฌ Contact For questions or collaborations: - **GitHub**: [Ashx098](https://github.com/Ashx098) - **LinkedIn**: [Avinash Mynampati](https://linkedin.com/in/avinash-mynampati)