A newer version of this model is available: ray0rf1re/Nano-nano-4.6

🧠 Nano-nano v4.5

~255.7 M · LLaMA · Instruction-tuned · From scratch

License Loss Eval Datasets

Successor to Nano-nano v4.
Same architecture family, ~8.5% larger, trained from scratch on 15 carefully weighted datasets.


📋 Quick Facts

Architecture LLaMA (decoder-only)
Parameters ~255.7 M
Context length 2 048 tokens
Vocabulary 50,264 tokens
Training loss 5.1763
Eval score 16.7%
Trained on 0.08 B tokens
Hardware NVIDIA GTX 1080 8 GB (Pascal)
Trained 2026-05-09 22:50

🏗️ Architecture

Standard LLaMA decoder-only transformer. Scaled ~8.5% wider + 1 extra layer vs v4.

Hyperparameter v4 v4.5
Parameters ~236 M ~255.7 M
hidden_size 896 896
intermediate_size 2 688 2 912
num_hidden_layers 14 15
num_attention_heads 14 14
num_key_value_heads 14 14
head_dim 64 64
vocab_size 50 264 50,264
max_position_embeddings 1 024 2 048
rms_norm_eps 1e-6 1e-6
rope_theta 10 000 10 000
hidden_act SiLU SiLU
tie_word_embeddings False False
attention_bias False False
mlp_bias False False

📊 Evaluation

Automatically evaluated after training across 5 capability dimensions.

Category Hits Score
Knowledge 0/5 🔴 0%
Reasoning 0/4 🔴 0%
Hallucination 0/4 🔴 0%
Instruction 2/4 🟡 50%
Coherence 1/3 🔴 33%
Overall 🔴 17%

Hallucination resistance — whether the model appropriately declines questions about future events, fictional entities, or impossible premises rather than confabulating.

Category Scores Hallucination Training Curves


🍳 Training

Setting Value
Hardware GTX 1080 8 GB · Pascal · CUDA 6.1
Precision fp32 weights / fp16 AMP (GradScaler)
Optimizer StovetopCooker (HyperNix, pre-Volta)
LR 0.0001 cosine decay
Warmup 6% of steps
Embedding freeze First 15% of steps
Effective batch 8 × 2048 = 16,384 tokens/step
Steps 5092
Total tokens 0.08 B
Grad clipping 1.0
Grad checkpointing
Peak VRAM 5.34 GB
HyperNix freezer · StovetopCooker · old_fridge · new_fridge · smoke_alarm · pans · smoker

Dataset Mix

Dataset Samples Weight Category
Roman1111111/claude-opus-4.6-10000x 10 k 2.5× Claude conversations
WithinUsAI/GPT5.5_thinking_max_distill_god_seed_25K 25 k 2.0× Reasoning / thinking
HuggingFaceH4/MATH-500 500 2.0× Competition math
lighteval/MATH-Hard 10 k 2.0× Hard math
garage-bAInd/Open-Platypus 25 k 1.8× Reasoning instruction
iamtarun/python_code_instructions_18k_alpaca 8 k 1.6× Python code
b-mc2/sql-create-context 6 k 1.4× SQL code
nvidia/OpenCodeInstruct 30 k 1.5× Code instruction
teknium/OpenHermes-2.5 30 k 1.5× General instruction
Amod/mental_health_counseling_conversations 5 k 1.2× Chat / counseling
ray0rf1re/FineWeb-Nano 50 k 1.0× Web text
tonytins/chat-dataset 10 k 1.0× Conversation
databricks/databricks-dolly-15k 15 k 1.0× Instruction following
mlabonne/guanaco-llama2-1k 1 k 1.0× General QA
ray0rf1re/hyper-pip 20 k 2.0× HyperNix pip data
HuggingFaceH4/ultrachat_200k 30 k 1.5× Multi-turn chat
fka/awesome-chatgpt-prompts 5 k 0.8× Prompt engineering

🚀 Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "ray0rf1re/Nano-nano_v4.5",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("ray0rf1re/Nano-nano_v4.5")

def generate(prompt: str, max_new_tokens: int = 256) -> str:
    text   = f"### Instruction:
{prompt}

### Response:
"
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    out    = model.generate(
        **inputs,
        max_new_tokens  = max_new_tokens,
        do_sample       = True,
        temperature     = 0.7,
        top_p           = 0.9,
        repetition_penalty = 1.1,
        pad_token_id    = tokenizer.eos_token_id,
    )
    new_ids = out[0][inputs["input_ids"].shape[-1]:]
    return tokenizer.decode(new_ids, skip_special_tokens=True).strip()

# Examples
print(generate("Write a Python function to reverse a linked list."))
print(generate("What is the capital of France?"))
print(generate("Explain gradient descent in simple terms."))

⚠️ Limitations

  • Context limited to 1 024 tokens — unsuitable for long documents
  • Trained on 0.08 B tokens — far less than production models
  • May hallucinate on obscure or out-of-distribution queries
  • Not RLHF/DPO aligned — outputs may vary in safety and tone
  • Pascal GPU constraint (GTX 1080): fp32/fp16 only, no bf16

📜 Citation

@misc{nano-nano-v45,
  author       = {ray0rf1re},
  title        = {Nano-nano v4.5: Compact LLaMA-Family Causal LM},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {https://huggingface.co/ray0rf1re/Nano-nano_v4.5},
}
Downloads last month
2,324
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train ray0rf1re/Nano-nano_v4.5

Collection including ray0rf1re/Nano-nano_v4.5

Evaluation results