YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Chimera 5.3 — HYPER CPU Training (10,000+ tok/s target)

100% faithful implementation of the Chimera 5.x config. All 15 architectural components implemented in pure PyTorch, with true 1.58-bit ternary computation on CPU.

v5.3 NEW: 7 stacked training paradigms designed to push CPU training from ~50-200 tok/s to 10,000+ tok/s on a single CPU — targeting AGI-class LLM training without GPUs.

Tokenizer: splintr-rs (Rust) — o200k_base vocab (200,073 tokens, OpenAI o1/o3).

Repo Structure

The repo is now organized around the chimera/ package as the source of truth:

chimera/ — model code, config helpers, package CLI wrappers, shared path helpers
train.py — standard training entrypoint
train_fast.py — cached-dataset training entrypoint
train_hyper.py — hyper training entrypoint
inference.py — generation entrypoint
gguf_import.py — GGUF import entrypoint
tests/ — smoke and config tests

You can still run the root scripts directly, or use packaged commands after install:

chimera-train --help
chimera-train-fast --help
chimera-train-hyper --help
chimera-infer --help
chimera-import-gguf --help

v5.3 — HYPER Training Paradigms

Seven orthogonal paradigms that stack multiplicatively for extreme CPU training speed:

#	Paradigm	Speedup	Paper	Mechanism
P1	GrowLength Curriculum	4-8×	arxiv:2310.00576	Start seq=16, grow to target. Short seqs → huge batch → way more tok/s
P2	Reservoir Freezing	1.5-2×	arxiv:2512.23145	Freeze 50% of recurrent gates as random ternary. No grad = fewer FLOPs
P3	Sparse MeZO	3-5×	arxiv:2406.02913	Perturb only top-1% sensitive params. ZO signal quality ∝ sparsity
P4	Blockwise Pipeline	1.3-2×	—	Pin layer-groups to core-groups; overlap forward passes
P5	Fused Ternary Cache	1.3×	—	Pre-materialise dense weights once; reuse for both MeZO forwards
P6	Aggressive Token Packing	1.1-1.3×	—	Zero padding waste; documents packed back-to-back with EOS
P7	Progressive Layer Unfreeze	1.5-2×	—	Train only top 25% of layers first; unfreeze downward

Combined theoretical multiplier: P1(6×) × P2(1.7×) × P3(4×) × P5(1.3×) × P7(1.7×) ≈ 57-260×

Realistic target: 50-200 tok/s baseline → 3,000-15,000+ tok/s

Quick Start — HYPER Training

# All 7 paradigms ON — maximum speed
python train_hyper.py --scale tiny --max_steps 5000 --all

# Cherry-pick specific paradigms
python train_hyper.py --scale tiny --max_steps 5000 \
    --growlength --sparse-mezo --reservoir --fused-cache

# Benchmark: baseline vs hyper (side-by-side comparison)
python train_hyper.py --scale tiny --max_steps 100 --benchmark

# Full training run with all paradigms
OMP_NUM_THREADS=$(nproc) python train_hyper.py \
    --scale small --seq_len 256 --max_steps 50000 \
    --all --bf16 --compile \
    --save_every 5000 --log_every 10

Paradigm Details

P1 — GrowLength Curriculum (arxiv:2310.00576)

Trains with progressively longer sequences. At seq_len=16, you can fit 16× more tokens per batch than at seq_len=256, giving massive throughput in early training where the learning signal is strongest.

Default schedule:

20% of training at seq_len = target/8
25% at target/4
25% at target/2
30% at full target

python train_hyper.py --growlength --seq_len 256

P2 — Reservoir Freezing (arxiv:2512.23145)

Inspired by GRC (Reservoir Computing for Language Models): freezes gate/forget projections in recurrent layers as random ternary matrices with unit spectral radius. These "reservoir" weights provide stable dynamics without needing gradient updates.

Targets:

GatedDeltaNet: a_proj, b_proj (alpha/beta gates)
mLSTM: fgate (forget gate)
TitansMAC: alpha_proj (forgetting gate)

python train_hyper.py --reservoir --reservoir-ratio 0.5

P3 — Sparse MeZO (arxiv:2406.02913)

Standard MeZO perturbs all ~35M parameters — most contribute near-zero gradient signal. Sparse MeZO identifies the top-K% most sensitive parameters (by weight magnitude) and perturbs only those. This dramatically reduces the variance of the ZO gradient estimate.

At 1% sparsity on a 35M model: only 350K params perturbed per step → 100× better signal-to-noise per forward pass.

python train_hyper.py --sparse-mezo --mezo-sparsity 0.01

P5 — Fused Ternary Cache

Before each MeZO dual-forward, pre-materialises all BitLinear packed+dense weight caches. Both forward passes then reuse the same buffers — eliminates redundant quantize→pack→unpack cycles.

python train_hyper.py --fused-cache

P7 — Progressive Layer Unfreezing

Starts with only the top ~25% of layers trainable. Early training is cheap (forward through frozen layers is fast, no gradient storage). Gradually unfreezes deeper layers as training progresses.

python train_hyper.py --progressive-unfreeze --unfreeze-stages 4

Files

chimera/
  __init__.py          — Package exports (v5.3)
  config.py            — Config loading / scaling
  hyper.py             — ★ NEW: 7 HYPER paradigm engine
  quantization.py      — BitLinear (2-bit packed, C++ kernel, STE, N:M 2:4)
  layers.py            — GatedDeltaNet, mLSTM, TitansMAC, TSPSpanKnot
  moe.py               — MoELayer (sort-based dispatch)
  looping.py           — ParcaeLoopController
  inference.py         — SpanBank, STree, Grammar, EntropyValve, DebtLedger
  evolution.py         — TTT, SemanticMemory, EpisodicCases, MetaGuidelines
  multimodal.py        — VisionEncoder, AudioEncoder
  tokenizer.py         — ChimeraTokenizer (splintr, o200k_base)
  model.py             — Chimera51ForCausalLM
config.json            — Full model config
train.py               — Standard training (MeZO + AdamW)
train_fast.py          — Fast training with pre-tokenized cache
train_hyper.py         — ★ NEW: HYPER training (7 paradigms, 10k+ tok/s)
inference.py           — Inference / generation

Previous Versions

v5.1.4 — CPU Fast Path Audit

Fixed package/runtime mismatch
Added sparse MoELayer with expert-grouped dispatch
Made C++ ternary extensions lazy-loaded
Vectorized BitLinear AbsMean scaling
Cached causal/triangular masks
Reduced GatedDeltaNet clone churn

v5.1.3 — Fix Illegal Instruction Crash

Removed -march=native from C++ JIT flags
Runtime CPUID detection for AVX-512/AVX2

v5.1.2 — True Ternary Compute

2-bit packed uint8 weight storage (16× compression)
C++ unpack + MKL BLAS forward path
MeZO sparse perturbation (skip ~33% zeros)
STE backward with deep-zero masking

Architecture (28 layers, 4 types)

Layer pattern: GD XM GD TM GD XM GD SK × 3.5
  GD = Gated DeltaNet (14 layers) — arxiv:2412.06464
  XM = xLSTM mLSTM (7 layers) — arxiv:2405.04517
  TM = Titans MAC (4 layers) — arxiv:2501.00663
  SK = TSP Span Knot (3 layers)

All linear layers use BitLinear (ternary 1.58-bit) with per-group AbsMean scaling.

Training Modes

HYPER (v5.3 — Recommended)

7 stacked paradigms for maximum CPU throughput
Target: 10,000+ tok/s on 8-core CPU (tiny scale)
Forward-only training (Sparse MeZO): no backward pass
Memory = 2× model size (no activations, no gradients, no optimizer states)
Each paradigm independently toggleable via CLI flags

MeZO (v5.1 — Standard)

Standard zeroth-order optimization
2 forward passes per step, no backward
Good for fine-tuning; ~50-200 tok/s on CPU

AdamW (v5.1 — Full backprop)

Standard gradient descent with checkpointing
Best convergence quality for pretraining from scratch
~10-50 tok/s on CPU

References

37 papers indexed in config.json under §. Key additions for v5.3:

GrowLength — Progressive sequence length training
GRC MatMul-free LM — Reservoir computing for LMs
Sparse MeZO — Sparse zeroth-order fine-tuning
GaLore — Gradient low-rank projection
QuZO — Quantized zeroth-order training
SparAMX — AMX-accelerated sparse CPU kernels

Plus all previous references:

Gated DeltaNet — NVIDIA
xLSTM — NXAI/JKU
Titans — Google
Parcae — Stanford/Together
BitNet b1.58 — Microsoft
MeZO — Princeton

Downloads last month: 338

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for Lgr54HFi/chomera