YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Chimera 5.3 β HYPER CPU Training (10,000+ tok/s target)
100% faithful implementation of the Chimera 5.x config. All 15 architectural components implemented in pure PyTorch, with true 1.58-bit ternary computation on CPU.
v5.3 NEW: 7 stacked training paradigms designed to push CPU training from ~50-200 tok/s to 10,000+ tok/s on a single CPU β targeting AGI-class LLM training without GPUs.
Tokenizer: splintr-rs (Rust) β o200k_base vocab (200,073 tokens, OpenAI o1/o3).
Repo Structure
The repo is now organized around the chimera/ package as the source of truth:
chimera/β model code, config helpers, package CLI wrappers, shared path helperstrain.pyβ standard training entrypointtrain_fast.pyβ cached-dataset training entrypointtrain_hyper.pyβ hyper training entrypointinference.pyβ generation entrypointgguf_import.pyβ GGUF import entrypointtests/β smoke and config tests
You can still run the root scripts directly, or use packaged commands after install:
chimera-train --help
chimera-train-fast --help
chimera-train-hyper --help
chimera-infer --help
chimera-import-gguf --help
v5.3 β HYPER Training Paradigms
Seven orthogonal paradigms that stack multiplicatively for extreme CPU training speed:
| # | Paradigm | Speedup | Paper | Mechanism |
|---|---|---|---|---|
| P1 | GrowLength Curriculum | 4-8Γ | arxiv:2310.00576 | Start seq=16, grow to target. Short seqs β huge batch β way more tok/s |
| P2 | Reservoir Freezing | 1.5-2Γ | arxiv:2512.23145 | Freeze 50% of recurrent gates as random ternary. No grad = fewer FLOPs |
| P3 | Sparse MeZO | 3-5Γ | arxiv:2406.02913 | Perturb only top-1% sensitive params. ZO signal quality β sparsity |
| P4 | Blockwise Pipeline | 1.3-2Γ | β | Pin layer-groups to core-groups; overlap forward passes |
| P5 | Fused Ternary Cache | 1.3Γ | β | Pre-materialise dense weights once; reuse for both MeZO forwards |
| P6 | Aggressive Token Packing | 1.1-1.3Γ | β | Zero padding waste; documents packed back-to-back with EOS |
| P7 | Progressive Layer Unfreeze | 1.5-2Γ | β | Train only top 25% of layers first; unfreeze downward |
Combined theoretical multiplier: P1(6Γ) Γ P2(1.7Γ) Γ P3(4Γ) Γ P5(1.3Γ) Γ P7(1.7Γ) β 57-260Γ
Realistic target: 50-200 tok/s baseline β 3,000-15,000+ tok/s
Quick Start β HYPER Training
# All 7 paradigms ON β maximum speed
python train_hyper.py --scale tiny --max_steps 5000 --all
# Cherry-pick specific paradigms
python train_hyper.py --scale tiny --max_steps 5000 \
--growlength --sparse-mezo --reservoir --fused-cache
# Benchmark: baseline vs hyper (side-by-side comparison)
python train_hyper.py --scale tiny --max_steps 100 --benchmark
# Full training run with all paradigms
OMP_NUM_THREADS=$(nproc) python train_hyper.py \
--scale small --seq_len 256 --max_steps 50000 \
--all --bf16 --compile \
--save_every 5000 --log_every 10
Paradigm Details
P1 β GrowLength Curriculum (arxiv:2310.00576)
Trains with progressively longer sequences. At seq_len=16, you can fit 16Γ more tokens per batch than at seq_len=256, giving massive throughput in early training where the learning signal is strongest.
Default schedule:
- 20% of training at seq_len = target/8
- 25% at target/4
- 25% at target/2
- 30% at full target
python train_hyper.py --growlength --seq_len 256
P2 β Reservoir Freezing (arxiv:2512.23145)
Inspired by GRC (Reservoir Computing for Language Models): freezes gate/forget projections in recurrent layers as random ternary matrices with unit spectral radius. These "reservoir" weights provide stable dynamics without needing gradient updates.
Targets:
- GatedDeltaNet:
a_proj,b_proj(alpha/beta gates) - mLSTM:
fgate(forget gate) - TitansMAC:
alpha_proj(forgetting gate)
python train_hyper.py --reservoir --reservoir-ratio 0.5
P3 β Sparse MeZO (arxiv:2406.02913)
Standard MeZO perturbs all ~35M parameters β most contribute near-zero gradient signal. Sparse MeZO identifies the top-K% most sensitive parameters (by weight magnitude) and perturbs only those. This dramatically reduces the variance of the ZO gradient estimate.
At 1% sparsity on a 35M model: only 350K params perturbed per step β 100Γ better signal-to-noise per forward pass.
python train_hyper.py --sparse-mezo --mezo-sparsity 0.01
P5 β Fused Ternary Cache
Before each MeZO dual-forward, pre-materialises all BitLinear packed+dense weight caches. Both forward passes then reuse the same buffers β eliminates redundant quantizeβpackβunpack cycles.
python train_hyper.py --fused-cache
P7 β Progressive Layer Unfreezing
Starts with only the top ~25% of layers trainable. Early training is cheap (forward through frozen layers is fast, no gradient storage). Gradually unfreezes deeper layers as training progresses.
python train_hyper.py --progressive-unfreeze --unfreeze-stages 4
Files
chimera/
__init__.py β Package exports (v5.3)
config.py β Config loading / scaling
hyper.py β β
NEW: 7 HYPER paradigm engine
quantization.py β BitLinear (2-bit packed, C++ kernel, STE, N:M 2:4)
layers.py β GatedDeltaNet, mLSTM, TitansMAC, TSPSpanKnot
moe.py β MoELayer (sort-based dispatch)
looping.py β ParcaeLoopController
inference.py β SpanBank, STree, Grammar, EntropyValve, DebtLedger
evolution.py β TTT, SemanticMemory, EpisodicCases, MetaGuidelines
multimodal.py β VisionEncoder, AudioEncoder
tokenizer.py β ChimeraTokenizer (splintr, o200k_base)
model.py β Chimera51ForCausalLM
config.json β Full model config
train.py β Standard training (MeZO + AdamW)
train_fast.py β Fast training with pre-tokenized cache
train_hyper.py β β
NEW: HYPER training (7 paradigms, 10k+ tok/s)
inference.py β Inference / generation
Previous Versions
v5.1.4 β CPU Fast Path Audit
- Fixed package/runtime mismatch
- Added sparse MoELayer with expert-grouped dispatch
- Made C++ ternary extensions lazy-loaded
- Vectorized BitLinear AbsMean scaling
- Cached causal/triangular masks
- Reduced GatedDeltaNet clone churn
v5.1.3 β Fix Illegal Instruction Crash
- Removed
-march=nativefrom C++ JIT flags - Runtime CPUID detection for AVX-512/AVX2
v5.1.2 β True Ternary Compute
- 2-bit packed uint8 weight storage (16Γ compression)
- C++ unpack + MKL BLAS forward path
- MeZO sparse perturbation (skip ~33% zeros)
- STE backward with deep-zero masking
Architecture (28 layers, 4 types)
Layer pattern: GD XM GD TM GD XM GD SK Γ 3.5
GD = Gated DeltaNet (14 layers) β arxiv:2412.06464
XM = xLSTM mLSTM (7 layers) β arxiv:2405.04517
TM = Titans MAC (4 layers) β arxiv:2501.00663
SK = TSP Span Knot (3 layers)
All linear layers use BitLinear (ternary 1.58-bit) with per-group AbsMean scaling.
Training Modes
HYPER (v5.3 β Recommended)
- 7 stacked paradigms for maximum CPU throughput
- Target: 10,000+ tok/s on 8-core CPU (tiny scale)
- Forward-only training (Sparse MeZO): no backward pass
- Memory = 2Γ model size (no activations, no gradients, no optimizer states)
- Each paradigm independently toggleable via CLI flags
MeZO (v5.1 β Standard)
- Standard zeroth-order optimization
- 2 forward passes per step, no backward
- Good for fine-tuning; ~50-200 tok/s on CPU
AdamW (v5.1 β Full backprop)
- Standard gradient descent with checkpointing
- Best convergence quality for pretraining from scratch
- ~10-50 tok/s on CPU
References
37 papers indexed in config.json under Β§. Key additions for v5.3:
- GrowLength β Progressive sequence length training
- GRC MatMul-free LM β Reservoir computing for LMs
- Sparse MeZO β Sparse zeroth-order fine-tuning
- GaLore β Gradient low-rank projection
- QuZO β Quantized zeroth-order training
- SparAMX β AMX-accelerated sparse CPU kernels
Plus all previous references:
- Gated DeltaNet β NVIDIA
- xLSTM β NXAI/JKU
- Titans β Google
- Parcae β Stanford/Together
- BitNet b1.58 β Microsoft
- MeZO β Princeton
- Downloads last month
- 338