YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Chimera 5.3 β€” HYPER CPU Training (10,000+ tok/s target)

100% faithful implementation of the Chimera 5.x config. All 15 architectural components implemented in pure PyTorch, with true 1.58-bit ternary computation on CPU.

v5.3 NEW: 7 stacked training paradigms designed to push CPU training from ~50-200 tok/s to 10,000+ tok/s on a single CPU β€” targeting AGI-class LLM training without GPUs.

Tokenizer: splintr-rs (Rust) β€” o200k_base vocab (200,073 tokens, OpenAI o1/o3).

Repo Structure

The repo is now organized around the chimera/ package as the source of truth:

  • chimera/ β€” model code, config helpers, package CLI wrappers, shared path helpers
  • train.py β€” standard training entrypoint
  • train_fast.py β€” cached-dataset training entrypoint
  • train_hyper.py β€” hyper training entrypoint
  • inference.py β€” generation entrypoint
  • gguf_import.py β€” GGUF import entrypoint
  • tests/ β€” smoke and config tests

You can still run the root scripts directly, or use packaged commands after install:

chimera-train --help
chimera-train-fast --help
chimera-train-hyper --help
chimera-infer --help
chimera-import-gguf --help

v5.3 β€” HYPER Training Paradigms

Seven orthogonal paradigms that stack multiplicatively for extreme CPU training speed:

# Paradigm Speedup Paper Mechanism
P1 GrowLength Curriculum 4-8Γ— arxiv:2310.00576 Start seq=16, grow to target. Short seqs β†’ huge batch β†’ way more tok/s
P2 Reservoir Freezing 1.5-2Γ— arxiv:2512.23145 Freeze 50% of recurrent gates as random ternary. No grad = fewer FLOPs
P3 Sparse MeZO 3-5Γ— arxiv:2406.02913 Perturb only top-1% sensitive params. ZO signal quality ∝ sparsity
P4 Blockwise Pipeline 1.3-2Γ— β€” Pin layer-groups to core-groups; overlap forward passes
P5 Fused Ternary Cache 1.3Γ— β€” Pre-materialise dense weights once; reuse for both MeZO forwards
P6 Aggressive Token Packing 1.1-1.3Γ— β€” Zero padding waste; documents packed back-to-back with EOS
P7 Progressive Layer Unfreeze 1.5-2Γ— β€” Train only top 25% of layers first; unfreeze downward

Combined theoretical multiplier: P1(6Γ—) Γ— P2(1.7Γ—) Γ— P3(4Γ—) Γ— P5(1.3Γ—) Γ— P7(1.7Γ—) β‰ˆ 57-260Γ—

Realistic target: 50-200 tok/s baseline β†’ 3,000-15,000+ tok/s

Quick Start β€” HYPER Training

# All 7 paradigms ON β€” maximum speed
python train_hyper.py --scale tiny --max_steps 5000 --all

# Cherry-pick specific paradigms
python train_hyper.py --scale tiny --max_steps 5000 \
    --growlength --sparse-mezo --reservoir --fused-cache

# Benchmark: baseline vs hyper (side-by-side comparison)
python train_hyper.py --scale tiny --max_steps 100 --benchmark

# Full training run with all paradigms
OMP_NUM_THREADS=$(nproc) python train_hyper.py \
    --scale small --seq_len 256 --max_steps 50000 \
    --all --bf16 --compile \
    --save_every 5000 --log_every 10

Paradigm Details

P1 β€” GrowLength Curriculum (arxiv:2310.00576)

Trains with progressively longer sequences. At seq_len=16, you can fit 16Γ— more tokens per batch than at seq_len=256, giving massive throughput in early training where the learning signal is strongest.

Default schedule:

  • 20% of training at seq_len = target/8
  • 25% at target/4
  • 25% at target/2
  • 30% at full target
python train_hyper.py --growlength --seq_len 256

P2 β€” Reservoir Freezing (arxiv:2512.23145)

Inspired by GRC (Reservoir Computing for Language Models): freezes gate/forget projections in recurrent layers as random ternary matrices with unit spectral radius. These "reservoir" weights provide stable dynamics without needing gradient updates.

Targets:

  • GatedDeltaNet: a_proj, b_proj (alpha/beta gates)
  • mLSTM: fgate (forget gate)
  • TitansMAC: alpha_proj (forgetting gate)
python train_hyper.py --reservoir --reservoir-ratio 0.5

P3 β€” Sparse MeZO (arxiv:2406.02913)

Standard MeZO perturbs all ~35M parameters β€” most contribute near-zero gradient signal. Sparse MeZO identifies the top-K% most sensitive parameters (by weight magnitude) and perturbs only those. This dramatically reduces the variance of the ZO gradient estimate.

At 1% sparsity on a 35M model: only 350K params perturbed per step β†’ 100Γ— better signal-to-noise per forward pass.

python train_hyper.py --sparse-mezo --mezo-sparsity 0.01

P5 β€” Fused Ternary Cache

Before each MeZO dual-forward, pre-materialises all BitLinear packed+dense weight caches. Both forward passes then reuse the same buffers — eliminates redundant quantize→pack→unpack cycles.

python train_hyper.py --fused-cache

P7 β€” Progressive Layer Unfreezing

Starts with only the top ~25% of layers trainable. Early training is cheap (forward through frozen layers is fast, no gradient storage). Gradually unfreezes deeper layers as training progresses.

python train_hyper.py --progressive-unfreeze --unfreeze-stages 4

Files

chimera/
  __init__.py          β€” Package exports (v5.3)
  config.py            β€” Config loading / scaling
  hyper.py             β€” β˜… NEW: 7 HYPER paradigm engine
  quantization.py      β€” BitLinear (2-bit packed, C++ kernel, STE, N:M 2:4)
  layers.py            β€” GatedDeltaNet, mLSTM, TitansMAC, TSPSpanKnot
  moe.py               β€” MoELayer (sort-based dispatch)
  looping.py           β€” ParcaeLoopController
  inference.py         β€” SpanBank, STree, Grammar, EntropyValve, DebtLedger
  evolution.py         β€” TTT, SemanticMemory, EpisodicCases, MetaGuidelines
  multimodal.py        β€” VisionEncoder, AudioEncoder
  tokenizer.py         β€” ChimeraTokenizer (splintr, o200k_base)
  model.py             β€” Chimera51ForCausalLM
config.json            β€” Full model config
train.py               β€” Standard training (MeZO + AdamW)
train_fast.py          β€” Fast training with pre-tokenized cache
train_hyper.py         β€” β˜… NEW: HYPER training (7 paradigms, 10k+ tok/s)
inference.py           β€” Inference / generation

Previous Versions

v5.1.4 β€” CPU Fast Path Audit

  • Fixed package/runtime mismatch
  • Added sparse MoELayer with expert-grouped dispatch
  • Made C++ ternary extensions lazy-loaded
  • Vectorized BitLinear AbsMean scaling
  • Cached causal/triangular masks
  • Reduced GatedDeltaNet clone churn

v5.1.3 β€” Fix Illegal Instruction Crash

  • Removed -march=native from C++ JIT flags
  • Runtime CPUID detection for AVX-512/AVX2

v5.1.2 β€” True Ternary Compute

  • 2-bit packed uint8 weight storage (16Γ— compression)
  • C++ unpack + MKL BLAS forward path
  • MeZO sparse perturbation (skip ~33% zeros)
  • STE backward with deep-zero masking

Architecture (28 layers, 4 types)

Layer pattern: GD XM GD TM GD XM GD SK Γ— 3.5
  GD = Gated DeltaNet (14 layers) β€” arxiv:2412.06464
  XM = xLSTM mLSTM (7 layers) β€” arxiv:2405.04517
  TM = Titans MAC (4 layers) β€” arxiv:2501.00663
  SK = TSP Span Knot (3 layers)

All linear layers use BitLinear (ternary 1.58-bit) with per-group AbsMean scaling.


Training Modes

HYPER (v5.3 β€” Recommended)

  • 7 stacked paradigms for maximum CPU throughput
  • Target: 10,000+ tok/s on 8-core CPU (tiny scale)
  • Forward-only training (Sparse MeZO): no backward pass
  • Memory = 2Γ— model size (no activations, no gradients, no optimizer states)
  • Each paradigm independently toggleable via CLI flags

MeZO (v5.1 β€” Standard)

  • Standard zeroth-order optimization
  • 2 forward passes per step, no backward
  • Good for fine-tuning; ~50-200 tok/s on CPU

AdamW (v5.1 β€” Full backprop)

  • Standard gradient descent with checkpointing
  • Best convergence quality for pretraining from scratch
  • ~10-50 tok/s on CPU

References

37 papers indexed in config.json under Β§. Key additions for v5.3:

  • GrowLength β€” Progressive sequence length training
  • GRC MatMul-free LM β€” Reservoir computing for LMs
  • Sparse MeZO β€” Sparse zeroth-order fine-tuning
  • GaLore β€” Gradient low-rank projection
  • QuZO β€” Quantized zeroth-order training
  • SparAMX β€” AMX-accelerated sparse CPU kernels

Plus all previous references:

Downloads last month
338
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for Lgr54HFi/chomera