almaghrabima
/

SARFTokenizer

sentencepiece-style

Model card Files Files and versions

6.92 MB

Ctrl+K

Ctrl+K

1 contributor

History: 61 commits

almaghrabima's picture

Add 'lower/higher is better' captions under the two charts

8446f73 verified 3 days ago

.gitattributes

1.78 kB
Add characters per 1M tokens chart 3 days ago
BENCHMARK.md

1.74 kB
Promote v0.3.1 to main (4-domain SOTA at 100k vocab) 18 days ago
FAIR_BENCHMARK.md

5.67 kB
Honest OOD FineWeb benchmark: lead AR +13.5% vs GPT-4o, lose EN by 6.2% 21 days ago
README.md

26.7 kB
Add 'lower/higher is better' captions under the two charts 3 days ago
api_cost_comparison_per_1m_tokens.png

139 kB
xet

Add API cost comparison per 1M tokens chart 3 days ago
bench_results.json

12.5 kB
Promote v0.3.1 to main (4-domain SOTA at 100k vocab) 18 days ago
benchmark_results.json

2.97 kB
v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab 22 days ago
benchmark_results_2026flagships.json

883 Bytes
README: add Gemma-4, Qwen3.6, Kimi-K2.6 benchmarks + head-to-head vs flagships 22 days ago
characters_per_1m_tokens.png

128 kB
xet

Add characters per 1M tokens chart 3 days ago
fair_benchmark_results.json

4.8 kB
Honest OOD FineWeb benchmark: lead AR +13.5% vs GPT-4o, lose EN by 6.2% 21 days ago
special_tokens_map.json

449 Bytes
Retrain v0.3.1 with 13 modern <|...|> special tokens (chat / code-block / tool-output boundaries; benchmark unchanged) 18 days ago
tokenizer.json

6.6 MB
Retrain v0.3.1 with 13 modern <|...|> special tokens (chat / code-block / tool-output boundaries; benchmark unchanged) 18 days ago
tokenizer_config.json

571 Bytes
Retrain v0.3.1 with 13 modern <|...|> special tokens (chat / code-block / tool-output boundaries; benchmark unchanged) 18 days ago