Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Log In
Sign Up
almaghrabima
/
SARFTokenizer
like
3
Arabic
English
tokenizers
tokenizer
sarf
bilingual
arabic
english
math
code
sentencepiece-style
License:
cc-by-nc-4.0
Model card
Files
Files and versions
xet
Community
main
SARFTokenizer
6.92 MB
Ctrl+K
Ctrl+K
1 contributor
History:
61 commits
almaghrabima
Add 'lower/higher is better' captions under the two charts
8446f73
verified
3 days ago
.gitattributes
1.78 kB
Add characters per 1M tokens chart
3 days ago
BENCHMARK.md
Safe
1.74 kB
Promote v0.3.1 to main (4-domain SOTA at 100k vocab)
18 days ago
FAIR_BENCHMARK.md
Safe
5.67 kB
Honest OOD FineWeb benchmark: lead AR +13.5% vs GPT-4o, lose EN by 6.2%
21 days ago
README.md
26.7 kB
Add 'lower/higher is better' captions under the two charts
3 days ago
api_cost_comparison_per_1m_tokens.png
139 kB
xet
Add API cost comparison per 1M tokens chart
3 days ago
bench_results.json
Safe
12.5 kB
Promote v0.3.1 to main (4-domain SOTA at 100k vocab)
18 days ago
benchmark_results.json
Safe
2.97 kB
v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab
22 days ago
benchmark_results_2026flagships.json
Safe
883 Bytes
README: add Gemma-4, Qwen3.6, Kimi-K2.6 benchmarks + head-to-head vs flagships
22 days ago
characters_per_1m_tokens.png
128 kB
xet
Add characters per 1M tokens chart
3 days ago
fair_benchmark_results.json
Safe
4.8 kB
Honest OOD FineWeb benchmark: lead AR +13.5% vs GPT-4o, lose EN by 6.2%
21 days ago
special_tokens_map.json
Safe
449 Bytes
Retrain v0.3.1 with 13 modern <|...|> special tokens (chat / code-block / tool-output boundaries; benchmark unchanged)
18 days ago
tokenizer.json
Safe
6.6 MB
Retrain v0.3.1 with 13 modern <|...|> special tokens (chat / code-block / tool-output boundaries; benchmark unchanged)
18 days ago
tokenizer_config.json
Safe
571 Bytes
Retrain v0.3.1 with 13 modern <|...|> special tokens (chat / code-block / tool-output boundaries; benchmark unchanged)
18 days ago