SARFTokenizer v0.3.1 β 4-domain (AR / EN / Math / Code) at 100k vocab
A 4-domain tokenizer at 100,000 vocabulary with an Arabic-focused normalization pipeline. Adds math and code to the bilingual AR/EN coverage of v0.2 without regressing Arabic β and pushes Arabic CpT to 4.004, the highest we have measured on any tokenizer at any vocab size.
The headline β what we actually claim
SOTA on every domain at any published vocab tier. v0.3.1 is simultaneously the best Arabic, best English, best math, and best code tokenizer we have measured, beating GPT-5.4-mini / GPT-5.5 o200k_base on every domain at half the vocab size.
Benchmark β 1,200-document held-out 4-domain eval
300 docs each of Arabic, English, math (FineMath-4plus), code (Nemotron-Code). 2,000-char cap per doc. add_special_tokens=False. No external preprocessing β each tokenizer's own normalizer/pre-tokenizer runs naturally.
| Rank | Tokenizer | Vocab | AR | EN | MATH | CODE | Parity AR/EN |
|---|---|---|---|---|---|---|---|
| π₯ | SARFTokenizer v0.3.1 | 100,000 | 4.004 | 3.733 | 4.243 | 4.200 | 1.073 |
| 2 | SARFTokenizer v0.2 | 65,000 | 3.683 | 3.522 | 3.922 | 3.913 | 1.046 |
| 3 | Qwen3.6-35B-A3B | 248,077 | 3.129 | 2.985 | 3.233 | 3.432 | 1.048 |
| 4 | tiktoken/o200k_base (GPT-5.4-mini, GPT-5.5) | 200,019 | 3.087 | 3.409 | 3.505 | 3.622 | 0.906 |
| 5 | ALLaM-7B-Instruct-preview | 64,000 | 2.854 | 2.518 | 3.000 | 3.250 | 1.133 |
| 6 | google/gemma-4-31B-it | 262,144 | 2.833 | 3.069 | 3.242 | 3.383 | 0.923 |
| 6t | google/gemma-3-1b-pt | 262,145 | 2.833 | 3.069 | 3.242 | 3.384 | 0.923 |
| 8 | google/gemma-2-2b | 256,000 | 2.779 | 3.117 | 3.269 | 3.383 | 0.892 |
| 9 | QCRI/Fanar-1-9B-Instruct | 128,256 | 2.778 | 3.047 | 3.221 | 3.346 | 0.911 |
| 10 | Qwen2.5-0.5B | 151,665 | 2.583 | 2.923 | 3.299 | 3.512 | 0.884 |
| 11 | Hala-350M | 64,400 | 2.219 | 3.220 | 3.367 | 3.477 | 0.689 |
| 12 | Kimi-K2.6 | 163,840 | 2.074 | 3.239 | 3.520 | 3.630 | 0.640 |
| 13 | tiktoken/cl100k_base (GPT-4) | 100,277 | 1.429 | 3.066 | 3.479 | 3.607 | 0.466 |
| 14 | Falcon-7B | 65,024 | 0.991 | 2.720 | 3.108 | 3.210 | 0.364 |
Token and cost comparison β AR / EN / Math / Code only
CpT means characters per token. In other words:
- AR CpT 4.004 means about 4.004 Arabic characters = 1 token.
- EN CpT 3.733 means about 3.733 English characters = 1 token.
- Higher CpT means fewer tokens for the same text.
- Estimated tokens for a text block are calculated as:
characters Γ· CpT.
Token count comparison per 1M characters
This table shows how many tokens each tokenizer would produce for 1,000,000 characters in each domain.
| Tokenizer / Model | Vocab | AR CpT | AR tokens / 1M chars | EN CpT | EN tokens / 1M chars | Math CpT | Math tokens / 1M chars | Code CpT | Code tokens / 1M chars |
|---|---|---|---|---|---|---|---|---|---|
| SARFTokenizer v0.3.1 | 100,000 | 4.004 | 249,750 | 3.733 | 267,881 | 4.243 | 235,682 | 4.200 | 238,095 |
| SARFTokenizer v0.2 | 65,000 | 3.683 | 271,518 | 3.522 | 283,930 | 3.922 | 254,972 | 3.913 | 255,558 |
| Qwen3.6-35B-A3B | 248,077 | 3.129 | 319,591 | 2.985 | 335,008 | 3.233 | 309,310 | 3.432 | 291,375 |
| tiktoken/o200k_base (GPT-5.4-mini, GPT-5.5) | 200,019 | 3.087 | 323,939 | 3.409 | 293,341 | 3.505 | 285,307 | 3.622 | 276,091 |
Token change vs o200k_base
Positive means fewer tokens than o200k_base. Negative means more tokens than o200k_base.
| Tokenizer / Model | AR token change | EN token change | Math token change | Code token change |
|---|---|---|---|---|
| SARFTokenizer v0.3.1 | +22.9% fewer tokens | +8.7% fewer tokens | +17.4% fewer tokens | +13.8% fewer tokens |
| SARFTokenizer v0.2 | +16.2% fewer tokens | +3.2% fewer tokens | +10.6% fewer tokens | +7.4% fewer tokens |
| Qwen3.6-35B-A3B | +1.3% fewer tokens | β14.2% more tokens | β8.4% more tokens | β5.5% more tokens |
| tiktoken/o200k_base (GPT-5.4-mini, GPT-5.5) | baseline | baseline | baseline | baseline |
Estimated API cost per 1M characters
The cost formulas are:
tokens per 1M characters = 1,000,000 Γ· CpT
cost per 1M characters = price per 1M tokens Γ· CpT
The per-token price doesn't tell you what you'll actually pay β what matters is how many tokens each tokenizer produces for your text. A tokenizer with higher CpT lets you fit more characters into the same number of tokens, so at the same per-token price the per-character cost is lower. The bar chart and table below use that fact directly.
Pricing references:
- Qwen3.6-35B-A3B DeepInfra pricing: https://deepinfra.com/Qwen/Qwen3.6-35B-A3B
- OpenAI API pricing for GPT models: https://openai.com/api/pricing/
API pricing assumptions
For the chart and tables below, all SARFTokenizer rows are priced at Qwen3.6-35B's DeepInfra rate ($0.15 input / $0.95 output per 1M tokens). SARFTokenizer is a tokenizer, not a hosted API β applying Qwen's rate is a notional anchor that lets us compare against the cheapest hosted peer in this set. The hosted peers (Qwen, GPT-5.4 mini, GPT-5.5) keep their own real prices.
| Model / Tokenizer | Input price / 1M tokens | Cached input / 1M tokens | Output price / 1M tokens | Pricing status |
|---|---|---|---|---|
| SARFTokenizer v0.3.1 (notional Qwen) | $0.15 | β | $0.95 | Tokenizer only; using Qwen's DeepInfra rate as anchor |
| SARFTokenizer v0.2 (notional Qwen) | $0.15 | β | $0.95 | Tokenizer only; using Qwen's DeepInfra rate as anchor |
| Qwen3.6-35B-A3B on DeepInfra | $0.15 | β | $0.95 | Hosted API price; no cached tier in listing |
| GPT-5.4 mini | $0.75 | $0.075 | $4.50 | Hosted API price; cached input = 10% of uncached |
| GPT-5.5 | $5.00 | $0.50 | $30.00 | Hosted API price; cached input = 10% of uncached |
API cost per 1M tokens
The chart below shows the raw per-token pricing for each model and the notional Qwen anchor used for the SARFTokenizer rows. OpenAI's cached input rates (10% of uncached) appear as their own bars. This is the pricing tier view β the per-character analysis that exploits SARFTokenizer's CpT advantage starts in the next section.
Lower is better β shorter bars mean less cost per 1M tokens.
Hatched bars = notional pricing (SARFTokenizer rows). Linear y-axis. Three pricing tiers are visible: Qwen-tier (input+output β $1.10), GPT-5.4 mini tier (β $5.25), and GPT-5.5 tier (β $35.00). Output tokens dominate every bill β about 6Γ the uncached input price across both OpenAI models and 6.3Γ the per-token price on Qwen. The compression and reduction figures in the next sections build on these raw prices.
Characters delivered per 1M tokens
The per-token chart above shows the billing rate. This chart shows the content rate β how much actual text a token actually encodes. Same tokens, but each tokenizer extracts a different amount of characters from them. SARFTokenizer's compression advantage that's invisible on the price chart shows up directly here as taller bars.
GPT-5.4 mini and GPT-5.5 share the same tokenizer (o200k_base), so they collapse to a single bar group here.
Higher is better β taller bars mean more characters packed into each token.
Reading across the Arabic bars (blue): SARFTokenizer v0.3.1 packs 4.00M characters into 1M tokens, v0.2 packs 3.68M, Qwen packs 3.13M, and o200k_base packs 3.09M. At the same per-token price, those gaps are the cost advantage β every extra character per token is a character you don't pay extra for. The 22% Arabic cost reduction shown two sections down is exactly this gap, just re-expressed in dollars.
The pattern holds across all four domains, with v0.3.1 leading every column β strongest on math (4.24M chars / 1M tokens) and weakest (still ahead) on English (3.73M).
Price reduction from hosted peers to SARFTokenizer v0.3.1
Comparing total (input + output) cost per 1M Arabic characters. SARFTokenizer v0.3.1 is priced at Qwen's DeepInfra rate (notional); hosted peers use their own real prices.
| Peer | Peer total / 1M AR chars | SARF v0.3.1 / 1M AR chars | Reduction |
|---|---|---|---|
| Qwen3.6-35B on DeepInfra | $0.3515 | $0.2747 | β21.9% |
| GPT-5.4 mini | $1.7007 | $0.2747 | β83.8% |
| GPT-5.5 | $11.3379 | $0.2747 | β97.6% |
Formula: reduction = 1 β (SARF_total Γ· peer_total) = 1 β (SARF_rate Γ peer_CpT) Γ· (peer_rate Γ SARF_CpT)
Where the reduction comes from β pricing tier vs compression
| Peer | Rate factor (SARF_rate Γ· peer_rate) | Compression factor (peer_CpT Γ· SARF_CpT) | Combined cost ratio | Total reduction |
|---|---|---|---|---|
| Qwen | $1.10 / $1.10 = 1.000 | 3.129 / 4.004 = 0.781 | 0.781 | 21.9% |
| GPT-5.4 mini | $1.10 / $5.25 = 0.210 | 3.087 / 4.004 = 0.771 | 0.162 | 83.8% |
| GPT-5.5 | $1.10 / $35.00 = 0.031 | 3.087 / 4.004 = 0.771 | 0.024 | 97.6% |
Reading the columns: against Qwen the full 21.9% reduction comes from compression alone, since the per-token rates are identical. Against the GPT models, most of the reduction is the pricing-tier gap (Qwen-tier inference is ~5Γ cheaper per token than GPT-5.4 mini and ~32Γ cheaper than GPT-5.5) and SARFTokenizer's compression contributes a multiplicative ~23% on top. The tokenizer's own contribution is a clean ~22β23% wedge across all three comparisons; the rest is pricing.
Full cost table β all four domains
| Model / Tokenizer | API pricing used | AR input | AR output | EN input | EN output | Math input | Math output | Code input | Code output |
|---|---|---|---|---|---|---|---|---|---|
| SARFTokenizer v0.3.1 (notional Qwen) | $0.15 input / $0.95 output per 1M tokens | $0.037 | $0.237 | $0.040 | $0.254 | $0.035 | $0.224 | $0.036 | $0.226 |
| SARFTokenizer v0.2 (notional Qwen) | $0.15 input / $0.95 output per 1M tokens | $0.041 | $0.258 | $0.043 | $0.270 | $0.038 | $0.242 | $0.038 | $0.243 |
| Qwen3.6-35B-A3B on DeepInfra | $0.15 input / $0.95 output per 1M tokens | $0.048 | $0.304 | $0.050 | $0.318 | $0.046 | $0.294 | $0.044 | $0.277 |
GPT-5.4 mini with o200k_base |
$0.75 input / $4.50 output per 1M tokens | $0.243 | $1.458 | $0.220 | $1.320 | $0.214 | $1.284 | $0.207 | $1.242 |
GPT-5.5 with o200k_base |
$5.00 input / $30.00 output per 1M tokens | $1.620 | $9.718 | $1.467 | $8.800 | $1.427 | $8.559 | $1.380 | $8.283 |
SARFTokenizer compression at the same per-token rate β savings vs Qwen on DeepInfra
Same per-token rate as Qwen on DeepInfra; the dollar savings come entirely from SARFTokenizer's higher CpT producing fewer tokens for the same text.
| Tokenizer | AR input | AR output | EN input | EN output | Math input | Math output | Code input | Code output |
|---|---|---|---|---|---|---|---|---|
| Qwen3.6 (native) | $0.048 | $0.304 | $0.050 | $0.318 | $0.046 | $0.294 | $0.044 | $0.277 |
| SARF v0.3.1 | $0.037 | $0.237 | $0.040 | $0.254 | $0.035 | $0.224 | $0.036 | $0.226 |
| Savings v0.3.1 vs Qwen | β22.9% | β22.0% | β20.0% | β20.1% | β23.9% | β23.8% | β18.2% | β18.4% |
| SARF v0.2 | $0.041 | $0.258 | $0.043 | $0.270 | $0.038 | $0.242 | $0.038 | $0.243 |
| Savings v0.2 vs Qwen | β14.6% | β15.1% | β14.0% | β15.1% | β17.4% | β17.7% | β13.6% | β12.3% |
What SARFTokenizer compression means at GPT-style pricing
This is not an API price for SARFTokenizer. It shows the compression advantage only: if a model had GPT-style pricing but used SARFTokenizer compression instead of o200k_base, the estimated cost per 1M characters would be lower because the same text becomes fewer tokens.
| Pricing scenario | Tokenizer | AR input | AR output | EN input | EN output | Math input | Math output | Code input | Code output |
|---|---|---|---|---|---|---|---|---|---|
| GPT-5.4 mini pricing | o200k_base |
$0.243 | $1.458 | $0.220 | $1.320 | $0.214 | $1.284 | $0.207 | $1.242 |
| GPT-5.4 mini pricing | SARFTokenizer v0.3.1 compression | $0.187 | $1.124 | $0.201 | $1.205 | $0.177 | $1.061 | $0.179 | $1.071 |
| GPT-5.4 mini pricing | SARFTokenizer v0.2 compression | $0.204 | $1.222 | $0.213 | $1.278 | $0.191 | $1.147 | $0.192 | $1.150 |
| GPT-5.5 pricing | o200k_base |
$1.620 | $9.718 | $1.467 | $8.800 | $1.427 | $8.559 | $1.380 | $8.283 |
| GPT-5.5 pricing | SARFTokenizer v0.3.1 compression | $1.249 | $7.493 | $1.339 | $8.036 | $1.178 | $7.070 | $1.190 | $7.143 |
| GPT-5.5 pricing | SARFTokenizer v0.2 compression | $1.358 | $8.146 | $1.420 | $8.518 | $1.275 | $7.649 | $1.278 | $7.667 |
v0.3.1 vs the best peer per domain
| Domain | v0.3.1 | Best peer | Ξ |
|---|---|---|---|
| Arabic | 4.004 | Qwen3.6-35B (3.129) | +27.9% |
| English | 3.733 | GPT-5.4-mini / GPT-5.5 o200k_base (3.409) |
+9.5% |
| Math | 4.243 | GPT-5.4-mini / GPT-5.5 o200k_base (3.505) |
+21.0% |
| Code | 4.200 | GPT-5.4-mini / GPT-5.5 o200k_base (3.622) |
+16.0% |
v0.3.1 vs prior SARFTokenizer revisions
| Domain | v0.2 (65k) | v0.3 (80k) | v0.3.1 (100k) | Ξ vs v0.2 |
|---|---|---|---|---|
| Arabic | 3.683 | 3.192 | 4.004 | +8.7% |
| English | 3.522 | 3.631 | 3.733 | +6.0% |
| Math | 3.922 | 4.259 | 4.243 | +8.2% |
| Code | 3.913 | 4.224 | 4.200 | +7.3% |
The 100k vocab gives Arabic ~50,000 effective slots (vs v0.2's 32,500 at 65k), and the 250M-char Arabic training share matches v0.2 exactly β so AR strictly gains from the larger vocab while math/code retain v0.3-class compression.
Why this matters
- Arabic-first deployments: 4.004 AR CpT means ~30% more Arabic context in the same window vs GPT-5.4-mini / GPT-5.5
o200k_base, ~9% more vs our own v0.2. - Bilingual + technical domains: math and code now first-class β strong compression on Python, math word problems, and formal reasoning chains.
- Vocab specialization > vocab size: at 100k we beat models with 200kβ262k vocabularies on every domain.
- Same infrastructure:
AutoTokenizer.from_pretrainedwithouttrust_remote_code, no Python preprocessing.
Caveats we want you to know
- Lossy Arabic normalization (inherited from v0.2). Tashkeel, Alef variants, Ya Maksura, and Indic digits are normalized at encode time. Not suitable for Qur'anic text or classical poetry with full diacritics.
- Math is web-style. Trained on FineMath-4plus β natural-language math web text, not LaTeX-heavy formal mathematics.
- Code is Python-leaning. Trained on Nemotron-Code, dominated by Python competitive-programming solutions with
<think>reasoning. Less common languages may fall back to byte-level pieces more often. - Larger embedding table. 100k Γ hidden_dim is ~50% bigger than the v0.2 65k row table. Worth it if you can afford the parameters; if not, see v0.2 (AR/EN only) or v0.3 (4-domain at 80k with AR regression).
- Breaking change vs v0.2/v0.3 special tokens. Old
<s>/</s>/<unk>/<pad>are no longer present. Pinrevision="v0.2"if you depend on the old token IDs.
Special tokens
13 atomic special tokens with reserved IDs 0β12 (single-token, never split):
| ID | Token | Slot | Purpose |
|---|---|---|---|
| 0 | <|assistant_end|> |
additional | end of assistant turn (chat) |
| 1 | <|assistant_start|> |
additional | start of assistant turn (chat) |
| 2 | <|bos|> |
bos_token | beginning-of-sequence |
| 3 | <|end_of_text|> |
eos_token | end-of-sequence |
| 4 | <|mask|> |
mask_token | mask for FIM / denoising / infilling |
| 5 | <|output_end|> |
additional | end of tool / exec output block |
| 6 | <|output_start|> |
additional | start of tool / exec output block |
| 7 | <|pad|> |
pad_token | padding |
| 8 | <|python_end|> |
additional | end of Python code block |
| 9 | <|python_start|> |
additional | start of Python code block |
| 10 | <|unk|> |
unk_token | unknown / byte-fallback signal |
| 11 | <|user_end|> |
additional | end of user turn (chat) |
| 12 | <|user_start|> |
additional | start of user turn (chat) |
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer")
print(tok.bos_token, tok.eos_token, tok.mask_token)
# β <|bos|> <|end_of_text|> <|mask|>
The chat / code / output tokens enable a downstream model to emit:
<|user_start|>solve x^2 + 3x = 10<|user_end|>
<|assistant_start|>
<|python_start|>
from sympy import symbols, solve
x = symbols('x')
print(solve(x**2 + 3*x - 10))
<|python_end|>
<|output_start|>
[-5, 2]
<|output_end|>
The roots are x = -5 and x = 2.
<|assistant_end|>
<|end_of_text|>
without any markup-tokenization overhead β every boundary is a single token.
Overview
| Property | Value |
|---|---|
| Vocabulary size | 100,000 |
| Pre-tokenizer | Metaspace (β marker, SentencePiece-style) |
| Normalizer | Arabic-focused: NFKC β Alef/Ya unification β tashkeel/tatweel/zero-width strip β Indic digits β ASCII |
| Special tokens | 13 (see table above) |
| Domains | Arabic + English + Math + Code |
| Training corpus | 500M chars (250 AR / 100 EN / 75 math / 75 code) |
| Training corpus repo | almaghrabima/deeplatent-labeled |
| Public API | AutoTokenizer.from_pretrained without trust_remote_code |
Quick start
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer")
print(tok.vocab_size) # 100000
To pin to a specific revision:
# v0.3.1 (latest, 100k, 4-domain, modern specials, this revision)
tok = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer", revision="v0.3.1")
# v0.3 (80k, 4-domain, legacy <s>/</s> specials β accepts AR regression for smaller vocab)
tok = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer", revision="v0.3")
# v0.2 (65k, AR/EN only, legacy specials β original SOTA-Arabic release)
tok = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer", revision="v0.2")
Low-level tokenizers API
from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("almaghrabima/SARFTokenizer") # main = v0.3.1
print(tok.encode("Ψ§ΩΩ
ΨΉΩΩ
ΩΨ΄Ψ±Ψ Ψ§ΩΨ―Ψ±Ψ³ ΩΩ Ψ§ΩΨ΅Ω Ψ§ΩΩΩΩ
.", add_special_tokens=False).tokens)
print(tok.encode("def fib(n):\n return n if n<2 else fib(n-1)+fib(n-2)",
add_special_tokens=False).tokens)
Reproduce the benchmark
The eval set (300 AR + 300 EN + 300 math + 300 code) is built from:
- AR/EN: the
SARFTokenizer-benchmark-evaldataset. - Math: held-out tail of
HuggingFaceTB/finemath(finemath-4plus). - Code: held-out tail of
saurabh5/nemotron-post-training-dataset-v1-codewith role markers stripped (problem + solution flattened with\n\n).
Each doc capped at 2000 chars, no normalization beyond what each tokenizer applies internally.
Normalization (lossy on Arabic, by design)
All Arabic text is normalized at encode time:
- NFKC compat normalization
- Tashkeel (
U+064BβU+0652,U+0670) removed - Tatweel
U+0640removed - Zero-width + BiDi controls removed
- Alef variants (
Ψ£,Ψ₯,Ψ’,Ω±) β bare AlefΨ§ - Alef Maksura
Ωβ YaΩ - Arabic-Indic digits (
ΩβΩ©) β ASCII0β9
Encoding is lossy on diacritics and Alef-Hamza variants β by design. If your downstream task requires preserving these (classical poetry with full diacritics, Qur'anic text), this tokenizer is not suitable.
Files
tokenizer.jsonβ HuggingFace-format tokenizer (6.6 MB)tokenizer_config.jsonβPreTrainedTokenizerFastconfigspecial_tokens_map.jsonβ special tokens map (5 named slots + 13-item additional)BENCHMARK.mdβ full results across 15 tokenizers (this README's table)bench_results.jsonβ raw per-tokenizer per-domain metrics
Related
- Training corpus:
almaghrabima/deeplatent-labeledβ 4-domain labeled pretraining corpus - Eval corpus (AR/EN portion):
almaghrabima/SARFTokenizer-benchmark-evalβ 300 AR + 300 EN held-out documents
Version history
- v0.3.1 (latest, this revision) β 100k vocab, 4-domain, 13 modern
<|...|>specials. SOTA on AR/EN/math/code. - v0.3 β 80k vocab, 4-domain, legacy
<s>/</s>/<unk>/<pad>specials. Math/code SOTA but AR regresses vs v0.2. - v0.2 β 65k vocab, AR/EN only, legacy specials. Original release; SOTA Arabic at sub-100k tier.
License
Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

