JetonCount

Summary

Task: regression
Total training time: 111 minutes
Params: 7009
Final MAE: 192
Framework: PyTorch
Authors: Paul Courneya, Jonathon Ly

Description

JetonCount is a 7k-parameter MLP regression model trained to predict the number of tokens a piece of text might contain using only six input features.

Model Details

Architecture: MLP
Input Feature Dimension: 19
Raw Input Features: 7
Engineered Features: True
Log1p Engineered Features: True
Hidden Size: 32
Number of Layers: 8
Activation: SiLU
Dropout: 0.005
Total Parameters: 7,009

Input Features

chars
words
avg_chars_per_word
longest_word_chars
symbol_ratio
punctuation_ratio
vocab_size

Training

Dataset

22M rows. 28 tokenizers. 9 sources.

Tokenizers: tokenizers_used.txt Datasets: datasets_used.txt

Training Details

Maximum Learning Rate: 6e-3
Minimum Learning Rate: 3e-6
Number of Epochs: 3
Batch Size: 32000
Eval Split Ratio: 0.005
Gradient Accumulation Steps: 1
Gradient Clipping: 1.0
AdamW Betas: (0.9, 0.95)
DType: float32

Final Eval and Train Results

Train:
- R²: 0.951257
- MSE: 938621.647477
- RMSE: 968.824880
- MAE: 192.055944
- MRE: 0.137838
- Explained Variance: 0.951305
- Loss: 938621.647477
Eval:
- R²: 0.9738627018499254
- MSE: 480722.18035314884
- RMSE: 693.3413159138498
- MAE: 163.19862499318103
- MRE: 0.10729957442834033
- Explained Variance: 0.9738670065468997
- Loss: 480722.18035314884
Test:
- R²: 0.9717820439277628
- MSE: 388793.8423401673
- RMSE: 623.5333530294649
- MAE: 157.27345487387672
- MRE: 0.10509939610167868
- Explained Variance: 0.9717854117493441
- Loss: 388793.8423401673

Hardware

CPU: Ryzen 5 2600 (data preparation and training)

Predictions

Actual Tokens	Model Prediction
197	239
1333	1395
5973	6609
18569	20423

Note: Rounded to nearest integer.

Example

Input Text (taken from wikipedia):

Artificial intelligence (AI) is the capability of computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of research in engineering, mathematics and computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals.[1]

Input vocab size (matching tokenizer): 2560
Tokenizer (baseline): FromZero/Er-Tiny-1.3M

Out:

{
  "actual_token_count": 139,
  "prediction" "190.5435028076172",
  "model_latency_ms": "0.24457614858252544",
  "tokenizer_latency_ms": 0.3174110000009023
}

Is it faster than a tokenizer, though?

We came across a dilemma: why build this model if a tokenizer is more accurate anyway?

The answer is speed.

In our tests, especially on long texts, the model is significantly faster than a tokenizer.

Tokens	Model Latency (ms)	Tokenizer Latency (ms)
197	0.2429	0.4134
1333	0.3409	1.8775
5973	0.9827	7.6504
18569	5.2890	28.8244

Note: Model latency changes based on hardware.

Use Cases

Educational work and research
API Pricing Estimation
Dataset labeling
Or just for fun.

Limitations

The model is an approximation and can produce errors on out-of-distrubtion texts.
Prediction accuracy heavily depends on the corectness of the input features.
It does not perform actual tokenization and therefore is much less accurate than an actual tokenizer.
Vocabulary sizes larger than 128k can result in performance degradation.

License

Before using, distributing, selling, or modifying this software, you must read the license here.

Inference

from __future__ import annotations

import json
import re
import time
from dataclasses import dataclass
from typing import Tuple

import torch
from transformers import AutoModel, AutoTokenizer

MODEL_ID = "fromziro/JetonCount"
TOKENIZER_ID = "fromziro/Er-Tiny-1.3M"

FEATURE_MEAN = None
FEATURE_STD = None
TARGET_OFFSET = 0.0

DEFAULT_VOCAB_SIZE = 2564

TEXT = "Put your text here."
TOKENIZER_ROUNDS = 100
MODEL_ROUNDS = 1000

PUNCTUATION_CHARS = set(r""".,!?;:'"`~@#$%^&*()-_=+[]{}<>/\|""")
SYMBOL_CHARS = set(r"""@#$%^&*()-_=+[]{}<>/\|~`""")


@dataclass
class TextStats:
    chars: float
    words: float
    avg_chars_per_word: float
    punctuation_ratio: float
    symbol_ratio: float
    longest_word_chars: float
    vocab_size: float


def compute_text_stats(text: str, vocab_size: int) -> TextStats:
    chars = len(text)
    words_list = re.findall(r"\b\w+\b", text, flags=re.UNICODE)
    words = len(words_list)

    total_word_chars = sum(len(w) for w in words_list)
    avg_chars_per_word = (total_word_chars / words) if words else 0.0
    longest_word_chars = max((len(w) for w in words_list), default=0)

    if chars:
        punctuation_count = sum(1 for ch in text if ch in PUNCTUATION_CHARS)
        symbol_count = sum(1 for ch in text if ch in SYMBOL_CHARS)
        punctuation_ratio = punctuation_count / chars
        symbol_ratio = symbol_count / chars
    else:
        punctuation_ratio = 0.0
        symbol_ratio = 0.0

    return TextStats(
        chars=float(chars),
        words=float(words),
        avg_chars_per_word=float(avg_chars_per_word),
        punctuation_ratio=float(punctuation_ratio),
        symbol_ratio=float(symbol_ratio),
        longest_word_chars=float(longest_word_chars),
        vocab_size=float(vocab_size),
    )


def build_feature_tensor(stats: TextStats) -> torch.Tensor:
    base = torch.tensor(
        [
            stats.chars,
            stats.words,
            stats.avg_chars_per_word,
            stats.punctuation_ratio,
            stats.symbol_ratio,
            stats.longest_word_chars,
            stats.vocab_size,
        ],
        dtype=torch.float32,
    )

    chars, words, avg_chars_per_word, punctuation_ratio, symbol_ratio, longest_word_chars, vocab_size = base
    eps = 1e-6

    extra = torch.tensor(
        [
            chars / max(words.item(), 1.0),
            words / max(chars.item(), 1.0),
            torch.log1p(torch.clamp(chars, min=0.0)).item(),
            torch.log1p(torch.clamp(words, min=0.0)).item(),
            torch.log1p(torch.clamp(vocab_size, min=0.0)).item(),
            (chars * punctuation_ratio).item(),
            (chars * symbol_ratio).item(),
            (words * avg_chars_per_word).item(),
            (words * punctuation_ratio).item(),
            (longest_word_chars * punctuation_ratio).item(),
            ((avg_chars_per_word + longest_word_chars) * (1.0 + punctuation_ratio + symbol_ratio)).item(),
            ((chars + eps) * (punctuation_ratio + symbol_ratio + eps)).item(),
        ],
        dtype=torch.float32,
    )

    return torch.cat([base, extra], dim=0)


def standardize_features(x: torch.Tensor) -> torch.Tensor:
    if FEATURE_MEAN is None or FEATURE_STD is None:
        return x
    mean = torch.tensor(FEATURE_MEAN, dtype=x.dtype, device=x.device)
    std = torch.tensor(FEATURE_STD, dtype=x.dtype, device=x.device)
    safe_std = torch.where(torch.isfinite(std) & (std != 0), std, torch.ones_like(std))
    safe_mean = torch.where(torch.isfinite(mean), mean, torch.zeros_like(mean))
    return (x - safe_mean) / safe_std


def benchmark_tokenizer(tokenizer, text: str, rounds: int = 100) -> Tuple[int, float]:
    tokenizer(text)
    start = time.perf_counter()
    actual_count = 0
    for _ in range(rounds):
        ids = tokenizer(text, add_special_tokens=False).input_ids
        actual_count = len(ids)
    elapsed_ms = (time.perf_counter() - start) * 1000.0 / rounds
    return actual_count, elapsed_ms


@torch.inference_mode()
def benchmark_model(model, feature_tensor: torch.Tensor, rounds: int = 1000) -> Tuple[float, float]:
    x = standardize_features(feature_tensor).unsqueeze(0)

    _ = model(input_features=x)

    start = time.perf_counter()
    pred = 0.0
    for _ in range(rounds):
        out = model(input_features=x)
        pred = float(out.logits.squeeze().item())
    elapsed_ms = (time.perf_counter() - start) * 1000.0 / rounds
    return pred, elapsed_ms


def main() -> None:
    tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_ID, use_fast=True)
    model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True)
    model.eval()

    stats = compute_text_stats(TEXT, DEFAULT_VOCAB_SIZE)
    feature_tensor = build_feature_tensor(stats)

    actual_count, tokenizer_latency_ms = benchmark_tokenizer(tokenizer, TEXT, rounds=TOKENIZER_ROUNDS)
    prediction, model_latency_ms = benchmark_model(model, feature_tensor, rounds=MODEL_ROUNDS)

    result = {
        "actual_token_count": actual_count,
        "prediction": prediction,
        "model_latency_ms": model_latency_ms,
        "tokenizer_latency_ms": tokenizer_latency_ms,
        "model_id": MODEL_ID,
        "tokenizer_id": TOKENIZER_ID,
        "vocab_size": DEFAULT_VOCAB_SIZE,
        "features": {
            "chars": stats.chars,
            "words": stats.words,
            "avg_chars_per_word": stats.avg_chars_per_word,
            "punctuation_ratio": stats.punctuation_ratio,
            "symbol_ratio": stats.symbol_ratio,
            "longest_word_chars": stats.longest_word_chars,
            "vocab_size": stats.vocab_size,
        },
    }

    print(json.dumps(result, indent=2, ensure_ascii=False))


if __name__ == "__main__":
    main()

Copyright

Copyright (c) 2026 FromZero  
Copyright (c) 2026 Paul Courneya
Copyright (c) 2026 Jonathon LY

Citation

@misc{jetoncount,
  title        = {JetonCount},
  organization = [FromZero],
  authors      = {Paul Courneya, Jonathon LY},
  year         = {2026},
  url          = {https://huggingface.co/fromziro/JetonCount}
}

Downloads last month: -

Safetensors

Model size

7.01k params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support