AzText Tokenizer (SentencePiece BPE, 16k)

A SentencePiece BPE tokenizer trained on a 100,000-document sample of the AzText curated Azerbaijani corpus.

Released with the paper AzText: Curating Web-Scale Pretraining Data for a Low-Resource Language (AIDT 2026).

Specifications

  • Algorithm: SentencePiece BPE
  • Vocabulary size: 16,000
  • Character coverage: 1.0
  • Special tokens: <unk> (0), <s> (1), </s> (2)
  • Wrapper class: LlamaTokenizer (compatible with AutoTokenizer)

Compression

On a held-out 5,000-document evaluation set drawn from the curated corpus, this tokenizer achieves approximately 0.24 tokens per character on Azerbaijani text. By comparison, GPT-2's tokenizer requires roughly 2.7× more tokens for the same input, and XLM-RoBERTa requires roughly 1.1× more.

Usage

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("eljanmahammadli/aztext-tokenizer")
ids = tok.encode("Salam, dünya! Azərbaycan dilində bir nümunə.")
print(tok.convert_ids_to_tokens(ids))

Citation

@inproceedings{mahammadli2026aztext,
  title={AzText: Curating Web-Scale Pretraining Data for a Low-Resource Language},
  author={Mahammadli, Eljan and Rustamov, Samir},
  booktitle={Artificial Intelligence for Digital Transformations (AIDT)},
  year={2026}
}

License

MIT.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support