AzText Tokenizer (SentencePiece BPE, 16k)
A SentencePiece BPE tokenizer trained on a 100,000-document sample of the AzText curated Azerbaijani corpus.
Released with the paper AzText: Curating Web-Scale Pretraining Data for a Low-Resource Language (AIDT 2026).
Specifications
- Algorithm: SentencePiece BPE
- Vocabulary size: 16,000
- Character coverage: 1.0
- Special tokens:
<unk>(0),<s>(1),</s>(2) - Wrapper class:
LlamaTokenizer(compatible withAutoTokenizer)
Compression
On a held-out 5,000-document evaluation set drawn from the curated corpus, this tokenizer achieves approximately 0.24 tokens per character on Azerbaijani text. By comparison, GPT-2's tokenizer requires roughly 2.7× more tokens for the same input, and XLM-RoBERTa requires roughly 1.1× more.
Usage
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("eljanmahammadli/aztext-tokenizer")
ids = tok.encode("Salam, dünya! Azərbaycan dilində bir nümunə.")
print(tok.convert_ids_to_tokens(ids))
Citation
@inproceedings{mahammadli2026aztext,
title={AzText: Curating Web-Scale Pretraining Data for a Low-Resource Language},
author={Mahammadli, Eljan and Rustamov, Samir},
booktitle={Artificial Intelligence for Digital Transformations (AIDT)},
year={2026}
}
License
MIT.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support