amandyk/kazakh_wiki_articles
Viewer • Updated • 5.9k • 50 • 10
How to use Eraly-ml/KazBERT with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("fill-mask", model="Eraly-ml/KazBERT") # Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Eraly-ml/KazBERT")
model = AutoModelForMaskedLM.from_pretrained("Eraly-ml/KazBERT")"KazBERT қазақ тілін [MASK] түсінеді."If you find KazBERT useful please press like button
KazBERT is a BERT-based model fine-tuned specifically for Kazakh using Masked Language Modeling (MLM). It is based on bert-base-uncased and uses a custom tokenizer trained on Kazakh text.
Erlanulu, Y. G. (2025). KazBERT: A Custom BERT Model for the Kazakh Language. Zenodo.
config.json – Model config model.safetensors – Model weights tokenizer.json – Tokenizer data tokenizer_config.json – Tokenizer config special_tokens_map.json – Special tokens vocab.txt – VocabularyInstall 🤗 Transformers and load the model:
from transformers import BertForMaskedLM, BertTokenizerFast
model_name = "Eraly-ml/KazBERT"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)
from transformers import pipeline
pipe = pipeline("fill-mask", model="Eraly-ml/KazBERT")
output = pipe('KazBERT қазақ тілін [MASK] түсінеді.')
Output:
[
{"score": 0.198, "token_str": "жетік", "sequence": "KazBERT қазақ тілін жетік түсінеді."},
{"score": 0.038, "token_str": "де", "sequence": "KazBERT қазақ тілін де түсінеді."},
{"score": 0.032, "token_str": "терең", "sequence": "KazBERT қазақ тілін терең түсінеді."},
{"score": 0.029, "token_str": "ерте", "sequence": "KazBERT қазақ тілін ерте түсінеді."},
{"score": 0.026, "token_str": "жете", "sequence": "KazBERT қазақ тілін жете түсінеді."}
]
- Trained only on public Kazakh Wikipedia & Common Crawl
- Might miss informal speech or dialects
- Could underperform on deep-context or rare words
- May reflect cultural or social biases in data
Apache 2.0 License
@misc{eraly_gainulla_2025,
author = { Eraly Gainulla },
title = { KazBERT (Revision 15240d4) },
year = 2025,
url = { https://huggingface.co/Eraly-ml/KazBERT },
doi = { 10.57967/hf/5271 },
publisher = { Hugging Face }
}