Tokenizer Module
This module handles all tokenization tasks for the Mini-LLM project, converting raw text into numerical tokens that the model can process.
Overview
The tokenizer uses SentencePiece with Byte Pair Encoding (BPE) to create a 32,000 token vocabulary. BPE is the same algorithm used by GPT-3, GPT-4, and LLaMA models.
Directory Structure
Tokenizer/
βββ BPE/ # BPE tokenizer artifacts
β βββ spm.model # Trained SentencePiece model
β βββ spm.vocab # Vocabulary file
β βββ tokenizer.json # HuggingFace format
β βββ tokenizer_config.json
β βββ special_tokens_map.json
βββ Unigram/ # Unigram tokenizer (baseline)
β βββ ...
βββ train_spm_bpe.py # Train BPE tokenizer
βββ train_spm_unigram.py # Train Unigram tokenizer
βββ convert_to_hf.py # Convert to HuggingFace format
How It Works
1. Training the Tokenizer
Script: train_spm_bpe.py
import sentencepiece as spm
spm.SentencePieceTrainer.Train(
input="data/raw/merged_text/corpus.txt",
model_prefix="Tokenizer/BPE/spm",
vocab_size=32000,
model_type="bpe",
byte_fallback=True, # Handles emojis, special chars
character_coverage=1.0,
user_defined_symbols=["<user>", "<assistant>", "<system>"]
)
What happens:
- Reads raw text corpus
- Learns byte-pair merges (e.g., "th" + "e" β "the")
- Builds 32,000 most frequent tokens
- Saves model to
spm.model
2. Example: Tokenization Process
Input Text:
"Hello world! <user> write code </s>"
Tokenization Steps:
βββββββββββββββββββββββββββββββββββββββββββ
β 1. Text Input β
β "Hello world! <user> write code" β
βββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββ
β 2. BPE Segmentation β
β ['H', 'ello', 'βworld', '!', β
β 'β', '<user>', 'βwrite', 'βcode'] β
βββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββ
β 3. Token IDs β
β [334, 3855, 288, 267, 2959, β
β 354, 267, 12397] β
βββββββββββββββββββββββββββββββββββββββββββ
Key Features:
βrepresents space (SentencePiece convention)- Special tokens like
<user>are preserved - Byte fallback handles emojis: π₯ β
<0xF0><0x9F><0x94><0xA5>
3. Converting to HuggingFace Format
Script: convert_to_hf.py
from transformers import LlamaTokenizerFast
tokenizer = LlamaTokenizerFast(vocab_file="Tokenizer/BPE/spm.model")
tokenizer.add_special_tokens({
'bos_token': '<s>',
'eos_token': '</s>',
'unk_token': '<unk>',
'pad_token': '<pad>'
})
tokenizer.save_pretrained("Tokenizer/BPE")
This creates tokenizer.json and config files compatible with HuggingFace Transformers.
Usage
Load Tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Tokenizer/BPE")
Encode Text
text = "Hello world!"
ids = tokenizer.encode(text)
# Output: [1, 334, 3855, 288, 267, 2]
# [<s>, H, ello, βworld, !, </s>]
Decode IDs
decoded = tokenizer.decode(ids)
# Output: "<s> Hello world! </s>"
decoded = tokenizer.decode(ids, skip_special_tokens=True)
# Output: "Hello world!"
BPE vs Unigram
| Feature | BPE | Unigram |
|---|---|---|
| Algorithm | Merge frequent pairs | Probabilistic segmentation |
| Emoji Handling | β Byte fallback | β Creates <unk> |
| URL Handling | β Clean splits | β οΈ Unstable |
| Used By | GPT-3, GPT-4, LLaMA | BERT, T5 |
| Recommendation | β Primary | Baseline only |
Vocabulary Statistics
- Total Tokens: 32,000
- Special Tokens: 4 (
<s>,</s>,<unk>,<pad>) - User-Defined: 3 (
<user>,<assistant>,<system>) - Coverage: 100% (byte fallback ensures no
<unk>)
Performance
- Compression Ratio: ~3.5 bytes/token (English text)
- Tokenization Speed: ~1M tokens/second
- Vocab Usage: ~70% of tokens used in typical corpus