Mini-LLM / Tokenizer /README.md
Ashx098's picture
Upload folder using huggingface_hub
a433a25 verified

Tokenizer Module

This module handles all tokenization tasks for the Mini-LLM project, converting raw text into numerical tokens that the model can process.

Overview

The tokenizer uses SentencePiece with Byte Pair Encoding (BPE) to create a 32,000 token vocabulary. BPE is the same algorithm used by GPT-3, GPT-4, and LLaMA models.

Directory Structure

Tokenizer/
β”œβ”€β”€ BPE/                      # BPE tokenizer artifacts
β”‚   β”œβ”€β”€ spm.model            # Trained SentencePiece model
β”‚   β”œβ”€β”€ spm.vocab            # Vocabulary file
β”‚   β”œβ”€β”€ tokenizer.json       # HuggingFace format
β”‚   β”œβ”€β”€ tokenizer_config.json
β”‚   └── special_tokens_map.json
β”œβ”€β”€ Unigram/                 # Unigram tokenizer (baseline)
β”‚   └── ...
β”œβ”€β”€ train_spm_bpe.py         # Train BPE tokenizer
β”œβ”€β”€ train_spm_unigram.py     # Train Unigram tokenizer
└── convert_to_hf.py         # Convert to HuggingFace format

How It Works

1. Training the Tokenizer

Script: train_spm_bpe.py

import sentencepiece as spm

spm.SentencePieceTrainer.Train(
    input="data/raw/merged_text/corpus.txt",
    model_prefix="Tokenizer/BPE/spm",
    vocab_size=32000,
    model_type="bpe",
    byte_fallback=True,  # Handles emojis, special chars
    character_coverage=1.0,
    user_defined_symbols=["<user>", "<assistant>", "<system>"]
)

What happens:

  1. Reads raw text corpus
  2. Learns byte-pair merges (e.g., "th" + "e" β†’ "the")
  3. Builds 32,000 most frequent tokens
  4. Saves model to spm.model

2. Example: Tokenization Process

Input Text:

"Hello world! <user> write code </s>"

Tokenization Steps:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. Text Input                           β”‚
β”‚    "Hello world! <user> write code"     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 2. BPE Segmentation                     β”‚
β”‚    ['H', 'ello', '▁world', '!',         β”‚
β”‚     '▁', '<user>', '▁write', '▁code']   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 3. Token IDs                            β”‚
β”‚    [334, 3855, 288, 267, 2959,          β”‚
β”‚     354, 267, 12397]                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Features:

  • ▁ represents space (SentencePiece convention)
  • Special tokens like <user> are preserved
  • Byte fallback handles emojis: πŸ”₯ β†’ <0xF0><0x9F><0x94><0xA5>

3. Converting to HuggingFace Format

Script: convert_to_hf.py

from transformers import LlamaTokenizerFast

tokenizer = LlamaTokenizerFast(vocab_file="Tokenizer/BPE/spm.model")
tokenizer.add_special_tokens({
    'bos_token': '<s>',
    'eos_token': '</s>',
    'unk_token': '<unk>',
    'pad_token': '<pad>'
})
tokenizer.save_pretrained("Tokenizer/BPE")

This creates tokenizer.json and config files compatible with HuggingFace Transformers.

Usage

Load Tokenizer

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Tokenizer/BPE")

Encode Text

text = "Hello world!"
ids = tokenizer.encode(text)
# Output: [1, 334, 3855, 288, 267, 2]
#         [<s>, H, ello, ▁world, !, </s>]

Decode IDs

decoded = tokenizer.decode(ids)
# Output: "<s> Hello world! </s>"

decoded = tokenizer.decode(ids, skip_special_tokens=True)
# Output: "Hello world!"

BPE vs Unigram

Feature BPE Unigram
Algorithm Merge frequent pairs Probabilistic segmentation
Emoji Handling βœ… Byte fallback ❌ Creates <unk>
URL Handling βœ… Clean splits ⚠️ Unstable
Used By GPT-3, GPT-4, LLaMA BERT, T5
Recommendation βœ… Primary Baseline only

Vocabulary Statistics

  • Total Tokens: 32,000
  • Special Tokens: 4 (<s>, </s>, <unk>, <pad>)
  • User-Defined: 3 (<user>, <assistant>, <system>)
  • Coverage: 100% (byte fallback ensures no <unk>)

Performance

  • Compression Ratio: ~3.5 bytes/token (English text)
  • Tokenization Speed: ~1M tokens/second
  • Vocab Usage: ~70% of tokens used in typical corpus

References