Mini-LLM / data /README.md
Ashx098's picture
Upload folder using huggingface_hub
f4e346e verified

Data Module

This module handles all data preprocessing, tokenization, and preparation for training.

Overview

The data pipeline converts raw text into binary token files optimized for training:

  • Raw text collection from multiple sources
  • Tokenization using BPE tokenizer
  • Binary serialization for efficient loading
  • Train/validation splitting

Directory Structure

data/
β”œβ”€β”€ raw/                    # Raw text sources
β”‚   β”œβ”€β”€ books/             # Book corpus
β”‚   β”œβ”€β”€ wikipedia/         # Wikipedia dumps
β”‚   β”œβ”€β”€ fineweb/           # Web crawl data
β”‚   └── merged_text/
β”‚       └── corpus.txt     # Combined corpus
β”œβ”€β”€ bin/                   # Tokenized binary files
β”‚   β”œβ”€β”€ train.bin         # Training data (uint16)
β”‚   └── val.bin           # Validation data (uint16)
└── prepare_data.py       # Tokenization script

Data Processing Pipeline

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. Raw Text Sources                         β”‚
β”‚    - Books: 15 files                        β”‚
β”‚    - Wikipedia: 3 dumps                     β”‚
β”‚    - FineWeb: 1 crawl                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚
                   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 2. Merge & Clean                            β”‚
β”‚    β†’ corpus.txt (all text combined)         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚
                   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 3. Tokenize (prepare_data.py)              β”‚
β”‚    - Load BPE tokenizer                     β”‚
β”‚    - Process line-by-line                   β”‚
β”‚    - Append EOS tokens                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚
                   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 4. Convert to NumPy (uint16)               β”‚
β”‚    - Vocab size: 32,000 fits in uint16     β”‚
β”‚    - Memory efficient (2 bytes/token)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚
                   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 5. Train/Val Split (90/10)                 β”‚
β”‚    - train.bin: 325M tokens                 β”‚
β”‚    - val.bin: 36M tokens                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Preparation Script

File: prepare_data.py

import numpy as np
from transformers import AutoTokenizer
from tqdm import tqdm

# 1. Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Tokenizer/BPE")
eos_id = tokenizer.eos_token_id

# 2. Read corpus
with open("data/raw/merged_text/corpus.txt") as f:
    lines = f.readlines()

# 3. Tokenize
all_tokens = []
for line in tqdm(lines):
    tokens = tokenizer.encode(line.strip())
    tokens.append(eos_id)  # Mark end of line
    all_tokens.extend(tokens)

# 4. Convert to uint16
ids = np.array(all_tokens, dtype=np.uint16)

# 5. Split
val_count = int(len(ids) * 0.1)
train_ids = ids[:-val_count]
val_ids = ids[-val_count:]

# 6. Save
train_ids.tofile("data/bin/train.bin")
val_ids.tofile("data/bin/val.bin")

Example: Text β†’ Tokens

Input Text (corpus.txt):

The quick brown fox jumps over the lazy dog.
Machine learning is transforming the world.

Tokenization Process:

Line 1: "The quick brown fox jumps over the lazy dog."
  Tokens: [1, 334, 3855, 288, 267, 2959, 354, 267, 12397, 8885, 2]
          [<s>, The, quick, brown, fox, jumps, over, the, lazy, dog, </s>]

Line 2: "Machine learning is transforming the world."
  Tokens: [1, 5234, 1234, 456, 7890, 267, 9876, 2]
          [<s>, Machine, learning, is, transforming, the, world, </s>]

Combined: [1, 334, 3855, ..., 2, 1, 5234, ..., 2]

Binary Format:

train.bin structure:
  Byte 0-1:   Token 0 (uint16)
  Byte 2-3:   Token 1 (uint16)
  Byte 4-5:   Token 2 (uint16)
  ...
  Byte N-2:N  Token N/2 (uint16)

Total size: 325,004,796 tokens Γ— 2 bytes = ~650 MB

Dataset Statistics

Corpus Size

Raw Text:
  - Total files: 19
  - Total size: ~1.4 GB
  - Total lines: ~5.2M

Tokenized:
  - Total tokens: 361,116,440
  - Train tokens: 325,004,796 (90%)
  - Val tokens: 36,111,644 (10%)

Usage

Prepare Data

# Tokenize corpus
python data/prepare_data.py

Output:

Loading tokenizer from Tokenizer/BPE...
Vocab size: 32000
EOS ID: 2
Reading data/raw/merged_text/corpus.txt...
Total lines: 5,234,567
Tokenizing...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5.2M/5.2M [02:34<00:00]
Total tokens: 361,116,440
Train tokens: 325,004,796
Val tokens:   36,111,644
βœ… Saved binary files to data/bin/

Load in Training

from train.dataloader import DataLoader

loader = DataLoader("data/bin", batch_size=16, block_size=512, split="train")
x, y = loader.get_batch(device="cuda")

# x: [16, 512] input tokens
# y: [16, 512] target tokens (shifted by 1)

Memory-Mapped Loading

The binary files are loaded using np.memmap for efficiency:

# Traditional loading (BAD)
data = np.fromfile("train.bin", dtype=np.uint16)  # Loads 650MB into RAM!

# Memory-mapped loading (GOOD)
data = np.memmap("train.bin", dtype=np.uint16, mode='r')  # OS handles paging

Benefits:

  • No RAM overhead: File stays on disk
  • Fast random access: OS caches hot pages
  • Scalable: Works with TB-scale datasets

References