Ashx098 commited on
Commit
f4e346e
·
verified ·
1 Parent(s): a433a25

Upload folder using huggingface_hub

Browse files
Files changed (30) hide show
  1. .gitattributes +1 -0
  2. data/README.md +213 -0
  3. data/bin/train.bin +3 -0
  4. data/bin/val.bin +3 -0
  5. data/prepare_data.py +72 -0
  6. data/raw/books/.gitattributes +27 -0
  7. data/raw/books/README.md +344 -0
  8. data/raw/books/wikitext-103-raw-v1/test-00000-of-00001.parquet +3 -0
  9. data/raw/books/wikitext-103-raw-v1/train-00000-of-00002.parquet +3 -0
  10. data/raw/books/wikitext-103-raw-v1/train-00001-of-00002.parquet +3 -0
  11. data/raw/books/wikitext-103-raw-v1/validation-00000-of-00001.parquet +3 -0
  12. data/raw/books/wikitext-103-v1/test-00000-of-00001.parquet +3 -0
  13. data/raw/books/wikitext-103-v1/train-00000-of-00002.parquet +3 -0
  14. data/raw/books/wikitext-103-v1/train-00001-of-00002.parquet +3 -0
  15. data/raw/books/wikitext-103-v1/validation-00000-of-00001.parquet +3 -0
  16. data/raw/books/wikitext-2-raw-v1/test-00000-of-00001.parquet +3 -0
  17. data/raw/books/wikitext-2-raw-v1/train-00000-of-00001.parquet +3 -0
  18. data/raw/books/wikitext-2-raw-v1/validation-00000-of-00001.parquet +3 -0
  19. data/raw/books/wikitext-2-v1/test-00000-of-00001.parquet +3 -0
  20. data/raw/books/wikitext-2-v1/train-00000-of-00001.parquet +3 -0
  21. data/raw/books/wikitext-2-v1/validation-00000-of-00001.parquet +3 -0
  22. data/raw/extract_all.py +53 -0
  23. data/raw/fineweb/.gitattributes +59 -0
  24. data/raw/fineweb/train-00000-of-00099.parquet +3 -0
  25. data/raw/merged_text/corpus.txt +3 -0
  26. data/raw/verify_compression_ratio.py +15 -0
  27. data/raw/wikipedia/.gitattributes +59 -0
  28. data/raw/wikipedia/README.md +26 -0
  29. data/raw/wikipedia/data/test-00000-of-00001.parquet +3 -0
  30. data/raw/wikipedia/data/train-00000-of-00001.parquet +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ data/raw/merged_text/corpus.txt filter=lfs diff=lfs merge=lfs -text
data/README.md ADDED
@@ -0,0 +1,213 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Data Module
2
+
3
+ This module handles all data preprocessing, tokenization, and preparation for training.
4
+
5
+ ## Overview
6
+
7
+ The data pipeline converts raw text into binary token files optimized for training:
8
+ - **Raw text collection** from multiple sources
9
+ - **Tokenization** using BPE tokenizer
10
+ - **Binary serialization** for efficient loading
11
+ - **Train/validation splitting**
12
+
13
+ ## Directory Structure
14
+
15
+ ```
16
+ data/
17
+ ├── raw/ # Raw text sources
18
+ │ ├── books/ # Book corpus
19
+ │ ├── wikipedia/ # Wikipedia dumps
20
+ │ ├── fineweb/ # Web crawl data
21
+ │ └── merged_text/
22
+ │ └── corpus.txt # Combined corpus
23
+ ├── bin/ # Tokenized binary files
24
+ │ ├── train.bin # Training data (uint16)
25
+ │ └── val.bin # Validation data (uint16)
26
+ └── prepare_data.py # Tokenization script
27
+ ```
28
+
29
+ ## Data Processing Pipeline
30
+
31
+ ```
32
+ ┌─────────────────────────────────────────────┐
33
+ │ 1. Raw Text Sources │
34
+ │ - Books: 15 files │
35
+ │ - Wikipedia: 3 dumps │
36
+ │ - FineWeb: 1 crawl │
37
+ └──────────────────┬──────────────────────────┘
38
+
39
+
40
+ ┌─────────────────────────────────────────────┐
41
+ │ 2. Merge & Clean │
42
+ │ → corpus.txt (all text combined) │
43
+ └──────────────────┬──────────────────────────┘
44
+
45
+
46
+ ┌─────────────────────────────────────────────┐
47
+ │ 3. Tokenize (prepare_data.py) │
48
+ │ - Load BPE tokenizer │
49
+ │ - Process line-by-line │
50
+ │ - Append EOS tokens │
51
+ └──────────────────┬──────────────────────────┘
52
+
53
+
54
+ ┌─────────────────────────────────────────────┐
55
+ │ 4. Convert to NumPy (uint16) │
56
+ │ - Vocab size: 32,000 fits in uint16 │
57
+ │ - Memory efficient (2 bytes/token) │
58
+ └──────────────────┬──────────────────────────┘
59
+
60
+
61
+ ┌─────────────────────────────────────────────┐
62
+ │ 5. Train/Val Split (90/10) │
63
+ │ - train.bin: 325M tokens │
64
+ │ - val.bin: 36M tokens │
65
+ └─────────────────────────────────────────────┘
66
+ ```
67
+
68
+ ## Data Preparation Script
69
+
70
+ **File**: `prepare_data.py`
71
+
72
+ ```python
73
+ import numpy as np
74
+ from transformers import AutoTokenizer
75
+ from tqdm import tqdm
76
+
77
+ # 1. Load tokenizer
78
+ tokenizer = AutoTokenizer.from_pretrained("Tokenizer/BPE")
79
+ eos_id = tokenizer.eos_token_id
80
+
81
+ # 2. Read corpus
82
+ with open("data/raw/merged_text/corpus.txt") as f:
83
+ lines = f.readlines()
84
+
85
+ # 3. Tokenize
86
+ all_tokens = []
87
+ for line in tqdm(lines):
88
+ tokens = tokenizer.encode(line.strip())
89
+ tokens.append(eos_id) # Mark end of line
90
+ all_tokens.extend(tokens)
91
+
92
+ # 4. Convert to uint16
93
+ ids = np.array(all_tokens, dtype=np.uint16)
94
+
95
+ # 5. Split
96
+ val_count = int(len(ids) * 0.1)
97
+ train_ids = ids[:-val_count]
98
+ val_ids = ids[-val_count:]
99
+
100
+ # 6. Save
101
+ train_ids.tofile("data/bin/train.bin")
102
+ val_ids.tofile("data/bin/val.bin")
103
+ ```
104
+
105
+ ## Example: Text → Tokens
106
+
107
+ **Input Text** (`corpus.txt`):
108
+ ```
109
+ The quick brown fox jumps over the lazy dog.
110
+ Machine learning is transforming the world.
111
+ ```
112
+
113
+ **Tokenization Process**:
114
+
115
+ ```
116
+ Line 1: "The quick brown fox jumps over the lazy dog."
117
+ Tokens: [1, 334, 3855, 288, 267, 2959, 354, 267, 12397, 8885, 2]
118
+ [<s>, The, quick, brown, fox, jumps, over, the, lazy, dog, </s>]
119
+
120
+ Line 2: "Machine learning is transforming the world."
121
+ Tokens: [1, 5234, 1234, 456, 7890, 267, 9876, 2]
122
+ [<s>, Machine, learning, is, transforming, the, world, </s>]
123
+
124
+ Combined: [1, 334, 3855, ..., 2, 1, 5234, ..., 2]
125
+ ```
126
+
127
+ **Binary Format**:
128
+
129
+ ```
130
+ train.bin structure:
131
+ Byte 0-1: Token 0 (uint16)
132
+ Byte 2-3: Token 1 (uint16)
133
+ Byte 4-5: Token 2 (uint16)
134
+ ...
135
+ Byte N-2:N Token N/2 (uint16)
136
+
137
+ Total size: 325,004,796 tokens × 2 bytes = ~650 MB
138
+ ```
139
+
140
+ ## Dataset Statistics
141
+
142
+ ### Corpus Size
143
+
144
+ ```
145
+ Raw Text:
146
+ - Total files: 19
147
+ - Total size: ~1.4 GB
148
+ - Total lines: ~5.2M
149
+
150
+ Tokenized:
151
+ - Total tokens: 361,116,440
152
+ - Train tokens: 325,004,796 (90%)
153
+ - Val tokens: 36,111,644 (10%)
154
+ ```
155
+
156
+ ## Usage
157
+
158
+ ### Prepare Data
159
+
160
+ ```bash
161
+ # Tokenize corpus
162
+ python data/prepare_data.py
163
+ ```
164
+
165
+ **Output:**
166
+ ```
167
+ Loading tokenizer from Tokenizer/BPE...
168
+ Vocab size: 32000
169
+ EOS ID: 2
170
+ Reading data/raw/merged_text/corpus.txt...
171
+ Total lines: 5,234,567
172
+ Tokenizing...
173
+ 100%|████████████| 5.2M/5.2M [02:34<00:00]
174
+ Total tokens: 361,116,440
175
+ Train tokens: 325,004,796
176
+ Val tokens: 36,111,644
177
+ ✅ Saved binary files to data/bin/
178
+ ```
179
+
180
+ ### Load in Training
181
+
182
+ ```python
183
+ from train.dataloader import DataLoader
184
+
185
+ loader = DataLoader("data/bin", batch_size=16, block_size=512, split="train")
186
+ x, y = loader.get_batch(device="cuda")
187
+
188
+ # x: [16, 512] input tokens
189
+ # y: [16, 512] target tokens (shifted by 1)
190
+ ```
191
+
192
+ ## Memory-Mapped Loading
193
+
194
+ The binary files are loaded using `np.memmap` for efficiency:
195
+
196
+ ```python
197
+ # Traditional loading (BAD)
198
+ data = np.fromfile("train.bin", dtype=np.uint16) # Loads 650MB into RAM!
199
+
200
+ # Memory-mapped loading (GOOD)
201
+ data = np.memmap("train.bin", dtype=np.uint16, mode='r') # OS handles paging
202
+ ```
203
+
204
+ **Benefits:**
205
+ - **No RAM overhead**: File stays on disk
206
+ - **Fast random access**: OS caches hot pages
207
+ - **Scalable**: Works with TB-scale datasets
208
+
209
+ ## References
210
+
211
+ - [The Pile: An 800GB Dataset](https://arxiv.org/abs/2101.00027)
212
+ - [Data Quality for Language Models](https://arxiv.org/abs/2201.06009)
213
+ - [Efficient Data Loading](https://pytorch.org/docs/stable/data.html)
data/bin/train.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:de480e746786af5675ce42681e009835772c7688567c16ceb429239dfb8eb38b
3
+ size 650009592
data/bin/val.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d0715cff0afb66a0f922639521ec5aaa7e75803134b4283df10501f981a20954
3
+ size 72223288
data/prepare_data.py ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import numpy as np
3
+ from transformers import AutoTokenizer
4
+ from tqdm import tqdm
5
+
6
+ def process_data():
7
+ # 1. Config
8
+ input_file_path = "data/raw/merged_text/corpus.txt" # PATH TO YOUR DATA
9
+ tokenizer_path = "Tokenizer/BPE" # PATH TO YOUR NEW TOKENIZER
10
+ output_dir = "data/bin"
11
+ val_split_ratio = 0.1 # 10% for validation
12
+
13
+ os.makedirs(output_dir, exist_ok=True)
14
+
15
+ # 2. Load Tokenizer
16
+ print(f"Loading tokenizer from {tokenizer_path}...")
17
+ tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
18
+
19
+ # Ensure eos_token is present (usually ID 2)
20
+ eos_id = tokenizer.eos_token_id
21
+ print(f"Vocab size: {tokenizer.vocab_size}")
22
+ print(f"EOS ID: {eos_id}")
23
+
24
+ # 3. Read Data
25
+ print(f"Reading {input_file_path}...")
26
+ with open(input_file_path, 'r', encoding='utf-8') as f:
27
+ # Read all lines
28
+ lines = f.readlines()
29
+
30
+ print(f"Total lines: {len(lines):,}")
31
+
32
+ # 4. Tokenize
33
+ # We use a simple list comprehension for the 80M scale.
34
+ # For 100B scale, we would use parallel processing (multiprocessing).
35
+ print("Tokenizing...")
36
+ all_tokens = []
37
+
38
+ # Using tqdm for progress bar
39
+ for line in tqdm(lines):
40
+ text = line.strip()
41
+ if not text:
42
+ continue
43
+
44
+ # Encode text and append EOS token
45
+ # This tells the model where one sentence ends and the next begins
46
+ tokens = tokenizer.encode(text)
47
+ tokens.append(eos_id)
48
+ all_tokens.extend(tokens)
49
+
50
+ token_count = len(all_tokens)
51
+ print(f"Total tokens: {token_count:,}")
52
+
53
+ # 5. Convert to Numpy (uint16 saves 50% RAM)
54
+ # 32,000 fits easily in uint16 (max 65,535)
55
+ ids = np.array(all_tokens, dtype=np.uint16)
56
+
57
+ # 6. Split Train/Val
58
+ val_count = int(token_count * val_split_ratio)
59
+ train_ids = ids[:-val_count]
60
+ val_ids = ids[-val_count:]
61
+
62
+ print(f"Train tokens: {len(train_ids):,}")
63
+ print(f"Val tokens: {len(val_ids):,}")
64
+
65
+ # 7. Save to disk (Memory Mapped friendly)
66
+ train_ids.tofile(os.path.join(output_dir, "train.bin"))
67
+ val_ids.tofile(os.path.join(output_dir, "val.bin"))
68
+
69
+ print(f"✅ Saved binary files to {output_dir}/")
70
+
71
+ if __name__ == "__main__":
72
+ process_data()
data/raw/books/.gitattributes ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bin.* filter=lfs diff=lfs merge=lfs -text
5
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.model filter=lfs diff=lfs merge=lfs -text
12
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
13
+ *.onnx filter=lfs diff=lfs merge=lfs -text
14
+ *.ot filter=lfs diff=lfs merge=lfs -text
15
+ *.parquet filter=lfs diff=lfs merge=lfs -text
16
+ *.pb filter=lfs diff=lfs merge=lfs -text
17
+ *.pt filter=lfs diff=lfs merge=lfs -text
18
+ *.pth filter=lfs diff=lfs merge=lfs -text
19
+ *.rar filter=lfs diff=lfs merge=lfs -text
20
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
21
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
22
+ *.tflite filter=lfs diff=lfs merge=lfs -text
23
+ *.tgz filter=lfs diff=lfs merge=lfs -text
24
+ *.xz filter=lfs diff=lfs merge=lfs -text
25
+ *.zip filter=lfs diff=lfs merge=lfs -text
26
+ *.zstandard filter=lfs diff=lfs merge=lfs -text
27
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
data/raw/books/README.md ADDED
@@ -0,0 +1,344 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ annotations_creators:
3
+ - no-annotation
4
+ language_creators:
5
+ - crowdsourced
6
+ language:
7
+ - en
8
+ license:
9
+ - cc-by-sa-3.0
10
+ - gfdl
11
+ multilinguality:
12
+ - monolingual
13
+ size_categories:
14
+ - 1M<n<10M
15
+ source_datasets:
16
+ - original
17
+ task_categories:
18
+ - text-generation
19
+ - fill-mask
20
+ task_ids:
21
+ - language-modeling
22
+ - masked-language-modeling
23
+ paperswithcode_id: wikitext-2
24
+ pretty_name: WikiText
25
+ dataset_info:
26
+ - config_name: wikitext-103-raw-v1
27
+ features:
28
+ - name: text
29
+ dtype: string
30
+ splits:
31
+ - name: test
32
+ num_bytes: 1305088
33
+ num_examples: 4358
34
+ - name: train
35
+ num_bytes: 546500949
36
+ num_examples: 1801350
37
+ - name: validation
38
+ num_bytes: 1159288
39
+ num_examples: 3760
40
+ download_size: 315466397
41
+ dataset_size: 548965325
42
+ - config_name: wikitext-103-v1
43
+ features:
44
+ - name: text
45
+ dtype: string
46
+ splits:
47
+ - name: test
48
+ num_bytes: 1295575
49
+ num_examples: 4358
50
+ - name: train
51
+ num_bytes: 545141915
52
+ num_examples: 1801350
53
+ - name: validation
54
+ num_bytes: 1154751
55
+ num_examples: 3760
56
+ download_size: 313093838
57
+ dataset_size: 547592241
58
+ - config_name: wikitext-2-raw-v1
59
+ features:
60
+ - name: text
61
+ dtype: string
62
+ splits:
63
+ - name: test
64
+ num_bytes: 1305088
65
+ num_examples: 4358
66
+ - name: train
67
+ num_bytes: 11061717
68
+ num_examples: 36718
69
+ - name: validation
70
+ num_bytes: 1159288
71
+ num_examples: 3760
72
+ download_size: 7747362
73
+ dataset_size: 13526093
74
+ - config_name: wikitext-2-v1
75
+ features:
76
+ - name: text
77
+ dtype: string
78
+ splits:
79
+ - name: test
80
+ num_bytes: 1270947
81
+ num_examples: 4358
82
+ - name: train
83
+ num_bytes: 10918118
84
+ num_examples: 36718
85
+ - name: validation
86
+ num_bytes: 1134123
87
+ num_examples: 3760
88
+ download_size: 7371282
89
+ dataset_size: 13323188
90
+ configs:
91
+ - config_name: wikitext-103-raw-v1
92
+ data_files:
93
+ - split: test
94
+ path: wikitext-103-raw-v1/test-*
95
+ - split: train
96
+ path: wikitext-103-raw-v1/train-*
97
+ - split: validation
98
+ path: wikitext-103-raw-v1/validation-*
99
+ - config_name: wikitext-103-v1
100
+ data_files:
101
+ - split: test
102
+ path: wikitext-103-v1/test-*
103
+ - split: train
104
+ path: wikitext-103-v1/train-*
105
+ - split: validation
106
+ path: wikitext-103-v1/validation-*
107
+ - config_name: wikitext-2-raw-v1
108
+ data_files:
109
+ - split: test
110
+ path: wikitext-2-raw-v1/test-*
111
+ - split: train
112
+ path: wikitext-2-raw-v1/train-*
113
+ - split: validation
114
+ path: wikitext-2-raw-v1/validation-*
115
+ - config_name: wikitext-2-v1
116
+ data_files:
117
+ - split: test
118
+ path: wikitext-2-v1/test-*
119
+ - split: train
120
+ path: wikitext-2-v1/train-*
121
+ - split: validation
122
+ path: wikitext-2-v1/validation-*
123
+ ---
124
+
125
+ # Dataset Card for "wikitext"
126
+
127
+ ## Table of Contents
128
+ - [Dataset Description](#dataset-description)
129
+ - [Dataset Summary](#dataset-summary)
130
+ - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
131
+ - [Languages](#languages)
132
+ - [Dataset Structure](#dataset-structure)
133
+ - [Data Instances](#data-instances)
134
+ - [Data Fields](#data-fields)
135
+ - [Data Splits](#data-splits)
136
+ - [Dataset Creation](#dataset-creation)
137
+ - [Curation Rationale](#curation-rationale)
138
+ - [Source Data](#source-data)
139
+ - [Annotations](#annotations)
140
+ - [Personal and Sensitive Information](#personal-and-sensitive-information)
141
+ - [Considerations for Using the Data](#considerations-for-using-the-data)
142
+ - [Social Impact of Dataset](#social-impact-of-dataset)
143
+ - [Discussion of Biases](#discussion-of-biases)
144
+ - [Other Known Limitations](#other-known-limitations)
145
+ - [Additional Information](#additional-information)
146
+ - [Dataset Curators](#dataset-curators)
147
+ - [Licensing Information](#licensing-information)
148
+ - [Citation Information](#citation-information)
149
+ - [Contributions](#contributions)
150
+
151
+ ## Dataset Description
152
+
153
+ - **Homepage:** [https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)
154
+ - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
155
+ - **Paper:** [Pointer Sentinel Mixture Models](https://arxiv.org/abs/1609.07843)
156
+ - **Point of Contact:** [Stephen Merity](mailto:[email protected])
157
+ - **Size of downloaded dataset files:** 391.41 MB
158
+ - **Size of the generated dataset:** 1.12 GB
159
+ - **Total amount of disk used:** 1.52 GB
160
+
161
+ ### Dataset Summary
162
+
163
+ The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified
164
+ Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.
165
+
166
+ Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over
167
+ 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation
168
+ and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models
169
+ that can take advantage of long term dependencies.
170
+
171
+ Each subset comes in two different variants:
172
+ - Raw (for character level work) contain the raw tokens, before the addition of the <unk> (unknown) tokens.
173
+ - Non-raw (for word level work) contain only the tokens in their vocabulary (wiki.train.tokens, wiki.valid.tokens, and wiki.test.tokens).
174
+ The out-of-vocabulary tokens have been replaced with the the <unk> token.
175
+
176
+
177
+ ### Supported Tasks and Leaderboards
178
+
179
+ [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
180
+
181
+ ### Languages
182
+
183
+ [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
184
+
185
+ ## Dataset Structure
186
+
187
+ ### Data Instances
188
+
189
+ #### wikitext-103-raw-v1
190
+
191
+ - **Size of downloaded dataset files:** 191.98 MB
192
+ - **Size of the generated dataset:** 549.42 MB
193
+ - **Total amount of disk used:** 741.41 MB
194
+
195
+ An example of 'validation' looks as follows.
196
+ ```
197
+ This example was too long and was cropped:
198
+
199
+ {
200
+ "text": "\" The gold dollar or gold one @-@ dollar piece was a coin struck as a regular issue by the United States Bureau of the Mint from..."
201
+ }
202
+ ```
203
+
204
+ #### wikitext-103-v1
205
+
206
+ - **Size of downloaded dataset files:** 190.23 MB
207
+ - **Size of the generated dataset:** 548.05 MB
208
+ - **Total amount of disk used:** 738.27 MB
209
+
210
+ An example of 'train' looks as follows.
211
+ ```
212
+ This example was too long and was cropped:
213
+
214
+ {
215
+ "text": "\" Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..."
216
+ }
217
+ ```
218
+
219
+ #### wikitext-2-raw-v1
220
+
221
+ - **Size of downloaded dataset files:** 4.72 MB
222
+ - **Size of the generated dataset:** 13.54 MB
223
+ - **Total amount of disk used:** 18.26 MB
224
+
225
+ An example of 'train' looks as follows.
226
+ ```
227
+ This example was too long and was cropped:
228
+
229
+ {
230
+ "text": "\" The Sinclair Scientific Programmable was introduced in 1975 , with the same case as the Sinclair Oxford . It was larger than t..."
231
+ }
232
+ ```
233
+
234
+ #### wikitext-2-v1
235
+
236
+ - **Size of downloaded dataset files:** 4.48 MB
237
+ - **Size of the generated dataset:** 13.34 MB
238
+ - **Total amount of disk used:** 17.82 MB
239
+
240
+ An example of 'train' looks as follows.
241
+ ```
242
+ This example was too long and was cropped:
243
+
244
+ {
245
+ "text": "\" Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..."
246
+ }
247
+ ```
248
+
249
+ ### Data Fields
250
+
251
+ The data fields are the same among all splits.
252
+
253
+ #### wikitext-103-raw-v1
254
+ - `text`: a `string` feature.
255
+
256
+ #### wikitext-103-v1
257
+ - `text`: a `string` feature.
258
+
259
+ #### wikitext-2-raw-v1
260
+ - `text`: a `string` feature.
261
+
262
+ #### wikitext-2-v1
263
+ - `text`: a `string` feature.
264
+
265
+ ### Data Splits
266
+
267
+ | name | train |validation|test|
268
+ |-------------------|------:|---------:|---:|
269
+ |wikitext-103-raw-v1|1801350| 3760|4358|
270
+ |wikitext-103-v1 |1801350| 3760|4358|
271
+ |wikitext-2-raw-v1 | 36718| 3760|4358|
272
+ |wikitext-2-v1 | 36718| 3760|4358|
273
+
274
+ ## Dataset Creation
275
+
276
+ ### Curation Rationale
277
+
278
+ [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
279
+
280
+ ### Source Data
281
+
282
+ #### Initial Data Collection and Normalization
283
+
284
+ [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
285
+
286
+ #### Who are the source language producers?
287
+
288
+ [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
289
+
290
+ ### Annotations
291
+
292
+ #### Annotation process
293
+
294
+ [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
295
+
296
+ #### Who are the annotators?
297
+
298
+ [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
299
+
300
+ ### Personal and Sensitive Information
301
+
302
+ [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
303
+
304
+ ## Considerations for Using the Data
305
+
306
+ ### Social Impact of Dataset
307
+
308
+ [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
309
+
310
+ ### Discussion of Biases
311
+
312
+ [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
313
+
314
+ ### Other Known Limitations
315
+
316
+ [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
317
+
318
+ ## Additional Information
319
+
320
+ ### Dataset Curators
321
+
322
+ [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
323
+
324
+ ### Licensing Information
325
+
326
+ The dataset is available under the [Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/).
327
+
328
+ ### Citation Information
329
+
330
+ ```
331
+ @misc{merity2016pointer,
332
+ title={Pointer Sentinel Mixture Models},
333
+ author={Stephen Merity and Caiming Xiong and James Bradbury and Richard Socher},
334
+ year={2016},
335
+ eprint={1609.07843},
336
+ archivePrefix={arXiv},
337
+ primaryClass={cs.CL}
338
+ }
339
+ ```
340
+
341
+
342
+ ### Contributions
343
+
344
+ Thanks to [@thomwolf](https://github.com/thomwolf), [@lewtun](https://github.com/lewtun), [@patrickvonplaten](https://github.com/patrickvonplaten), [@mariamabarham](https://github.com/mariamabarham) for adding this dataset.
data/raw/books/wikitext-103-raw-v1/test-00000-of-00001.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5f1bea067869d04849c0f975a2b29c4ff47d867f484f5010ea5e861eab246d91
3
+ size 732610
data/raw/books/wikitext-103-raw-v1/train-00000-of-00002.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:74da360f23826045b3e6ac6375411fdb15f003030aa74f2596ed08b857cb9212
3
+ size 156987808
data/raw/books/wikitext-103-raw-v1/train-00001-of-00002.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ba090ac30dbf5461e8dcbdd1a1b8e6f3cf9c2c756d64f0c1220450acd514f720
3
+ size 157088770
data/raw/books/wikitext-103-raw-v1/validation-00000-of-00001.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:204929b7ff9d6184953f867dedb860e40aa69c078fc1e54b3baaa8fb28511c4c
3
+ size 657209
data/raw/books/wikitext-103-v1/test-00000-of-00001.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:abdfc9f83b1103b502924072460d4c92f277c9b49c313cef3e48cfcf7428e125
3
+ size 721735
data/raw/books/wikitext-103-v1/train-00000-of-00002.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c2ecca8c3250e79518e45d125f3a9a757d8014f6b2d8435c602be87c1f79ec3b
3
+ size 155788327
data/raw/books/wikitext-103-v1/train-00001-of-00002.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:720f2503551f33c25bb822aad74d699fee4d5331a7373d0c262f1bfb01354fcf
3
+ size 155928670
data/raw/books/wikitext-103-v1/validation-00000-of-00001.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a586125adab06f115018c43507ac267ea70850ce6218cbb96e08bb3b4db0899b
3
+ size 655106
data/raw/books/wikitext-2-raw-v1/test-00000-of-00001.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5f1bea067869d04849c0f975a2b29c4ff47d867f484f5010ea5e861eab246d91
3
+ size 732610
data/raw/books/wikitext-2-raw-v1/train-00000-of-00001.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e83889baabc497075506f91975be5fac0d45c5290b6b20582c8cd1e853d0c9f7
3
+ size 6357543
data/raw/books/wikitext-2-raw-v1/validation-00000-of-00001.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:204929b7ff9d6184953f867dedb860e40aa69c078fc1e54b3baaa8fb28511c4c
3
+ size 657209
data/raw/books/wikitext-2-v1/test-00000-of-00001.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e6b3913da714b63a60a571698b20ff15441fb015783ea1b5285f707d4f2f00a9
3
+ size 685430
data/raw/books/wikitext-2-v1/train-00000-of-00001.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dfc27e4360c639dc1fba1e403bfffd53af4a5c75d5363b5724d49bf12d07cce6
3
+ size 6068114
data/raw/books/wikitext-2-v1/validation-00000-of-00001.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:717de9a0c1c0b0b1dfdd8f1e6ad8a30ece618bbde81f5da8207277547d324215
3
+ size 617738
data/raw/extract_all.py ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import pyarrow.parquet as pq
3
+ from glob import glob
4
+ from tqdm import tqdm
5
+
6
+ INPUT_DIRS = [
7
+ "books",
8
+ "fineweb",
9
+ "wikipedia",
10
+ ]
11
+
12
+ OUTPUT_DIR = "merged_text"
13
+ os.makedirs(OUTPUT_DIR, exist_ok=True)
14
+ OUT_FILE = os.path.join(OUTPUT_DIR, "corpus.txt")
15
+
16
+ def extract_text_from_parquet(path):
17
+ try:
18
+ table = pq.read_table(path)
19
+ df = table.to_pandas()
20
+
21
+ # Look for likely text column
22
+ for col in ["text", "content", "document", "article", "source"]:
23
+ if col in df.columns:
24
+ return df[col].astype(str).tolist()
25
+
26
+ # Fallback: take the first string-like column
27
+ for col in df.columns:
28
+ if df[col].dtype == object:
29
+ return df[col].astype(str).tolist()
30
+
31
+ return []
32
+ except Exception as e:
33
+ print(f"Error reading {path}: {e}")
34
+ return []
35
+
36
+ all_parquet_files = []
37
+ for d in INPUT_DIRS:
38
+ all_parquet_files.extend(glob(f"{d}/**/*.parquet", recursive=True))
39
+
40
+ print("Total parquet files found:", len(all_parquet_files))
41
+
42
+ with open(OUT_FILE, "w", encoding="utf-8") as fout:
43
+ for file in tqdm(all_parquet_files, desc="Extracting text"):
44
+ texts = extract_text_from_parquet(file)
45
+ for t in texts:
46
+ t = t.strip()
47
+ if len(t) < 50:
48
+ continue
49
+ if not any(c.isalpha() for c in t):
50
+ continue
51
+ fout.write(t + "\n\n")
52
+
53
+ print("DONE! Saved merged corpus →", OUT_FILE)
data/raw/fineweb/.gitattributes ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.lz4 filter=lfs diff=lfs merge=lfs -text
12
+ *.mds filter=lfs diff=lfs merge=lfs -text
13
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
14
+ *.model filter=lfs diff=lfs merge=lfs -text
15
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
16
+ *.npy filter=lfs diff=lfs merge=lfs -text
17
+ *.npz filter=lfs diff=lfs merge=lfs -text
18
+ *.onnx filter=lfs diff=lfs merge=lfs -text
19
+ *.ot filter=lfs diff=lfs merge=lfs -text
20
+ *.parquet filter=lfs diff=lfs merge=lfs -text
21
+ *.pb filter=lfs diff=lfs merge=lfs -text
22
+ *.pickle filter=lfs diff=lfs merge=lfs -text
23
+ *.pkl filter=lfs diff=lfs merge=lfs -text
24
+ *.pt filter=lfs diff=lfs merge=lfs -text
25
+ *.pth filter=lfs diff=lfs merge=lfs -text
26
+ *.rar filter=lfs diff=lfs merge=lfs -text
27
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
28
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
29
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
30
+ *.tar filter=lfs diff=lfs merge=lfs -text
31
+ *.tflite filter=lfs diff=lfs merge=lfs -text
32
+ *.tgz filter=lfs diff=lfs merge=lfs -text
33
+ *.wasm filter=lfs diff=lfs merge=lfs -text
34
+ *.xz filter=lfs diff=lfs merge=lfs -text
35
+ *.zip filter=lfs diff=lfs merge=lfs -text
36
+ *.zst filter=lfs diff=lfs merge=lfs -text
37
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
38
+ # Audio files - uncompressed
39
+ *.pcm filter=lfs diff=lfs merge=lfs -text
40
+ *.sam filter=lfs diff=lfs merge=lfs -text
41
+ *.raw filter=lfs diff=lfs merge=lfs -text
42
+ # Audio files - compressed
43
+ *.aac filter=lfs diff=lfs merge=lfs -text
44
+ *.flac filter=lfs diff=lfs merge=lfs -text
45
+ *.mp3 filter=lfs diff=lfs merge=lfs -text
46
+ *.ogg filter=lfs diff=lfs merge=lfs -text
47
+ *.wav filter=lfs diff=lfs merge=lfs -text
48
+ # Image files - uncompressed
49
+ *.bmp filter=lfs diff=lfs merge=lfs -text
50
+ *.gif filter=lfs diff=lfs merge=lfs -text
51
+ *.png filter=lfs diff=lfs merge=lfs -text
52
+ *.tiff filter=lfs diff=lfs merge=lfs -text
53
+ # Image files - compressed
54
+ *.jpg filter=lfs diff=lfs merge=lfs -text
55
+ *.jpeg filter=lfs diff=lfs merge=lfs -text
56
+ *.webp filter=lfs diff=lfs merge=lfs -text
57
+ # Video files - compressed
58
+ *.mp4 filter=lfs diff=lfs merge=lfs -text
59
+ *.webm filter=lfs diff=lfs merge=lfs -text
data/raw/fineweb/train-00000-of-00099.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f7c386575467e252ff81316a193bb1e07ebe067aec34cbbd5076ee7dd2ffe42f
3
+ size 289110403
data/raw/merged_text/corpus.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0ad42d10157bf9f296b7752bbabc47b936de7af220927c9be54ceeb2ecada01d
3
+ size 1599143862
data/raw/verify_compression_ratio.py ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import PreTrainedTokenizerFast
2
+
3
+ tok = PreTrainedTokenizerFast(tokenizer_file="tokenizer/hf/tokenizer.json")
4
+
5
+ import os
6
+
7
+ with open("tokenizer/corpus.txt","r") as f:
8
+ text = f.read()
9
+
10
+ num_bytes = len(text.encode("utf-8"))
11
+ num_tokens = len(tok.encode(text))
12
+
13
+ ratio = num_bytes / num_tokens
14
+ print("Compression ratio:", ratio)
15
+ # Expected ratio is around 3.5 to 4.5 for a good tokenizer
data/raw/wikipedia/.gitattributes ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.lz4 filter=lfs diff=lfs merge=lfs -text
12
+ *.mds filter=lfs diff=lfs merge=lfs -text
13
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
14
+ *.model filter=lfs diff=lfs merge=lfs -text
15
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
16
+ *.npy filter=lfs diff=lfs merge=lfs -text
17
+ *.npz filter=lfs diff=lfs merge=lfs -text
18
+ *.onnx filter=lfs diff=lfs merge=lfs -text
19
+ *.ot filter=lfs diff=lfs merge=lfs -text
20
+ *.parquet filter=lfs diff=lfs merge=lfs -text
21
+ *.pb filter=lfs diff=lfs merge=lfs -text
22
+ *.pickle filter=lfs diff=lfs merge=lfs -text
23
+ *.pkl filter=lfs diff=lfs merge=lfs -text
24
+ *.pt filter=lfs diff=lfs merge=lfs -text
25
+ *.pth filter=lfs diff=lfs merge=lfs -text
26
+ *.rar filter=lfs diff=lfs merge=lfs -text
27
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
28
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
29
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
30
+ *.tar filter=lfs diff=lfs merge=lfs -text
31
+ *.tflite filter=lfs diff=lfs merge=lfs -text
32
+ *.tgz filter=lfs diff=lfs merge=lfs -text
33
+ *.wasm filter=lfs diff=lfs merge=lfs -text
34
+ *.xz filter=lfs diff=lfs merge=lfs -text
35
+ *.zip filter=lfs diff=lfs merge=lfs -text
36
+ *.zst filter=lfs diff=lfs merge=lfs -text
37
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
38
+ # Audio files - uncompressed
39
+ *.pcm filter=lfs diff=lfs merge=lfs -text
40
+ *.sam filter=lfs diff=lfs merge=lfs -text
41
+ *.raw filter=lfs diff=lfs merge=lfs -text
42
+ # Audio files - compressed
43
+ *.aac filter=lfs diff=lfs merge=lfs -text
44
+ *.flac filter=lfs diff=lfs merge=lfs -text
45
+ *.mp3 filter=lfs diff=lfs merge=lfs -text
46
+ *.ogg filter=lfs diff=lfs merge=lfs -text
47
+ *.wav filter=lfs diff=lfs merge=lfs -text
48
+ # Image files - uncompressed
49
+ *.bmp filter=lfs diff=lfs merge=lfs -text
50
+ *.gif filter=lfs diff=lfs merge=lfs -text
51
+ *.png filter=lfs diff=lfs merge=lfs -text
52
+ *.tiff filter=lfs diff=lfs merge=lfs -text
53
+ # Image files - compressed
54
+ *.jpg filter=lfs diff=lfs merge=lfs -text
55
+ *.jpeg filter=lfs diff=lfs merge=lfs -text
56
+ *.webp filter=lfs diff=lfs merge=lfs -text
57
+ # Video files - compressed
58
+ *.mp4 filter=lfs diff=lfs merge=lfs -text
59
+ *.webm filter=lfs diff=lfs merge=lfs -text
data/raw/wikipedia/README.md ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ dataset_info:
3
+ features:
4
+ - name: text
5
+ dtype: string
6
+ - name: tokens
7
+ sequence: int64
8
+ - name: token_count
9
+ dtype: int64
10
+ splits:
11
+ - name: train
12
+ num_bytes: 167968257.38066393
13
+ num_examples: 69445
14
+ - name: test
15
+ num_bytes: 1726968.6193360796
16
+ num_examples: 714
17
+ download_size: 49543706
18
+ dataset_size: 169695226.0
19
+ configs:
20
+ - config_name: default
21
+ data_files:
22
+ - split: train
23
+ path: data/train-*
24
+ - split: test
25
+ path: data/test-*
26
+ ---
data/raw/wikipedia/data/test-00000-of-00001.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:062222ddf69aa636b56a2c48299ece565eb85fbee8d9efbce0a1f47b436617ac
3
+ size 511192
data/raw/wikipedia/data/train-00000-of-00001.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4785fbc0ae815936f34a7923af854ab7752a456d64f5fc497ed7f234330afd94
3
+ size 49032514