KJML commited on
Commit
9fc46e5
·
verified ·
1 Parent(s): f715494

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +169 -0
README.md CHANGED
@@ -148,3 +148,172 @@ outputs = model.generate(
148
  )
149
 
150
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
  )
149
 
150
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
151
+ ````
152
+
153
+ Make sure you are using a recent version of **Transformers** and a PyTorch build that supports FP8 where applicable.
154
+
155
+ ---
156
+
157
+ ## Training Details
158
+
159
+ ### Training Data
160
+
161
+ No new training data is introduced in this repository.
162
+
163
+ * **This model is not trained from scratch.**
164
+ * It directly reuses the weights and training data of `openai/gpt-oss-20b`.
165
+ * For full details on the original training data and methodology, see the official gpt-oss model card and paper.
166
+
167
+ ### Training Procedure
168
+
169
+ No additional gradient-based training was performed. The steps were:
170
+
171
+ 1. Start from base `openai/gpt-oss-20b` weights.
172
+ 2. Apply FP8-dynamic post-training quantization (weights and activations) for inference.
173
+ 3. Export quantized weights to `safetensors` format for deployment.
174
+
175
+ #### Preprocessing
176
+
177
+ No extra data preprocessing was done beyond what OpenAI used for the base model.
178
+
179
+ #### Training Hyperparameters
180
+
181
+ * **Training regime for this repo:** *None* (no fine-tuning; quantization only)
182
+ * **Original base model:** Trained by OpenAI using high-precision training and post-training MXFP4 quantization of MoE weights (see upstream model card / paper for specifics).
183
+
184
+ #### Speeds, Sizes, Times
185
+
186
+ Exact performance depends on your hardware and FP8 support, but in general:
187
+
188
+ * **VRAM usage:** Lower than the BF16 / MXFP4 original, enabling more concurrent contexts or larger batch sizes.
189
+ * **Throughput:** Higher tokens/sec on FP8-capable hardware compared to running BF16 weights, especially at batch size >1.
190
+
191
+ You should benchmark on your own GPU(s) for precise numbers.
192
+
193
+ ---
194
+
195
+ ## Evaluation
196
+
197
+ No separate benchmark suite has been run specifically for the FP8-dynamic variant at this time.
198
+
199
+ ### Testing Data, Factors & Metrics
200
+
201
+ * **Testing data:** Not re-evaluated independently here.
202
+ * It is reasonable to expect **similar qualitative behavior** to `openai/gpt-oss-20b`, with minor differences due to quantization.
203
+
204
+ ### Results
205
+
206
+ If you run your own evals (e.g. on reasoning or coding benchmarks), please feel free to share issues / PRs or discussion links so others can reference them.
207
+
208
+ #### Summary
209
+
210
+ * Use this model when you want **gpt-oss-20b-level reasoning** with **lower memory usage and better throughput**.
211
+ * Expect small quality differences vs. the original due to FP8 quantization.
212
+
213
+ ---
214
+
215
+ ## Model Examination (Optional)
216
+
217
+ No additional interpretability or probing analysis has been carried out on this quantized variant.
218
+
219
+ For deeper analysis and interpretability work, refer to:
220
+
221
+ * The official gpt-oss paper / model card.
222
+ * Independent community evaluations of `gpt-oss-20b`.
223
+
224
+ ---
225
+
226
+ ## Environmental Impact
227
+
228
+ This repository does **not** involve training a new model.
229
+
230
+ * The main compute cost is a **one-time quantization pass** over the base weights.
231
+ * Carbon footprint is therefore negligible compared to the original model training.
232
+
233
+ For estimates of training-time emissions, please consult the original gpt-oss model card and related publications.
234
+
235
+ ---
236
+
237
+ ## Technical Specifications
238
+
239
+ ### Model Architecture and Objective
240
+
241
+ * **Architecture:** Mixture-of-Experts Transformer language model (same as `gpt-oss-20b`)
242
+ * **Objective:** Next-token prediction / causal language modeling
243
+ * **Quantization:**
244
+
245
+ * FP8 dynamic for weights and activations at inference time
246
+ * Intended for GPUs / accelerators that support efficient FP8 matmul
247
+
248
+ The quantization is applied in a way that preserves the original architecture and I/O behavior.
249
+
250
+ ### Compute Infrastructure
251
+
252
+ Quantization was performed on a single modern GPU (exact details may vary; see repository description or commits if you need exact hardware).
253
+
254
+ #### Hardware
255
+
256
+ * Single GPU with FP8 support (for quantization and testing)
257
+ * Standard CPU + RAM sufficient to host original and quantized weights
258
+
259
+ #### Software
260
+
261
+ * PyTorch (FP8-capable build)
262
+ * Hugging Face Transformers
263
+ * Supporting libraries for FP8 quantization and safetensor export
264
+
265
+ ---
266
+
267
+ ## Citation
268
+
269
+ If you use this model in academic or commercial work, please cite at least the original gpt-oss paper/model card from OpenAI:
270
+
271
+ **BibTeX:**
272
+
273
+ ```bibtex
274
+ @misc{openai2025gptoss120bgptoss20bmodel,
275
+ title={gpt-oss-120b & gpt-oss-20b Model Card},
276
+ author={OpenAI},
277
+ year={2025},
278
+ eprint={2508.10925},
279
+ archivePrefix={arXiv},
280
+ primaryClass={cs.CL},
281
+ url={https://arxiv.org/abs/2508.10925}
282
+ }
283
+ ```
284
+
285
+ You may also optionally reference this quantized variant as:
286
+
287
+ ```bibtex
288
+ @misc{kjml2025gptoss20bfp8dynamic,
289
+ title={KJML/gpt-oss-20b-FP8-Dynamic: FP8-dynamic Quantized Variant of gpt-oss-20b},
290
+ author={KJML},
291
+ year={2025},
292
+ howpublished={Hugging Face model repository},
293
+ url={https://huggingface.co/KJML/gpt-oss-20b-FP8-Dynamic}
294
+ }
295
+ ```
296
+
297
+ ---
298
+
299
+ ## Glossary
300
+
301
+ * **MoE (Mixture-of-Experts):** Architecture where only a subset of “experts” (parameter blocks) are active per token, reducing compute vs. dense models.
302
+ * **FP8 dynamic:** 8-bit floating point representation with dynamic scaling, used to reduce memory and bandwidth while preserving model quality.
303
+ * **Harmony format:** OpenAI’s chat / response formatting used for training gpt-oss models; must be respected for best performance.
304
+
305
+ ---
306
+
307
+ ## More Information
308
+
309
+ * Base model details, prompts, and advanced usage examples: see `openai/gpt-oss-20b` on Hugging Face and the official gpt-oss GitHub repository.
310
+ * For questions, issues, or suggestions around this FP8-dynamic variant, please open an issue or discussion in this repository.
311
+
312
+ ---
313
+
314
+ ## Model Card Authors
315
+
316
+ * **Author:** KJML
317
+ * **Contact:** [email protected]
318
+
319
+ ```