KJML
/

gpt-oss-20b-FP8-Dynamic

 )
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+````
+Make sure you are using a recent version of **Transformers** and a PyTorch build that supports FP8 where applicable.
+---
+## Training Details
+### Training Data
+No new training data is introduced in this repository.
+* **This model is not trained from scratch.**
+* It directly reuses the weights and training data of `openai/gpt-oss-20b`.
+* For full details on the original training data and methodology, see the official gpt-oss model card and paper.
+### Training Procedure
+No additional gradient-based training was performed. The steps were:
+1. Start from base `openai/gpt-oss-20b` weights.
+2. Apply FP8-dynamic post-training quantization (weights and activations) for inference.
+3. Export quantized weights to `safetensors` format for deployment.
+#### Preprocessing
+No extra data preprocessing was done beyond what OpenAI used for the base model.
+#### Training Hyperparameters
+* **Training regime for this repo:** *None* (no fine-tuning; quantization only)
+* **Original base model:** Trained by OpenAI using high-precision training and post-training MXFP4 quantization of MoE weights (see upstream model card / paper for specifics).
+#### Speeds, Sizes, Times
+Exact performance depends on your hardware and FP8 support, but in general:
+* **VRAM usage:** Lower than the BF16 / MXFP4 original, enabling more concurrent contexts or larger batch sizes.
+* **Throughput:** Higher tokens/sec on FP8-capable hardware compared to running BF16 weights, especially at batch size >1.
+You should benchmark on your own GPU(s) for precise numbers.
+---
+## Evaluation
+No separate benchmark suite has been run specifically for the FP8-dynamic variant at this time.
+### Testing Data, Factors & Metrics
+* **Testing data:** Not re-evaluated independently here.
+* It is reasonable to expect **similar qualitative behavior** to `openai/gpt-oss-20b`, with minor differences due to quantization.
+### Results
+If you run your own evals (e.g. on reasoning or coding benchmarks), please feel free to share issues / PRs or discussion links so others can reference them.
+#### Summary
+* Use this model when you want **gpt-oss-20b-level reasoning** with **lower memory usage and better throughput**.
+* Expect small quality differences vs. the original due to FP8 quantization.
+---
+## Model Examination (Optional)
+No additional interpretability or probing analysis has been carried out on this quantized variant.
+For deeper analysis and interpretability work, refer to:
+* The official gpt-oss paper / model card.
+* Independent community evaluations of `gpt-oss-20b`.
+---
+## Environmental Impact
+This repository does **not** involve training a new model.
+* The main compute cost is a **one-time quantization pass** over the base weights.
+* Carbon footprint is therefore negligible compared to the original model training.
+For estimates of training-time emissions, please consult the original gpt-oss model card and related publications.
+---
+## Technical Specifications
+### Model Architecture and Objective
+* **Architecture:** Mixture-of-Experts Transformer language model (same as `gpt-oss-20b`)
+* **Objective:** Next-token prediction / causal language modeling
+* **Quantization:**
+  * FP8 dynamic for weights and activations at inference time
+  * Intended for GPUs / accelerators that support efficient FP8 matmul
+The quantization is applied in a way that preserves the original architecture and I/O behavior.
+### Compute Infrastructure
+Quantization was performed on a single modern GPU (exact details may vary; see repository description or commits if you need exact hardware).
+#### Hardware
+* Single GPU with FP8 support (for quantization and testing)
+* Standard CPU + RAM sufficient to host original and quantized weights
+#### Software
+* PyTorch (FP8-capable build)
+* Hugging Face Transformers
+* Supporting libraries for FP8 quantization and safetensor export
+---
+## Citation
+If you use this model in academic or commercial work, please cite at least the original gpt-oss paper/model card from OpenAI:
+**BibTeX:**
+```bibtex
+@misc{openai2025gptoss120bgptoss20bmodel,
+      title={gpt-oss-120b & gpt-oss-20b Model Card},
+      author={OpenAI},
+      year={2025},
+      eprint={2508.10925},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2508.10925}
+}
+```
+You may also optionally reference this quantized variant as:
+```bibtex
+@misc{kjml2025gptoss20bfp8dynamic,
+      title={KJML/gpt-oss-20b-FP8-Dynamic: FP8-dynamic Quantized Variant of gpt-oss-20b},
+      author={KJML},
+      year={2025},
+      howpublished={Hugging Face model repository},
+      url={https://huggingface.co/KJML/gpt-oss-20b-FP8-Dynamic}
+}
+```
+---
+## Glossary
+* **MoE (Mixture-of-Experts):** Architecture where only a subset of “experts” (parameter blocks) are active per token, reducing compute vs. dense models.
+* **FP8 dynamic:** 8-bit floating point representation with dynamic scaling, used to reduce memory and bandwidth while preserving model quality.
+* **Harmony format:** OpenAI’s chat / response formatting used for training gpt-oss models; must be respected for best performance.
+---
+## More Information
+* Base model details, prompts, and advanced usage examples: see `openai/gpt-oss-20b` on Hugging Face and the official gpt-oss GitHub repository.
+* For questions, issues, or suggestions around this FP8-dynamic variant, please open an issue or discussion in this repository.
+---
+## Model Card Authors
+* **Author:** KJML
+* **Contact:** [email protected]
+```