This is a preliminary version (and subject to change) of the FP8 quantized google/gemma-4-31B-it model. The model has both weights and activations quantized to FP8 with vllm-project/llm-compressor.
This model requires a nightly vllm wheel, see install instructions at https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#installing-vllm
On a single B200:
lm_eval --model local-chat-completions --tasks gsm8k_platinum_cot_llama --model_args "model=RedHatAI/gemma-4-31B-it-FP8-Dynamic,max_length=96000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=2400" --num_fewshot 5 --apply_chat_template --fewshot_as_multiturn --output_path results_gsm8k_platinum.json --seed 1234 --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=64000,seed=1234"
Original:
| Tasks |Version| Filter |n-shot| Metric | |Value| |Stderr|
|------------------------|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k_platinum_cot_llama| 3|flexible-extract| 5|exact_match|↑ |0.976|± |0.0044|
| | |strict-match | 5|exact_match|↑ |0.976|± |0.0044|
FP8:
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_platinum_cot_llama| 3|flexible-extract| 5|exact_match|↑ |0.9768|± |0.0043|
| | |strict-match | 5|exact_match|↑ |0.9777|± |0.0043|
Creation
This model was created by applying data-free FP8 Dynamic quantization with LLM Compressor, as presented in the code snippet below.
from llmcompressor import model_free_ptq
MODEL_ID = "google/gemma-4-31B-it"
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-block"
model_free_ptq(
model_stub=MODEL_ID,
save_directory=SAVE_DIR,
scheme="FP8_DYNAMIC",
ignore=["re:.*vision.*", "lm_head", "re:.*embed_tokens.*"],
max_workers=8,
device="cuda:0",
)
- Downloads last month
- 7,461
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for RedHatAI/gemma-4-31B-it-FP8-Dynamic
Base model
google/gemma-4-31B-it