Instructions to use Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M") model = AutoModelForCausalLM.from_pretrained("Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M
- SGLang
How to use Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M with Docker Model Runner:
docker model run hf.co/Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M
OLMo-3-7B-Instruct-NVFP4-1M
NVFP4 quantized version of allenai/Olmo-3-7B-Instruct with extended 1M token context support via linear RoPE scaling.
Model Description
This model is the NVFP4 (4-bit floating point) quantized version of OLMo-3-7B-Instruct, optimized for NVIDIA DGX Spark systems with Blackwell GB10 GPUs and Ada Lovelace architecture support. The quantization uses NVIDIA's ModelOpt library with two-level scaling: E4M3 FP8 per block plus FP32 global scale.
Key Features
- Base Model: allenai/Olmo-3-7B-Instruct (7.3B parameters)
- Quantization Format: NVFP4 with group_size=16
- Context Length: 1,048,576 tokens (1M) via linear RoPE scaling
- Model Size: 5.30 GB (64% reduction from 14.60 GB)
- GPU Memory: ~5.23 GiB (64% reduction)
Performance
| Metric | Original | Quantized | Improvement |
|---|---|---|---|
| Model Size | 14.60 GB | 5.30 GB | 64% reduction |
| GPU Memory | 14.6 GB | 5.23 GiB | 64% reduction |
| Context Length | 4,096 | 1,048,576 | 256x increase |
| Inference Speed | - | 31-35 tok/s | - |
Usage
Important: This model requires vLLM with ModelOpt quantization support. It cannot be loaded with standard transformers.
vLLM Server Deployment
python3 -m vllm.entrypoints.openai.api_server \
--model Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M \
--quantization modelopt \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--max-model-len 200000 \
--served-model-name 'OLMo-3-7B-NVFP4' \
--host 0.0.0.0 \
--port 8000
Python Usage with vLLM
from vllm import LLM, SamplingParams
llm = LLM(
model="Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M",
quantization="modelopt",
trust_remote_code=True,
gpu_memory_utilization=0.95,
max_model_len=200000
)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=512)
prompts = ["What is artificial intelligence?"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
Requirements
- GPU: NVIDIA GPU with compute capability 8.9+ (Ada Lovelace, Blackwell)
- vLLM: Latest version with ModelOpt support
- Dependencies:
pip install vllm transformers torchao
Quantization Details
- Algorithm: NVFP4 (4-bit floating point)
- Calibration Dataset: allenai/c4 (2048 samples)
- Calibration Length: 2048 tokens per sample
- Tool: NVIDIA ModelOpt 0.39.0
- Group Size: 16
- Excluded Layers: lm_head
Context Extension
The context was extended from 4,096 to 1,048,576 tokens using linear RoPE scaling:
- Scaling Factor: 16x
- rope_theta: 50,000,000
- rope_scaling:
{"type": "linear", "factor": 16.0}
Note: Actual usable context depends on available GPU memory. With 120GB GPU at 95% utilization, approximately 200,000 tokens can be stored in KV cache.
Architecture Compatibility
For vLLM compatibility, the model uses:
- Architecture: Olmo2ForCausalLM
- Model Type: olmo2
This mapping allows vLLM to properly load the OLMo-3 architecture.
Limitations
- Requires vLLM with
--quantization modeloptflag - Cannot be loaded with standard transformers
- Requires NVIDIA GPU with FP4 support (Ada Lovelace or newer)
- Maximum usable context limited by GPU memory for KV cache
Intended Use
- Long-context instruction following and chat
- Document analysis and summarization
- Code generation and review
- Research and educational purposes
License
Apache 2.0 (inherited from base model)
Citation
@misc{olmo3-nvfp4-1m,
author = {Ex0bit},
title = {OLMo-3-7B-Instruct-NVFP4-1M: NVFP4 Quantized OLMo-3 with 1M Context},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M}}
}
Acknowledgments
- Base model by Allen Institute for AI (Ai2)
- Quantization using NVIDIA ModelOpt
- Inference powered by vLLM
- Downloads last month
- 4
Model tree for Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M
Base model
allenai/Olmo-3-1025-7B