Instructions to use Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M")
model = AutoModelForCausalLM.from_pretrained("Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M", device_map="auto")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M

SGLang

How to use Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M with Docker Model Runner:
```
docker model run hf.co/Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M
```

OLMo-3-7B-Instruct-NVFP4-1M

NVFP4 quantized version of allenai/Olmo-3-7B-Instruct with extended 1M token context support via linear RoPE scaling.

Model Description

This model is the NVFP4 (4-bit floating point) quantized version of OLMo-3-7B-Instruct, optimized for NVIDIA DGX Spark systems with Blackwell GB10 GPUs and Ada Lovelace architecture support. The quantization uses NVIDIA's ModelOpt library with two-level scaling: E4M3 FP8 per block plus FP32 global scale.

Key Features

Base Model: allenai/Olmo-3-7B-Instruct (7.3B parameters)
Quantization Format: NVFP4 with group_size=16
Context Length: 1,048,576 tokens (1M) via linear RoPE scaling
Model Size: 5.30 GB (64% reduction from 14.60 GB)
GPU Memory: ~5.23 GiB (64% reduction)

Performance

Metric	Original	Quantized	Improvement
Model Size	14.60 GB	5.30 GB	64% reduction
GPU Memory	14.6 GB	5.23 GiB	64% reduction
Context Length	4,096	1,048,576	256x increase
Inference Speed	-	31-35 tok/s	-

Usage

Important: This model requires vLLM with ModelOpt quantization support. It cannot be loaded with standard transformers.

vLLM Server Deployment

python3 -m vllm.entrypoints.openai.api_server \
    --model Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M \
    --quantization modelopt \
    --trust-remote-code \
    --gpu-memory-utilization 0.95 \
    --max-model-len 200000 \
    --served-model-name 'OLMo-3-7B-NVFP4' \
    --host 0.0.0.0 \
    --port 8000

Python Usage with vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M",
    quantization="modelopt",
    trust_remote_code=True,
    gpu_memory_utilization=0.95,
    max_model_len=200000
)

sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=512)

prompts = ["What is artificial intelligence?"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Requirements

GPU: NVIDIA GPU with compute capability 8.9+ (Ada Lovelace, Blackwell)
vLLM: Latest version with ModelOpt support
Dependencies: pip install vllm transformers torchao

Quantization Details

Algorithm: NVFP4 (4-bit floating point)
Calibration Dataset: allenai/c4 (2048 samples)
Calibration Length: 2048 tokens per sample
Tool: NVIDIA ModelOpt 0.39.0
Group Size: 16
Excluded Layers: lm_head

Context Extension

The context was extended from 4,096 to 1,048,576 tokens using linear RoPE scaling:

Scaling Factor: 16x
rope_theta: 50,000,000
rope_scaling: {"type": "linear", "factor": 16.0}

Note: Actual usable context depends on available GPU memory. With 120GB GPU at 95% utilization, approximately 200,000 tokens can be stored in KV cache.

Architecture Compatibility

For vLLM compatibility, the model uses:

Architecture: Olmo2ForCausalLM
Model Type: olmo2

This mapping allows vLLM to properly load the OLMo-3 architecture.

Limitations

Requires vLLM with --quantization modelopt flag
Cannot be loaded with standard transformers
Requires NVIDIA GPU with FP4 support (Ada Lovelace or newer)
Maximum usable context limited by GPU memory for KV cache

Intended Use

Long-context instruction following and chat
Document analysis and summarization
Code generation and review
Research and educational purposes

License

Apache 2.0 (inherited from base model)

Citation

@misc{olmo3-nvfp4-1m,
  author = {Ex0bit},
  title = {OLMo-3-7B-Instruct-NVFP4-1M: NVFP4 Quantized OLMo-3 with 1M Context},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Ex0bit/OLMo-3-7B-Instruct-NVFP4-1M}}
}