ZwZ-4B-VL-MLX-4bit
This is a 4-bit quantized MLX conversion of inclusionAI/ZwZ-4B, optimized for Apple Silicon inference using the MLX framework.
ZwZ-4B is a fine-grained multimodal perception model built on Qwen3-VL-4B, trained using Region-to-Image Distillation (R2I) combined with reinforcement learning. It achieves state-of-the-art fine-grained visual understanding among models of comparable size in a single forward pass — no inference-time zooming or tool calling required.
The 4B 4-bit variant is the most compact option in this family, ideal for fast iteration and resource-constrained environments while still benefiting from ZwZ's fine-grained perception training.
Conversion Details
| Setting | Value |
|---|---|
| Source model | inclusionAI/ZwZ-4B |
| Conversion tool | mlx_vlm.convert (via mlx-vlm) |
| Quantization bits | 4-bit |
| Group size | 64 |
| Quantization method | Affine post-training quantization (PTQ) |
| Quant predicate | None (uniform quantization across all text/LLM layers) |
| DWQ / AWQ | Not used |
Quantized Layers
Only the language model / text decoder layers are quantized. The following module paths are excluded from quantization and remain at their original precision:
vision_model, vision_tower, vl_connector, sam_model, audio_model, audio_tower, code_predictor
Performance
Benchmarked on Apple M2 Max, 96 GB unified memory.
Text Generation (mlx_vlm.generate)
| Metric | Value |
|---|---|
| Prompt tok/s | 75.6 |
| Generation tok/s | 98.2 |
| Peak memory | 4.30 GB |
Vision Inference by Resolution (vllm-mlx-bench)
| Resolution | Tok/s | Memory (GB) |
|---|---|---|
| 224×224 | 76.3 | 4.25 |
| 448×448 | 65.5 | 4.70 |
| 768×768 | 41.3 | 5.70 |
| 1024×1024 | 27.5 | 6.13 |
Validation
| Test | Status |
|---|---|
| Text generation | ✅ |
| Image + text generation | ✅ |
| vllm-mlx serving | ✅ |
Usage
Installation
pip install -U mlx-vlm
MLX-VLM CLI
python -m mlx_vlm.generate \
--model swaylenhayes/ZwZ-4B-VL-MLX-4bit \
--max-tokens 512 \
--temperature 0.0 \
--prompt "Describe this image in detail." \
--image path/to/image.png
Python API
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model_path = "swaylenhayes/ZwZ-4B-VL-MLX-4bit"
model, processor = load(model_path)
config = load_config(model_path)
prompt = apply_chat_template(
processor,
config,
"List every interactive UI element visible in this screenshot.",
num_images=1,
)
output = generate(
model,
processor,
prompt,
image="path/to/screenshot.png",
max_tokens=512,
temperature=0.0,
)
print(output)
vLLM-MLX (OpenAI-compatible server)
vllm-mlx serve swaylenhayes/ZwZ-4B-VL-MLX-4bit --host 127.0.0.1 --port 8108
curl http://127.0.0.1:8108/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Reply with OK"}],
"max_tokens": 16
}'
About ZwZ (Zooming without Zooming)
ZwZ transforms zooming from an inference-time tool into a training-time primitive:
- Zoom in to micro-cropped regions and let strong teacher models (Qwen3-VL-235B, GLM-4.5V) generate high-quality VQA data
- Distill this region-grounded supervision back to the full image with explicit bounding-box overlays
- Reinforce via RL training to enable single-glance fine-grained perception
This makes ZwZ particularly well-suited for tasks requiring fine visual detail recognition, such as UI screenshot parsing, document analysis, and dense image understanding.
Links
- Original model: inclusionAI/ZwZ-4B
- Base architecture: Qwen/Qwen3-VL-4B-Instruct
- Paper: Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
- Project: github.com/inclusionAI/Zooming-without-Zooming
- Training data: inclusionAI/ZwZ-RL-VQA
- MLX framework: github.com/ml-explore/mlx
- mlx-vlm: github.com/Blaizzy/mlx-vlm
Other Quantizations
| Variant | Link |
|---|---|
| ZwZ-8B MLX 4-bit | swaylenhayes/ZwZ-8B-VL-MLX-4bit |
| ZwZ-8B MLX 8-bit | swaylenhayes/ZwZ-8B-VL-MLX-8bit |
| ZwZ-4B MLX 4-bit | this model |
| ZwZ-4B MLX 8-bit | swaylenhayes/ZwZ-4B-VL-MLX-8bit |
Notes and Limitations
- Quantization changes numerical behavior relative to full-precision weights. Performance may differ from the original model on edge cases.
- Throughput and memory depend on prompt length, image resolution, and runtime settings.
- Benchmark numbers reflect a quiet system with no other models loaded.
Citation
@article{wei2026zooming,
title={Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception},
author={Wei, Lai and He, Liangbo and Lan, Jun and Dong, Lingzhong and Cai, Yutong and Li, Siyuan and Zhu, Huijia and Wang, Weiqiang and Kong, Linghe and Wang, Yue and Zhang, Zhuosheng and Huang, Weiran},
journal={arXiv preprint arXiv:2602.11858},
year={2026}
}
License
Apache 2.0 — follows the license of the original ZwZ and Qwen3-VL models.
- Downloads last month
- 3
4-bit