ZwZ-4B-VL-MLX-4bit

This is a 4-bit quantized MLX conversion of inclusionAI/ZwZ-4B, optimized for Apple Silicon inference using the MLX framework.

ZwZ-4B is a fine-grained multimodal perception model built on Qwen3-VL-4B, trained using Region-to-Image Distillation (R2I) combined with reinforcement learning. It achieves state-of-the-art fine-grained visual understanding among models of comparable size in a single forward pass — no inference-time zooming or tool calling required.

The 4B 4-bit variant is the most compact option in this family, ideal for fast iteration and resource-constrained environments while still benefiting from ZwZ's fine-grained perception training.

Conversion Details

Setting Value
Source model inclusionAI/ZwZ-4B
Conversion tool mlx_vlm.convert (via mlx-vlm)
Quantization bits 4-bit
Group size 64
Quantization method Affine post-training quantization (PTQ)
Quant predicate None (uniform quantization across all text/LLM layers)
DWQ / AWQ Not used

Quantized Layers

Only the language model / text decoder layers are quantized. The following module paths are excluded from quantization and remain at their original precision:

vision_model, vision_tower, vl_connector, sam_model, audio_model, audio_tower, code_predictor

Performance

Benchmarked on Apple M2 Max, 96 GB unified memory.

Text Generation (mlx_vlm.generate)

Metric Value
Prompt tok/s 75.6
Generation tok/s 98.2
Peak memory 4.30 GB

Vision Inference by Resolution (vllm-mlx-bench)

Resolution Tok/s Memory (GB)
224×224 76.3 4.25
448×448 65.5 4.70
768×768 41.3 5.70
1024×1024 27.5 6.13

Validation

Test Status
Text generation
Image + text generation
vllm-mlx serving

Usage

Installation

pip install -U mlx-vlm

MLX-VLM CLI

python -m mlx_vlm.generate \
  --model swaylenhayes/ZwZ-4B-VL-MLX-4bit \
  --max-tokens 512 \
  --temperature 0.0 \
  --prompt "Describe this image in detail." \
  --image path/to/image.png

Python API

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model_path = "swaylenhayes/ZwZ-4B-VL-MLX-4bit"
model, processor = load(model_path)
config = load_config(model_path)

prompt = apply_chat_template(
    processor,
    config,
    "List every interactive UI element visible in this screenshot.",
    num_images=1,
)

output = generate(
    model,
    processor,
    prompt,
    image="path/to/screenshot.png",
    max_tokens=512,
    temperature=0.0,
)
print(output)

vLLM-MLX (OpenAI-compatible server)

vllm-mlx serve swaylenhayes/ZwZ-4B-VL-MLX-4bit --host 127.0.0.1 --port 8108
curl http://127.0.0.1:8108/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Reply with OK"}],
    "max_tokens": 16
  }'

About ZwZ (Zooming without Zooming)

ZwZ transforms zooming from an inference-time tool into a training-time primitive:

  1. Zoom in to micro-cropped regions and let strong teacher models (Qwen3-VL-235B, GLM-4.5V) generate high-quality VQA data
  2. Distill this region-grounded supervision back to the full image with explicit bounding-box overlays
  3. Reinforce via RL training to enable single-glance fine-grained perception

This makes ZwZ particularly well-suited for tasks requiring fine visual detail recognition, such as UI screenshot parsing, document analysis, and dense image understanding.

Links

Other Quantizations

Variant Link
ZwZ-8B MLX 4-bit swaylenhayes/ZwZ-8B-VL-MLX-4bit
ZwZ-8B MLX 8-bit swaylenhayes/ZwZ-8B-VL-MLX-8bit
ZwZ-4B MLX 4-bit this model
ZwZ-4B MLX 8-bit swaylenhayes/ZwZ-4B-VL-MLX-8bit

Notes and Limitations

  • Quantization changes numerical behavior relative to full-precision weights. Performance may differ from the original model on edge cases.
  • Throughput and memory depend on prompt length, image resolution, and runtime settings.
  • Benchmark numbers reflect a quiet system with no other models loaded.

Citation

@article{wei2026zooming,
  title={Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception},
  author={Wei, Lai and He, Liangbo and Lan, Jun and Dong, Lingzhong and Cai, Yutong and Li, Siyuan and Zhu, Huijia and Wang, Weiqiang and Kong, Linghe and Wang, Yue and Zhang, Zhuosheng and Huang, Weiran},
  journal={arXiv preprint arXiv:2602.11858},
  year={2026}
}

License

Apache 2.0 — follows the license of the original ZwZ and Qwen3-VL models.

Downloads last month
3
Safetensors
Model size
1B params
Tensor type
F32
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for swaylenhayes/ZwZ-4B-VL-MLX-4bit

Quantized
(6)
this model

Paper for swaylenhayes/ZwZ-4B-VL-MLX-4bit