ZwZ-4B-VL-MLX-4bit

This is a 4-bit quantized MLX conversion of inclusionAI/ZwZ-4B, optimized for Apple Silicon inference using the MLX framework.

ZwZ-4B is a fine-grained multimodal perception model built on Qwen3-VL-4B, trained using Region-to-Image Distillation (R2I) combined with reinforcement learning. It achieves state-of-the-art fine-grained visual understanding among models of comparable size in a single forward pass — no inference-time zooming or tool calling required.

The 4B 4-bit variant is the most compact option in this family, ideal for fast iteration and resource-constrained environments while still benefiting from ZwZ's fine-grained perception training.

Conversion Details

Setting	Value
Source model	`inclusionAI/ZwZ-4B`
Conversion tool	`mlx_vlm.convert` (via mlx-vlm)
Quantization bits	4-bit
Group size	64
Quantization method	Affine post-training quantization (PTQ)
Quant predicate	None (uniform quantization across all text/LLM layers)
DWQ / AWQ	Not used

Quantized Layers

Only the language model / text decoder layers are quantized. The following module paths are excluded from quantization and remain at their original precision:

vision_model, vision_tower, vl_connector, sam_model, audio_model, audio_tower, code_predictor

Performance

Benchmarked on Apple M2 Max, 96 GB unified memory.

Text Generation (`mlx_vlm.generate`)

Metric	Value
Prompt tok/s	75.6
Generation tok/s	98.2
Peak memory	4.30 GB

Vision Inference by Resolution (`vllm-mlx-bench`)

Resolution	Tok/s	Memory (GB)
224×224	76.3	4.25
448×448	65.5	4.70
768×768	41.3	5.70
1024×1024	27.5	6.13

Validation

Test	Status
Text generation	✅
Image + text generation	✅
vllm-mlx serving	✅

Usage

Installation

pip install -U mlx-vlm

MLX-VLM CLI

python -m mlx_vlm.generate \
  --model swaylenhayes/ZwZ-4B-VL-MLX-4bit \
  --max-tokens 512 \
  --temperature 0.0 \
  --prompt "Describe this image in detail." \
  --image path/to/image.png

Python API

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model_path = "swaylenhayes/ZwZ-4B-VL-MLX-4bit"
model, processor = load(model_path)
config = load_config(model_path)

prompt = apply_chat_template(
    processor,
    config,
    "List every interactive UI element visible in this screenshot.",
    num_images=1,
)

output = generate(
    model,
    processor,
    prompt,
    image="path/to/screenshot.png",
    max_tokens=512,
    temperature=0.0,
)
print(output)

vLLM-MLX (OpenAI-compatible server)

vllm-mlx serve swaylenhayes/ZwZ-4B-VL-MLX-4bit --host 127.0.0.1 --port 8108

curl http://127.0.0.1:8108/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Reply with OK"}],
    "max_tokens": 16
  }'

About ZwZ (Zooming without Zooming)

ZwZ transforms zooming from an inference-time tool into a training-time primitive:

Zoom in to micro-cropped regions and let strong teacher models (Qwen3-VL-235B, GLM-4.5V) generate high-quality VQA data
Distill this region-grounded supervision back to the full image with explicit bounding-box overlays
Reinforce via RL training to enable single-glance fine-grained perception

This makes ZwZ particularly well-suited for tasks requiring fine visual detail recognition, such as UI screenshot parsing, document analysis, and dense image understanding.

Other Quantizations

Variant	Link
ZwZ-8B MLX 4-bit	swaylenhayes/ZwZ-8B-VL-MLX-4bit
ZwZ-8B MLX 8-bit	swaylenhayes/ZwZ-8B-VL-MLX-8bit
ZwZ-4B MLX 4-bit	this model
ZwZ-4B MLX 8-bit	swaylenhayes/ZwZ-4B-VL-MLX-8bit

Notes and Limitations

Quantization changes numerical behavior relative to full-precision weights. Performance may differ from the original model on edge cases.
Throughput and memory depend on prompt length, image resolution, and runtime settings.
Benchmark numbers reflect a quiet system with no other models loaded.

Citation

@article{wei2026zooming,
  title={Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception},
  author={Wei, Lai and He, Liangbo and Lan, Jun and Dong, Lingzhong and Cai, Yutong and Li, Siyuan and Zhu, Huijia and Wang, Weiqiang and Kong, Linghe and Wang, Yue and Zhang, Zhuosheng and Huang, Weiran},
  journal={arXiv preprint arXiv:2602.11858},
  year={2026}
}

License

Apache 2.0 — follows the license of the original ZwZ and Qwen3-VL models.

Downloads last month: 3

Safetensors

Model size

1B params

Tensor type

F32

U32

MLX

Hardware compatibility

4-bit

Model tree for swaylenhayes/ZwZ-4B-VL-MLX-4bit

Base model

Qwen/Qwen3-VL-4B-Instruct

Finetuned

inclusionAI/ZwZ-4B

Quantized

(6)

this model

Paper for swaylenhayes/ZwZ-4B-VL-MLX-4bit

Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

Paper • 2602.11858 • Published Feb 12 • 62

swaylenhayes
/

ZwZ-4B-VL-MLX-4bit

ZwZ-4B-VL-MLX-4bit

Conversion Details

Quantized Layers

Performance

Text Generation (`mlx_vlm.generate`)

Vision Inference by Resolution (`vllm-mlx-bench`)

Validation

Usage

Installation

MLX-VLM CLI

Python API

vLLM-MLX (OpenAI-compatible server)

About ZwZ (Zooming without Zooming)

Links

Other Quantizations

Notes and Limitations

Citation

License

Model tree for swaylenhayes/ZwZ-4B-VL-MLX-4bit

Paper for swaylenhayes/ZwZ-4B-VL-MLX-4bit

Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

ZwZ-4B-VL-MLX-4bit

Conversion Details

Quantized Layers

Performance

Text Generation (mlx_vlm.generate)

Vision Inference by Resolution (vllm-mlx-bench)

Validation

Usage

Installation

MLX-VLM CLI

Python API

vLLM-MLX (OpenAI-compatible server)

About ZwZ (Zooming without Zooming)

Links

Other Quantizations

Notes and Limitations

Citation

License

Model tree for swaylenhayes/ZwZ-4B-VL-MLX-4bit

Paper for swaylenhayes/ZwZ-4B-VL-MLX-4bit

Text Generation (`mlx_vlm.generate`)

Vision Inference by Resolution (`vllm-mlx-bench`)