---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- visual-document-retrieval
- cross-modal-distillation
- multilingual
- nanovdr
base_model: google-bert/bert-base-uncased
language:
- en
- de
- fr
- es
- it
- pt
license: apache-2.0
---

> **Paper**: [NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval](https://arxiv.org/abs/2603.12824) | [Blog](https://huggingface.co/blog/Ryenhails/nanovdr)

# NanoVDR-M-Multi: Multilingual Query Encoder for Visual Document Retrieval

**NanoVDR-M-Multi** is a 116M-parameter multilingual text-only query encoder for visual document retrieval. It retrieves document page images as effectively as Vision-Language Models 30-100x its size, with strong cross-lingual transfer across 6 languages.

Built on [NanoVDR-S](https://huggingface.co/nanovdr/NanoVDR-S) and further trained with multilingual query augmentation (English + German, French, Spanish, Italian, Portuguese), it is the recommended model for production use with multilingual or mixed-language queries.

## Results

| Model | Params | ViDoRe v1 (en) | ViDoRe v2 (multi) | ViDoRe v3 (multi) |
|-------|--------|----------------|--------------------|--------------------|
| Qwen3-VL-Emb (Teacher) | 2.0B | 84.3 | 65.3 | 50.0 |
| **NanoVDR-M-Multi** | **112M** | **82.5** | **62.8** | **47.5** |
| NanoVDR-S-Multi | 69M | 82.2 | 61.9 | 46.5 |
| ColPali | ~3B | 84.2 | 54.7 | 42.0 |

### Per-Language Teacher Retention

| Language | NDCG@5 | Teacher Retention |
|----------|--------|-------------------|
| English | 50.7 | 93.0% |
| French | 47.8 | 93.6% |
| Spanish | 47.8 | 93.1% |
| Italian | 45.7 | 93.3% |
| German | 45.4 | 92.0% |
| Portuguese | 46.1 | 94.6% |

All 6 languages achieve >92% of the 2B teacher's performance.

## How It Works

NanoVDR decouples query encoding from document encoding in visual document retrieval:
- **Offline indexing**: The VLM teacher (Qwen3-VL-Embedding-2B) encodes document page images into single-vector embeddings. This is a one-time cost.
- **Online querying**: NanoVDR-M-Multi encodes text queries in any supported language into the same embedding space via a lightweight text encoder + MLP projector. No vision model needed at query time.

Retrieval uses standard cosine similarity between query and document embeddings.

## Usage

```python
from sentence_transformers import SentenceTransformer

# Load the multilingual query encoder
model = SentenceTransformer("nanovdr/NanoVDR-M-Multi")

# Encode queries in any supported language
queries = [
    "What was the revenue growth in Q3 2024?",           # English
    "Quel est le chiffre d'affaires du trimestre?",       # French
    "Wie hoch war das Umsatzwachstum im dritten Quartal?", # German
    "Cual fue el crecimiento de ingresos en el Q3?",      # Spanish
]
query_embeddings = model.encode(queries)
print(query_embeddings.shape)  # (4, 2048)

# Retrieve against pre-indexed document embeddings from the VLM teacher
# scores = query_embeddings @ doc_embeddings.T
```

### Full Retrieval Pipeline

```python
from sentence_transformers import SentenceTransformer

# Step 1: Index documents with the VLM teacher (one-time, offline)
from transformers import AutoModel
teacher = AutoModel.from_pretrained("Qwen/Qwen3-VL-Embedding-2B")
# doc_embeddings = teacher.encode(document_images)  # See Qwen3-VL-Embedding docs

# Step 2: Query with NanoVDR-S-Multi (online, fast, CPU-only)
student = SentenceTransformer("nanovdr/NanoVDR-M-Multi")
query_emb = student.encode("Quel est le chiffre d'affaires?")

# Step 3: Retrieve
scores = query_emb @ doc_embeddings.T
top_k = scores.argsort()[-5:][::-1]
```

## Training Details

- **Architecture**: google-bert/bert-base-uncased + 2-layer MLP projector (768 → 768 → 2048)
- **Training objective**: Pointwise cosine alignment with teacher query embeddings
- **Training data**: 1.49M query-document pairs — 711K original (4 public sources) + 778K machine-translated queries in 5 languages (DE, FR, ES, IT, PT) via Helsinki-NLP Opus-MT models
- **Training cost**: ~15 GPU-hours on a single H200
- **Epochs**: 10, lr=3e-4, batch size 1024 (effective)

### Multilingual Augmentation Pipeline

1. Extract 489K English queries from training data
2. Translate to 5 target languages using [Helsinki-NLP Opus-MT](https://huggingface.co/Helsinki-NLP) models (~200K per language)
3. Re-encode translated queries with the frozen teacher in text mode to produce target embeddings
4. Combine with original 711K pairs → 1.49M total training samples

## Key Properties

- **Output dimension**: 2048 (aligned with Qwen3-VL-Embedding-2B)
- **Max sequence length**: 512 tokens
- **Supported languages**: English, German, French, Spanish, Italian, Portuguese
- **Similarity function**: Cosine similarity
- **Pooling**: Mean pooling
- **Normalization**: L2-normalized output

## Efficiency

| Metric | NanoVDR-M-Multi | ColPali (3B) | Teacher (2B) |
|--------|------------|--------------|--------------|
| Query latency (CPU, B=1) | 51 ms | 7,300 ms | GPU only |
| Model size | 116M | ~3B | 2B |
| Index type | Single-vector | Multi-vector | Single-vector |
| Scoring | Cosine | MaxSim | Cosine |

## Related Models

- [NanoVDR-S](https://huggingface.co/nanovdr/NanoVDR-S) — English-focused, same architecture
- [NanoVDR-M](https://huggingface.co/nanovdr/NanoVDR-M) — BERT-base backbone (116M)
- [NanoVDR-L](https://huggingface.co/nanovdr/NanoVDR-L) — ModernBERT backbone (155M)

## Citation

```bibtex
@article{nanovdr2026,
  title={NanoVDR: Asymmetric Cross-Modal Distillation for Efficient Visual Document Retrieval},
  author={...},
  journal={arXiv preprint},
  year={2026}
}
```

## License

Apache 2.0