ColQwen3 8B - VetCoders MLX Edition

Visual document retrieval model with ColBERT-style late interaction (MaxSim scoring), optimized for Apple Silicon via MLX.

Created by M&K (c)2025 The LibraxisAI Team

Model Description

ColQwen3-VetCoders-MLX is a visual document retrieval model converted to Apple MLX format. It produces multi-vector embeddings for both document images and text queries, enabling precise visual document search using late interaction (MaxSim) scoring.

Key Features

  • Visual Document Retrieval - Find relevant pages in PDF documents using image understanding
  • Late Interaction Ranking - ColBERT-style MaxSim scoring for precision
  • Multi-modal Embeddings - Embed both images and text queries into shared 320-dim space
  • Apple Silicon Native - Optimized for M1/M2/M3/M4 via MLX framework

Architecture

Input (Image or Text)
         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Qwen3-VL Vision Encoder   β”‚  ← For images: extract visual features
β”‚   (frozen ViT patches)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Qwen3 Language Model      β”‚  ← Multimodal token processing
β”‚   (7B parameters)           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Projection Layer          β”‚  ← Project to 320-dim embedding space
β”‚   (4096 β†’ 320)              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓
Multi-vector embeddings [N, 320]

Usage

Installation

pip install mlx mlx-vlm safetensors pillow

Loading the Model

from colqwen3_embedder import ColQwen3Embedder

# Initialize embedder (uses env vars or default paths)
embedder = ColQwen3Embedder()
embedder.load()

# Or specify paths directly
embedder = ColQwen3Embedder(
    model_path="LibraxisAI/colqwen3-8b-vetcoders-mlx",
    projection_path="path/to/projection.safetensors"
)

Embedding Documents

# Embed a document image
doc_embedding = embedder.embed_image("document_page.png")
# Returns: EmbeddingResult with shape [num_patches, 320]

# Embed a text query
query_embedding = embedder.embed_text("What is the treatment protocol?")
# Returns: EmbeddingResult with shape [num_tokens, 320]

Scoring Relevance

# MaxSim scoring for retrieval
score = embedder.maxsim_score(query_embedding, doc_embedding)
# Higher score = more relevant document

Batch Processing

# Process multiple documents
documents = ["page1.png", "page2.png", "page3.png"]
doc_embeddings = [embedder.embed_image(doc) for doc in documents]

# Score all documents against query
scores = [embedder.maxsim_score(query_embedding, doc) for doc in doc_embeddings]

# Get top matches
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)

Technical Details

Base Model

Converted from tomoro-ai/Colqwen3-8B-base, which was trained on:

  • vidore/colpali_train_set
  • Additional document understanding datasets

Weight Mapping

Original Tomoro weights are mapped to MLX-compatible structure:

  • vlm.model.language_model.* β†’ language_model.model.*
  • vlm.model.visual.* β†’ vision_tower.*
  • embedding_proj_layer.* β†’ saved separately as projection weights

Embedding Details

  • Dimension: 320 (projected from 4096)
  • Image tokens: Variable based on image resolution (patches)
  • Text tokens: Variable based on query length
  • Scoring: MaxSim (maximum similarity) late interaction

Performance

Tested on Apple Silicon:

Device Image Embedding Text Embedding Memory
M3 Max 128GB ~1.2s ~0.3s ~17GB
M3 Ultra 512GB ~0.8s ~0.2s ~17GB
M2 Ultra 192GB ~1.5s ~0.4s ~17GB

Files

colqwen3-8b-vetcoders-mlx/
β”œβ”€β”€ config.json                    # Model configuration
β”œβ”€β”€ model-00001-of-00007.safetensors  # Model weights (sharded)
β”œβ”€β”€ model-00002-of-00007.safetensors
β”œβ”€β”€ ...
β”œβ”€β”€ model.safetensors.index.json   # Weight index
β”œβ”€β”€ tokenizer.json                 # Tokenizer
β”œβ”€β”€ tokenizer_config.json
β”œβ”€β”€ preprocessor_config.json       # Image preprocessor
└── video_preprocessor_config.json

Projection weights (separate file):

colqwen3_projection.safetensors    # 4096β†’320 projection layer

Limitations

  • Requires Apple Silicon (M1/M2/M3/M4) for MLX acceleration
  • Large memory footprint (~17GB for inference)
  • Optimized for document images, not general photos

Citation

@misc{colqwen3-vetcoders-mlx,
  author = {LibraxisAI Team},
  title = {ColQwen3 8B - VetCoders MLX Edition},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/LibraxisAI/colqwen3-8b-vetcoders-mlx}}
}

License

Apache 2.0


Created by M&K (c)2025 The LibraxisAI Team Co-Authored-By: Maciej & Klaudiusz

Downloads last month
3
Safetensors
Model size
9B params
Tensor type
F32
Β·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train LibraxisAI/colqwen3-8b-vetcoders-mlx