Instructions to use baidu/Qianfan-VL-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use baidu/Qianfan-VL-8B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="baidu/Qianfan-VL-8B", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("baidu/Qianfan-VL-8B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use baidu/Qianfan-VL-8B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "baidu/Qianfan-VL-8B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "baidu/Qianfan-VL-8B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/baidu/Qianfan-VL-8B
- SGLang
How to use baidu/Qianfan-VL-8B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "baidu/Qianfan-VL-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "baidu/Qianfan-VL-8B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "baidu/Qianfan-VL-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "baidu/Qianfan-VL-8B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use baidu/Qianfan-VL-8B with Docker Model Runner:
docker model run hf.co/baidu/Qianfan-VL-8B
Qianfan-VL: Domain-Enhanced Universal Vision-Language Models
Domain Capability Enhancement through Continuous Pre-training | 3B to 70B Parameter Scale | Document Understanding & OCR Enhancement | Chain-of-Thought Reasoning Support
This repository contains models presented in the paper Qianfan-OCR: A Unified End-to-End Model for Document Intelligence.
🔗 Quick Links
- Repository: 💻 GitHub
- Models: 🤗 Hugging Face | 🤖 ModelScope
- Documentation: 📚 Cookbook | 📝 Technical Report
- Blogs: 🇨🇳 中文博客 | 🇬🇧 English Blog
Model Description
Qianfan-VL is a series of general-purpose multimodal large language models enhanced for enterprise-level multimodal applications. The models offer deep optimization for high-frequency scenarios in industrial deployment while maintaining strong general capabilities.
Model Variants
| Model | Parameters | Context Length | CoT Support | Best For |
|---|---|---|---|---|
| Qianfan-VL-3B | 3B | 32k | ❌ | Edge deployment, real-time OCR |
| Qianfan-VL-8B | 8B | 32k | ✅ | Server-side general scenarios, fine-tuning |
| Qianfan-VL-70B | 70B | 32k | ✅ | Complex reasoning, data synthesis |
Architecture
- Language Model:
- Qianfan-VL-3B: Based on Qwen2.5-3B
- Qianfan-VL-8B/70B: Based on Llama 3.1 architecture
- Enhanced with 3T multilingual corpus
- Vision Encoder: InternViT-based, supports dynamic patching up to 4K resolution
- Cross-modal Fusion: MLP adapter for efficient vision-language bridging
Key Capabilities
🔍 OCR & Document Understanding
- Full-Scenario OCR: Handwriting, formulas, natural scenes, cards/documents
- Document Intelligence: Layout analysis, table parsing, chart understanding, document Q&A
- High Precision: Industry-leading performance on OCR benchmarks
🧮 Chain-of-Thought Reasoning (8B & 70B)
- Complex chart analysis and reasoning
- Mathematical problem-solving with step-by-step derivation
- Visual reasoning and logical inference
- Statistical computation and trend prediction
📊 Benchmark Performance
General Vision-Language Benchmarks
| Benchmark | Qianfan-VL-3B | Qianfan-VL-8B | Qianfan-VL-70B | InternVL-3-8B | InternVL-3-78B | Qwen2.5-VL-7B | Qwen2.5-VL-72B |
|---|---|---|---|---|---|---|---|
| A-Bench_VAL | 75.65 | 75.72 | 78.1 | 75.86 | 75.86 | 76.49 | 79.22 |
| CCBench | 66.86 | 70.39 | 80.98 | 77.84 | 70.78 | 57.65 | 73.73 |
| SEEDBench_IMG | 76.55 | 78.02 | 79.13 | 77.0 | 77.52 | 76.98 | 78.34 |
| SEEDBench2_Plus | 67.59 | 70.97 | 73.17 | 69.52 | 68.47 | 70.93 | 73.25 |
| MMVet | 48.17 | 53.21 | 67.34 | 80.28 | 78.9 | 70.64 | 75.69 |
| MMMU_VAL | 46.44 | 47.11 | 58.33 | 56.11 | 60.78 | 51.0 | 65.78 |
| ScienceQA_TEST | 95.19 | 97.62 | 98.76 | 97.97 | 97.17 | 85.47 | 92.51 |
| ScienceQA_VAL | 93.85 | 97.62 | 98.81 | 97.81 | 95.14 | 83.59 | 91.32 |
| MMT-Bench_VAL | 62.23 | 63.22 | 71.06 | 65.17 | 63.67 | 61.4 | 69.49 |
| MTVQA_TEST | 26.5 | 30.14 | 32.18 | 30.3 | 27.62 | 29.08 | 31.48 |
| BLINK | 49.97 | 56.81 | 59.44 | 55.87 | 51.87 | 54.55 | 63.02 |
| MMStar | 57.93 | 64.07 | 69.47 | 68.4 | 66.07 | 61.53 | 66.0 |
| RealWorldQA | 65.75 | 70.59 | 71.63 | 71.11 | 74.25 | 69.28 | 73.86 |
| Q-Bench1_VAL | 73.51 | 75.25 | 77.46 | 75.99 | 77.99 | 78.1 | 79.93 |
| POPE | 85.08 | 86.06 | 88.97 | 90.59 | 88.87 | 85.97 | 83.35 |
| RefCOCO (Avg) | 85.94 | 89.37 | 91.01 | 89.65 | 91.40 | 86.56 | 90.25 |
OCR & Document Understanding
| Benchmark | Qianfan-VL-3B | Qianfan-VL-8B | Qianfan-VL-70B | InternVL-3-8B | InternVL-3-78B | Qwen2.5-VL-3B | Qwen2.5-VL-7B | Qwen2.5-VL-72B |
|---|---|---|---|---|---|---|---|---|
| OCRBench | 831 | 854 | 873 | 881 | 847 | 810 | 883 | 874 |
| AI2D_TEST | 81.38 | 85.07 | 87.23 | 85.07 | 83.55 | 77.07 | 80.472 | 83.84 |
| OCRVQA_TEST | 66.15 | 68.98 | 74.06 | 39.03 | 35.58 | 69.24 | 71.02 | 66.8 |
| TextVQA_VAL | 80.11 | 82.13 | 84.48 | 82.15 | 83.52 | 79.09 | 84.962 | 83.26 |
| DocVQA_VAL | 90.85 | 93.54 | 94.75 | 92.04 | 83.82 | 92.71 | 94.91 | 95.75 |
| ChartQA_TEST | 81.79 | 87.72 | 89.6 | 85.76 | 82.04 | 83.4 | 86.68 | 87.16 |
Quick Start
Installation
pip install transformers accelerate torch torchvision pillow einops
Using Transformers
import torch
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
from PIL import Image
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_transform(input_size):
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
transform = T.Compose([
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff = float('inf')
best_ratio = (1, 1)
area = width * height
for ratio in target_ratios:
target_aspect_ratio = ratio[0] / ratio[1]
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
if ratio_diff < best_ratio_diff:
best_ratio_diff = ratio_diff
best_ratio = ratio
elif ratio_diff == best_ratio_diff:
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
best_ratio = ratio
return best_ratio
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
# calculate the existing image aspect ratio
target_ratios = set(
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
i * j <= max_num and i * j >= min_num)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
# find the closest aspect ratio to the target
target_aspect_ratio = find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
# calculate the target width and height
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
# resize the image
resized_img = image.resize((target_width, target_height))
processed_images = []
for i in range(blocks):
box = (
(i % (target_width // image_size)) * image_size,
(i // (target_width // image_size)) * image_size,
((i % (target_width // image_size)) + 1) * image_size,
((i // (target_width // image_size)) + 1) * image_size
)
# split the image
split_img = resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) == blocks
if use_thumbnail and len(processed_images) != 1:
thumbnail_img = image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
def load_image(image_file, input_size=448, max_num=12):
image = Image.open(image_file).convert('RGB')
transform = build_transform(input_size=input_size)
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_values = [transform(image) for image in images]
pixel_values = torch.stack(pixel_values)
return pixel_values
# Load model
MODEL_PATH = "baidu/Qianfan-VL-8B" # or Qianfan-VL-3B, Qianfan-VL-70B
model = AutoModel.from_pretrained(
MODEL_PATH,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto"
).eval()
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
# Load and process image
pixel_values = load_image("./example/scene_ocr.png").to(torch.bfloat16)
# Inference
prompt = "<image>请识别图中所有文字"
with torch.no_grad():
response = model.chat(
tokenizer,
pixel_values=pixel_values,
question=prompt,
generation_config={"max_new_tokens": 512},
verbose=False
)
print(response)
Training Details
Four-Stage Progressive Training
- Cross-modal Alignment (100B tokens): Establishes vision-language connections
- General Knowledge Injection (3.5T tokens): Builds strong foundational capabilities
- Domain Enhancement (300B tokens): Specialized OCR and reasoning capabilities
- Post-training (1B tokens): Instruction following and preference alignment
Infrastructure
- Trained on 5000+ Baidu Kunlun chips
- Single-task parallel training with 5000 chips demonstrating unprecedented scale
- 90%+ scaling efficiency for large-scale distributed training
- Innovative communication-computation fusion technology
Citation
If you use Qianfan-VL or Qianfan-OCR in your research, please cite:
@misc{dong2026qianfanocr,
title={Qianfan-OCR: A Unified End-to-End Model for Document Intelligence},
author={Daxiang Dong and Mingming Zheng and Dong Xu and Chunhua Luo and Bairong Zhuang and Yuxuan Li and Ruoyun He and Haoran Wang and Wenyu Zhang and Wenbo Wang and Yicheng Wang and Xue Xiong and Ayong Zheng and Xiaoying Zuo and Ziwei Ou and Jingnan Gu and Quanhao Guo and Jianmin Wu and Dawei Yin and Dou Shen},
year={2026},
eprint={2603.13398},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.13398}
}
@misc{qianfan-vl-2025,
title={Qianfan-VL: Domain-Enhanced Universal Vision-Language Models},
author={Qianfan Team},
year={2025},
publisher={Baidu}
}
Contact
For more information and API access, visit: Baidu AI Cloud Qianfan Platform
- Downloads last month
- 8,234