Improve model card: Add abstract, 'qwen' tag, and enhance top links

#7
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +93 -47
README.md CHANGED
@@ -1,30 +1,41 @@
1
  ---
2
- license: mit
3
- pipeline_tag: image-text-to-text
4
- library_name: transformers
5
  base_model:
6
- - OpenGVLab/InternViT-300M-448px-V2_5
7
- - Qwen/Qwen2.5-0.5B-Instruct
8
- base_model_relation: merge
 
9
  language:
10
- - multilingual
 
 
 
11
  tags:
12
- - internvl
13
- - custom_code
14
- datasets:
15
- - HuggingFaceFV/finevideo
16
  ---
17
 
18
  # InternVL2_5-1B
19
 
20
- [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL) [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[📜 Mini-InternVL\]](https://arxiv.org/abs/2410.16261) [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271)
 
 
21
 
22
- [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#quick-start) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
 
 
 
 
23
 
24
  <div align="center">
25
  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
26
  </div>
27
 
 
 
 
 
28
  ## Introduction
29
 
30
  We are excited to introduce **InternVL 2.5**, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality.
@@ -73,11 +84,11 @@ The training pipeline for a single model in InternVL 2.5 is structured across th
73
 
74
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/5NduZeCPLgPJTFr0RGTq3.png)
75
 
76
- - **Stage 1: MLP Warmup.** In this stage, only the MLP projector is trained while the vision encoder and language model are frozen. A dynamic high-resolution training strategy is applied for better performance, despite increased cost. This phase ensures robust cross-modal alignment and prepares the model for stable multimodal training.
77
 
78
- - **Stage 1.5: ViT Incremental Learning (Optional).** This stage allows incremental training of the vision encoder and MLP projector using the same data as Stage 1. It enhances the encoder’s ability to handle rare domains like multilingual OCR and mathematical charts. Once trained, the encoder can be reused across LLMs without retraining, making this stage optional unless new domains are introduced.
79
 
80
- - **Stage 2: Full Model Instruction Tuning.** The entire model is trained on high-quality multimodal instruction datasets. Strict data quality controls are enforced to prevent degradation of the LLM, as noisy data can cause issues like repetitive or incorrect outputs. After this stage, the training process is complete.
81
 
82
  ### Progressive Scaling Strategy
83
 
@@ -91,9 +102,9 @@ Compared to Qwen2-VL's 1.4 trillion tokens, InternVL2.5-78B uses only 120 billio
91
 
92
  To improve real-world adaptability and performance, we introduce two key techniques:
93
 
94
- - **Random JPEG Compression**: Random JPEG compression with quality levels between 75 and 100 is applied as a data augmentation technique. This simulates image degradation from internet sources, enhancing the model's robustness to noisy images.
95
 
96
- - **Loss Reweighting**: To balance the NTP loss across responses of different lengths, we use a reweighting strategy called **square averaging**. This method balances contributions from responses of varying lengths, mitigating biases toward longer or shorter responses.
97
 
98
  ### Data Organization
99
 
@@ -103,11 +114,11 @@ In InternVL 2.0 and 2.5, the organization of the training data is controlled by
103
 
104
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/2LJe24b1ua3gjI9gDitVl.png)
105
 
106
- - **Data Augmentation:** JPEG compression is applied conditionally: enabled for image datasets to enhance robustness and disabled for video datasets to maintain consistent frame quality.
107
 
108
- - **Maximum Tile Number:** The parameter `n_max` controls the maximum tiles per dataset. For example, higher values (24–36) are used for multi-image or high-resolution data, lower values (6–12) for standard images, and 1 for videos.
109
 
110
- - **Repeat Factor:** The repeat factor `r` adjusts dataset sampling frequency. Values below 1 reduce a dataset's weight, while values above 1 increase it. This ensures balanced training across tasks and prevents overfitting or underfitting.
111
 
112
  #### Data Filtering Pipeline
113
 
@@ -121,14 +132,14 @@ To address this challenge and support future research, we designed an efficient
121
 
122
  The pipeline includes two modules, for **pure-text data**, three key strategies are used:
123
 
124
- 1. **LLM-Based Quality Scoring**: Each sample is scored (0–10) using a pre-trained LLM with domain-specific prompts. Samples scoring below a threshold (e.g., 7) are removed to ensure high-quality data.
125
- 2. **Repetition Detection**: Repetitive samples are flagged using LLM-based prompts and manually reviewed. Samples scoring below a stricter threshold (e.g., 3) are excluded to avoid repetitive patterns.
126
- 3. **Heuristic Rule-Based Filtering**: Anomalies like abnormal sentence lengths or duplicate lines are detected using rules. Flagged samples undergo manual verification to ensure accuracy before removal.
127
 
128
  For **multimodal data**, two strategies are used:
129
 
130
- 1. **Repetition Detection**: Repetitive samples in non-academic datasets are flagged and manually reviewed to prevent pattern loops. High-quality datasets are exempt from this process.
131
- 2. **Heuristic Rule-Based Filtering**: Similar rules are applied to detect visual anomalies, with flagged data verified manually to maintain integrity.
132
 
133
  #### Training Data
134
 
@@ -360,40 +371,50 @@ generation_config = dict(max_new_tokens=1024, do_sample=True)
360
  # pure-text conversation (纯文本对话)
361
  question = 'Hello, who are you?'
362
  response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
363
- print(f'User: {question}\nAssistant: {response}')
 
364
 
365
  question = 'Can you tell me a story?'
366
  response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
367
- print(f'User: {question}\nAssistant: {response}')
 
368
 
369
  # single-image single-round conversation (单图单轮对话)
370
- question = '<image>\nPlease describe the image shortly.'
 
371
  response = model.chat(tokenizer, pixel_values, question, generation_config)
372
- print(f'User: {question}\nAssistant: {response}')
 
373
 
374
  # single-image multi-round conversation (单图多轮对话)
375
- question = '<image>\nPlease describe the image in detail.'
 
376
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
377
- print(f'User: {question}\nAssistant: {response}')
 
378
 
379
  question = 'Please write a poem according to the image.'
380
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
381
- print(f'User: {question}\nAssistant: {response}')
 
382
 
383
  # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
384
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
385
  pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
386
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
387
 
388
- question = '<image>\nDescribe the two images in detail.'
 
389
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
390
  history=None, return_history=True)
391
- print(f'User: {question}\nAssistant: {response}')
 
392
 
393
  question = 'What are the similarities and differences between these two images.'
394
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
395
  history=history, return_history=True)
396
- print(f'User: {question}\nAssistant: {response}')
 
397
 
398
  # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
399
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
@@ -401,17 +422,21 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
401
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
402
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
403
 
404
- question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
 
 
405
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
406
  num_patches_list=num_patches_list,
407
  history=None, return_history=True)
408
- print(f'User: {question}\nAssistant: {response}')
 
409
 
410
  question = 'What are the similarities and differences between these two images.'
411
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
412
  num_patches_list=num_patches_list,
413
  history=history, return_history=True)
414
- print(f'User: {question}\nAssistant: {response}')
 
415
 
416
  # batch inference, single image per sample (单图批处理)
417
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
@@ -419,13 +444,15 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
419
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
420
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
421
 
422
- questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
 
423
  responses = model.batch_chat(tokenizer, pixel_values,
424
  num_patches_list=num_patches_list,
425
  questions=questions,
426
  generation_config=generation_config)
427
  for question, response in zip(questions, responses):
428
- print(f'User: {question}\nAssistant: {response}')
 
429
 
430
  # video multi-round conversation (视频多轮对话)
431
  def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
@@ -463,17 +490,24 @@ def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=3
463
  video_path = './examples/red-panda.mp4'
464
  pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
465
  pixel_values = pixel_values.to(torch.bfloat16).cuda()
466
- video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
 
467
  question = video_prefix + 'What is the red panda doing?'
468
- # Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
 
 
 
 
469
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
470
  num_patches_list=num_patches_list, history=None, return_history=True)
471
- print(f'User: {question}\nAssistant: {response}')
 
472
 
473
  question = 'Describe this video in detail.'
474
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
475
  num_patches_list=num_patches_list, history=history, return_history=True)
476
- print(f'User: {question}\nAssistant: {response}')
 
477
  ```
478
 
479
  #### Streaming Output
@@ -555,7 +589,9 @@ image_urls=[
555
 
556
  images = [load_image(img_url) for img_url in image_urls]
557
  # Numbering images improves multi-image conversations
558
- response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
 
 
559
  print(response.text)
560
  ```
561
 
@@ -656,7 +692,7 @@ If you find this project useful in your research, please consider citing:
656
  year={2024}
657
  }
658
  @article{gao2024mini,
659
- title={Mini-internvl: A flexible-transfer pocket multimodal model with 5\% parameters and 90\% performance},
660
  author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others},
661
  journal={arXiv preprint arXiv:2410.16261},
662
  year={2024}
@@ -675,3 +711,13 @@ If you find this project useful in your research, please consider citing:
675
  year={2024}
676
  }
677
  ```
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
2
  base_model:
3
+ - OpenGVLab/InternViT-300M-448px-V2_5
4
+ - Qwen/Qwen2.5-0.5B-Instruct
5
+ datasets:
6
+ - HuggingFaceFV/finevideo
7
  language:
8
+ - multilingual
9
+ library_name: transformers
10
+ license: mit
11
+ pipeline_tag: image-text-to-text
12
  tags:
13
+ - internvl
14
+ - custom_code
15
+ - qwen
16
+ base_model_relation: merge
17
  ---
18
 
19
  # InternVL2_5-1B
20
 
21
+ **Paper:** [Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling](https://huggingface.co/papers/2412.05271)
22
+
23
+ **GitHub Repository:** [OpenGVLab/InternVL](https://github.com/OpenGVLab/InternVL) | **Hugging Face Demo / Project Page:** [OpenGVLab/InternVL Space](https://huggingface.co/spaces/OpenGVLab/InternVL)
24
 
25
+ **Related Papers:**
26
+ - [InternVL 1.0](https://huggingface.co/papers/2312.14238)
27
+ - [InternVL 1.5](https://huggingface.co/papers/2404.16821)
28
+ - [Mini-InternVL](https://arxiv.org/abs/2410.16261)
29
+ - [InternVL 2.5](https://huggingface.co/papers/2412.05271)
30
 
31
  <div align="center">
32
  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
33
  </div>
34
 
35
+ ## Abstract
36
+
37
+ We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see this https URL
38
+
39
  ## Introduction
40
 
41
  We are excited to introduce **InternVL 2.5**, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality.
 
84
 
85
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/5NduZeCPLgPJTFr0RGTq3.png)
86
 
87
+ - **Stage 1: MLP Warmup.** In this stage, only the MLP projector is trained while the vision encoder and language model are frozen. A dynamic high-resolution training strategy is applied for better performance, despite increased cost. This phase ensures robust cross-modal alignment and prepares the model for stable multimodal training.
88
 
89
+ - **Stage 1.5: ViT Incremental Learning (Optional).** This stage allows incremental training of the vision encoder and MLP projector using the same data as Stage 1. It enhances the encoder’s ability to handle rare domains like multilingual OCR and mathematical charts. Once trained, the encoder can be reused across LLMs without retraining, making this stage optional unless new domains are introduced.
90
 
91
+ - **Stage 2: Full Model Instruction Tuning.** The entire model is trained on high-quality multimodal instruction datasets. Strict data quality controls are enforced to prevent degradation of the LLM, as noisy data can cause issues like repetitive or incorrect outputs. After this stage, the training process is complete.
92
 
93
  ### Progressive Scaling Strategy
94
 
 
102
 
103
  To improve real-world adaptability and performance, we introduce two key techniques:
104
 
105
+ - **Random JPEG Compression**: Random JPEG compression with quality levels between 75 and 100 is applied as a data augmentation technique. This simulates image degradation from internet sources, enhancing the model's robustness to noisy images.
106
 
107
+ - **Loss Reweighting**: To balance the NTP loss across responses of different lengths, we use a reweighting strategy called **square averaging**. This method balances contributions from responses of varying lengths, mitigating biases toward longer or shorter responses.
108
 
109
  ### Data Organization
110
 
 
114
 
115
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/2LJe24b1ua3gjI9gDitVl.png)
116
 
117
+ - **Data Augmentation:** JPEG compression is applied conditionally: enabled for image datasets to enhance robustness and disabled for video datasets to maintain consistent frame quality.
118
 
119
+ - **Maximum Tile Number:** The parameter `n_max` controls the maximum tiles per dataset. For example, higher values (24–36) are used for multi-image or high-resolution data, lower values (6–12) for standard images, and 1 for videos.
120
 
121
+ - **Repeat Factor:** The repeat factor `r` adjusts dataset sampling frequency. Values below 1 reduce a dataset's weight, while values above 1 increase it. This ensures balanced training across tasks and prevents overfitting or underfitting.
122
 
123
  #### Data Filtering Pipeline
124
 
 
132
 
133
  The pipeline includes two modules, for **pure-text data**, three key strategies are used:
134
 
135
+ 1. **LLM-Based Quality Scoring**: Each sample is scored (0–10) using a pre-trained LLM with domain-specific prompts. Samples scoring below a threshold (e.g., 7) are removed to ensure high-quality data.
136
+ 2. **Repetition Detection**: Repetitive samples are flagged using LLM-based prompts and manually reviewed. Samples scoring below a stricter threshold (e.g., 3) are excluded to avoid repetitive patterns.
137
+ 3. **Heuristic Rule-Based Filtering**: Anomalies like abnormal sentence lengths or duplicate lines are detected using rules. Flagged samples undergo manual verification to ensure accuracy before removal.
138
 
139
  For **multimodal data**, two strategies are used:
140
 
141
+ 1. **Repetition Detection**: Repetitive samples in non-academic datasets are flagged and manually reviewed to prevent pattern loops. High-quality datasets are exempt from this process.
142
+ 2. **Heuristic Rule-Based Filtering**: Similar rules are applied to detect visual anomalies, with flagged data verified manually to maintain integrity.
143
 
144
  #### Training Data
145
 
 
371
  # pure-text conversation (纯文本对话)
372
  question = 'Hello, who are you?'
373
  response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
374
+ print(f'User: {question}
375
+ Assistant: {response}')
376
 
377
  question = 'Can you tell me a story?'
378
  response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
379
+ print(f'User: {question}
380
+ Assistant: {response}')
381
 
382
  # single-image single-round conversation (单图单轮对话)
383
+ question = '<image>
384
+ Please describe the image shortly.'
385
  response = model.chat(tokenizer, pixel_values, question, generation_config)
386
+ print(f'User: {question}
387
+ Assistant: {response}')
388
 
389
  # single-image multi-round conversation (单图多轮对话)
390
+ question = '<image>
391
+ Please describe the image in detail.'
392
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
393
+ print(f'User: {question}
394
+ Assistant: {response}')
395
 
396
  question = 'Please write a poem according to the image.'
397
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
398
+ print(f'User: {question}
399
+ Assistant: {response}')
400
 
401
  # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
402
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
403
  pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
404
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
405
 
406
+ question = '<image>
407
+ Describe the two images in detail.'
408
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
409
  history=None, return_history=True)
410
+ print(f'User: {question}
411
+ Assistant: {response}')
412
 
413
  question = 'What are the similarities and differences between these two images.'
414
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
415
  history=history, return_history=True)
416
+ print(f'User: {question}
417
+ Assistant: {response}')
418
 
419
  # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
420
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
 
422
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
423
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
424
 
425
+ question = 'Image-1: <image>
426
+ Image-2: <image>
427
+ Describe the two images in detail.'
428
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
429
  num_patches_list=num_patches_list,
430
  history=None, return_history=True)
431
+ print(f'User: {question}
432
+ Assistant: {response}')
433
 
434
  question = 'What are the similarities and differences between these two images.'
435
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
436
  num_patches_list=num_patches_list,
437
  history=history, return_history=True)
438
+ print(f'User: {question}
439
+ Assistant: {response}')
440
 
441
  # batch inference, single image per sample (单图批处理)
442
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
 
444
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
445
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
446
 
447
+ questions = ['<image>
448
+ Describe the image in detail.'] * len(num_patches_list)
449
  responses = model.batch_chat(tokenizer, pixel_values,
450
  num_patches_list=num_patches_list,
451
  questions=questions,
452
  generation_config=generation_config)
453
  for question, response in zip(questions, responses):
454
+ print(f'User: {question}
455
+ Assistant: {response}')
456
 
457
  # video multi-round conversation (视频多轮对话)
458
  def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
 
490
  video_path = './examples/red-panda.mp4'
491
  pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
492
  pixel_values = pixel_values.to(torch.bfloat16).cuda()
493
+ video_prefix = ''.join([f'Frame{i+1}: <image>
494
+ ' for i in range(len(num_patches_list))])
495
  question = video_prefix + 'What is the red panda doing?'
496
+ # Frame1: <image>
497
+ Frame2: <image>
498
+ ...
499
+ Frame8: <image>
500
+ {question}
501
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
502
  num_patches_list=num_patches_list, history=None, return_history=True)
503
+ print(f'User: {question}
504
+ Assistant: {response}')
505
 
506
  question = 'Describe this video in detail.'
507
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
508
  num_patches_list=num_patches_list, history=history, return_history=True)
509
+ print(f'User: {question}
510
+ Assistant: {response}')
511
  ```
512
 
513
  #### Streaming Output
 
589
 
590
  images = [load_image(img_url) for img_url in image_urls]
591
  # Numbering images improves multi-image conversations
592
+ response = pipe((f'Image-1: {IMAGE_TOKEN}
593
+ Image-2: {IMAGE_TOKEN}
594
+ describe these two images', images))
595
  print(response.text)
596
  ```
597
 
 
692
  year={2024}
693
  }
694
  @article{gao2024mini,
695
+ title={Mini-internvl: A flexible-transfer pocket multimodal model with 5% parameters and 90% performance},
696
  author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others},
697
  journal={arXiv preprint arXiv:2410.16261},
698
  year={2024}
 
711
  year={2024}
712
  }
713
  ```
714
+
715
+ ## Acknowledgement
716
+
717
+ InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
718
+
719
+ ______________________________________________________________________
720
+
721
+ Scan the following QR Code, join our WeChat group.
722
+
723
+ <p align="center"><img width="300" alt="image" src="https://github.com/user-attachments/assets/f776df09-ebba-4fd5-80c2-fec4ff1518be"></p>