Open VLM Leaderboard
VLMEvalKit Evaluation Results Collection
RealWorldQA is a benchmark designed to evaluate the real-world spatial understanding capabilities of multimodal AI models, contributed by XAI. It assesses how well these models comprehend physical environments. The benchmark consists of 700+ images, each accompanied by a question and a verifiable answer. These images are drawn from real-world scenarios, including those captured from vehicles. The goal is to advance AI models' understanding of our physical world.
| Name | Type | #Questions | Data Quality* (manually verified 10% samples) | Finegrained Classes |
|---|---|---|---|---|
| RealWorldQA | MCQ | 765 | > 97% | No |
TL;DR: **RealWorldQA **is a benchmark that requires VLMs to have the capability of:
*Data Quality: We perform manual verification on 10% samples and check if each sample is correct and unambiguous. Most samples (>97%) in RealWorldQA are good and clear.
Some cases I found ambiguous like:
Questions in RealWorldQA have 2 - 4 candidate choices (the majority have 3 choices), the expectation of RandomGuess Top-1 accuracy would be 37.7%.
We perform the evaluation using VLMEvalKit and list the performance of representative VLMs (proprietarty or opensource) below:
| Proprietary Models | Acc | Proprietary Models | Acc |
|---|---|---|---|
| GPT-4v (0409, low-res) | 61.4 | GPT-4v (0409, high-res) | 68.0 |
| GeminiPro-V (1.0) | 60.4 | QwenVLMax | 61.3 |
| OpenSource Models | Acc | OpenSource Models | Acc |
| InternLM-XComposer2 | 63.8 | InternVL-Chat-V1.5 | 65.6 |
| IDEFICS2-8B | 60.8 | LLaVA-NeXT (Yi-34B) | 66.0 |
| LLaVA-v1.5 (7B) | 54.8 | LLaVA-v1.5 (13B) | 55.3 |
Grok-v1.5 is not included since it's not publicly available.
Among the evaluated VLMs, GPT-4v (0409, high-res) achieves the best performance and significantly outperforms its low-res version (remember that RealWorldQA requires fine-grained recognition in high-res images). Meanwhile, top OpenSource VLMs also display competitive performace.
We select a subset of questions that cannot be correctly answered by all of the Top-3 VLMs (GPT-4v (0409, high-res), InternVL-Chat-V1.5, LLaVA-NeXT (Yi-34B)). The subset includes 101 samples. We visualize several random samples in the subset below.
VLMEvalKit Evaluation Results Collection
More from this author