Improve model card: Add pipeline tag, library name, abstract, results, and GitHub link (#1)
Browse files- Improve model card: Add pipeline tag, library name, abstract, results, and GitHub link (2cf6f9c1573c235ff6fa2ee96b8eecb01ef2a996)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
|
@@ -1,6 +1,9 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
| 4 |
<p align="center">
|
| 5 |
<img src="./assets/raven_logo.png" width="100" style="margin-bottom: 0.2;"/>
|
| 6 |
<p>
|
|
@@ -16,13 +19,33 @@ license: apache-2.0
|
|
| 16 |
<a href="https://bashlab.github.io/raven_project/" style="color:#825987">
|
| 17 |
https://bashlab.github.io/raven_project/
|
| 18 |
</a>
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
</h5>
|
| 20 |
<p align="center">
|
| 21 |
<img src="./assets/raven_architecture.png" width="800" />
|
| 22 |
<p>
|
| 23 |
|
| 24 |
---
|
| 25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
## π οΈ Requirements and Installation
|
| 27 |
Basic Dependencies:
|
| 28 |
* Python >= 3.8
|
|
@@ -40,7 +63,13 @@ apt-get update && apt-get install ffmpeg libsm6 libxext6 -y
|
|
| 40 |
```
|
| 41 |
---
|
| 42 |
|
| 43 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
- **STEP 1:** Download $\texttt{siglip-so400m-patch14-384}$ from here [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
|
| 45 |
- **STEP 2:** Download **RAVEN** checkpoint
|
| 46 |
```bash
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
pipeline_tag: video-text-to-text
|
| 4 |
+
library_name: transformers
|
| 5 |
---
|
| 6 |
+
|
| 7 |
<p align="center">
|
| 8 |
<img src="./assets/raven_logo.png" width="100" style="margin-bottom: 0.2;"/>
|
| 9 |
<p>
|
|
|
|
| 19 |
<a href="https://bashlab.github.io/raven_project/" style="color:#825987">
|
| 20 |
https://bashlab.github.io/raven_project/
|
| 21 |
</a>
|
| 22 |
+
• Code:
|
| 23 |
+
<a href="https://github.com/BASHLab/RAVEN" style="color:#825987">
|
| 24 |
+
https://github.com/BASHLab/RAVEN
|
| 25 |
+
</a>
|
| 26 |
</h5>
|
| 27 |
<p align="center">
|
| 28 |
<img src="./assets/raven_architecture.png" width="800" />
|
| 29 |
<p>
|
| 30 |
|
| 31 |
---
|
| 32 |
+
|
| 33 |
+
## Abstract
|
| 34 |
+
Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning -- each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and evaluation, we release AVS-QA, a dataset of 300K synchronized Audio--Video-Sensor streams paired with automatically generated question-answer pairs. Experimental results on seven multi-modal QA benchmarks -- including egocentric and exocentric tasks -- show that RAVEN achieves up to 14.5% and 8.0% gains in accuracy compared to state-of-the-art multi-modal large language models, respectively. Incorporating sensor data provides an additional 16.4% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23%. Our code and dataset are available at this https URL .
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
## π Main Results
|
| 38 |
+
##### Comparison of **RAVEN** and prior MLLMs on *exocentric* open-ended video QA (MSVD-QA, MSRVTT-QA, ActivityNet-QA) and audio-visual QA (AVSD, MUSIC-QA) benchmarks. Best and second-best scores are in $\textbf{Bold}$ and $\underline{\text{underline}}$. $^*$ indicates scores reproduced by us.
|
| 39 |
+
<p><img src="./assets/main_result_exo.png" width="800"></p>
|
| 40 |
+
|
| 41 |
+
##### Comparison of **RAVEN** with MLLMs on the EgoThink (Reasoning) and AVS-QA benchmarks. **RAVEN** outperforms across metrics and excels in reasoning. $\textbf{Bold}$ and $\underline{\text{underline}}$ indicate the best and second-best scores.
|
| 42 |
+
<p><img src="./assets/main_result_ego.png" width="800"></p>
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
## π **AVS-QA** Dataset
|
| 46 |
+
Train and test split of **AVS-QA** is provided [here](./avs-qa-dataset/).<br>
|
| 47 |
+
More details [here](./avs-qa-dataset/README.md).
|
| 48 |
+
|
| 49 |
## π οΈ Requirements and Installation
|
| 50 |
Basic Dependencies:
|
| 51 |
* Python >= 3.8
|
|
|
|
| 63 |
```
|
| 64 |
---
|
| 65 |
|
| 66 |
+
## π Model Zoo
|
| 67 |
+
| Model Name | Modal Type |
|
| 68 |
+
|:----------------|:------------:|
|
| 69 |
+
| [RAVEN-7B-AV](https://huggingface.co/BASH-Lab/RAVEN-AV-7B)| AV |
|
| 70 |
+
| RAVEN-7B-AVS| AVS |
|
| 71 |
+
|
| 72 |
+
## π€ Sample Usage
|
| 73 |
- **STEP 1:** Download $\texttt{siglip-so400m-patch14-384}$ from here [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
|
| 74 |
- **STEP 2:** Download **RAVEN** checkpoint
|
| 75 |
```bash
|