Subrata132 nielsr HF Staff commited on
Commit
64f7ec6
Β·
verified Β·
1 Parent(s): 4bc248b

Improve model card: Add pipeline tag, library name, abstract, results, and GitHub link (#1)

Browse files

- Improve model card: Add pipeline tag, library name, abstract, results, and GitHub link (2cf6f9c1573c235ff6fa2ee96b8eecb01ef2a996)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +31 -2
README.md CHANGED
@@ -1,6 +1,9 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
 
4
  <p align="center">
5
  <img src="./assets/raven_logo.png" width="100" style="margin-bottom: 0.2;"/>
6
  <p>
@@ -16,13 +19,33 @@ license: apache-2.0
16
  <a href="https://bashlab.github.io/raven_project/" style="color:#825987">
17
  https://bashlab.github.io/raven_project/
18
  </a>
 
 
 
 
19
  </h5>
20
  <p align="center">
21
  <img src="./assets/raven_architecture.png" width="800" />
22
  <p>
23
 
24
  ---
25
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  ## πŸ› οΈ Requirements and Installation
27
  Basic Dependencies:
28
  * Python >= 3.8
@@ -40,7 +63,13 @@ apt-get update && apt-get install ffmpeg libsm6 libxext6 -y
40
  ```
41
  ---
42
 
43
- ## πŸ€– Inference
 
 
 
 
 
 
44
  - **STEP 1:** Download $\texttt{siglip-so400m-patch14-384}$ from here [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
45
  - **STEP 2:** Download **RAVEN** checkpoint
46
  ```bash
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: video-text-to-text
4
+ library_name: transformers
5
  ---
6
+
7
  <p align="center">
8
  <img src="./assets/raven_logo.png" width="100" style="margin-bottom: 0.2;"/>
9
  <p>
 
19
  <a href="https://bashlab.github.io/raven_project/" style="color:#825987">
20
  https://bashlab.github.io/raven_project/
21
  </a>
22
+ &bull; Code:
23
+ <a href="https://github.com/BASHLab/RAVEN" style="color:#825987">
24
+ https://github.com/BASHLab/RAVEN
25
+ </a>
26
  </h5>
27
  <p align="center">
28
  <img src="./assets/raven_architecture.png" width="800" />
29
  <p>
30
 
31
  ---
32
+
33
+ ## Abstract
34
+ Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning -- each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and evaluation, we release AVS-QA, a dataset of 300K synchronized Audio--Video-Sensor streams paired with automatically generated question-answer pairs. Experimental results on seven multi-modal QA benchmarks -- including egocentric and exocentric tasks -- show that RAVEN achieves up to 14.5% and 8.0% gains in accuracy compared to state-of-the-art multi-modal large language models, respectively. Incorporating sensor data provides an additional 16.4% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23%. Our code and dataset are available at this https URL .
35
+
36
+ ---
37
+ ## πŸš€ Main Results
38
+ ##### Comparison of **RAVEN** and prior MLLMs on *exocentric* open-ended video QA (MSVD-QA, MSRVTT-QA, ActivityNet-QA) and audio-visual QA (AVSD, MUSIC-QA) benchmarks. Best and second-best scores are in $\textbf{Bold}$ and $\underline{\text{underline}}$. $^*$ indicates scores reproduced by us.
39
+ <p><img src="./assets/main_result_exo.png" width="800"></p>
40
+
41
+ ##### Comparison of **RAVEN** with MLLMs on the EgoThink (Reasoning) and AVS-QA benchmarks. **RAVEN** outperforms across metrics and excels in reasoning. $\textbf{Bold}$ and $\underline{\text{underline}}$ indicate the best and second-best scores.
42
+ <p><img src="./assets/main_result_ego.png" width="800"></p>
43
+
44
+ ---
45
+ ## πŸ“ **AVS-QA** Dataset
46
+ Train and test split of **AVS-QA** is provided [here](./avs-qa-dataset/).<br>
47
+ More details [here](./avs-qa-dataset/README.md).
48
+
49
  ## πŸ› οΈ Requirements and Installation
50
  Basic Dependencies:
51
  * Python >= 3.8
 
63
  ```
64
  ---
65
 
66
+ ## πŸ€ Model Zoo
67
+ | Model Name | Modal Type |
68
+ |:----------------|:------------:|
69
+ | [RAVEN-7B-AV](https://huggingface.co/BASH-Lab/RAVEN-AV-7B)| AV |
70
+ | RAVEN-7B-AVS| AVS |
71
+
72
+ ## πŸ€– Sample Usage
73
  - **STEP 1:** Download $\texttt{siglip-so400m-patch14-384}$ from here [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
74
  - **STEP 2:** Download **RAVEN** checkpoint
75
  ```bash