arxiv:2604.02317

A Simple Baseline for Streaming Video Understanding

Published on Apr 2

Authors:

Abstract

A simple sliding-window approach using recent video frames outperforms complex memory-based streaming video understanding methods, revealing trade-offs between real-time perception and long-term memory capabilities.

AI-generated summary

Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception. This suggests that stronger memory, retrieval, or compression modules should not be taken as evidence of progress unless they clearly outperform SimpleStream under the same protocol. We therefore argue that future streaming benchmarks should separate recent-scene perception from long-range memory, so that performance improvements from added complexity can be evaluated more clearly.

View arXiv page View PDF Add to collection

Community

play0718

Paper author about 9 hours ago

🚀 SimpleStream: Rethinking Memory in Streaming Video Understanding

Recent streaming video understanding methods increasingly rely on complex memory mechanisms.
We revisit a simple question: do we really need them?

🔑 Key finding
A simple sliding-window baseline using only the most recent N frames can match or outperform existing memory-based approaches.

📊 Highlights

Uses only recent frames (e.g., N=4), without explicit memory modules
Achieves 67.7% on OVO-Bench and 80.6% on StreamingBench
under real-time evaluation
Reveals a consistent perception–memory trade-off:
adding more historical context can improve recall, but often weakens real-time perception

💡 Takeaway
Performance gains from memory are not guaranteed.
We argue that future streaming benchmarks should separate present-scene perception from long-range memory, so improvements from added complexity can be evaluated more clearly.

📄 Paper: https://arxiv.org/abs/2604.02317
💻 Code: https://github.com/EvolvingLMMs-Lab/SimpleStream
🌐 Project: https://simple-stream.github.io/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.02317

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.02317 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.02317 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.02317 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.