Skip to content

Leon1207/Video-RAG-master

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension

Arxiv Arxiv YouTube Blog visitors

😮 Highlights

radar

  • We integrate RAG into open-source LVLMs: Video-RAG incorporates three types of visually-aligned auxiliary texts (OCR, ASR, and object detection) processed by external tools and retrieved via RAG, enhancing the LVLM. It’s implemented using completely open-source tools, without the need for any commercial APIs.
  • We design a versatile plug-and-play RAG-based pipeline for any LVLM: Video-RAG offers a training-free solution for a wide range of LVLMs, delivering performance improvements with minimal additional resource requirements.
  • We achieve proprietary-level performance with open-source models: Applying Video-RAG to a 72B open-source model yields state-of-the-art performance in Video-MME, surpassing models such as Gemini-1.5-Pro. framework results

🔨 Usage

This repo is built upon LLaVA-NeXT:

  • Step 1: Clone and build LLaVA-NeXT conda environment:
git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # Enable PEP 660 support.
pip install -e ".[train]"

Then install the following packages in llava environment:

pip install spacy faiss-cpu easyocr ffmpeg-python
pip install torch==2.1.2 torchaudio numpy
python -m spacy download en_core_web_sm
# Optional: pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz
  • Step 2: Clone and build another conda environment for APE by:
git clone https://github.com/shenyunhang/APE
cd APE
pip3 install -r requirements.txt
python3 -m pip install -e .
  • Step 3: Copy all the files in vidrag_pipeline under the root dir of LLaVA-NeXT;

  • Step 4: Copy all the files in ape_tools under the demo dir of APE;

  • Step 5: Opening a service of APE by running the code under APE/demo:

python demo/ape_service.py
  • Step 6: You can now run our pipeline build upon LLaVA-Video-7B by:
python vidrag_pipeline.py

Note

You can also use our pipeline in any LVLMs by implementing some modifications in vidrag_pipeline.py:

1. The video-language model you load (line #161).
2. The llava_inference() function, make sure your model supports both inputs with/without video (line #175).
3. The process_video() function may suit your model (line #34).
4. The final prompt may suit your model (line #366).

☁️ Optional: TwelveLabs (Marengo + Pegasus)

The pipeline ships an opt-in TwelveLabs backend in addition to the default fully open-source path. It is disabled by default and changes nothing unless you set the environment variables below, so the original behavior is preserved.

  • Marengo replaces the local Contriever retriever with hosted multimodal (512-dim) embeddings for the OCR/ASR RAG step — same retrieve_documents_with_dynamic signature, same FAISS range search, no local embedding model to load.
  • Pegasus replaces the local LLaVA-Video model for the final answer step — it reads the source video server-side, so you don't need the LVLM weights or a GPU to produce an answer.

Install the SDK and set a key (free tier at https://twelvelabs.io):

pip install twelvelabs
export TWELVELABS_API_KEY=<your-key>

Marengo retrieval (drop-in for Contriever):

export USE_TWELVELABS_RETRIEVER=1
python vidrag_pipeline.py

Pegasus answering (point it at the source video):

export USE_PEGASUS=1
export TWELVELABS_VIDEO_URL=https://.../video.mp4   # or TWELVELABS_VIDEO_ID / TWELVELABS_ASSET_ID
python vidrag_pipeline.py

Both flags are independent and can be combined. Optional overrides: TWELVELABS_EMBED_MODEL (default marengo3.0), TWELVELABS_ANALYZE_MODEL (default pegasus1.5), TWELVELABS_MAX_TOKENS (default 2048).

A focused test (the network part is skipped without a key) lives at vidrag_pipeline/tools/test_twelvelabs.py:

python tools/test_twelvelabs.py

✏️ Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝:

@misc{luo2024videoragvisuallyalignedretrievalaugmentedlong,
      title={Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension}, 
      author={Yongdong Luo and Xiawu Zheng and Xiao Yang and Guilin Li and Haojia Lin and Jinfa Huang and Jiayi Ji and Fei Chao and Jiebo Luo and Rongrong Ji},
      year={2024},
      eprint={2411.13093},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.13093}, 
}

About

✨✨[NeurIPS 2025] This is the official implementation of our paper "Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension"

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages