A unified framework for video question answering with spatio-temporal evidence generation and evaluation.
EVQA/
├── train/ # Training code and data
│ ├── unipixel/ # UniPixel model implementation
│ │ ├── model/ # Model architectures
│ │ ├── dataset/ # Dataset loaders
│ │ ├── train/ # Training scripts
│ │ └── eval/ # Evaluation scripts
│ ├── sam2/ # SAM2 integration
│ ├── scripts/ # Training scripts
│ │ ├── sft.sh # Main training script
│ │ ├── auto_eval.sh # Auto evaluation script
│ │ └── zero*.json # DeepSpeed configurations
│ ├── data/ # Training datasets
│ ├── model_zoo/ # Model checkpoints
│ ├── requirements.txt # Python dependencies
│ └── setup.py # Package setup
│
└── benchmark/ # Benchmark evaluation
├── st_evidence_gen/ # Generative ST-Evidence task
│ ├── ours_st_evidence.py # UniPixel inference
│ ├── gpt_st_evidence.py # GPT-4/5 baseline
│ ├── gemini_st_evidence.py # Gemini baseline
│ ├── internvl_3_5.py # InternVL baseline
│ ├── qwen2_5vl_st_evidence.py # Qwen2.5-VL baseline
│ └── eval_st_evidence.py # Evaluation script
│
└── st_evidence_mcq/ # Multiple-choice ST-Evidence task
├── ours_st_evidence_mcq.py # UniPixel inference
├── gpt_st_evidence_mcq.py # GPT baseline
├── gemini_st_evidence_mcq.py # Gemini baseline
└── eval_st_evidence_mcq.py # Evaluation script
cd train
pip install -r requirements.txt
# Install PyTorch (if not already installed)
pip install torch==2.7.1+cu128 torchvision==0.22.1+cu128 --extra-index-url https://download.pytorch.org/whl/cu128
# Install Flash Attention (optional, for faster training)
pip install flash-attn==2.8.2The training data is organized in train/data/ with the following structure:
train/data/
├── st-evidence-instruct/ # ST-Evidence instruction tuning data
│ ├── gen_mask/ # Mask generation task
│ │ ├── st_evidence.csv # Metadata and annotations
│ │ ├── masks/ # Object masks (JSON format)
│ │ ├── video_frames_6fps/ # Extracted video frames at 6fps
│ │ └── st_evidence_meta.pkl # Additional metadata
│ └── gen_qa/vicas/ # QA generation task (141k samples)
│ └── st_evidence_vicas.csv
├── revos/ # Referring expression video segmentation
├── mevis/ # Multi-expression video segmentation
├── lvvis/ # Long video instance segmentation
├── ref_youtube_vos/ # Referring YouTube-VOS
├── ref_davis17/ # Referring DAVIS-17
├── ref_sav/ # Referring SAV
├── groundmore/ # Grounding dataset
├── vicas/ # Video caption segmentation
│ ├── annotations/
│ ├── splits/
│ ├── masks/ # Segmentation masks
│ └── video_frames/ # Video frames
├── llava_instruct/ # LLaVA instruction data
└── videogpt_plus/ # VideoGPT+ data
ST-Evidence CSV Format (st_evidence_vicas.csv):
entry_id,video_id,video_path,question,answer,candidates,mask_evidence,source,split,qa_source,temporal_evidence
10375_0,10375,010375_video.mp4,What object is the man holding?,A pole,"['A desk', 'A book', 'A chair', 'A pole']","[1, 2]",vicas,train,gemini,"[[0.0, 14.5]]"Mask Annotation JSON Format (masks/YNIWQ_868759.json):
{
"entry_id": "YNIWQ_868759",
"video_path": "videos/star/YNIWQ.mp4",
"fps": 6.0,
"width": 480,
"height": 270,
"evidence_objects": [
{
"ref_expression": "gray backpack on the bed",
"prompts": [
{
"timestamp": 0.0,
"bbox_norm": [385.0, 420.0, 740.0, 997.0],
"frame": 0,
"confidence": 0.95
}
]
}
]
}ST-Evidence-Instruct Dataset (~46GB):
# Download from HuggingFace
huggingface-cli download Salesforce/ST-Evidence-Instruct --repo-type dataset --local-dir train/data/st-evidence-instruct
# Extract compressed files
cd train/data/st-evidence-instruct/gen_mask
tar -xzf masks.tar.gz
tar -xzf video_frames_6fps.tar.gzDownload link: https://huggingface.co/datasets/Salesforce/ST-Evidence-Instruct
Other Training Datasets (ReVOS, MEVIS, LVVIS, etc.):
# Download UniPixel-SFT-1M dataset bundle
huggingface-cli download PolyU-ChenLab/UniPixel-SFT-1M --repo-type dataset --local-dir train/data/Download link: https://huggingface.co/datasets/PolyU-ChenLab/UniPixel-SFT-1M
cd train
bash scripts/sft.sh [model_size] [sam2_type] [resume_option]Parameters:
model_size:3bor7b(default:3b)sam2_type:baseorlarge(default:base)resume_option:--from_scratch- Train from scratch/path/to/checkpoint- Resume from specific checkpoint- (empty) - Auto-resume if checkpoint exists
# Train UniPixel-3B with SAM2-Base from scratch
bash scripts/sft.sh 3b base --from_scratch
# Train UniPixel-7B with SAM2-Large (auto-resume)
bash scripts/sft.sh 7b large
# Resume from specific checkpoint
bash scripts/sft.sh 3b base /path/to/checkpoint/dirThe training script supports various datasets and configurations:
- Datasets: ST-Evidence, ReVOS, MEVIS, LVVIS, Ref-YouTube-VOS, Ref-DAVIS, etc.
- LoRA: Enabled with r=128, alpha=256
- Learning Rate: 2e-5 (SAM2: 5e-6)
- Batch Size: 4 per device with 8 GPUs
- Max Frames: 8 random sampled frames per video
Training logs and checkpoints will be saved to train/work_dirs/{model_size}/finetune_1e_sam2_{type}_videoseg_eccv_merged_v2/
Generate answers and spatio-temporal evidence for video questions.
These models can directly generate segmentation masks for spatial evidence.
UniPixel:
cd benchmark/st_evidence_gen
# Single GPU
CUDA_VISIBLE_DEVICES=0 python unipixel_st_evidence.py \
--fps 1.0 \
--model PolyU-ChenLab/UniPixel-3B
# Multi-GPU (8 GPUs)
bash ours_st_evidence_multigpu.sh /path/to/trained/model 8 0SA2VA:
CUDA_VISIBLE_DEVICES=0 python sa2va_st_evidence.py --device cudaThese models generate referring expressions first, then use UniPixel to generate masks.
Step 1: Generate Answers and Referring Expressions
cd benchmark/st_evidence_gen
# GPT
python gpt_st_evidence.py --model o3 --mode single-turn --fps 1
# Gemini
python gemini_st_evidence.py --model gemini-2.5-flash --mode single-turn
python gemini_st_evidence.py --model gemini-2.5-pro --mode single-turn
# InternVL 3.5
CUDA_VISIBLE_DEVICES=0 python internvl_3_5.py --mode single-turn
# Qwen2.5-VL
CUDA_VISIBLE_DEVICES=0 python qwen2_5vl_st_evidence.py \
--model Qwen/Qwen2.5-VL-7B-Instruct \
--fps 1 \
--mode single-turn
# Qwen2.5-VL-72B (multi-GPU)
CUDA_VISIBLE_DEVICES=0,1,2,3 python qwen2_5vl_st_evidence.py \
--model Qwen/Qwen2.5-VL-72B-Instruct \
--fps 1 \
--gpu-num 4
# Qwen3-VL
CUDA_VISIBLE_DEVICES=0 python qwen3vl_st_evidence.py \
--model Qwen/Qwen3-VL-4B-Instruct \
--fps 1 \
--batch-size 8 \
--gpu-num 1 \
--mode single-turn
# VideoLLaMA3
CUDA_VISIBLE_DEVICES=0 python videollama3_st_evidence.py \
--model DAMO-NLP-SG/VideoLLaMA3-3B \
--mode single-turn \
--fps 1 \
--max-frames 128
# LLaVA-OV-1.5
CUDA_VISIBLE_DEVICES=0 python llava_ov1_5_st_evidence.py \
--mode single-turn \
--max-size 512Step 2: Generate Masks from Referring Expressions
python unipixel_video_seg.py \
results/qwen3vl/qwen3_vl_235b_a22b_instruct_st_evidence_ref_exp_1fps.json \
--save_masks \
--skip_viz \
--mode seperate \
--batch_size 5 \
--every_n_frames 6# Evaluate QA and temporal evidence only
python eval_st_evidence.py --pred_file results/ours/predictions.json
# Evaluate with mask quality metrics (J, F, J&F scores)
python eval_st_evidence.py \
--pred_file results/gemini/gemini_2_5_flash_st_evidence_single_1fps.json \
--eval_masks \
--pred_mask_dir results/gemini/gemini_2_5_flash_st_evidence_single_1fps/concat \
--num_workers 32Metrics:
- QA Accuracy: Percentage of correct answers
- Temporal IoU/IoP: mIoU, TIoU@0.3, TIoU@0.5, mIoP, TIoP@0.3, TIoP@0.5
- Mask Quality (optional): J score (region IoU), F score (contour accuracy), J&F score
Multiple-choice evaluation for QA, temporal evidence, and spatial evidence selection.
cd benchmark/st_evidence_mcq
# Run single task
python ours_st_evidence_mcq.py --task qa --model /path/to/model
# Run all tasks (QA + temporal + spatial)
python ours_st_evidence_mcq.py --task all --model /path/to/modelTasks:
qa: Video question answering (5 options: A/B/C/D/E)time_evidence: Select best temporal segment (4 options: A/B/C/D)spatial_evidence: Select best masked region (4 options: A/B/C/D)all: Run all three tasks sequentially
# GPT
python gpt_st_evidence_mcq.py --task all --model o3
# Gemini
python gemini_st_evidence_mcq.py --fps 1 --task all --model gemini-2.5-pro
# Qwen2.5-VL
CUDA_VISIBLE_DEVICES=0,1,2,3 python qwen2_5vl_st_evidence_mcq.py --task all --model Qwen/Qwen2.5-VL-72B-Instruct --batch_size 8 --gpu_num 4
# Qwen3-VL
CUDA_VISIBLE_DEVICES=0 python qwen3vl_st_evidence_mcq.py --task all --model Qwen/Qwen3-VL-4B-Instruct --batch_size 8 --gpu_num 1python eval_st_evidence_mcq.py --pred_file results/ours/predictions_all.jsonOutput:
- QA Accuracy
- Temporal Evidence Accuracy
- Spatial Evidence Accuracy
{
"entry_id": {
"answer": "answer text",
"gt_answer": "ground truth",
"time_segments": [[0.5, 2.3], [5.1, 7.8]],
"gt_time_segments": [[0.6, 2.5], [5.0, 8.0]],
"referring_expressions": ["person on the left", "red car"],
"mask_dir": "path/to/masks"
}
}{
"entry_id": {
"answer": "A",
"gt_answer": "B",
"evidence_t": "C",
"gt_evidence_t": "A",
"evidence_s": "D",
"gt_evidence_s": "D"
}
}CC-BY-NC 4.0
This was released for research purposes only, in support of the academic paper Evidence-Backed Video Question Answering.