NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)

Last update: Nov 24, 2022

Overview

NExT-QA

We reproduce some SOTA VideoQA methods to provide benchmark results for our NExT-QA dataset accepted to CVPR2021 (with 1 'Strong Accept' and 2 'Weak Accept's).

NExT-QA is a VideoQA benchmark targeting the explanation of video contents. It challenges QA models to reason about the causal and temporal actions and understand the rich object interactions in daily activities. We set up both multi-choice and open-ended QA tasks on the dataset. This repo. provides resources for multi-choice QA; open-ended QA is found in NExT-OE. For more details, please refer to our dataset page.

Environment

Anaconda 4.8.4, python 3.6.8, pytorch 1.6 and cuda 10.2. For other libs, please refer to the file requirements.txt.

Install

Please create an env for this project using anaconda (should install anaconda first)

>conda create -n videoqa python=3.6.8
>conda activate videoqa
>git clone https://github.com/doc-doc/NExT-QA.git
>pip install -r requirements.txt #may take some time to install

Data Preparation

Please download the pre-computed features and QA annotations from here. There are 4 zip files:

['vid_feat.zip']: Appearance and motion feature for video representation. (With code provided by HCRN).
['qas_bert.zip']: Finetuned BERT feature for QA-pair representation. (Based on pytorch-pretrained-BERT).
['nextqa.zip']: Annotations of QAs and GloVe Embeddings.
['models.zip']: Learned HGA model.

After downloading the data, please create a folder ['data/feats'] at the same directory as ['NExT-QA'], then unzip the video and QA features into it. You will have directories like ['data/feats/vid_feat/', 'data/feats/qas_bert/' and 'NExT-QA/'] in your workspace. Please unzip the files in ['nextqa.zip'] into ['NExT-QA/dataset/nextqa'] and ['models.zip'] into ['NExT-QA/models/'].

(You are also encouraged to design your own pre-computed video features. In that case, please download the raw videos from VidOR. As NExT-QA's videos are sourced from VidOR, you can easily link the QA annotations with the corresponding videos according to the key 'video' in the ['nextqa/.csv'] files, during which you may need the map file ['nextqa/map_vid_vidorID.json']).

Usage

Once the data is ready, you can easily run the code. First, to test the environment and code, we provide the prediction and model of the SOTA approach (i.e., HGA) on NExT-QA. You can get the results reported in the paper by running:

>python eval_mc.py

The command above will load the prediction file under ['results/'] and evaluate it. You can also obtain the prediction by running:

>./main.sh 0 val #Test the model with GPU id 0

The command above will load the model under ['models/'] and generate the prediction file. If you want to train the model, please run

>./main.sh 0 train # Train the model with GPU id 0

It will train the model and save to ['models']. (The results may be slightly different depending on the environments)

Results

Methods	Text Rep.	Acc_C	Acc_T	Acc_D	Acc	Text Rep.	Acc_C	Acc_T	Acc_D	Acc
BlindQA	GloVe	26.89	30.83	32.60	30.60	BERT-FT	42.62	45.53	43.89	43.76
EVQA	GloVe	28.69	31.27	41.44	31.51	BERT-FT	42.64	46.34	45.82	44.24
STVQA [CVPR17]	GloVe	36.25	36.29	55.21	39.21	BERT-FT	44.76	49.26	55.86	47.94
CoMem [CVPR18]	GloVe	35.10	37.28	50.45	38.19	BERT-FT	45.22	49.07	55.34	48.04
HME [CVPR19]	GloVe	37.97	36.91	51.87	39.79	BERT-FT	46.18	48.20	58.30	48.72
HCRN [CVPR20]	GloVe	39.09	40.01	49.16	40.95	BERT-FT	45.91	49.26	53.67	48.20
HGA [AAAI20]	GloVe	35.71	38.40	55.60	39.67	BERT-FT	46.26	50.74	59.33	49.74
Human	-	87.61	88.56	90.40	88.38	-	87.61	88.56	90.40	88.38

Multi-choice QA vs. Open-ended QA

Citation

@article{xiao2021next,
  title={NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions},
  author={Xiao, Junbin and Shang, Xindi and Yao, Angela and Chua, Tat-Seng},
  journal={arXiv preprint arXiv:2105.08276},
  year={2021}
}

Todo

Open evaluation server and release test data.
Release spatial feature.
Release RoI feature.

Acknowledgement

Our reproduction of the methods are based on the respective official repositories, we thank the authors to release their code. If you use the related part, please cite the corresponding paper commented in the code.

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)

Related tags

Overview

NExT-QA

Environment

Install

Data Preparation

Usage

Results

Multi-choice QA vs. Open-ended QA

Citation

Todo

Acknowledgement

Owner

Junbin Xiao

[ICML 2020] "When Does Self-Supervision Help Graph Convolutional Networks?" by Yuning You, Tianlong Chen, Zhangyang Wang, Yang Shen

custom pytorch implementation of MoCo v3

Code for "Reconstructing 3D Human Pose by Watching Humans in the Mirror", CVPR 2021 oral

Alignment Attention Fusion framework for Few-Shot Object Detection

Label-Free Model Evaluation with Semi-Structured Dataset Representations

Layer 7 DDoS Panel with Cloudflare Bypass ( UAM, CAPTCHA, BFM, etc.. )

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

PyTorch implementation for 3D human pose estimation

Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-localization in Large Scenes from Body-Mounted Sensors, CVPR 2021

OpenCVのGrabCut()を利用したセマンティックセグメンテーション向けアノテーションツール(Annotation tool using GrabCut() of OpenCV. It can be used to create datasets for semantic segmentation.)

FaceOcc: A Diverse, High-quality Face Occlusion Dataset for Human Face Extraction

Starter kit for getting started in the Music Demixing Challenge.

Learning Domain Invariant Representations in Goal-conditioned Block MDPs

Source code of article "Towards Toxic and Narcotic Medication Detection with Rotated Object Detector"

AI Virtual Calculator: This is a simple virtual calculator based on Artificial intelligence.

Official implementation of AAAI-21 paper "Label Confusion Learning to Enhance Text Classification Models"

Repo for "Physion: Evaluating Physical Prediction from Vision in Humans and Machines" submission to NeurIPS 2021 (Datasets & Benchmarks track)

Official code of CVPR 2021's PLOP: Learning without Forgetting for Continual Semantic Segmentation

[ICCV '21] In this repository you find the code to our paper Keypoint Communities

NVIDIA Deep Learning Examples for Tensor Cores