MUSIC-AVQA, CVPR2022 (ORAL)

Related tags

AudioMUSIC-AVQA
Overview

Audio-Visual Question Answering (AVQA)

PyTorch code accompanies our CVPR 2022 paper:

Learning to Answer Questions in Dynamic Audio-Visual Scenarios (Oral Presentation)

Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen and Di Hu

Resources: [Paper], [Supplementary], [Poster], [Video]

Project Homepage: https://gewu-lab.github.io/MUSIC-AVQA/


What's Audio-Visual Question Answering Task?

We focus on audio-visual question answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal understanding and spatio-temporal reasoning over audio-visual scenes.

MUSIC-AVQA Dataset

The large-scale MUSIC-AVQA dataset of musical performance, which contains 45,867 question-answer pairs, distributed in 9,288 videos for over 150 hours. All QA pairs types are divided into 3 modal scenarios, which contain 9 question types and 33 question templates. Finally, as an open-ended problem of our AVQA tasks, all 42 kinds of answers constitute a set for selection.

  • QA examples

Model Overview

To solve the AVQA problem, we propose a spatio-temporal grounding model to achieve scene understanding and reasoning over audio and visual modalities. An overview of the proposed framework is illustrated in below figure.

Requirements

python3.6 +
pytorch1.6.0
tensorboardX
ffmpeg
numpy

Usage

  1. Clone this repo

    https://github.com/GeWu-Lab/MUSIC-AVQA_CVPR2022.git
  2. Download data

    Annotations (QA pairs, etc.)

    • Available for download at here
    • The annotation files are stored in JSON format. Each annotation file contains seven different keyword. And more detail see in Project Homepage

    Features

    • We use VGGish, ResNet18, and ResNet (2+1)D to extract audio, 2D frame-level, and 3D snippet-level features, respectively.

    • The audio and visual features of videos in the MUSIC-AVQA dataset can be download from Baidu Drive (password: cvpr):

      • VGGish feature shape: [T, 128]  Download (112.7M)
      • ResNet18 feature shape: [T, 512]  Download (972.6M)
      • R(2+1)D feature shape: [T, 512]  Download (973.9M)
    • The features are in the ./data/feats folder.

    • 14x14 features, too large to share ... but we can extract from raw video frames.

    Download videos frames

    • Raw videos: Availabel at Baidu Drive (password: cvpr):.

      Note: Please move all downloaded videos to a folder, for example, create a new folder named MUSIC-AVQA-Videos, which contains 9,288 real videos and synthetic videos.

    • Raw video frames (1fps): Available at Baidu Drive (14.84GB) (password: cvpr).

    • Download raw videos in the MUSIC-AVQA dataset. The downloaded videos will be in the /data/video folder.

    • Pandas and ffmpeg libraries are required.

  3. Data pre-processing

    Extract audio waveforms from videos. The extracted audios will be in the ./data/audio folder. moviepy library is used to read videos and extract audios.

    python feat_script/extract_audio_cues/extract_audio.py	

    Extract video frames from videos. The extracted frames will be in the data/frames folder.

    python feat_script/extract_visual_frames/extract_frames_adaptive_script.py
  4. Feature extraction

    Audio feature. TensorFlow1.4 and VGGish pretrained on AudioSet is required. Feature file also can be found from here (password: cvpr).

    python feat_script/extract_audio_feat/audio_feature_extractor.py

    2D visual feature. Pretrained models library is required.

    python feat_script/eatract_visual_feat/extract_rgb_feat.py

    3D visual feature.

    python feat_script/eatract_visual_feat/extract_3d_feat.py

    14x14 visual feature.

    python feat_script/extract_visual_feat_14x14/extract_14x14_feat.py
  5. Baseline Model

    Training

    python net_grd_baseline/main_qa_grd_baseline.py --mode train

    Testing

    python net_grd_baseline/main_qa_grd_baseline.py --mode test
  6. Our Audio-Visual Spatial-Temporal Model

    We provide trained models and you can quickly test the results. Test results may vary slightly on different machines.

    python net_grd_avst/main_avst.py --mode train \
    	--audio_dir = "path to your audio features"
    	--video_res14x14_dir = "path to your visual res14x14 features"

    Audio-Visual grounding generation

    python grounding_gen/main_grd_gen.py

    Training

    python net_grd_avst/main_avst.py --mode train \
    	--audio_dir = "path to your audio features"
    	--video_res14x14_dir = "path to your visual res14x14 features"

    Testing

    python net_grd_avst/main_avst.py --mode test \
    	--audio_dir = "path to your audio features"
    	--video_res14x14_dir = "path to your visual res14x14 features"

Results

  1. Audio-visual video question answering results of different methods on the test set of MUSIC-AVQA. The top-2 results are highlighted. Please see the citations in the [Paper] for comparison methods.

  2. Visualized spatio-temporal grounding results

    We provide several visualized spatial grounding results. The heatmap indicates the location of sounding source. Through the spatial grounding results, the sounding objects are visually captured, which can facilitate the spatial reasoning.

    Firstly, ./grounding_gen/models_grd_vis/ should be created.

    python grounding_gen/main_grd_gen_vis.py

Citation

If you find this work useful, please consider citing it.


@ARTICLE{Li2022Learning,
  title	= {Learning to Answer Questions in Dynamic Audio-Visual Scenarios},
  author	= {Guangyao li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu},
  journal	= {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year	= {2022},
}

Acknowledgement

This research was supported by Public Computing Cloud, Renmin University of China.

License

This project is released under the GNU General Public License v3.0.

PianoPlayer - Automatic fingering generator for piano scores

PianoPlayer - Automatic fingering generator for piano scores

Marco Musy 571 Jan 02, 2023
pedalboard is a Python library for adding effects to audio.

pedalboard is a Python library for adding effects to audio. It supports a number of common audio effects out of the box, and also allows the use of VST3® and Audio Unit plugin formats for third-party

Spotify 3.9k Jan 02, 2023
A simple music player, powered by Python, utilising various libraries such as Tkinter and Pygame

A simple music player, powered by Python, utilising various libraries such as Tkinter and Pygame

PotentialCoding 2 May 12, 2022
digital audio workstation, instrument and effect plugins, wave editor

digital audio workstation, instrument and effect plugins, wave editor

306 Jan 05, 2023
A2DP agent for promiscuous/permissive audio sinc.

Promiscuous Bluetooth audio sinc A2DP agent for promiscuous/permissive audio sinc for Linux. Once installed, a Bluetooth client, such as a smart phone

Jasper Aorangi 4 May 27, 2022
AudioDVP:Photorealistic Audio-driven Video Portraits

AudioDVP This is the official implementation of Photorealistic Audio-driven Video Portraits. Major Requirements Ubuntu = 18.04 PyTorch = 1.2 GCC =

232 Jan 03, 2023
praudio provides audio preprocessing framework for Deep Learning audio applications

praudio provides objects and a script for performing complex preprocessing operations on entire audio datasets with one command.

Valerio Velardo 105 Dec 26, 2022
extract unpack asset file (form unreal engine 4 pak) with extenstion *.uexp which contain awb/acb (cri/cpk like) sound or music resource

Uexp2Awb extract unpack asset file (form unreal engine 4 pak) with extenstion .uexp which contain awb/acb (cri/cpk like) sound or music resource. i ju

max 6 Jun 22, 2022
This is a realtime voice translator program which gets input from user at any language and converts it to the desired language that the user asks

This is a realtime voice translator program which gets input from user at any language and converts it to the desired language that the user asks ...

Mohan Ram S 1 Dec 30, 2021
Reading list for research topics in sound event detection

Sound event detection aims at processing the continuous acoustic signal and converting it into symbolic descriptions of the corresponding sound events present at the auditory scene.

Soham 64 Jan 05, 2023
A useful tool to generate chord progressions according to melody MIDIs

Auto chord generator, pure python package that generate chord progressions according to given melodies

Billy Yi 53 Dec 30, 2022
Supysonic is a Python implementation of the Subsonic server API.

Supysonic Supysonic is a Python implementation of the Subsonic server API. Current supported features are: browsing (by folders or tags) streaming of

Alban 228 Nov 19, 2022
Open-Source Tools & Data for Music Source Separation: A Pragmatic Guide for the MIR Practitioner

Open-Source Tools & Data for Music Source Separation: A Pragmatic Guide for the MIR Practitioner

IELab@ Korea University 0 Nov 12, 2021
Basically Play Pauses the song when it is safe to do so. when you die in a round

Basically Play Pauses the song when it is safe to do so. when you die in a round

AG_1436 1 Feb 13, 2022
Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features

MediumVC MediumVC is an utterance-level method towards any-to-any VC. Before that, we propose SingleVC to perform A2O tasks(Xi → Ŷi) , Xi means utter

谷下雨 47 Dec 25, 2022
Oliva music bot help to play vc music

OLIVA V2 🎵 Requirements 📝 FFmpeg NodeJS nodesource.com Python 3.7+ PyTgCalls Commands 🛠 For all in group /play - reply to youtube url or song file

SOUL々H҉A҉C҉K҉E҉R҉ 2 Oct 22, 2021
🎵 A repository for manually annotating files to create labeled acoustic datasets for machine learning.

🎵 A repository for manually annotating files to create labeled acoustic datasets for machine learning.

Jim Schwoebel 28 Dec 22, 2022
Expressive Digital Signal Processing (DSP) package for Python

AudioLazy Development Last release PyPI status Real-Time Expressive Digital Signal Processing (DSP) Package for Python! Laziness and object representa

Danilo de Jesus da Silva Bellini 642 Dec 26, 2022
This is a python package that turns any images into MIDI files that views the same as them

image_to_midi This is a python package that turns any images into MIDI files that views the same as them. This package firstly convert the image to AS

Rainbow Dreamer 4 Mar 10, 2022
Analyze, visualize and process sound field data recorded by spherical microphone arrays.

Sound Field Analysis toolbox for Python The sound_field_analysis toolbox (short: sfa) is a Python port of the Sound Field Analysis Toolbox (SOFiA) too

Division of Applied Acoustics at Chalmers University of Technology 69 Nov 23, 2022