Localizing-Visual-Sounds-the-Hard-Way

Code and Dataset for "Localizing Visual Sounds the Hard Way".

The repo contains code and our pre-trained model.

Environment

Python 3.6.8
Pytorch 1.3.0

Flickr-SoundNet

We provide the pretrained model here.

To test the model, testing data and ground truth should be downloaded from learning to localize sound source.

Then run

python test.py --data_path "path to downloaded data with structure below/" --summaries_dir "path to pretrained models" --gt_path "path to ground truth" --testset "flickr"

VGG-Sound Source

We provide the pretrained model here.

To test the model, run

python test.py --data_path "path to downloaded data with structure below/" --summaries_dir "path to pretrained models" --testset "vggss"

(Note, some gt bounding boxes are updated recently, all results on VGG-SS cause a 2~3% difference on IoU.)

Both test data should be placed in the following structure.

data path
│
└───frames
│   │   image001.jpg
│   │   image002.jpg
│   │
└───audio
    │   audio011.wav
    │   audio012.wav

Citation

@InProceedings{Chen21,
              title        = "Localizing Visual Sounds the Hard Way",
              author       = "Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman",
              booktitle    = "CVPR",
              year         = "2021"}

Localizing Visual Sounds the Hard Way

Related tags

Overview

Localizing-Visual-Sounds-the-Hard-Way

Environment

Flickr-SoundNet

VGG-Sound Source

Citation

Owner

Honglie Chen

nanodet_plus,yolov5_v6.0

BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond

Super-Fast-Adversarial-Training - A PyTorch Implementation code for developing super fast adversarial training

A simple and lightweight genetic algorithm for optimization of any machine learning model

This is the code for Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning

Nested Graph Neural Network (NGNN) is a general framework to improve a base GNN's expressive power and performance

A PyTorch implementation of "From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network" (ICCV2021)

PyTorch implementation of "Learning to Discover Cross-Domain Relations with Generative Adversarial Networks"

Unofficial & improved implementation of NeRF--: Neural Radiance Fields Without Known Camera Parameters

A state-of-the-art semi-supervised method for image recognition

Simulation code and tutorial for BBHnet training data

Additional code for Stable-baselines3 to load and upload models from the Hub.

"Segmenter: Transformer for Semantic Segmentation" reproduced via mmsegmentation

Deploy optimized transformer based models on Nvidia Triton server

This is the official PyTorch implementation for "Mesa: A Memory-saving Training Framework for Transformers".

[AAAI 2021] MVFNet: Multi-View Fusion Network for Efficient Video Recognition

🥇 LG-AI-Challenge 2022 1위 솔루션 입니다.

PyTorch code for training MM-DistillNet for multimodal knowledge distillation

Code for "Training Neural Networks with Fixed Sparse Masks" (NeurIPS 2021).

Code for EMNLP2021 paper "Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training"