Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

Last update: Jan 02, 2023

Related tags

Deep Learning AudioCLIP

Overview

AudioCLIP

Extending CLIP to Image, Text and Audio

This repository contains implementation of the models described in the paper arXiv:2106.13043. This work based on our previous works:

ESResNe(X)t-fbsp: Learning Robust Time-Frequency Transformation of Audio (2021).
ESResNet: Environmental Sound Classification Based on Visual Domain Models (2020).

Abstract

In the past, the rapidly evolving field of sound classification greatly benefited from the application of methods from other domains. Today, we observe the trend to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models.

In this work, we present an extension of the CLIP model that handles audio in addition to text and images. Our proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset. Such a combination enables the proposed model to perform bimodal and unimodal classification and querying, while keeping CLIP's ability to generalize to unseen datasets in a zero-shot inference fashion.

AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task, out-performing other approaches by reaching accuracies of 90.07% on the UrbanSound8K and 97.15% on the ESC-50 datasets. Further it sets new baselines in the zero-shot ESC-task on the same datasets (68.78% and 69.40%, respectively).

Finally, we also assess the cross-modal querying performance of the proposed model as well as the influence of full and partial training on the results. For the sake of reproducibility, our code is published.

Downloading Pre-Trained Weights

The pre-trained model can be downloaded from the releases.

# AudioCLIP trained on AudioSet (text-, image- and audio-head simultaneously)
wget https://github.com/AndreyGuzhov/AudioCLIP/releases/download/v0.1/AudioCLIP-Full-Training.pt

How to Run the Model

The required Python version is >= 3.7.

AudioCLIP

On the ESC-50 dataset

python main.py --config protocols/audioclip-esc50.json --Dataset.args.root /path/to/ESC50

On the UrbanSound8K dataset

python main.py --config protocols/audioclip-us8k.json --Dataset.args.root /path/to/UrbanSound8K

Cite Us

@misc{guzhov2021audioclip,
      title={AudioCLIP: Extending CLIP to Image, Text and Audio}, 
      author={Andrey Guzhov and Federico Raue and Jörn Hees and Andreas Dengel},
      year={2021},
      eprint={2106.13043},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

You might also like...

This repository contains the code used for Predicting Patient Outcomes with Graph Representation Learning (https://arxiv.org/abs/2101.03940).

Predicting Patient Outcomes with Graph Representation Learning This repository contains the code used for Predicting Patient Outcomes with Graph Repre

76 Dec 22, 2022

Pytorch implementation of Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization https://arxiv.org/abs/2008.11646

[TCSVT] Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization LPN [Paper] NEWs Prerequisites Python 3.6 GPU Memory = 8G Numpy 1.

46 Dec 14, 2022

https://arxiv.org/abs/2102.11005

LogME LogME: Practical Assessment of Pre-trained Models for Transfer Learning How to use Just feed the features f and labels y to the function, and yo

149 Dec 19, 2022

Official Implementation for "ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement" https://arxiv.org/abs/2104.02699

ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement Recently, the power of unconditional image synthesis has significantly advanced th

967 Jan 4, 2023

ISTR: End-to-End Instance Segmentation with Transformers (https://arxiv.org/abs/2105.00637)

This is the project page for the paper: ISTR: End-to-End Instance Segmentation via Transformers, Jie Hu, Liujuan Cao, Yao Lu, ShengChuan Zhang, Yan Wa

182 Dec 19, 2022

Non-Official Pytorch implementation of "Face Identity Disentanglement via Latent Space Mapping" https://arxiv.org/abs/2005.07728 Using StyleGAN2 instead of StyleGAN

Face Identity Disentanglement via Latent Space Mapping - Implement in pytorch with StyleGAN 2 Description Pytorch implementation of the paper Face Ide

58 Dec 24, 2022

Minimal implementation of PAWS (https://arxiv.org/abs/2104.13963) in TensorFlow.

Comments

Make project usable by other python projects: remove git lfs and move files into an audioclip folder

Git lfs was giving problems, so I removed all assets files from it - the files can be found in the "Release" anyways.

Also it was a bit problematic to use this project in other projects because the folder structure was lacking. I moved all files into an "audioclip" folder to fix python pathing for external projects.

I renamed master to main, but I doubt that this change is going to stay once this pull request is merged.

opened by NotNANtoN 0

Releases(v0.1)

v0.1(Jun 29, 2021)
Text embeddings' vocabulary and PyTorch' state_dicts containing weights of the AudioCLIP model trained on AudioSet:

bpe_simple_vocab_16e6.txt.gz – CLIP's vocabulary (origin)

CLIP.pt – vanilla CLIP (text Transformer & ResNet-50 image-head, origin)

ESRNXFBSP.pt – ESResNeXt trained on AudioSet (standalone)

AudioCLIP trained on AudioSet (+ video frames)

AudioCLIP-Full-Training.pt – training of all three heads (text, image and audio)

AudioCLIP-Partial-Training.pt – training of the audio-head only

Source code(tar.gz)
Source code(zip)
AudioCLIP-Full-Training.pt(512.41 MB)
AudioCLIP-Partial-Training.pt(512.41 MB)
bpe_simple_vocab_16e6.txt.gz(1.29 MB)
CLIP.pt(389.49 MB)
ESRNXFBSP.pt(119.01 MB)

Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

Related tags

Overview

AudioCLIP

Extending CLIP to Image, Text and Audio

Abstract

Downloading Pre-Trained Weights

How to Run the Model

AudioCLIP

On the ESC-50 dataset

On the UrbanSound8K dataset

Cite Us

You might also like...

This repository contains the code used for Predicting Patient Outcomes with Graph Representation Learning (https://arxiv.org/abs/2101.03940).

Pytorch implementation of Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization https://arxiv.org/abs/2008.11646

https://arxiv.org/abs/2102.11005

Official Implementation for "ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement" https://arxiv.org/abs/2104.02699

ISTR: End-to-End Instance Segmentation with Transformers (https://arxiv.org/abs/2105.00637)

Non-Official Pytorch implementation of "Face Identity Disentanglement via Latent Space Mapping" https://arxiv.org/abs/2005.07728 Using StyleGAN2 instead of StyleGAN

Minimal implementation of PAWS (https://arxiv.org/abs/2104.13963) in TensorFlow.

YOLO5Face: Why Reinventing a Face Detector (https://arxiv.org/abs/2105.12931)

A PyTorch implementation of EventProp [https://arxiv.org/abs/2009.08378], a method to train Spiking Neural Networks

Comments

Make project usable by other python projects: remove git lfs and move files into an audioclip folder

Releases(v0.1)

v0.1(Jun 29, 2021)

Owner

Deep Learning Based Fasion Recommendation System for Ecommerce

Code for "Solving Graph-based Public Good Games with Tree Search and Imitation Learning"

You Only Look One-level Feature (YOLOF), CVPR2021, Detectron2

Codes for CIKM'21 paper 'Self-Supervised Graph Co-Training for Session-based Recommendation'.

Out-of-Distribution Generalization of Chest X-ray Using Risk Extrapolation

Measuring Coding Challenge Competence With APPS

Roadmap to becoming a machine learning engineer in 2020

Yolov5 deepsort inference，使用YOLOv5+Deepsort实现车辆行人追踪和计数，代码封装成一个Detector类，更容易嵌入到自己的项目中

nanodet_plus,yolov5_v6.0

This repository contains the code and models for the following paper.

Code for the paper: Adversarial Machine Learning: Bayesian Perspectives

🥈78th place in Riiid Answer Correctness Prediction competition

PyTorch implementation of "Learn to Dance with AIST++: Music Conditioned 3D Dance Generation."

Square Root Bundle Adjustment for Large-Scale Reconstruction

Video Autoencoder: self-supervised disentanglement of 3D structure and motion

Official implementation of the paper "Lightweight Deep CNN for Natural Image Matting via Similarity Preserving Knowledge Distillation"

Pytorch implementation of "Neural Wireframe Renderer: Learning Wireframe to Image Translations"

Semi-Supervised Learning for Fine-Grained Classification

Python Implementation of the CoronaWarnApp (CWA) Event Registration