A Joint Video and Image Encoder for End-to-End Retrieval

Last update: Dec 25, 2022

Related tags

Overview

Frozen️ in Time ❄️ ️️️️ ⏳

A Joint Video and Image Encoder for End-to-End Retrieval

project page | arXiv | webvid-data Repository containing the code, models, data for end-to-end retrieval. WebVid data can be found here

📝 Preparation

Create conda env conda env create -f requirements/frozen.yml
Create data / experiment folders mkdir data; mkdir exps, note this can just be a symlink to where you want to store big data.

🔧 Finetuning (benchmarks: MSR-VTT)

wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip -P data; unzip data/MSRVTT.zip -d data
Change num_gpus in the config file accordingly.
Train python train.py --config configs/msrvtt_4f_i21k.json
Test python test.py --resume exps/models/{EXP_NAME}/{EXP_TIMESTAMP}/model_best.pth

For finetuning a pretrained model, set "load_checkpoint": "PATH_TO_MODEL" in the config file.

🏋 ️‍️ Pretraining

Download WebVid-2M (see https://github.com/m-bain/webvid)
Download CC-3M (see https://ai.google.com/research/ConceptualCaptions/download)
Train. python train.py --config CONFIG_PATH. Here are the different options:

a. Dataset combinations
```
 i. CC-3M + WebVid2M: configs/cc-webvid2m-pt-i2k.json
 ii. WebVid2M : configs/webvid2m-pt-i2k.json
```
You can add in an arbitrary number of image/video datasets for pre-training by adding as many dataloaders to the config file dataloader list as your heart desires. Adding more datasets will likely to higher downstream performance.

b. Number of frames

For image datasets, this should always be set to video_params": {"num_frames": 1, ...}.

For video datasets, set this to what you want. N.B. More frames requires = more gpu memory.

If, like us, you are not a big company and have limited compute, then you will benefit by training via a curriculum on the number of frames. A lot of the knowledge can be learned in the 1-frame setting, as we show in the paper. You can then finetune with more frames. See curriculum learning section

c. Finetuning

Set "load_checkpoint": "FULL_MODEL_PATH" in the config file. You can now use different experiment params, such as num_frames, to do curriculum learning for example.

🗄 Pretrained Weights

CC-3M+WebVid-2M, 4-frames, base_patch_16_224

📚 Curriculum Learning on #frames

Curriculum learning on the number of frames in pretraining achieves similar performance with significant reduction in compute (both memory and training time). This is because model has higher throughput for fewer frames, as well as allowing a bigger batch size for the same gpu memory.

Our best model was trained on 1-frame then finetuned on 4-frames on CC+WebVid2M.

Train on 1-frame until the training loss converges, then finetune on 4-frames with the same config, from the 1-frame checkpoint via setting load_checkpoint in config file. 4-frame finetuning needs much less iterations (~10% of 1-frame setting is sufficient) since most of the knowledge is learned in the 1-frame setting.

📈 Experiment Logging and Visualising

This repository uses a sacred backbone for logging and tracking experiments, with a neptune front end. It makes life a lot easier. If you want to activate this:

Create a neptune.ai account.
Create a project, copy in your credentials in train.py and remove the ValueError
Set neptune: true in your config files.

🎓 Cite

If you use this code in your research, please cite:

@misc{bain2021frozen,
      title={Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval}, 
      author={Max Bain and Arsha Nagrani and Gül Varol and Andrew Zisserman},
      year={2021},
      eprint={2104.00650},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

🙏 Acknowledgements

This code is based off the pytorch-template https://github.com/victoresque/pytorch-template

As well as many good practices adopted from Samuel Albanie's https://github.com/albanie/collaborative-experts

A Joint Video and Image Encoder for End-to-End Retrieval

Related tags

Overview

Frozen️ in Time ❄️ ️️️️ ⏳

A Joint Video and Image Encoder for End-to-End Retrieval

📝 Preparation

🔧 Finetuning (benchmarks: MSR-VTT)

🏋 ️‍️ Pretraining

🗄 Pretrained Weights

📚 Curriculum Learning on #frames

📈 Experiment Logging and Visualising

🎓 Cite

🙏 Acknowledgements

Owner

Fashion Recommender System With Python

Predicting the duration of arrival delays for commercial flights.

Graph Transformer Architecture. Source code for

Sequence to Sequence Models with PyTorch

Codes for AAAI22 paper "Learning to Solve Travelling Salesman Problem with Hardness-Adaptive Curriculum"

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English

Stereo Radiance Fields (SRF): Learning View Synthesis for Sparse Views of Novel Scenes

This code is an unofficial implementation of HiFiSinger.

This toolkit provides codes to download and pre-process the SLUE datasets, train the baseline models, and evaluate SLUE tasks.

Teaching end to end workflow of deep learning

Image Data Augmentation in Keras

OpenL3: Open-source deep audio and image embeddings

Machine Learning with JAX Tutorials

Simulation environments for the CrazyFlie quadrotor: Used for Reinforcement Learning and Sim-to-Real Transfer

Official repository for "On Generating Transferable Targeted Perturbations" (ICCV 2021)

pybaum provides tools to work with pytrees which is a concept burrowed from JAX.

Streaming Anomaly Detection Framework in Python (Outlier Detection for Streaming Data)

Deep Residual Learning for Image Recognition

FinRL-Meta: A Universe for Data-Driven Financial Reinforcement Learning. 🔥

PyTorch-Multi-Style-Transfer - Neural Style and MSG-Net

A Joint Video and Image Encoder for End-to-End Retrieval

Related tags

Overview

Frozen️ in Time ❄️ ️️️️ ⏳

A Joint Video and Image Encoder for End-to-End Retrieval

📝 Preparation

🔧 Finetuning (benchmarks: MSR-VTT)

🏋 ️‍️ Pretraining

🗄 Pretrained Weights

📚 Curriculum Learning on #frames

📈 Experiment Logging and Visualising

🎓 Cite

🙏 Acknowledgements

Owner

Fashion Recommender System With Python

Predicting the duration of arrival delays for commercial flights.

Graph Transformer Architecture. Source code for

Sequence to Sequence Models with PyTorch

Codes for AAAI22 paper "Learning to Solve Travelling Salesman Problem with Hardness-Adaptive Curriculum"

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English

Stereo Radiance Fields (SRF): Learning View Synthesis for Sparse Views of Novel Scenes

This code is an unofficial implementation of HiFiSinger.

This toolkit provides codes to download and pre-process the SLUE datasets, train the baseline models, and evaluate SLUE tasks.

Teaching end to end workflow of deep learning

Image Data Augmentation in Keras

OpenL3: Open-source deep audio and image embeddings

Machine Learning with JAX Tutorials

Simulation environments for the CrazyFlie quadrotor: Used for Reinforcement Learning and Sim-to-Real Transfer

Official repository for "On Generating Transferable Targeted Perturbations" (ICCV 2021)

pybaum provides tools to work with pytrees which is a concept burrowed from JAX.

Streaming Anomaly Detection Framework in Python (Outlier Detection for Streaming Data)

Deep Residual Learning for Image Recognition

FinRL­-Meta: A Universe for Data­-Driven Financial Reinforcement Learning. 🔥

PyTorch-Multi-Style-Transfer - Neural Style and MSG-Net

FinRL-Meta: A Universe for Data-Driven Financial Reinforcement Learning. 🔥