[AAAI2021] The source code for our paper 《Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion》.

Last update: Oct 16, 2022

Overview

DSM

The source code for paper Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion

Project Website;

Datasets list and some visualizations/provided weights are preparing now.

1. Introduction (scene-dominated to motion-dominated)

Video datasets are usually scene-dominated, We propose to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid.

The generated triplet is as below:

What DSM learned?

With DSM pretrain, the model learn to focus on motion region (Not necessarily actor) powerful without one label available.

2. Installation

Dataset

Please refer dataset.md for details.

Requirements

Python3
pytorch1.1+
PIL
Intel (on the fly decode)

3. Structure

datasets
- list
  - hmdb51: the train/val lists of HMDB51
  - ucf101: the train/val lists of UCF101
  - kinetics-400: the train/val lists of kinetics-400
  - diving48: the train/val lists of diving48
experiments
- logs: experiments record in detials
- gradientes: grad check
- visualization:
src
- data: load data
- loss: the loss evaluate in this paper
- model: network architectures
- scripts: train/eval scripts
- augment: detail implementation of Spatio-temporal Augmentation
- utils
- feature_extract.py: feature extractor given pretrained model
- main.py: the main function of finetune
- trainer.py
- option.py
- pt.py: self-supervised pretrain
- ft.py: supervised finetune

DSM(Triplet)/DSM/Random

Self-supervised Pretrain

Kinetics

bash scripts/kinetics/pt.sh

UCF101

bash scripts/ucf101/pt.sh

Supervised Finetune (Clip-level)

HMDB51

bash scripts/hmdb51/ft.sh

UCF101

bash scripts/ucf101/ft.sh

Kinetics

bash scripts/kinetics/ft.sh

Video-level Evaluation

Following common practice TSN and Non-local. The final video-level result is average by 10 temporal window sampling + corner crop, which lead to better result than clip-level. Refer test.py for details.

Pretrain And Eval In one step

bash scripts/hmdb51/pt_and_ft_hmdb51.sh

Notice: More Training Options and ablation study Can be find in scripts

Video Retrieve and other visualization

(1). Feature Extractor

As STCR can be easily extend to other video representation task, we offer the scripts to perform feature extract.

python feature_extractor.py

The feature will be saved as a single numpy file in the format [video_nums,features_dim] for further visualization.

(2). Reterival Evaluation

modify line60-line62 in reterival.py.

python reterival.py

Results

Action Recognition

UCF101 Pretrained (I3D)

Method	UCF101	HMDB51
Random Initialization	47.9	29.6
MoCo Baseline	62.3	36.5
DSM(Triplet)	70.7	48.5
DSM	74.8	52.5

Kinetics Pretrained

Video Retrieve (UCF101-C3D)

Method	@1	@5	@10	@20	@50
DSM	16.8	33.4	43.4	54.6	70.7

Video Retrieve (HMDB51-C3D)

Method	@1	@5	@10	@20	@50
DSM	8.2	25.9	38.1	52.0	75.0

More Visualization

Acknowledgement

This work is partly based on STN, UEL and MoCo.

License

Citation

If you use our code in your research or wish to refer to the baseline results, pleasuse use the followint BibTex entry.

@inproceedings{wang2020enhancing,
  author    = {Lin, Ji and Zhang, Richard and Ganz, Frieder and Han, Song and Zhu, Jun-Yan},
  title     = {Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion},
  booktitle = {AAAI},
  year      = {2021},
}

[AAAI2021] The source code for our paper 《Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion》.

Related tags

Overview

DSM

1. Introduction (scene-dominated to motion-dominated)

What DSM learned?

2. Installation

Dataset

Requirements

3. Structure

DSM(Triplet)/DSM/Random

Self-supervised Pretrain

Kinetics

UCF101

Supervised Finetune (Clip-level)

HMDB51

UCF101

Kinetics

Video-level Evaluation

Pretrain And Eval In one step

Video Retrieve and other visualization

(1). Feature Extractor

(2). Reterival Evaluation

Results

Action Recognition

UCF101 Pretrained (I3D)

Kinetics Pretrained

Video Retrieve (UCF101-C3D)

Video Retrieve (HMDB51-C3D)

More Visualization

Acknowledgement

License

Citation

Owner

Jinpeng Wang

Unofficial implementation of the paper: PonderNet: Learning to Ponder in TensorFlow

Code for approximate graph reduction techniques for cardinality-based DSFM, from paper

TumorInsight is a Brain Tumor Detection and Classification model built using RESNET50 architecture.

Deep motion transfer

Code for "CloudAAE: Learning 6D Object Pose Regression with On-line Data Synthesis on Point Clouds" @ICRA2021

PyTorch module to use OpenFace's nn4.small2.v1.t7 model

A lightweight library to compare different PyTorch implementations of the same network architecture.

A collection of scripts I developed for personal and working projects.

The datasets and code of ACL 2021 paper "Aspect-Category-Opinion-Sentiment Quadruple Extraction with Implicit Aspects and Opinions".

Learning kernels to maximize the power of MMD tests

Applicator Kit for Modo allow you to apply Apple ARKit Face Tracking data from your iPhone or iPad to your characters in Modo.

Generate image analogies using neural matching and blending

pcnaDeep integrates cutting-edge detection techniques with tracking and cell cycle resolving models.

Activity tragle - Google is tracking everything, we just look at it

Localizing Visual Sounds the Hard Way

[ICRA2021] Reconstructing Interactive 3D Scene by Panoptic Mapping and CAD Model Alignment

Count the MACs / FLOPs of your PyTorch model.

Art Project "Schrödinger's Game of Life"

Fluency ENhanced Sentence-bert Evaluation (FENSE), metric for audio caption evaluation. And Benchmark dataset AudioCaps-Eval, Clotho-Eval.

When are Iterative GPs Numerically Accurate?