This is official implementaion of paper "Token Shift Transformer for Video Classification".

Last update: Dec 30, 2022

Related tags

Overview

TokShift-Transformer

This is official implementaion of paper "Token Shift Transformer for Video Classification". We achieve SOTA performance 80.40% on Kinetics-400 val. Paper link

Updates
Model Zoo and Baselines
Installation
Quick Start
Contributors
Citing
Acknowledgement

Updates

July 11, 2021

Release this V1 version (the version used in paper) to public.
we are preparing a V2 version which include the following modifications, will release within 1 week:

Directly decode video mp4 file during training/evaluation
Change to adopt standarlize timm code-base.
Performances are further improved than reported in paper version (average +0.5).

April 22, 2021

Add Train/Test guidline and Data perpariation

April 16, 2021

Publish TokShift Transformer for video content understanding

Model Zoo and Baselines

architecture	backbone	pretrain	Res & Frames	GFLOPs x views	top1	config
ViT (Video)	Base16	ImgNet21k	224 & 8	134.7 x 30	76.02 `link`	k400_vit_8x32_224.yml
TokShift	Base-16	ImgNet21k	224 & 8	134.7 x 30	77.28 `link`	k400_tokshift_div4_8x32_base_224.yml
TokShift (MR)	Base16	ImgNet21k	256 & 8	175.8 x 30	77.68 `link`	k400_tokshift_div4_8x32_base_256.yml
TokShift (HR)	Base16	ImgNet21k	384 & 8	394.7 x 30	78.14 `link`	k400_tokshift_div4_8x32_base_384.yml
TokShift	Base16	ImgNet21k	224 & 16	268.5 x 30	78.18 `link`	k400_tokshift_div4_16x32_base_224.yml
TokShift-Large (HR)	Large16	ImgNet21k	384 & 8	1397.6 x 30	79.83 `link`	k400_tokshift_div4_8x32_large_384.yml
TokShift-Large (HR)	Large16	ImgNet21k	384 & 12	2096.4 x 30	80.40 `link`	k400_tokshift_div4_12x32_large_384.yml

Below is trainig log, we use 3 views evaluation (instead of 30 views) during validation for time-saving.

Installation

PyTorch >= 1.7, torchvision
tensorboardx

Quick Start

Train

Download ImageNet-22k pretrained weights from Base16 and Large16.
Prepare Kinetics-400 dataset organized in the following structure, trainValTest

k400
|_ frames331_train
|  |_ [category name 0]
|  |  |_ [video name 0]
|  |  |  |_ img_00001.jpg
|  |  |  |_ img_00002.jpg
|  |  |  |_ ...
|  |  |
|  |  |_ [video name 1]
|  |  |   |_ img_00001.jpg
|  |  |   |_ img_00002.jpg
|  |  |   |_ ...
|  |  |_ ...
|  |
|  |_ [category name 1]
|  |  |_ [video name 0]
|  |  |  |_ img_00001.jpg
|  |  |  |_ img_00002.jpg
|  |  |  |_ ...
|  |  |
|  |  |_ [video name 1]
|  |  |   |_ img_00001.jpg
|  |  |   |_ img_00002.jpg
|  |  |   |_ ...
|  |  |_ ...
|  |_ ...
|
|_ frames331_val
|  |_ [category name 0]
|  |  |_ [video name 0]
|  |  |  |_ img_00001.jpg
|  |  |  |_ img_00002.jpg
|  |  |  |_ ...
|  |  |
|  |  |_ [video name 1]
|  |  |   |_ img_00001.jpg
|  |  |   |_ img_00002.jpg
|  |  |   |_ ...
|  |  |_ ...
|  |
|  |_ [category name 1]
|  |  |_ [video name 0]
|  |  |  |_ img_00001.jpg
|  |  |  |_ img_00002.jpg
|  |  |  |_ ...
|  |  |
|  |  |_ [video name 1]
|  |  |   |_ img_00001.jpg
|  |  |   |_ img_00002.jpg
|  |  |   |_ ...
|  |  |_ ...
|  |_ ...
|
|_ trainValTest
   |_ train.txt
   |_ val.txt

Using train-script (train.sh) to train k400

#!/usr/bin/env python
import os

cmd = "python -u main_ddp_shift_v3.py \
		--multiprocessing-distributed --world-size 1 --rank 0 \
		--dist-ur tcp://127.0.0.1:23677 \
		--tune_from pretrain/ViT-L_16_Img21.npz \
		--cfg config/custom/kinetics400/k400_tokshift_div4_12x32_large_384.yml"
os.system(cmd)

Test

Using test.sh (test.sh) to evaluate k400

#!/usr/bin/env python
import os
cmd = "python -u main_ddp_shift_v3.py \
        --multiprocessing-distributed --world-size 1 --rank 0 \
        --dist-ur tcp://127.0.0.1:23677 \
        --evaluate \
        --resume model_zoo/ViT-B_16_k400_dense_cls400_segs8x32_e18_lr0.1_B21_VAL224/best_vit_B8x32x224_k400.pth \
        --cfg config/custom/kinetics400/k400_vit_8x32_224.yml"
os.system(cmd)

Contributors

VideoNet is written and maintained by Dr. Hao Zhang and Dr. Yanbin Hao.

Citing

If you find TokShift-xfmr is useful in your research, please use the following BibTeX entry for citation.

@article{tokshift2021,
  title={Token Shift Transformer for Video Classification},
  author={Hao Zhang, Yanbin Hao, Chong-Wah Ngo},
  journal={ACM Multimedia 2021},
}

Acknowledgement

Thanks for the following Github projects:

This is official implementaion of paper "Token Shift Transformer for Video Classification".

Related tags

Overview

TokShift-Transformer

Updates

July 11, 2021

April 22, 2021

April 16, 2021

Model Zoo and Baselines

Installation

Quick Start

Train

Test

Contributors

Citing

Acknowledgement

Owner

VideoNet

Quickly and easily create / train a custom DeepDream model

List of all dependencies affected by node-ipc malicious commit

MetaDrive: Composing Diverse Scenarios for Generalizable Reinforcement Learning

A system used to detect whether a person is wearing a medical mask or not.

Context-Aware Image Matting for Simultaneous Foreground and Alpha Estimation

Let Python optimize the best stop loss and take profits for your TradingView strategy.

A repository for generating stylized talking 3D and 3D face

PyTorch3D is FAIR's library of reusable components for deep learning with 3D data

Repository aimed at compiling code, papers, demos etc.. related to my PhD on 3D vision and machine learning for fruit detection and shape estimation at the university of Lincoln

CharacterGAN: Few-Shot Keypoint Character Animation and Reposing

[CVPR'21] Locally Aware Piecewise Transformation Fields for 3D Human Mesh Registration

Centroid-UNet is deep neural network model to detect centroids from satellite images.

Official Pytorch Implementation of GraphiT

DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort

Implementation of the "PSTNet: Point Spatio-Temporal Convolution on Point Cloud Sequences" paper.

A program that uses computer vision to detect hand gestures, used for controlling movie players.

[CVPR2021] Look before you leap: learning landmark features for one-stage visual grounding.

Using VapourSynth with super resolution models and speeding them up with TensorRT.

clustimage is a python package for unsupervised clustering of images.

CAMoE + Dual SoftMax Loss (DSL): Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss