EsViT: Efficient self-supervised Vision Transformers

Overview

Efficient Self-Supervised Vision Transformers (EsViT)

PWC

PyTorch implementation for EsViT, built with two techniques:

  • A multi-stage Transformer architecture. Three multi-stage Transformer variants are implemented under the folder models.
  • A region-level matching pre-train task. The region-level matching task is implemented in function DDINOLoss(nn.Module) (Line 648) in main_esvit.py. Please use --use_dense_prediction True, otherwise only the view-level task is used.
Efficiency vs accuracy comparison under the linear classification protocol on ImageNet with EsViT
Figure: Efficiency vs accuracy comparison under the linear classification protocol on ImageNet. Left: Throughput of all SoTA SSL vision systems, circle sizes indicates model parameter counts; Right: performance over varied parameter counts for models with moderate (throughout/#parameters) ratio. Please refer Section 4.1 for details.

Pretrained models

You can download the full checkpoint (trained with both view-level and region-level tasks, batch size=512 and ImageNet-1K.), which contains backbone and projection head weights for both student and teacher networks.

arch params linear k-nn download logs
EsViT (Swin-T, W=7) 28M 78.0% 75.7% full ckpt train linear knn
EsViT (Swin-S, W=7) 49M 79.5% 77.7% full ckpt train linear knn
EsViT (Swin-B, W=7) 87M 80.4% 78.9% full ckpt train linear knn
EsViT (Swin-T, W=14) 28M 78.7% 77.0% full ckpt train linear knn
EsViT (Swin-S, W=14) 49M 80.8% 79.1% full ckpt train linear knn
EsViT (Swin-B, W=14) 87M 81.3% 79.3% full ckpt train linear knn

EsViT (Swin-T, W=7) with different pre-train datasets (view-level task only)

arch params batch size pre-train dataset linear k-nn download logs
EsViT 28M 512 ImageNet-1K 77.0% 74.2% full ckpt train linear knn
EsViT 28M 1024 ImageNet-1K 77.1% 73.7% full ckpt train linear knn
EsViT 28M 1024 WebVision-v1 75.4% 69.4% full ckpt train linear knn
EsViT 28M 1024 OpenImages-v4 69.6% 60.3% full ckpt train linear knn
EsViT 28M 1024 ImageNet-22K 73.5% 66.1% full ckpt train linear knn

Pre-training

One-node training

To train on 1 node with 16 GPUs for Swin-T model size:

PROJ_PATH=your_esvit_project_path
DATA_PATH=$PROJ_PATH/project/data/imagenet

OUT_PATH=$PROJ_PATH/output/esvit_exp/ssl/swin_tiny_imagenet/
python -m torch.distributed.launch --nproc_per_node=16 main_esvit.py --arch swin_tiny --data_path $DATA_PATH/train --output_dir $OUT_PATH --batch_size_per_gpu 32 --epochs 300 --teacher_temp 0.07 --warmup_epochs 10 --warmup_teacher_temp_epochs 30 --norm_last_layer false --use_dense_prediction True --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml 

The main training script is main_esvit.py and conducts the training loop, taking the following options (among others) as arguments:

  • --use_dense_prediction: whether or not to use the region matching task in pre-training
  • --arch: switch between different sparse self-attention in the multi-stage Transformer architecture. Example architecture choices for EsViT training include [swin_tiny, swin_small, swin_base, swin_large,cvt_tiny, vil_2262]. The configuration files should be adjusted accrodingly, we provide example below. One may specify the network configuration by editing the YAML file under experiments/imagenet/*/*.yaml. The default window size=7; To consider a multi-stage architecture with window size=14, please choose yaml files with window14 in filenames.

To train on 1 node with 16 GPUs for Convolutional vision Transformer (CvT) models:

python -m torch.distributed.launch --nproc_per_node=16 main_evsit.py --arch cvt_tiny --data_path $DATA_PATH/train --output_dir $OUT_PATH --batch_size_per_gpu 32 --epochs 300 --teacher_temp 0.07 --warmup_epochs 10 --warmup_teacher_temp_epochs 30 --norm_last_layer false --use_dense_prediction True --aug-opt dino_aug --cfg experiments/imagenet/cvt_v4/s1.yaml

To train on 1 node with 16 GPUs for Vision Longformer (ViL) models:

python -m torch.distributed.launch --nproc_per_node=16 main_evsit.py --arch vil_2262 --data_path $DATA_PATH/train --output_dir $OUT_PATH --batch_size_per_gpu 32 --epochs 300 --teacher_temp 0.07 --warmup_epochs 10 --warmup_teacher_temp_epochs 30 --norm_last_layer false --use_dense_prediction True --aug-opt dino_aug --cfg experiments/imagenet/vil/vil_small/base.yaml MODEL.SPEC.MSVIT.ARCH 'l1,h3,d96,n2,s1,g1,p4,f7,a0_l2,h6,d192,n2,s1,g1,p2,f7,a0_l3,h12,d384,n6,s0,g1,p2,f7,a0_l4,h24,d768,n2,s0,g0,p2,f7,a0' MODEL.SPEC.MSVIT.MODE 1 MODEL.SPEC.MSVIT.VIL_MODE_SWITCH 0.75

Multi-node training

To train on 2 nodes with 16 GPUs each (total 32 GPUs) for Swin-Small model size:

OUT_PATH=$PROJ_PATH/exp_output/esvit_exp/swin/swin_small/bl_lr0.0005_gpu16_bs16_multicrop_epoch300_dino_aug_window14
python main_evsit_mnodes.py --num_nodes 2 --num_gpus_per_node 16 --data_path $DATA_PATH/train --output_dir $OUT_PATH/continued_from0200_dense --batch_size_per_gpu 16 --arch swin_small --zip_mode True --epochs 300 --teacher_temp 0.07 --warmup_epochs 10 --warmup_teacher_temp_epochs 30 --norm_last_layer false --cfg experiments/imagenet/swin/swin_small_patch4_window14_224.yaml --use_dense_prediction True --pretrained_weights_ckpt $OUT_PATH/checkpoint0200.pth

Evaluation:

k-NN and Linear classification on ImageNet

To train a supervised linear classifier on frozen weights on a single node with 4 gpus, run eval_linear.py. To train a k-NN classifier on frozen weights on a single node with 4 gpus, run eval_knn.py. Please specify --arch, --cfg and --pretrained_weights to choose a pre-trained checkpoint. If you want to evaluate the last checkpoint of EsViT with Swin-T, you can run for example:

PROJ_PATH=your_esvit_project_path
DATA_PATH=$PROJ_PATH/project/data/imagenet

OUT_PATH=$PROJ_PATH/exp_output/esvit_exp/swin/swin_tiny/bl_lr0.0005_gpu16_bs32_dense_multicrop_epoch300
CKPT_PATH=$PROJ_PATH/exp_output/esvit_exp/swin/swin_tiny/bl_lr0.0005_gpu16_bs32_dense_multicrop_epoch300/checkpoint.pth

python -m torch.distributed.launch --nproc_per_node=4 eval_linear.py --data_path $DATA_PATH --output_dir $OUT_PATH/lincls/epoch0300 --pretrained_weights $CKPT_PATH --checkpoint_key teacher --batch_size_per_gpu 256 --arch swin_tiny --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml --n_last_blocks 4 --num_labels 1000 MODEL.NUM_CLASSES 0

python -m torch.distributed.launch --nproc_per_node=4 eval_knn.py --data_path $DATA_PATH --dump_features $OUT_PATH/features/epoch0300 --pretrained_weights $CKPT_PATH --checkpoint_key teacher --batch_size_per_gpu 256 --arch swin_tiny --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml MODEL.NUM_CLASSES 0

Analysis/Visualization of correspondence and attention maps

You can analyze the learned models by running python run_analysis.py. One example to analyze EsViT (Swin-T) is shown.

For an invidiual image (with path --image_path $IMG_PATH), we visualize the attention maps and correspondence of the last layer:

python run_analysis.py --arch swin_tiny --image_path $IMG_PATH --output_dir $OUT_PATH --pretrained_weights $CKPT_PATH --learning ssl --seed $SEED --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml --vis_attention True --vis_correspondence True MODEL.NUM_CLASSES 0 

For an image dataset (with path --data_path $DATA_PATH), we quantatively measure the correspondence:

python run_analysis.py --arch swin_tiny --data_path $DATA_PATH --output_dir $OUT_PATH --pretrained_weights $CKPT_PATH --learning ssl --seed $SEED --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml  --measure_correspondence True MODEL.NUM_CLASSES 0 

For more examples, please see scripts/scripts_local/run_analysis.sh.

Citation

If you find this repository useful, please consider giving a star ⭐ and citation đŸș :

@article{li2021esvit,
  title={Efficient Self-supervised Vision Transformers for Representation Learning},
  author={Li, Chunyuan and Yang, Jianwei and Zhang, Pengchuan and Gao, Mei and Xiao, Bin and Dai, Xiyang and Yuan, Lu and Gao, Jianfeng},
  journal={arXiv preprint arXiv:2106.09785},
  year={2021}
}

Related Projects/Codebase

[Swin Transformers] [Vision Longformer] [Convolutional vision Transformers (CvT)] [Focal Transformers]

Acknowledgement

Our implementation is built partly upon packages: [Dino] [Timm]

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Patch2Pix: Epipolar-Guided Pixel-Level Correspondences [CVPR2021]

Patch2Pix for Accurate Image Correspondence Estimation This repository contains the Pytorch implementation of our paper accepted at CVPR2021: Patch2Pi

Qunjie Zhou 199 Nov 29, 2022
A 2D Visual Localization Framework based on Essential Matrices [ICRA2020]

A 2D Visual Localization Framework based on Essential Matrices This repository provides implementation of our paper accepted at ICRA: To Learn or Not

Qunjie Zhou 27 Nov 07, 2022
Pytorch port of Google Research's LEAF Audio paper

leaf-audio-pytorch Pytorch port of Google Research's LEAF Audio paper published at ICLR 2021. This port is not completely finished, but the Leaf() fro

Dennis Fedorishin 80 Oct 31, 2022
A curated list of awesome papers for Semantic Retrieval (TOIS Accepted: Semantic Models for the First-stage Retrieval: A Comprehensive Review).

A curated list of awesome papers for Semantic Retrieval (TOIS Accepted: Semantic Models for the First-stage Retrieval: A Comprehensive Review).

Yinqiong Cai 189 Dec 28, 2022
[ICLR 2021] Is Attention Better Than Matrix Decomposition?

Enjoy-Hamburger 🍔 Official implementation of Hamburger, Is Attention Better Than Matrix Decomposition? (ICLR 2021) Under construction. Introduction T

Gsunshine 271 Dec 29, 2022
Pip-package for trajectory benchmarking from "Be your own Benchmark: No-Reference Trajectory Metric on Registered Point Clouds", ECMR'21

Map Metrics for Trajectory Quality Map metrics toolkit provides a set of metrics to quantitatively evaluate trajectory quality via estimating consiste

Mobile Robotics Lab. at Skoltech 31 Oct 28, 2022
PyTorch trainer and model for Sequence Classification

PyTorch-trainer-and-model-for-Sequence-Classification After cloning the repository, modify your training data so that the training data is a .csv file

NhanTieu 2 Dec 09, 2022
git《Commonsense Knowledge Base Completion with Structural and Semantic Context》(AAAI 2020) GitHub: [fig1]

Commonsense Knowledge Base Completion with Structural and Semantic Context Code for the paper Commonsense Knowledge Base Completion with Structural an

AI2 96 Nov 05, 2022
SimulLR - PyTorch Implementation of SimulLR

PyTorch Implementation of SimulLR There is an interesting work[1] about simultan

11 Dec 22, 2022
Pytorch implementation of the popular Improv RNN model originally proposed by the Magenta team.

Pytorch Implementation of Improv RNN Overview This code is a pytorch implementation of the popular Improv RNN model originally implemented by the Mage

Sebastian Murgul 3 Nov 11, 2022
Genetic Programming in Python, with a scikit-learn inspired API

Welcome to gplearn! gplearn implements Genetic Programming in Python, with a scikit-learn inspired and compatible API. While Genetic Programming (GP)

Trevor Stephens 1.3k Jan 03, 2023
Rethinking of Pedestrian Attribute Recognition: A Reliable Evaluation under Zero-Shot Pedestrian Identity Setting

Pytorch Pedestrian Attribute Recognition: A strong PyTorch baseline of pedestrian attribute recognition and multi-label classification.

Jian 79 Dec 18, 2022
This is a five-step framework for the development of intrusion detection systems (IDS) using machine learning (ML) considering model realization, and performance evaluation.

AB-TRAP: building invisibility shields to protect network devices The AB-TRAP framework is applicable to the development of Network Intrusion Detectio

Lab-C2DC - Laboratory of Command and Control and Cyber-security 17 Jan 04, 2023
[ICLR2021] Unlearnable Examples: Making Personal Data Unexploitable

Unlearnable Examples Code for ICLR2021 Spotlight Paper "Unlearnable Examples: Making Personal Data Unexploitable " by Hanxun Huang, Xingjun Ma, Sarah

Hanxun Huang 98 Dec 07, 2022
BanditPAM: Almost Linear-Time k-Medoids Clustering

BanditPAM: Almost Linear-Time k-Medoids Clustering This repo contains a high-performance implementation of BanditPAM from BanditPAM: Almost Linear-Tim

254 Dec 12, 2022
PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io

PyStan NOTE: This documentation describes a BETA release of PyStan 3. PyStan is a Python interface to Stan, a package for Bayesian inference. StanÂź is

Stan 229 Dec 29, 2022
Python library containing BART query generation and BERT-based Siamese models for neural retrieval.

Neural Retrieval Embedding-based Zero-shot Retrieval through Query Generation leverages query synthesis over large corpuses of unlabeled text (such as

Amazon Web Services - Labs 35 Apr 14, 2022
Code and dataset for ACL2018 paper "Exploiting Document Knowledge for Aspect-level Sentiment Classification"

Aspect-level Sentiment Classification Code and dataset for ACL2018 [paper] ‘‘Exploiting Document Knowledge for Aspect-level Sentiment Classification’’

Ruidan He 146 Nov 29, 2022
Source code release of the paper: Knowledge-Guided Deep Fractal Neural Networks for Human Pose Estimation.

GNet-pose Project Page: http://guanghan.info/projects/guided-fractal/ UPDATE 9/27/2018: Prototxts and model that achieved 93.9Pck on LSP dataset. http

Guanghan Ning 83 Nov 21, 2022
Download and preprocess popular sequential recommendation datasets

Sequential Recommendation Datasets This repository collects some commonly used sequential recommendation datasets in recent research papers and provid

125 Dec 06, 2022