A curated list and survey of awesome Vision Transformers.

Overview
awesome-vit

English | 简体中文

A curated list and survey of awesome Vision Transformers.

You can use mind mapping software to open the mind mapping source file. You can also download the mind mapping HD pictures if you just want to browse them.

Contents

Survey

Only typical algorithms are listed in each category.

Image Classification

Chinese Blogs

Attention-based

image

Training Strategy

image

  • [DeiT] Training data-efficient image transformers & distillation through attention (ICML 2021-2020.12) [Paper]
  • [Token Labeling] All Tokens Matter: Token Labeling for Training Better Vision Transformers (2021.4) [Paper]
Model Improvements
Tokenization Module

image

Image to Token:

  • Non-overlapping Patch Embedding

    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [TNT] Transformer in Transformer (NeurIPS 2021-2021.3) [Paper]
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
  • Overlapping Patch Embedding

    • [T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]

    • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]

    • [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]

    • [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]

    • [PS-ViT] Vision Transformer with Progressive Sampling (2021.8) [Paper]

Token to Token:

  • Fixed sampling window tokenization
    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
  • Dynamic sampling tokenization
    • [PS-ViT] Vision Transformer with Progressive Sampling (2021.8) [Paper]
    • [TokenLearner] TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? (2021.6) [Paper]
Position Encoding Module

image

Explicit position encoding:

  • Absolute position encoding
    • [Transformer] Attention is All You Need] (NIPS 2017-2017.06) [Paper]
    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]
  • Relative position encoding
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
    • [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
    • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]

Implicit position encoding:

  • [CPVT] Conditional Positional Encodings for Vision Transformers (2021.2) [Paper]
  • [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
  • [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]
  • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
Attention Module

image

Include only global attention:

  • Multi-Head attention module

    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
  • Reduce global attention computation

    • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]

    • [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]

    • [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]

    • [P2T] P2T: Pyramid Pooling Transformer for Scene Understanding (2021.6) [Paper]

    • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]

    • [MViT] Multiscale Vision Transformers (2021.4) [Paper]

    • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]

  • Generalized linear attention

    • [T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]

Introduce extra local attention:

  • Local window mode

    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
    • [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
    • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]
    • [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]
    • [GG-Transformer] Glance-and-Gaze Vision Transformer (2021.6) [Paper]
    • [Shuffle Transformer] Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer (2021.6) [Paper]
    • [MSG-Transformer] MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens (2021.5) [Paper]
    • [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
  • Introduce convolutional local inductive bias

    • [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]
    • [ELSA] ELSA: Enhanced Local Self-Attention for Vision Transformer (2021.12) [Paper]
  • Sparse attention

    • [Sparse Transformer] Sparse Transformer: Concentrated Attention Through Explicit Selection [Paper]
FFN Module

image

Improve performance with Conv's local information extraction capability:

  • [LocalViT] LocalViT: Bringing Locality to Vision Transformers (2021.4) [Paper]
  • [CeiT] Incorporating Convolution Designs into Visual Transformers (2021.3) [Paper]
Normalization Module Location

image

  • Pre Normalization

    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
  • Post Normalization

    • [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
Classification Prediction Head Module

image

  • Class Tokens

    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [CeiT] Incorporating Convolution Designs into Visual Transformers (2021.3) [Paper]
  • Avgerage Pooling

    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
    • [CPVT] Conditional Positional Encodings for Vision Transformers (2021.2) [Paper]
    • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
Others

image

(1) How to output multi-scale feature map

  • Patch merging

    • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]
    • [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
    • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
    • [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
    • [MetaFormer] MetaFormer is Actually What You Need for Vision (2021.11) [Paper]
  • Pooling attention

    • [MViT] Multiscale Vision Transformers (2021.4) [Paper][Imporved MViT]

    • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]

  • Dilation convolution

    • [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]

(2) How to train a deeper Transformer

  • [Cait] Going deeper with Image Transformers (2021.3) [Paper]
  • [DeepViT] DeepViT: Towards Deeper Vision Transformer (2021.3) [Paper]

MLP-based

image

  • [MLP-Mixer] MLP-Mixer: An all-MLP Architecture for Vision (2021.5) [Paper]

  • [ResMLP] ResMLP: Feedforward networks for image classification with data-efficient training (CVPR2021-2021.5) [Paper]

  • [gMLP] Pay Attention to MLPs (2021.5) [Paper]

  • [CycleMLP] CycleMLP: A MLP-like Architecture for Dense Prediction (2021.7) [Paper]

ConvMixer-based

  • [ConvMixer] Patches Are All You Need [Paper]

General Architecture Analysis

image

  • Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight (2021.6) [Paper]
  • A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP (2021.8) [Paper]
  • [MetaFormer] MetaFormer is Actually What You Need for Vision (2021.11) [Paper]
  • [ConvNeXt] A ConvNet for the 2020s (2022.01) [Paper]

Others

Object Detection

Semantic Segmentation

back to top

Papers

Transformer Original Paper

  • [Transformer] Attention is All You Need] (NIPS 2017-2017.06) [Paper]

ViT Original Paper

  • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]

Image Classification

2020

  • [DeiT] Training data-efficient image transformers & distillation through attention (ICML 2021-2020.12) [Paper]
  • [Sparse Transformer] Sparse Transformer: Concentrated Attention Through Explicit Selection [Paper]

2021

  • [T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]

  • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]

  • [CPVT] Conditional Positional Encodings for Vision Transformers (2021.2) [Paper]

  • [TNT] Transformer in Transformer (NeurIPS 2021-2021.3) [Paper]

  • [Cait] Going deeper with Image Transformers (2021.3) [Paper]

  • [DeepViT] DeepViT: Towards Deeper Vision Transformer (2021.3) [Paper]

  • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]

  • [CeiT] Incorporating Convolution Designs into Visual Transformers (2021.3) [Paper]

  • [LocalViT] LocalViT: Bringing Locality to Vision Transformers (2021.4) [Paper]

  • [MViT] Multiscale Vision Transformers (2021.4) [Paper]

  • [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]

  • [Token Labeling] All Tokens Matter: Token Labeling for Training Better Vision Transformers (2021.4) [Paper]

  • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]

  • [MLP-Mixer] MLP-Mixer: An all-MLP Architecture for Vision (2021.5) [Paper]

  • [ResMLP] ResMLP: Feedforward networks for image classification with data-efficient training (CVPR2021-2021.5) [Paper]

  • [gMLP] Pay Attention to MLPs (2021.5) [Paper]

  • [MSG-Transformer] MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens (2021.5) [Paper]

  • [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]

  • [TokenLearner] TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? (2021.6) [Paper]

  • Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight (2021.6) [Paper]

  • [P2T] P2T: Pyramid Pooling Transformer for Scene Understanding (2021.6) [Paper]

  • [GG-Transformer] Glance-and-Gaze Vision Transformer (2021.6) [Paper]

  • [Shuffle Transformer] Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer (2021.6) [Paper]

  • [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]

  • [CycleMLP] CycleMLP: A MLP-like Architecture for Dense Prediction (2021.7) [Paper]

  • [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]

  • [PS-ViT] Vision Transformer with Progressive Sampling (2021.8) [Paper]

  • A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP (2021.8) [Paper]

  • [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]

  • [MetaFormer] MetaFormer is Actually What You Need for Vision (2021.11) [Paper]

  • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]

  • [ELSA] ELSA: Enhanced Local Self-Attention for Vision Transformer (2021.12) [Paper]

  • [ConvMixer] Patches Are All You Need [Paper]

2022

  • [ConvNeXt] A ConvNet for the 2020s (2022.01) [Paper]

Object Detection

Semantic Segmentation

back to top

Stay tuned and PRs are welcomed!

Owner
OpenMMLab
OpenMMLab
A simple baseline for the 2022 IEEE GRSS Data Fusion Contest (DFC2022)

DFC2022 Baseline A simple baseline for the 2022 IEEE GRSS Data Fusion Contest (DFC2022) This repository uses TorchGeo, PyTorch Lightning, and Segmenta

isaac 24 Nov 28, 2022
Code that accompanies the paper Semi-supervised Deep Kernel Learning: Regression with Unlabeled Data by Minimizing Predictive Variance

Semi-supervised Deep Kernel Learning This is the code that accompanies the paper Semi-supervised Deep Kernel Learning: Regression with Unlabeled Data

58 Oct 26, 2022
AutoDeeplab / auto-deeplab / AutoML for semantic segmentation, implemented in Pytorch

AutoML for Image Semantic Segmentation Currently this repo contains the only working open-source implementation of Auto-Deeplab which, by the way out-

AI Necromancer 299 Dec 17, 2022
Semi-supervised Learning for Sentiment Analysis

Neural-Semi-supervised-Learning-for-Text-Classification-Under-Large-Scale-Pretraining Code, models and Datasets for《Neural Semi-supervised Learning fo

47 Jan 01, 2023
Trading Gym is an open source project for the development of reinforcement learning algorithms in the context of trading.

Trading Gym Trading Gym is an open-source project for the development of reinforcement learning algorithms in the context of trading. It is currently

Dimitry Foures 535 Nov 15, 2022
BMW TechOffice MUNICH 148 Dec 21, 2022
TensorFlow implementation of "Learning from Simulated and Unsupervised Images through Adversarial Training"

Simulated+Unsupervised (S+U) Learning in TensorFlow TensorFlow implementation of Learning from Simulated and Unsupervised Images through Adversarial T

Taehoon Kim 569 Dec 29, 2022
The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"

The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"

Sun Yi 201 Nov 21, 2022
WPPNets: Unsupervised CNN Training with Wasserstein Patch Priors for Image Superresolution

WPPNets: Unsupervised CNN Training with Wasserstein Patch Priors for Image Superresolution This code belongs to the paper [1] available at https://arx

Fabian Altekrueger 5 Jun 02, 2022
Dirty Pixels: Towards End-to-End Image Processing and Perception

Dirty Pixels: Towards End-to-End Image Processing and Perception This repository contains the code for the paper Dirty Pixels: Towards End-to-End Imag

50 Nov 18, 2022
Stereo Radiance Fields (SRF): Learning View Synthesis for Sparse Views of Novel Scenes

Stereo Radiance Fields (SRF): Learning View Synthesis for Sparse Views of Novel Scenes

111 Dec 29, 2022
This porject is intented to build the most accurate model for predicting the porbability of loan default

Estimating-Loan-Default-Probability IBA ML2 Mid-project / Kaggle Competition This porject is intented to build the most accurate model for predicting

Adil Gahramanov 1 Jan 24, 2022
Snscrape-jsonl-urls-extractor - Extracts urls from jsonl produced by snscrape

snscrape-jsonl-urls-extractor extracts urls from jsonl produced by snscrape Usag

1 Feb 26, 2022
Re-implementation of the Noise Contrastive Estimation algorithm for pyTorch, following "Noise-contrastive estimation: A new estimation principle for unnormalized statistical models." (Gutmann and Hyvarinen, AISTATS 2010)

Noise Contrastive Estimation for pyTorch Overview This repository contains a re-implementation of the Noise Contrastive Estimation algorithm, implemen

Denis Emelin 42 Nov 24, 2022
Tiny-NewsRec: Efficient and Effective PLM-based News Recommendation

Tiny-NewsRec The source codes for our paper "Tiny-NewsRec: Efficient and Effective PLM-based News Recommendation". Requirements PyTorch == 1.6.0 Tensor

Yang Yu 3 Dec 07, 2022
Dashboard for the COVID19 spread

COVID-19 Data Explorer App A streamlit Dashboard for the COVID-19 spread. The app is live at: [https://covid19.cwerner.ai]. New data is queried from G

Christian Werner 22 Sep 29, 2022
Code to reproduce the results in "Visually Grounded Reasoning across Languages and Cultures", EMNLP 2021.

marvl-code [WIP] This is the implementation of the approaches described in the paper: Fangyu Liu*, Emanuele Bugliarello*, Edoardo M. Ponti, Siva Reddy

25 Nov 15, 2022
Official repository for the paper "Instance-Conditioned GAN"

Official repository for the paper "Instance-Conditioned GAN" by Arantxa Casanova, Marlene Careil, Jakob Verbeek, Michał Drożdżal, Adriana Romero-Soriano.

Facebook Research 510 Dec 30, 2022
[ICCV21] Code for RetrievalFuse: Neural 3D Scene Reconstruction with a Database

RetrievalFuse Paper | Project Page | Video RetrievalFuse: Neural 3D Scene Reconstruction with a Database Yawar Siddiqui, Justus Thies, Fangchang Ma, Q

Yawar Nihal Siddiqui 75 Dec 22, 2022
JugLab 33 Dec 30, 2022