This is an official implementation for "Self-Supervised Learning with Swin Transformers".

Overview

Self-Supervised Learning with Vision Transformers

By Zhenda Xie*, Yutong Lin*, Zhuliang Yao, Zheng Zhang, Qi Dai, Yue Cao and Han Hu

This repo is the official implementation of "Self-Supervised Learning with Swin Transformers".

A important feature of this codebase is to include Swin Transformer as one of the backbones, such that we can evaluate the transferring performance of the learnt representations on down-stream tasks of object detection and semantic segmentation. This evaluation is usually not included in previous works due to the use of ViT/DeiT, which has not been well tamed for down-stream tasks.

It currently includes code and models for the following tasks:

Self-Supervised Learning and Linear Evaluation: Included in this repo. See get_started.md for a quick start.

Transferring Performance on Object Detection/Instance Segmentation: See Swin Transformer for Object Detection.

Transferring Performance on Semantic Segmentation: See Swin Transformer for Semantic Segmentation.

Highlights

  • Include down-stream evaluation: the first work to evaluate the transferring performance on down-stream tasks for SSL using Transformers
  • Small tricks: significantly less tricks than previous works, such as MoCo v3 and DINO
  • High accuracy on ImageNet-1K linear evaluation: 72.8 vs 72.5 (MoCo v3) vs 72.5 (DINO) using DeiT-S/16 and 300 epoch pre-training

Updates

05/13/2021

  1. Self-Supervised models with DeiT-Small on ImageNet-1K (MoBY-DeiT-Small-300Ep-Pretrained, MoBY-DeiT-Small-300Ep-Linear) are provided.
  2. The supporting code and config for self-supervised learning with DeiT-Small are provided.

05/11/2021

Initial Commits:

  1. Self-Supervised Pre-training models on ImageNet-1K (MoBY-Swin-T-300Ep-Pretrained, MoBY-Swin-T-300Ep-Linear) are provided.
  2. The supported code and models for self-supervised pre-training and ImageNet-1K linear evaluation, COCO object detection and ADE20K semantic segmentation are provided.

Introduction

MoBY: a self-supervised learning approach by combining MoCo v2 and BYOL

MoBY (the name MoBY stands for MoCo v2 with BYOL) is initially described in arxiv, which is a combination of two popular self-supervised learning approaches: MoCo v2 and BYOL. It inherits the momentum design, the key queue, and the contrastive loss used in MoCo v2, and inherits the asymmetric encoders, asymmetric data augmentations and the momentum scheduler in BYOL.

MoBY achieves reasonably high accuracy on ImageNet-1K linear evaluation: 72.8% and 75.3% top-1 accuracy using DeiT and Swin-T, respectively, by 300-epoch training. The performance is on par with recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks.

teaser_moby

Swin Transformer as a backbone

Swin Transformer (the name Swin stands for Shifted window) is initially described in arxiv, which capably serves as a general-purpose backbone for computer vision. It achieves strong performance on COCO object detection (58.7 box AP and 51.1 mask AP on test-dev) and ADE20K semantic segmentation (53.5 mIoU on val), surpassing previous models by a large margin.

We involve Swin Transformer as one of backbones to evaluate the transferring performance on down-stream tasks such as object detection. This differentiate this codebase with other approaches studying SSL on Transformer architectures.

ImageNet-1K linear evaluation

Method Architecture Epochs Params FLOPs img/s Top-1 Accuracy Pre-trained Checkpoint Linear Checkpoint
Supervised Swin-T 300 28M 4.5G 755.2 81.2 Here
MoBY Swin-T 100 28M 4.5G 755.2 70.9 TBA
MoBY1 Swin-T 100 28M 4.5G 755.2 72.0 TBA
MoBY DeiT-S 300 22M 4.6G 940.4 72.8 GoogleDrive/GitHub/Baidu GoogleDrive/GitHub/Baidu
MoBY Swin-T 300 28M 4.5G 755.2 75.3 GoogleDrive/GitHub/Baidu GoogleDrive/GitHub/Baidu
  • 1 denotes the result of MoBY which has adopted a trick from MoCo v3 that replace theLayerNorm layers before the MLP blocks by BatchNorm.

  • Access code for baidu is moby.

Transferring to Downstream Tasks

COCO Object Detection (2017 val)

Backbone Method Model Schd. box mAP mask mAP Params FLOPs
Swin-T Mask R-CNN Sup. 1x 43.7 39.8 48M 267G
Swin-T Mask R-CNN MoBY 1x 43.6 39.6 48M 267G
Swin-T Mask R-CNN Sup. 3x 46.0 41.6 48M 267G
Swin-T Mask R-CNN MoBY 3x 46.0 41.7 48M 267G
Swin-T Cascade Mask R-CNN Sup. 1x 48.1 41.7 86M 745G
Swin-T Cascade Mask R-CNN MoBY 1x 48.1 41.5 86M 745G
Swin-T Cascade Mask R-CNN Sup. 3x 50.4 43.7 86M 745G
Swin-T Cascade Mask R-CNN MoBY 3x 50.2 43.5 86M 745G

ADE20K Semantic Segmentation (val)

Backbone Method Model Crop Size Schd. mIoU mIoU (ms+flip) Params FLOPs
Swin-T UPerNet Sup. 512x512 160K 44.51 45.81 60M 945G
Swin-T UPerNet MoBY 512x512 160K 44.06 45.58 60M 945G

Citing MoBY and Swin

MoBY

@article{xie2021moby,
  title={Self-Supervised Learning with Swin Transformers}, 
  author={Zhenda Xie and Yutong Lin and Zhuliang Yao and Zheng Zhang and Qi Dai and Yue Cao and Han Hu},
  journal={arXiv preprint arXiv:2105.04553},
  year={2021}
}

Swin Transformer

@article{liu2021Swin,
  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
  journal={arXiv preprint arXiv:2103.14030},
  year={2021}
}

Getting Started

Owner
Swin Transformer
This organization maintains repositories built on Swin Transformers. The pretrained models locate at https://github.com/microsoft/Swin-Transformer
Swin Transformer
一个运行在 𝐞𝐥𝐞𝐜𝐕𝟐𝐏 或 𝐪𝐢𝐧𝐠𝐥𝐨𝐧𝐠 等定时面板的签到项目

定时面板上的签到盒 一个运行在 𝐞𝐥𝐞𝐜𝐕𝟐𝐏 或 𝐪𝐢𝐧𝐠𝐥𝐨𝐧𝐠 等定时面板的签到项目 𝐞𝐥𝐞𝐜𝐕𝟐𝐏 𝐪𝐢𝐧𝐠𝐥𝐨𝐧𝐠 特别声明 本仓库发布的脚本及其中涉及的任何解锁和解密分析脚本,仅用于测试和学习研究,禁止用于商业用途,不能保证其合

Leon 1.1k Dec 30, 2022
Self-supervised Label Augmentation via Input Transformations (ICML 2020)

Self-supervised Label Augmentation via Input Transformations Authors: Hankook Lee, Sung Ju Hwang, Jinwoo Shin (KAIST) Accepted to ICML 2020 Install de

hankook 96 Dec 29, 2022
Tensors and neural networks in Haskell

Hasktorch Hasktorch is a library for tensors and neural networks in Haskell. It is an independent open source community project which leverages the co

hasktorch 920 Jan 04, 2023
Video Instance Segmentation using Inter-Frame Communication Transformers (NeurIPS 2021)

Video Instance Segmentation using Inter-Frame Communication Transformers (NeurIPS 2021) Paper Video Instance Segmentation using Inter-Frame Communicat

Sukjun Hwang 81 Dec 29, 2022
A new video text spotting framework with Transformer

TransVTSpotter: End-to-end Video Text Spotter with Transformer Introduction A Multilingual, Open World Video Text Dataset and End-to-end Video Text Sp

weijiawu 67 Jan 03, 2023
这是一个mobilenet-yolov4-lite的库,把yolov4主干网络修改成了mobilenet,修改了Panet的卷积组成,使参数量大幅度缩小。

YOLOV4:You Only Look Once目标检测模型-修改mobilenet系列主干网络-在Keras当中的实现 2021年2月8日更新: 加入letterbox_image的选项,关闭letterbox_image后网络的map一般可以得到提升。

Bubbliiiing 65 Dec 01, 2022
Code for "Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency" paper

UNICORN 🦄 Webpage | Paper | BibTex PyTorch implementation of "Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency" pap

118 Jan 06, 2023
Self-Supervised depth kalilia

Self-Supervised depth kalilia

24 Oct 15, 2022
Combining Diverse Feature Priors

Combining Diverse Feature Priors This repository contains code for reproducing the results of our paper. Paper: https://arxiv.org/abs/2110.08220 Blog

Madry Lab 5 Nov 12, 2022
Ranger deep learning optimizer rewrite to use newest components

Ranger21 - integrating the latest deep learning components into a single optimizer Ranger deep learning optimizer rewrite to use newest components Ran

Less Wright 266 Dec 28, 2022
Code that accompanies the paper Semi-supervised Deep Kernel Learning: Regression with Unlabeled Data by Minimizing Predictive Variance

Semi-supervised Deep Kernel Learning This is the code that accompanies the paper Semi-supervised Deep Kernel Learning: Regression with Unlabeled Data

58 Oct 26, 2022
Official code for MPG2: Multi-attribute Pizza Generator: Cross-domain Attribute Control with Conditional StyleGAN

This is the official code for Multi-attribute Pizza Generator (MPG2): Cross-domain Attribute Control with Conditional StyleGAN. Paper Demo Setup Envir

Fangda Han 5 Sep 01, 2022
The fastai book, published as Jupyter Notebooks

English / Spanish / Korean / Chinese / Bengali / Indonesian The fastai book These notebooks cover an introduction to deep learning, fastai, and PyTorc

fast.ai 17k Jan 07, 2023
Credo AI Lens is a comprehensive assessment framework for AI systems. Lens standardizes model and data assessment, and acts as a central gateway to assessments created in the open source community.

Lens by Credo AI - Responsible AI Assessment Framework Lens is a comprehensive assessment framework for AI systems. Lens standardizes model and data a

Credo AI 27 Dec 14, 2022
Pre-Trained Image Processing Transformer (IPT)

Pre-Trained Image Processing Transformer (IPT) By Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Cha

HUAWEI Noah's Ark Lab 332 Dec 18, 2022
Code release for NeuS

NeuS We present a novel neural surface reconstruction method, called NeuS, for reconstructing objects and scenes with high fidelity from 2D image inpu

Peng Wang 813 Jan 04, 2023
A code generator from ONNX to PyTorch code

onnx-pytorch Generating pytorch code from ONNX. Currently support onnx==1.9.0 and torch==1.8.1. Installation From PyPI pip install onnx-pytorch From

Wenhao Hu 94 Jan 06, 2023
TensorFlow Implementation of Unsupervised Cross-Domain Image Generation

Domain Transfer Network (DTN) TensorFlow implementation of Unsupervised Cross-Domain Image Generation. Requirements Python 2.7 TensorFlow 0.12 Pickle

Yunjey Choi 864 Dec 30, 2022
The code for the NeurIPS 2021 paper "A Unified View of cGANs with and without Classifiers".

Energy-based Conditional Generative Adversarial Network (ECGAN) This is the code for the NeurIPS 2021 paper "A Unified View of cGANs with and without

sianchen 22 May 28, 2022
Reference code for the paper CAMS: Color-Aware Multi-Style Transfer.

CAMS: Color-Aware Multi-Style Transfer Mahmoud Afifi1, Abdullah Abuolaim*1, Mostafa Hussien*2, Marcus A. Brubaker1, Michael S. Brown1 1York University

Mahmoud Afifi 36 Dec 04, 2022