MlTr: Multi-label Classification with Transformer

This is official implement of "MlTr: Multi-label Classification with Transformer".

Abstract

The task of multi-label image classification is to recognize all the object labels presented in an image. Though advancing for years, small objects, similar objects and objects with high conditional probability are still the main bottlenecks of previous convolutional neural network(CNN) based models, limited by convolutional kernels' representational capacity. Recent vision transformer networks utilize the self-attention mechanism to extract the feature of pixel granularity, which expresses richer local semantic information, while is insufficient for mining global spatial dependence. In this paper, we point out the three crucial problems that CNN-based methods encounter and explore the possibility of conducting specific transformer modules to settle them. We put forward a Multi-label Transformer architecture(MlTr) constructed with windows partitioning, in-window pixel attention, cross-window attention, particularly improving the performance of multi-label image classification tasks. The proposed MlTr shows state-of-the-art results on various prevalent multi-label datasets such as MS-COCO, Pascal-VOC, and NUS-WIDE with 88.5%, 95.8%, and 65.5% respectively.

Pretrained model (Results on MS-COCO2014)

name	resolution	map	params(M)	model	log
mltr-s	224x224	81.9	33	coming soon	coming soon
mltr-m	384x384	86.8	62	coming soon	coming soon
mltr-l	384x384	88.5	108	coming soon	coming soon

Citing artical

Pleadse cite this article as:

@misc{cheng2021mltr,
      title={MlTr: Multi-label Classification with Transformer}, 
      author={Xing Cheng and Hezheng Lin and Xiangyu Wu and Fan Yang and Dong Shen and Zhongyuan Wang and Nian Shi and Honglin Liu},
      year={2021},
      eprint={2106.06195},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Started

Please refer to get_started.

MlTr: Multi-label Classification with Transformer

Related tags

Overview

MlTr: Multi-label Classification with Transformer

Abstract

Pretrained model (Results on MS-COCO2014)

Citing artical

Started

Owner

程星

RADIal is available now! Check the download section

Simple Baselines for Human Pose Estimation and Tracking

GRF: Learning a General Radiance Field for 3D Representation and Rendering

VQGAN+CLIP Colab Notebook with user-friendly interface.

A lightweight library to compare different PyTorch implementations of the same network architecture.

TransCD: Scene Change Detection via Transformer-based Architecture

Proof of concept GnuCash Webinterface

Progressive Growing of GANs for Improved Quality, Stability, and Variation

LSTM built using Keras Python package to predict time series steps and sequences. Includes sin wave and stock market data

City Surfaces: City-scale Semantic Segmentation of Sidewalk Surfaces

UltraPose: Synthesizing Dense Pose with 1 Billion Points by Human-body Decoupling 3D Model

Implementation for paper "STAR: A Structure-aware Lightweight Transformer for Real-time Image Enhancement" (ICCV 2021).

Creating Artificial Life with Reinforcement Learning

CoANet: Connectivity Attention Network for Road Extraction From Satellite Imagery

Some code of the implements of Geological Modeling Using 3D Pixel-Adaptive and Deformable Convolutional Neural Network

Realtime segmentation with ENet, the fast and accurate segmentation net.

EdMIPS: Rethinking Differentiable Search for Mixed-Precision Neural Networks

Iris prediction model is used to classify iris species created julia's DecisionTree, DataFrames, JLD2, PlotlyJS and Statistics packages.

Gluon CV Toolkit

SAT: 2D Semantics Assisted Training for 3D Visual Grounding, ICCV 2021 (Oral)