SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation

Last update: Dec 22, 2022

Related tags

Overview

SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation

SeqFormer

SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation

Junfeng Wu, Yi Jiang, Wenqing Zhang, Xiang Bai, Song Bai

arXiv 2112.08275

Abstract

In this work, we present SeqFormer, a frustratingly simple model for video instance segmentation. SeqFormer follows the principle of vision transformer that models instance relationships among video frames. Nevertheless, we observe that a stand-alone instance query suffices for capturing a time sequence of instances in a video, but attention mechanisms should be done with each frame independently. To achieve this, SeqFormer locates an instance in each frame and aggregates temporal information to learn a powerful representation of a video-level instance, which is used to predict the mask sequences on each frame dynamically. Instance tracking is achieved naturally without tracking branches or post-processing. On the YouTube-VIS dataset, SeqFormer achieves 47.4 AP with a ResNet-50 backbone and 49.0 AP with a ResNet-101 backbone without bells and whistles. Such achievement significantly exceeds the previous state-of-the-art performance by 4.6 and 4.4, respectively. In addition, integrated with the recently-proposed Swin transformer, SeqFormer achieves a much higher AP of 59.3. We hope SeqFormer could be a strong baseline that fosters future research in video instance segmentation, and in the meantime, advances this field with a more robust, accurate, neat model.

Visualization results on YouTube-VIS 2019 valid set

Installation

First, clone the repository locally:

git clone https://github.com/wjf5203/SeqFormer.git

Then, install PyTorch 1.7 and torchvision 0.8.

conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 -c pytorch

Install dependencies and pycocotools for VIS:

pip install -r requirements.txt
pip install git+https://github.com/youtubevos/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI"

Compiling CUDA operators:

cd ./models/ops
sh ./make.sh
# unit test (should see all checking is True)
python test.py

Data Preparation

Download and extract 2019 version of YoutubeVIS train and val images with annotations from CodeLab or YouTubeVIS, and download COCO 2017 datasets. We expect the directory structure to be the following:

SeqFormer
├── datasets
│   ├── coco_keepfor_ytvis19.json
...
ytvis
├── train
├── val
├── annotations
│   ├── instances_train_sub.json
│   ├── instances_val_sub.json
coco
├── train2017
├── val2017
├── annotations
│   ├── instances_train2017.json
│   ├── instances_val2017.json

The modified coco annotations 'coco_keepfor_ytvis19.json' for joint training can be downloaded from [google].

Model zoo

Ablation model

Train on YouTube-VIS 2019, evaluate on YouTube-VIS 2019.

Model	AP	AP50	AP75	AR1	AR10
SeqFormer_ablation [google]	45.1	66.9	50.5	45.6	54.6

YouTube-VIS model

Train on YouTube-VIS 2019 and COCO, evaluate on YouTube-VIS 2019 val set.

Model	AP	AP50	AP75	AR1	AR10	Pretrain
SeqFormer_r50 [google]	47.4	69.8	51.8	45.5	54.8	weight
SeqFormer_r101 [google]	49.0	71.1	55.7	46.8	56.9	weight
SeqFormer_x101 [google]	51.2	75.3	58.0	46.5	57.3	weight
SeqFormer_swin_L [google]	59.3	82.1	66.4	51.7	64.4	weight

Training

We performed the experiment on NVIDIA Tesla V100 GPU. All models of SeqFormer are trained with total batch size of 32.

To train SeqFormer on YouTube-VIS 2019 with 8 GPUs , run:

GPUS_PER_NODE=8 ./tools/run_dist_launch.sh 8 ./configs/r50_seqformer_ablation.sh

To train SeqFormer on YouTube-VIS 2019 and COCO 2017 jointly, run:

GPUS_PER_NODE=8 ./tools/run_dist_launch.sh 8 ./configs/r50_seqformer.sh

To train SeqFormer_swin_L on multiple nodes, run:

On node 1:

MASTER_ADDR=
   
     NODE_RANK=0 GPUS_PER_NODE=8 ./tools/run_dist_launch.sh 16 ./configs/swin_seqformer.sh

On node 2:

MASTER_ADDR=
   
     NODE_RANK=1 GPUS_PER_NODE=8 ./tools/run_dist_launch.sh 16 ./configs/swin_seqformer.sh

Inference & Evaluation

Evaluating on YouTube-VIS 2019:

python3 inference.py  --masks --backbone [backbone] --model_path /path/to/model_weights --save_path results.json

To get quantitative results, please zip the json file and upload to the codalab server.

Citation

@article{wu2021seqformer,
      title={SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation}, 
      author={Junfeng Wu and Yi Jiang and Wenqing Zhang and Xiang Bai and Song Bai},
      journal={arXiv preprint arXiv:2112.08275},
      year={2021},
}

Acknowledgement

This repo is based on Deformable DETR and VisTR. Thanks for their wonderful works.

SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation

Related tags

Overview

SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation

SeqFormer

Abstract

Visualization results on YouTube-VIS 2019 valid set

Installation

Data Preparation

Model zoo

Ablation model

YouTube-VIS model

Training

Inference & Evaluation

Citation

Acknowledgement

Owner

Junfeng Wu

KITTI-360 Annotation Tool is a framework that developed based on python(cherrypy + jinja2 + sqlite3) as the server end and javascript + WebGL as the front end.

PyTorch implementation of InstaGAN: Instance-aware Image-to-Image Translation

JAX code for the paper "Control-Oriented Model-Based Reinforcement Learning with Implicit Differentiation"

Magic tool for managing internet connection in local network by @zalexdev

Evaluating AlexNet features at various depths

python library for invisible image watermark (blind image watermark)

EfficientMPC - Efficient Model Predictive Control Implementation

MMdet2-based reposity about lightweight detection model: Nanodet, PicoDet.

IhoneyBakFileScan Modify - 批量网站备份文件扫描器，增加文件规则，优化内存占用

HyperSeg: Patch-wise Hypernetwork for Real-time Semantic Segmentation Official PyTorch Implementation

[CVPR'21] FedDG: Federated Domain Generalization on Medical Image Segmentation via Episodic Learning in Continuous Frequency Space

Entity-Based Knowledge Conflicts in Question Answering.

Research Artifact of USENIX Security 2022 Paper: Automated Side Channel Analysis of Media Software with Manifold Learning

A platform to display the carbon neutralization information for researchers, decision-makers, and other participants in the community.

Vector.ai assignment

Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences forImage-Text Retrieval

An Intelligent Self-driving Truck System For Highway Transportation

Equipped customers with insights about their EVs Hourly energy consumption and helped predict future charging behavior using LSTM model

Code accompanying the paper "How Tight Can PAC-Bayes be in the Small Data Regime?"

IPATool-py: download ipa easily