(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Last update: Jan 08, 2023

Related tags

Deep Learning Kaleido-BERT

Overview

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Mingchen Zhuge*, Dehong Gao*, Deng-Ping Fan#, Linbo Jin, Ben Chen, Haoming Zhou, Minghui Qiu, Ling Shao.

[Paper][中文版][Video][Poster][MSRA_Slide][News1][New2][MSRA_Talking][机器之心_Talking]

Introduction

We present a new vision-language (VL) pre-training model dubbed Kaleido-BERT, which introduces a novel kaleido strategy for fashion cross-modality representations from transformers. In contrast to random masking strategy of recent VL models, we design alignment guided masking to jointly focus more on image-text semantic relations. To this end, we carry out five novel tasks, \ie, rotation, jigsaw, camouflage, grey-to-color, and blank-to-color for self-supervised VL pre-training at patches of different scale. Kaleido-BERT is conceptually simple and easy to extend to the existing BERT framework, it attains state-of-the-art results by large margins on four downstream tasks, including text retrieval ([email protected]: 4.03% absolute improvement), image retrieval ([email protected]: 7.13% abs imv.), category recognition (ACC: 3.28% abs imv.), and fashion captioning (Bleu4: 1.2 abs imv.). We validate the efficiency of Kaleido-BERT on a wide range of e-commercial websites, demonstrating its broader potential in real-world applications.

Noted

Code will be released in 2021/4/16.
This is the tensorflow implementation built on Alibaba/EasyTransfer. We will also release a Pytorch version built on Huggingface/Transformers in future.
If you feel hard to download these datasets, please modify /dataset/get_pretrain_data.sh, /dataset/get_finetune_data.sh, /dataset/get_retrieve_data.sh, and comment out some wget #file_links as you want. This will not inhibit following implementation.

Get started

Clone this code

git clone [email protected]:mczhuge/Kaleido-BERT.git
cd Kaleido-BERT

Enviroment setup (Details can be found on conda_env.info)

conda create --name kaleidobert --file conda_env.info
conda activate kaleidobert
conda install tensorflow-gpu=1.15.0
pip install boto3 tqdm tensorflow_datasets --index-url=https://mirrors.aliyun.com/pypi/simple/
pip install sentencepiece==0.1.92 sklearn --index-url=https://mirrors.aliyun.com/pypi/simple/
pip install joblib==0.14.1
python setup.py develop

Download Pretrained Dependancy

cd Kaleido-BERT/scripts/checkpoint
sh get_checkpoint.sh

Finetune

#Download finetune datasets

cd Kaleido-BERT/scripts/dataset
sh get_finetune_data.sh
sh get_retrieve_data.sh

#Testing CAT/SUB

cd Kaleido-BERT/scripts
sh run_cat.sh
sh run_subcat.sh

#Testing TIR/ITR

cd Kaleido-BERT/scripts
sh run_i2t.sh
sh run_t2i.sh

Pre-training

#Download pre-training datasets

cd Kaleido-BERT/scripts/dataset
sh get_prtrain_data.sh

#Remove existed checkpoint
rm -rf Kaleido-BERT/checkpoint/pretrained

#Run pre-training
cd Kaleido-BERT/scripts/
sh run_pretrain.sh

Acknowlegement

Thanks Alibaba ICBU Search Team and Alibaba PAI Team for technical support.

Citing Kaleido-BERT

@InProceedings{Zhuge_2021_CVPR,
    author    = {Zhuge, Mingchen and Gao, Dehong and Fan, Deng-Ping and Jin, Linbo and Chen, Ben and Zhou, Haoming and Qiu, Minghui and Shao, Ling},
    title     = {Kaleido-BERT: Vision-Language Pre-Training on Fashion Domain},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2021},
    pages     = {12647-12657}
}

Contact

Mingchen Zhuge (email: [email protected] | wechat: tjpxiaoming)
Deng-Ping Fan (email: [email protected])
Dehong Gao (email: [email protected])

Feel free to contact us if you have additional questions.

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Related tags

Overview

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Introduction

Noted

Get started

Acknowlegement

Citing Kaleido-BERT

Contact

Owner

A library for graph deep learning research

VLG-Net: Video-Language Graph Matching Networks for Video Grounding

PoolFormer: MetaFormer is Actually What You Need for Vision

ATAC: Adversarially Trained Actor Critic

Unified file system operation experience for different backend

[TNNLS 2021] The official code for the paper "Learning Deep Context-Sensitive Decomposition for Low-Light Image Enhancement"

Bayesian optimisation library developped by Huawei Noah's Ark Library

Repository of best practices for deep learning in Julia, inspired by fastai

A keras implementation of ENet (abandoned for the foreseeable future)

SiT: Self-supervised vIsion Transformer

FaceAnon - Anonymize people in images and videos using yolov5-crowdhuman

Python script that analyses the given datasets and comes up with the best polynomial regression representation with the smallest polynomial degree possible

scikit-learn inspired API for CRFsuite

Aggragrating Nested Transformer Official Jax Implementation

Image Processing, Image Smoothing, Edge Detection and Transforms

Pairwise Learning for Neural Link Prediction for OGB (PLNLP-OGB)

Provide baselines and evaluation metrics of the task: traffic flow prediction

Hydra: an Extensible Fuzzing Framework for Finding Semantic Bugs in File Systems

Fast Soft Color Segmentation

Lingvo is a framework for building neural networks in Tensorflow, particularly sequence models.