Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET

Last update: Dec 17, 2022

Related tags

Overview

Training COMET using seq2seq setting

Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET. The codes are modified from run_summarization.py in the official example codes for transformers version 4.16.0.dev0.

The ./deepspeed/ folder is copied from https://github.com/huggingface/transformers/tree/master/tests/deepspeed .

The training data of ATOMIC2020 can be downloaded at https://allenai.org/data/atomic-2020. You need to convert the .tsv file to .csv to be compatible with the dataloader in transformers.

Dependencies

python

torch==1.7.1
cudatoolkit=11.0
transformers==4.15.0
deepspeed==0.5.10

others

GCC/G++ 5.2.0 (to complie deepspeed ops)

Usage

1. Normal training without memory optimization:

CUDA_VISIBLE_DEVICES=0 python models/comet_seq2seq.py \
    --model_name_or_path t5-small \
    --do_train \
    --train_file /path/to/train.csv \
    --source_prefix "" \
    --output_dir data/models/t5-small \
    --overwrite_output_dir \
    --gradient_accumulation_steps=4 \
    --per_device_train_batch_size=8 \
    --per_device_eval_batch_size=4 \
    --max_source_length 16 \
    --max_target_length 18 \
    --text_column head_event --summary_column tail_event \
    --save_strategy epoch \
    --num_train_epochs 3 \
    --learning_rate 1e-5

2. Train with gradient_checkpointing=True. Smaller memory usage, meanwhile lower training speed.

CUDA_VISIBLE_DEVICES=0 python models/comet_seq2seq.py \
    --model_name_or_path t5-small \
    --do_train \
    --train_file /path/to/train.csv \
    --source_prefix "" \
    --output_dir data/models/t5-small \
    --overwrite_output_dir \
    --gradient_accumulation_steps=4 \
    --per_device_train_batch_size=8 \
    --per_device_eval_batch_size=4 \
    --max_source_length 16 \
    --max_target_length 18 \
    --text_column head_event --summary_column tail_event \
    --save_strategy epoch \
    --num_train_epochs 3 \
    --learning_rate 1e-5 \
    --gradient_checkpointing

3. Train with DeepSpeed (Either zero-stage2 or zero-stage3)

# google/t5-3B training, on 2080Ti (11GB)
deepspeed --include localhost:0,1 --master_port 30000 models/comet_seq2seq.py \
    --deepspeed deepspeed/ds_config_zero2.json \
    --model_name_or_path google/t5-xl-lm-adapt \
    --do_train \
    --train_file data/kg/atomic2020_data-feb2021/train.csv \
    --source_prefix "" \
    --output_dir data/models/comet/t5_xl_s2_bs32_fp16 \
    --overwrite_output_dir \
    --gradient_accumulation_steps=1 \
    --per_device_train_batch_size=16 \
    --max_source_length 16 \
    --max_target_length 18 \
    --text_column head_event --summary_column tail_event \
    --save_strategy epoch \
    --num_train_epochs 3 \
    --learning_rate 1e-5 \
    --fp16

4. Comparison of memory usage of different memory optimization methods

Compare the memory usage on NVIDIA RTX A6000 (48685MB memory) and Nvidia GeForce 3090 (24268MB memory).

1. fp16

T5-3B: effects of fp16. A 20% reduce of memory size.

	Device	fp16	Batch Size x Grad-Accum x Num-GPU	Memory Usage	Time to Train a Batch
vanilla	A6000	False	8x4x1	47.5k M	1.5s/32ex
vanilla	A6000	True	8x4x1	31k M	1.0s/32ex
vanilla	3090	False	1x32x1	❌	-
vanilla	3090	True	1x32x1	❌	-

2. gradient_checkpointing

T5-3B: Effects of gradient_checkpointing.

	Device	fp16	Batch Size x Grad-Accum x Num-GPU	Memory Usage	Time to Train a Batch
vanilla	A6000	False	8x4x1	47k M	1.5s/32ex
vanilla	A6000	True	8x4x1	31k M	1.0s/32ex
grad-ckpt	A6000	False	8x4x1	46.4k M	1.3s/32ex
grad-ckpt	A6000	True	8x4x1	23.9k M	1.1/32ex
vanilla	3090	True	1x32x1	❌	-
grad-ckpt	3090	True	1x32x1	23.8k M	15s/32ex

3. Deepspeed stage 2

T5-3B: Effects of deepspeed.

	Device	fp16	Batch Size x Grad-Accum x Num-GPU	Memory Usage	Time to Train a Batch
vanilla	3090	True	1x32x1	❌	-
grad-ckpt	3090	True	1x32x1	23k M	13.5s/32ex
stage2	3090	True	32x1x1	20.3k M	7.5s/32ex
stage2	3090	True	16x1x2	20.3k M	6.36s/32ex
stage2	3090	True	32x1x2	20.3k M	3.75s/32ex

4. Deepspeed stage 3

stage3 will lead to smaller usage of memory but way smaller training speed.

5. Automatic Evaluation Result on ATOMIC2020 data

	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr
T5-3B (no deepspeed), lr1e-5, epoch 3	0.346	0.184	0.12	0.084	0.19	0.422	0.646
T5-3B (no deepspeed), lr1e-5, epoch 2	0.348	0.185	0.121	0.085	0.19	0.424	0.651
T5-3B (no deepspeed), lr1e-5, epoch 1	0.343	0.177	0.113	0.079	0.186	0.416	0.629
T5-3B (ds_stage2, fp16) epoch 3	0.340	0.182	0.118	0.083	0.189	0.418	0.637
T5-3B (ds_stage2, fp16) epoch 2	0.337	0.177	0.114	0.078	0.189	0.419	0.633
T5-3B (ds_stage2, fp16) epoch 1	0.335	0.174	0.112	0.076	0.186	0.415	0.632

Useful discussions regarding environment setups

Errors building DeepSpeed Ops: https://github.com/microsoft/DeepSpeed/issues/885

TODO

DeepSpeed without Trainer(): https://huggingface.co/docs/transformers/main_classes/deepspeed#deepspeed-non-trainer-integration

Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET

Related tags

Overview

Training COMET using seq2seq setting

Dependencies

Usage

1. Normal training without memory optimization:

2. Train with gradient_checkpointing=True. Smaller memory usage, meanwhile lower training speed.

3. Train with DeepSpeed (Either zero-stage2 or zero-stage3)

4. Comparison of memory usage of different memory optimization methods

1. fp16

2. gradient_checkpointing

3. Deepspeed stage 2

4. Deepspeed stage 3

5. Automatic Evaluation Result on ATOMIC2020 data

Useful discussions regarding environment setups

TODO

Owner

tqfang

Rank-One Model Editing for Locating and Editing Factual Knowledge in GPT

aMLP Transformer Model for Japanese

基于百度的语音识别，用python实现，pyaudio+pyqt

Twitter-Sentiment-Analysis - Analysis of twitter posts' positive and negative score.

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

novel deep learning research works with PaddlePaddle

Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations

Nested Named Entity Recognition

中文无监督SimCSE Pytorch实现

Minimal GUI for accessing the Watson Text to Speech service.

SentAugment is a data augmentation technique for semi-supervised learning in NLP.

Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

pyupbit 라이브러리를 활용하여 upbit에서 비트코인을 자동매매하는 코드입니다. 조코딩 유튜브 채널에서 자세한 강의 영상을 보실 수 있습니다.

A natural language modeling framework based on PyTorch

English loanwords in the world's languages

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform

This is a GUI program that will generate a word search puzzle image