EMNLP 2021 - Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Last update: Nov 18, 2022

Overview

Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

This is the official implementation for "Frustratingly Simple Pretraining Alternatives to Masked Language Modeling" (EMNLP 2021).

Requirements

torch
transformers
datasets
scikit-learn
tensorflow
spacy

How to pre-train

1. Clone this repository

git clone https://github.com/gucci-j/light-transformer-emnlp2021.git

2. Install required packages

cd ./light-transformer-emnlp2021
pip install -r requirements.txt

requirements.txt is located just under light-transformer-emnlp2021.

We also need spaCy's en_core_web_sm for preprocessing. If you have not installed this model, please run python -m spacy download en_core_web_sm.

3. Preprocess datasets

cd ./src/utils
python preprocess_roberta.py --path=/path/to/save/data/

You need to specify the following argument:

path: (str) Where to save the processed data?

4. Pre-training

You need to secify configs as command line arguments. Sample configs for pre-training MLM are shown as below. python pretrainer.py --help will display helper messages.

cd ../
python pretrainer.py \
--data_dir=/path/to/dataset/ \
--do_train \
--learning_rate=1e-4 \
--weight_decay=0.01 \
--adam_epsilon=1e-8 \
--max_grad_norm=1.0 \
--num_train_epochs=1 \
--warmup_steps=12774 \
--save_steps=12774 \
--seed=42 \
--per_device_train_batch_size=16 \
--logging_steps=100 \
--output_dir=/path/to/save/weights/ \
--overwrite_output_dir \
--logging_dir=/path/to/save/log/files/ \
--disable_tqdm=True \
--prediction_loss_only \
--fp16 \
--mlm_prob=0.15 \
--pretrain_model=RobertaForMaskedLM

pretrain_model should be selected from:
- RobertaForMaskedLM (MLM)
- RobertaForShuffledWordClassification (Shuffle)
- RobertaForRandomWordClassification (Random)
- RobertaForShuffleRandomThreeWayClassification (Shuffle+Random)
- RobertaForFourWayTokenTypeClassification (Token Type)
- RobertaForFirstCharPrediction (First Char)

Check the pre-training process

You can monitor the progress of pre-training via the Tensorboard. Simply run the following:

tensorboard --logdir=/path/to/log/dir/

Distributed training

pretrainer.py is compatible with distributed training. Sample configs for pre-training MLM are as follows.

python -m torch/distributed/launch.py \
--nproc_per_node=8 \
pretrainer.py \
--data_dir=/path/to/dataset/ \
--model_path=None \
--do_train \
--learning_rate=5e-5 \
--weight_decay=0.01 \
--adam_epsilon=1e-8 \
--max_grad_norm=1.0 \
--num_train_epochs=1 \
--warmup_steps=24000 \
--save_steps=1000 \
--seed=42 \
--per_device_train_batch_size=8 \
--logging_steps=100 \
--output_dir=/path/to/save/weights/ \
--overwrite_output_dir \
--logging_dir=/path/to/save/log/files/ \
--disable_tqdm \
--prediction_loss_only \
--fp16 \
--mlm_prob=0.15 \
--pretrain_model=RobertaForMaskedLM

For more details about launch.py, please refer to https://github.com/pytorch/pytorch/blob/master/torch/distributed/launch.py.

Mixed precision training

Installation

For PyTorch version >= 1.6, there is a native functionality to enable mixed precision training.
For older versions, NVIDIA apex must be installed.
- You might encounter some errors when installing apex due to permission problems. To fix these, specify export TMPDIR='/path/to/your/favourite/dir/' and change permissions of all files under apex/.git/ to 777.
- You also need to specify an optimisation method from https://nvidia.github.io/apex/amp.html.

Usage
To use mixed precision during pre-training, just specify --fp16 as an input argument. For older PyTorch versions, also specify --fp16_opt_level from O0, O1, O2, and O3.

How to fine-tune

GLUE

Download GLUE data

git clone https://github.com/huggingface/transformers
python transformers/utils/download_glue_data.py

Create a json config file
You need to create a .json file for configuration or use command line arguments.

{
    "model_name_or_path": "/path/to/pretrained/weights/",
    "tokenizer_name": "roberta-base",
    "task_name": "MNLI",
    "do_train": true,
    "do_eval": true,
    "data_dir": "/path/to/MNLI/dataset/",
    "max_seq_length": 128,
    "learning_rate": 2e-5,
    "num_train_epochs": 3, 
    "per_device_train_batch_size": 32,
    "per_device_eval_batch_size": 128,
    "logging_steps": 500,
    "logging_first_step": true,
    "save_steps": 1000,
    "save_total_limit": 2,
    "evaluate_during_training": true,
    "output_dir": "/path/to/save/models/",
    "overwrite_output_dir": true,
    "logging_dir": "/path/to/save/log/files/",
    "disable_tqdm": true
}

For task_name and data_dir, please choose one from CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, and WNLI.

Fine-tune
```
python run_glue.py /path/to/json/
```
Instead of specifying a JSON path, you can directly specify configs as input arguments.
You can also monitor training via Tensorboard.
--help option will display a helper message.

SQuAD

Download SQuAD data

cd ./utils
python download_squad_data.py --save_dir=/path/to/squad/

Fine-tune

cd ..
export SQUAD_DIR=/path/to/squad/
python run_squad.py \
--model_type roberta \
--model_name_or_path=/path/to/pretrained/weights/ \
--tokenizer_name roberta-base \
--do_train \
--do_eval \
--do_lower_case \
--data_dir=$SQUAD_DIR \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--per_gpu_train_batch_size 16 \
--per_gpu_eval_batch_size 32 \
--learning_rate 3e-5 \
--weight_decay=0.01 \
--warmup_steps=3327 \
--num_train_epochs 10.0 \
--max_seq_length 384 \
--doc_stride 128 \
--logging_steps=278 \
--save_steps=50000 \
--patience=5 \
--objective_type=maximize \
--metric_name=f1 \
--overwrite_output_dir \
--evaluate_during_training \
--output_dir=/path/to/save/weights/ \
--logging_dir=/path/to/save/logs/ \
--seed=42

Similar to pre-training, you can monitor the fine-tuning status via Tensorboard.
--help option will display a helper message.

Citation

@inproceedings{yamaguchi-etal-2021-frustratingly,
    title = "Frustratingly Simple Pretraining Alternatives to Masked Language Modeling",
    author = "Yamaguchi, Atsuki  and
      Chrysostomou, George  and
      Margatina, Katerina  and
      Aletras, Nikolaos",
    booktitle = "Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2021",
    publisher = "Association for Computational Linguistics",
}

License

MIT License

EMNLP 2021 - Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Related tags

Overview

Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Requirements

How to pre-train

1. Clone this repository

2. Install required packages

3. Preprocess datasets

4. Pre-training

Check the pre-training process

Distributed training

Mixed precision training

How to fine-tune

GLUE

SQuAD

Citation

License

Owner

Atsuki Yamaguchi

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

3D-Reconstruction 基于深度学习方法的单目多视图三维重建

Pytorch implementation of our paper under review -- 1xN Pattern for Pruning Convolutional Neural Networks

NaturalProofs: Mathematical Theorem Proving in Natural Language

NeuralDiff: Segmenting 3D objects that move in egocentric videos

Pytorch implement of 'Unmixing based PAN guided fusion network for hyperspectral imagery'

Code for ACL 2019 Paper: "COMET: Commonsense Transformers for Automatic Knowledge Graph Construction"

Read number plates with https://platerecognizer.com/

Generate images from texts. In Russian

PyTorch implementation of the end-to-end coreference resolution model with different higher-order inference methods.

Face Alignment using python

Code for the TASLP paper "PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation".

Towards Interpretable Deep Metric Learning with Structural Matching

SuperSDR: multiplatform KiwiSDR + CAT transceiver integrator

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

Tooling for GANs in TensorFlow

This is a Machine Learning Based Hand Detector Project, It Uses Machine Learning Models and Modules Like Mediapipe, Developed By Google!

This repository is maintained for the scientific paper tittled " Study of keyword extraction techniques for Electric Double Layer Capacitor domain using text similarity indexes: An experimental analysis "

PuppetGAN - Cross-Domain Feature Disentanglement and Manipulation just got way better! 🚀

Image inpainting using Gaussian Mixture Models