Train 🤗transformers with DeepSpeed: ZeRO-2, ZeRO-3

Overview

Fork from https://github.com/huggingface/transformers/tree/86d5fb0b360e68de46d40265e7c707fe68c8015b/examples/pytorch/language-modeling at 2021.05.17.

Language model training

Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, ALBERT, BERT, DistilBERT, RoBERTa, XLNet... GPT and GPT-2 are trained or fine-tuned using a causal language modeling (CLM) loss while ALBERT, BERT, DistilBERT and RoBERTa are trained or fine-tuned using a masked language modeling (MLM) loss. XLNet uses permutation language modeling (PLM), you can find more information about the differences between those objectives in our model summary.

There are two sets of scripts provided. The first set leverages the Trainer API. The second set with no_trainer in the suffix uses a custom training loop and leverages the 🤗 Accelerate library . Both sets use the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.

Note: The old script run_language_modeling.py is still available here.

The following examples, will run on datasets hosted on our hub or with your own text files for training and validation. We give examples of both below.

GPT-2/GPT and causal language modeling

The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before the tokenization). The loss here is that of causal language modeling.

python run_clm.py \
    --model_name_or_path gpt2 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-clm

This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches a score of ~20 perplexity once fine-tuned on the dataset.

To run on your own training and validation files, use the following command:

python run_clm.py \
    --model_name_or_path gpt2 \
    --train_file path_to_train_file \
    --validation_file path_to_validation_file \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-clm

This uses the built in HuggingFace Trainer for training. If you want to use a custom training loop, you can utilize or adapt the run_clm_no_trainer.py script. Take a look at the script for a list of supported arguments. An example is shown below:

python run_clm_no_trainer.py \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --model_name_or_path gpt2 \
    --output_dir /tmp/test-clm

RoBERTa/BERT/DistilBERT and masked language modeling

The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their pre-training: masked language modeling.

In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, converge slightly slower (over-fitting takes more epochs).

python run_mlm.py \
    --model_name_or_path roberta-base \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-mlm

To run on your own training and validation files, use the following command:

python run_mlm.py \
    --model_name_or_path roberta-base \
    --train_file path_to_train_file \
    --validation_file path_to_validation_file \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-mlm

If your dataset is organized with one sample per line, you can use the --line_by_line flag (otherwise the script concatenates all texts and then splits them in blocks of the same length).

This uses the built in HuggingFace Trainer for training. If you want to use a custom training loop, you can utilize or adapt the run_mlm_no_trainer.py script. Take a look at the script for a list of supported arguments. An example is shown below:

python run_mlm_no_trainer.py \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --model_name_or_path roberta-base \
    --output_dir /tmp/test-mlm

Note: On TPU, you should use the flag --pad_to_max_length in conjunction with the --line_by_line flag to make sure all your batches have the same length.

Whole word masking

This part was moved to examples/research_projects/mlm_wwm.

XLNet and permutation language modeling

XLNet uses a different training objective, which is permutation language modeling. It is an autoregressive method to learn bidirectional contexts by maximizing the expected likelihood over all permutations of the input sequence factorization order.

We use the --plm_probability flag to define the ratio of length of a span of masked tokens to surrounding context length for permutation language modeling.

The --max_span_length flag may also be used to limit the length of a span of masked tokens used for permutation language modeling.

Here is how to fine-tune XLNet on wikitext-2:

python run_plm.py \
    --model_name_or_path=xlnet-base-cased \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-plm

To fine-tune it on your own training and validation file, run:

python run_plm.py \
    --model_name_or_path=xlnet-base-cased \
    --train_file path_to_train_file \
    --validation_file path_to_validation_file \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-plm

If your dataset is organized with one sample per line, you can use the --line_by_line flag (otherwise the script concatenates all texts and then splits them in blocks of the same length).

Note: On TPU, you should use the flag --pad_to_max_length in conjunction with the --line_by_line flag to make sure all your batches have the same length.

Owner
Junbum Lee
AI/NLP Researcher who loves python, web, and data. And who believes technology would make the world a better place.
Junbum Lee
Machine learning classifiers to predict American Sign Language .

ASL-Classifiers American Sign Language (ASL) is a natural language that serves as the predominant sign language of Deaf communities in the United Stat

Tarek idrees 0 Feb 08, 2022
ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

ThinkTwice ThinkTwice is a retriever-reader architecture for solving long-text machine reading comprehension. It is based on the paper: ThinkTwice: A

Walle 4 Aug 06, 2021
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 6k Dec 30, 2022
Outreachy TFX custom component project

Schema Curation Custom Component Outreachy TFX custom component project This repo contains the code for Schema Curation Custom Component made as a par

Robert Crowe 5 Jul 16, 2021
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

CrossNER is a fully-labeled collected of named entity recognition (NER) data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specia

Zihan Liu 89 Nov 10, 2022
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

Steven Loria 8.4k Dec 26, 2022
Code for hyperboloid embeddings for knowledge graph entities

Implementation for the papers: Self-Supervised Hyperboloid Representations from Logical Queries over Knowledge Graphs, Nurendra Choudhary, Nikhil Rao,

30 Dec 10, 2022
Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models

Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models. A paraphrase framework is more than just a paraphrasing model.

Prithivida 681 Jan 01, 2023
A combination of autoregressors and autoencoders using XLNet for sentiment analysis

A combination of autoregressors and autoencoders using XLNet for sentiment analysis Abstract In this paper sentiment analysis has been performed in or

James Zaridis 2 Nov 20, 2021
Sentello is python script that simulates the anti-evasion and anti-analysis techniques used by malware.

sentello Sentello is a python script that simulates the anti-evasion and anti-analysis techniques used by malware. For techniques that are difficult t

Malwation 62 Oct 02, 2022
Simple and efficient RevNet-Library with DeepSpeed support

RevLib Simple and efficient RevNet-Library with DeepSpeed support Features Half the constant memory usage and faster than RevNet libraries Less memory

Lucas Nestler 112 Dec 05, 2022
NLTK Source

Natural Language Toolkit (NLTK) NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets, and tutorials supporting

Natural Language Toolkit 11.4k Jan 04, 2023
Natural Language Processing

NLP Natural Language Processing apps Multilingual_NLP.py start #This script is demonstartion of Mul

Ritesh Sharma 1 Oct 31, 2021
ACL22 paper: Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost LOVE is accpeted by ACL22 main conference as a long pape

Lihu Chen 32 Jan 03, 2023
This repository contains (not all) code from my project on Named Entity Recognition in philosophical text

NERphilosophy 👋 Welcome to the github repository of my BsC thesis. This repository contains (not all) code from my project on Named Entity Recognitio

Ruben 1 Jan 27, 2022
Repositório do trabalho de introdução a NLP

Trabalho da disciplina de BI NLP Repositório do trabalho da disciplina Introdução a Processamento de Linguagem Natural da pós BI-Master da PUC-RIO. Eq

Leonardo Lins 1 Jan 18, 2022
Mkdocs + material + cool stuff

Modern-Python-Doc-Example mkdocs + material + cool stuff Doc is live here Features out of the box amazing good looking website thanks to mkdocs.org an

Francesco Saverio Zuppichini 61 Oct 26, 2022
pyupbit 라이브러리를 활용하여 upbit에서 비트코인을 자동매매하는 코드입니다. 조코딩 유튜브 채널에서 자세한 강의 영상을 보실 수 있습니다.

파이썬 비트코인 투자 자동화 강의 코드 by 유튜브 조코딩 채널 pyupbit 라이브러리를 활용하여 upbit 거래소에서 비트코인 자동매매를 하는 코드입니다. 파일 구성 test.py : 잔고 조회 (1강) backtest.py : 백테스팅 코드 (2강) bestK.p

조코딩 JoCoding 186 Dec 29, 2022
EasyTransfer is designed to make the development of transfer learning in NLP applications easier.

EasyTransfer is designed to make the development of transfer learning in NLP applications easier. The literature has witnessed the success of applying

Alibaba 819 Jan 03, 2023
Code for the paper "Language Models are Unsupervised Multitask Learners"

Status: Archive (code is provided as-is, no updates expected) gpt-2 Code and models from the paper "Language Models are Unsupervised Multitask Learner

OpenAI 16.1k Jan 08, 2023