This repository contains the code for running the character-level Sandwich Transformers from our ACL 2020 paper on Improving Transformer Models by Reordering their Sublayers.

Last update: Sep 26, 2022

Related tags

Text Data & NLP sandwich_transformer

Overview

Improving Transformer Models by Reordering their Sublayers

This repository contains the code for running the character-level Sandwich Transformers from our ACL 2020 paper on Improving Transformer Models by Reordering their Sublayers (video presentation here, summary here).

Our character-level model (and this repo) is based on the Adaptive Attention Span for Transformers model. In our paper we showed that by simply reordering that model's self-attention and feedforward sublayers, we could improve performance on the enwik8 benchmark (where we achieve 0.968 BPC on the test set).

The code here simply adds a way to reorder the sublayers of the Adaptive Span model, using the --architecture parameter.

If you use this code or results from our paper, please cite:

@inproceedings{press-etal-2020-improving,
    title = "Improving Transformer Models by Reordering their Sublayers",
    author = "Press, Ofir and Smith, Noah A. and Levy, Omer",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.270",
    doi = "10.18653/v1/2020.acl-main.270",
    pages = "2996--3005",
}

Requirements

You need CUDA 10 and PyTorch 1.2.0 to run this code. See this page for installation instructions. To replicate our experimental conditions eight V100 GPUs are needed.

Running experiments in the paper

The scripts for training the character-level models from the paper are located in the ./experiments/ directory. For example, to train the enwik8 model, run:

bash experiments/enwik8_large.sh

We used eight V100 GPUs, but if you'd like to run this model on GPUs with less memory you can increase the --batch-split (it splits batches into smaller pieces without changing the final result).

We obtained the following results in our experiments:

Experiment	#params	valid (bpc)	test (bpc)
enwik8 Sandwich Transformer	209M	0.992	0.968
text8 Sandwich Transformer	209M	1.012	1.076

The `--architecture` parameter

A standard transformer with 3 layers (so 6 self-attention and feedforward sublayers) would use be trained using --architecture sfsfsf. That 6 sublayer model with a sandwiching coefficient of 1 would be --architecture s.sfsf.f and with a sandwiching coefficient of 2 would be --architecture s.s.sf.f.f. Make sure to also set the --nlayers parameter to be the length of the architecture string divided by 2.

License

The code is licensed under CC-BY-NC license. See the LICENSE file for more details.

Acknowledgements + More Information

This code is based on the code of the Adaptive Span model. We recommend reading the Adaptive Span README for further information on this codebase.

This repository contains the code for running the character-level Sandwich Transformers from our ACL 2020 paper on Improving Transformer Models by Reordering their Sublayers.

Related tags

Overview

Improving Transformer Models by Reordering their Sublayers

Requirements

Running experiments in the paper

The `--architecture` parameter

License

Acknowledgements + More Information

Owner

Ofir Press

Training and evaluation codes for the BertGen paper (ACL-IJCNLP 2021)

(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Official PyTorch implementation of "Dual Path Learning for Domain Adaptation of Semantic Segmentation".

Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated

auto_code_complete is a auto word-completetion program which allows you to customize it on your need

PUA Programming Language written in Python.

Uses Google's gTTS module to easily create robo text readin' on command.

Code for the paper "Are Sixteen Heads Really Better than One?"

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

A library for end-to-end learning of embedding index and retrieval model

Multilingual finetuning of Machine Translation model on low-resource languages. Project for Deep Natural Language Processing course.

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Text to speech for Vietnamese, ez to use, ez to update

A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. X-Ray supports 18 languages.

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Différents programmes créant une interface graphique a l'aide de Tkinter pour simplifier la vie des étudiants.

This repository contains the code for running the character-level Sandwich Transformers from our ACL 2020 paper on Improving Transformer Models by Reordering their Sublayers.

Related tags

Overview

Improving Transformer Models by Reordering their Sublayers

Requirements

Running experiments in the paper

The --architecture parameter

License

Acknowledgements + More Information

Owner

Ofir Press

Training and evaluation codes for the BertGen paper (ACL-IJCNLP 2021)

(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Official PyTorch implementation of "Dual Path Learning for Domain Adaptation of Semantic Segmentation".

Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated

auto_code_complete is a auto word-completetion program which allows you to customize it on your need

PUA Programming Language written in Python.

Uses Google's gTTS module to easily create robo text readin' on command.

Code for the paper "Are Sixteen Heads Really Better than One?"

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

A library for end-to-end learning of embedding index and retrieval model

Multilingual finetuning of Machine Translation model on low-resource languages. Project for Deep Natural Language Processing course.

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Text to speech for Vietnamese, ez to use, ez to update

A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. X-Ray supports 18 languages.

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Différents programmes créant une interface graphique a l'aide de Tkinter pour simplifier la vie des étudiants.

The `--architecture` parameter