Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Last update: Jan 03, 2023

Overview

Text-AutoAugment (TAA)

This repository contains the code for our paper Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification (EMNLP 2021 main conference).

Overview

We present a learnable and compositional framework for data augmentation. Our proposed algorithm automatically searches for the optimal compositional policy, which improves the diversity and quality of augmented samples.
In low-resource and class-imbalanced regimes of six benchmark datasets, TAA significantly improves the generalization ability of deep neural networks like BERT and effectively boosts text classification performance.

Getting Started

Prepare environment

conda create -n taa python=3.6
conda activate taa
conda install pytorch torchvision cudatoolkit=10.0 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
pip install -r requirements.txt 
python -c "import nltk; nltk.download('wordnet'); nltk.download('averaged_perceptron_tagger')"

Modify dataroot parameter in confs/*yaml and abspath parameter in script/*.sh:
- e.g., change dataroot: /home/renshuhuai/TextAutoAugment/data/aclImdb in confs/bert_imdb.yaml to dataroot: path-to-your-TextAutoAugment/data/aclImdb
- change --abspath '/home/renshuhuai/TextAutoAugment' in script/imdb_lowresource.sh to --abspath 'path-to-your-TextAutoAugment'
Search for the best augmentation policy, e.g., low-resource regime for IMDB:
```
sh script/imdb_lowresource.sh
```
scripts for policy search in the low-resource and class-imbalanced regime for all datasets are provided in the script/ fold.
Train a model with pre-searched policy in archive.py, e.g., train model in low-resource regime for IMDB:
```
python train.py -c confs/bert_imdb.yaml 
```
train model on full dataset of IMDB:
```
python train.py -c confs/bert_imdb.yaml --train-npc -1 --valid-npc -1 --test-npc -1  
```

Contact

If you have any questions related to the code or the paper, feel free to email Shuhuai (renshuhuai007 [AT] gmail [DOT] com).

Acknowledgments

Code refers to: fast-autoaugment.

Citation

If you find this code useful for your research, please consider citing:

@inproceedings{ren2021taa,
  title={Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification},
  author={Shuhuai Ren, Jinchao Zhang, Lei Li, Xu Sun, Jie Zhou},
  booktitle={EMNLP},
  year={2021}
}

License

MIT

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Related tags

Overview

Text-AutoAugment (TAA)

Overview

Getting Started

Contact

Acknowledgments

Citation

License

Owner

LancoPKU

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

Library for Russian imprecise rhymes generation

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

Bu Chatbot, Konya Bilim Merkezi Yen için tasarlanmış olan bir projedir.

Translate - a PyTorch Language Library

🦆 Contextually-keyed word vectors

A simple word search made in python

Weakly-supervised Text Classification Based on Keyword Graph

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Multilingual text (NLP) processing toolkit

Unsupervised Abstract Reasoning for Raven’s Problem Matrices

Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

The code for two papers: Feedback Transformer and Expire-Span.

pyMorfologik MorfologikpyMorfologik - Python binding for Morfologik.

Python wrapper for Stanford CoreNLP tools v3.4.1

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

Examples of using sparse attention, as in "Generating Long Sequences with Sparse Transformers"

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。