EMNLP 2021 paper "Pre-train or Annotate? Domain Adaptation with a Constrained Budget".

Overview

Pre-train or Annotate? Domain Adaptation with a Constrained Budget

This repo contains code and data associated with EMNLP 2021 paper "Pre-train or Annotate? Domain Adaptation with a Constrained Budget".

@inproceedings{bai-etal-2021-pre,
    title = "Pre-train or Annotate? Domain Adaptation with a Constrained Budget",
    author = "Bai, Fan  and
              Ritter, Alan  and
              Xu, Wei",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
}

Installment

git clone https://github.com/bflashcp3f/ProcBERT.git
cd ProcBERT
conda env create -f environment.yml
conda activate procbert

Data & Model Checkpoints

Three procedural-text datasets (WLP, PubMed and ChemSyn) can be downloaded here, and model checkpoints (ProcBERT and Proc-RoBERTa) are accessible through HuggingFace.

Experiment

Setup

# After downloading the data, update the DATA_PATH variable in code/utils.py
DATA_PATH=<DATA_PATH>

Budget-aware Domain Adaptation Experiments (with EasyAdapt)

# Named Entity Recognition (NER) 
python code/ner_da_budget.py     \
  --lm_model procbert     \
  --src_data pubmed     \
  --tgt_data chemsyn     \
  --gpu_ids 0,1   \
  --output_dir ./output/da/pubmed_chemsyn     \
  --learning_rate 1e-5     \
  --task_name fa_ner     \
  --batch_size 16     \
  --max_len 512    \
  --epochs 25 \
  --budget 700 \
  --alpha 1   \
  --save_model

# Relation Extraction (RE)
python code/rel_da_budget.py \
  --lm_model procbert \
  --src_data pubmed     \
  --tgt_data chemsyn     \
  --gpu_ids 0,1  \
  --output_dir ./output/da/pubmed_chemsyn \
  --learning_rate 1e-5 \
  --task_name fa_rel \
  --batch_size 48 \
  --max_len 256 \
  --epochs 5 \
  --budget 700 \
  --alpha 1 \
  --down_sample \
  --down_sample_rate 0.4 \
  --save_model

To obtain ProcBERT results with different budgets under six domain adaptation settings:

# NER
sh script/ner/run_ner_da_budget_all.sh

# RE
sh script/rel/run_rel_da_budget_all.sh

Budget-aware Target-domain-only Experiments

# Named Entity Recognition (NER) 
python code/ner_budget.py \
  --lm_model procbert \
  --data_name chemsyn \
  --gpu_ids 0,1  \
  --output_dir ./output/chemsyn \
  --learning_rate 1e-5 \
  --task_name ner \
  --batch_size 16 \
  --max_len 512 \
  --epochs 25 \
  --budget 700 \
  --save_model

# Relation Extraction (RE)
python code/rel_budget.py \
  --lm_model procbert \
  --data_name chemsyn \
  --gpu_ids 0,1  \
  --output_dir ./output/chemsyn \
  --learning_rate 1e-5 \
  --task_name rel \
  --batch_size 48 \
  --max_len 256 \
  --epochs 5 \
  --budget 700 \
  --down_sample \
  --down_sample_rate 0.4 \
  --save_model

To obtain ProcBERT results with different budgets on three datasets:

# NER
sh script/ner/run_ner_budget_all.sh

# RE
sh script/rel/run_rel_budget_all.sh

Auxiliary Experiments

# Named Entity Recognition (NER) 
python code/ner.py \
  --lm_model procbert \
  --data_name chemsyn \
  --gpu_ids 0,1  \
  --output_dir ./output/chemsyn \
  --learning_rate 1e-5 \
  --task_name ner \
  --batch_size 16 \
  --max_len 512 \
  --epochs 20 \
  --save_model

# Relation Extraction (RE)
python code/rel.py \
  --lm_model procbert \
  --data_name chemsyn \
  --gpu_ids 0,1  \
  --output_dir ./output/chemsyn \
  --learning_rate 1e-5 \
  --task_name rel \
  --batch_size 48 \
  --max_len 256 \
  --epochs 5 \
  --down_sample \
  --down_sample_rate 0.4 \
  --save_model

To obtain ProcBERT results on all three datasets:

# NER
sh script/ner/run_ner_all.sh

# RE
sh script/rel/run_rel_all.sh
Owner
Fan Bai
Fan Bai
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Chi Han 43 Dec 28, 2022
This project deals with a simplified version of a more general problem of Aspect Based Sentiment Analysis.

Aspect_Based_Sentiment_Extraction Created on: 5th Jan, 2022. This project deals with an important field of Natural Lnaguage Processing - Aspect Based

Naman Rastogi 4 Jan 01, 2023
code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

AttentiveNAS: Improving Neural Architecture Search via Attentive Sampling This repository contains PyTorch evaluation code, training code and pretrain

Facebook Research 94 Oct 26, 2022
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

ALBERT ***************New March 28, 2020 *************** Add a colab tutorial to run fine-tuning for GLUE datasets. ***************New January 7, 2020

Google Research 3k Dec 26, 2022
Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

NVIDIA Corporation 3.5k Dec 30, 2022
Simple Python library, distributed via binary wheels with few direct dependencies, for easily using wav2vec 2.0 models for speech recognition

Wav2Vec2 STT Python Beta Software Simple Python library, distributed via binary wheels with few direct dependencies, for easily using wav2vec 2.0 mode

David Zurow 22 Dec 29, 2022
A telegram bot to translate 100+ Languages

🔥 GOOGLE TRANSLATER 🔥 The owner would not be responsible for any kind of bans due to the bot. • ⚡ INSTALLING ⚡ • • 🔰 Deploy To Railway 🔰 • • ✅ OFF

Aɴᴋɪᴛ Kᴜᴍᴀʀ 5 Dec 20, 2021
German Text-To-Speech Engine using Tacotron and Griffin-Lim

jotts JoTTS is a German text-to-speech engine using tacotron and griffin-lim. The synthesizer model has been trained on my voice using Tacotron1. Due

padmalcom 6 Aug 28, 2022
A sample project that exists for PyPUG's "Tutorial on Packaging and Distributing Projects"

A sample Python project A sample project that exists as an aid to the Python Packaging User Guide's Tutorial on Packaging and Distributing Projects. T

Python Packaging Authority 4.5k Dec 30, 2022
Weakly-supervised Text Classification Based on Keyword Graph

Weakly-supervised Text Classification Based on Keyword Graph How to run? Download data Our dataset follows previous works. For long texts, we follow C

Hello_World 20 Dec 29, 2022
A framework for cleaning Chinese dialog data

A framework for cleaning Chinese dialog data

Yida 136 Dec 20, 2022
DeLighT: Very Deep and Light-Weight Transformers

DeLighT: Very Deep and Light-weight Transformers This repository contains the source code of our work on building efficient sequence models: DeFINE (I

Sachin Mehta 440 Dec 18, 2022
A large-scale (194k), Multiple-Choice Question Answering (MCQA) dataset designed to address realworld medical entrance exam questions.

MedMCQA MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering A large-scale, Multiple-Choice Question Answe

MedMCQA 24 Nov 30, 2022
Abhijith Neil Abraham 2 Nov 05, 2021
A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis

WaveGlow A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis Quick Start: Install requirements: pip install

Yuchao Zhang 204 Jul 14, 2022
New Modeling The Background CodeBase

Modeling the Background for Incremental Learning in Semantic Segmentation This is the updated official PyTorch implementation of our work: "Modeling t

Fabio Cermelli 9 Dec 28, 2022
LightSeq: A High-Performance Inference Library for Sequence Processing and Generation

LightSeq is a high performance inference library for sequence processing and generation implemented in CUDA. It enables highly efficient computation of modern NLP models such as BERT, GPT2, Transform

Bytedance Inc. 2.5k Jan 03, 2023
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 847 Dec 19, 2022
Subtitle Workshop (subshop): tools to download and synchronize subtitles

SUBSHOP Tools to download, remove ads, and synchronize subtitles. SUBSHOP Purpose Limitations Required Web Credentials Installation, Configuration, an

Joe D 4 Feb 13, 2022
Code repository of the paper Neural circuit policies enabling auditable autonomy published in Nature Machine Intelligence

Code repository of the paper Neural circuit policies enabling auditable autonomy published in Nature Machine Intelligence

9 Jan 08, 2023