Code for EMNLP2021 paper "Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training"

Last update: Jul 28, 2022

Related tags

Deep Learning VoCapXLM

Overview

VoCapXLM

Code for EMNLP2021 paper Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

Environment

DockerFile: dancingsoul/pytorch:VoCapXLM

Manully build the sentencepiece with following command:

cd sentencepiece
mkdir build
cd build
cmake ..
make -j $(nproc)
sudo make install
sudo ldconfig -v

Data Preparation

Create a folder with mkdir -p monolingual_text in the root of this project.
Sample monolingual corpus for each language individually, move them to the monolingual_text directory, named after their language codes (e.g., en.txt).
Sample the multilingual corpus from monolingual corpora with the following command:

python sample_multilingual_corpus.py \
    --lang_prob_path ./lang_prob_wiki.json \ 
    --input_dir ./monolingual_text/ \ 
    --output_path ./multilingual_corpus.text \
    --n_sample <n_sample> --beta <beta> --rescale

where the options are described as follows:

--lang_prob_path: the probability of sampling training instances from each language during pre-training, lang_prob_wiki.json is counted on Wikipedia corpus and the probabilities are rescaled with alpha=0.7 from Equation (3) in our paper.
--n_sample: number of sentences in the multilingual corpus where the final multilingual sentencepiece model is trained, the default value is 20000000.
--rescale: further rescale the probability with another value beta from Equation (2) in our paper.
--beta: the rescaling factor in Equation (2), the default value is 0.7.

Training Monolingual SentencePiece Models

Train monolingual sentencepiece models in different sizes to obtain vocabularies with different ALP, i.e., language-specific vocabulary capacity.

python train_mono_spm.py \
    --input_dir ./monolingual_text/ \
    --output_dir ~/monolingual_spm/ \
    --languages <all_languages> \
    --min_vocab_size <min_vocab_size> \
    --max_vocab_size <max_vocab_size> \
    --delta_vocab_size <delta_vocab_size> \
    --n_sample <n_sample>

where the options are described as follows:

--languages: all languages under the monolingual_text directory, separated with ,, e.g. en,fr,zh.
--min_vocab_size: minimum vocabulary size allocated for each language, the default value is 1000.
--max_vocab_size: maximum vocabulary size allocated for each language, the default value is 50000.
--delta_vocab_size: the value of interval to learn vocabularies, the default value is 1000.
--n_sample: the number of sentences to calculate ALP for each language, the default value is 1000000.

or you can download our pre-trained monolingual sentencepiece models and vocabularies from [here][2].

Allocating Multilingual Vocabulary

Allocate the multilingual vocabulary from monolingual vocabularies:

python train_vocap.py \
    --lang_prob_path ./lang_prob_wiki.json \
    --input_dir ./monolingual_spm/ \
    --output_path ./multilingual.vocab \
    --beta <beta> --rescale --target_vocab_size <target_vocab_size>

where the options are described as follows:

--lang_prob_path: same as the above.
--rescale: same as the above.
--beta: same as the above.
--target_vocab_size: the desired vocabulary size of the multilingual vocabulary, the default value is 500000.

Then Use sentencepiece to train the tokenizer given the multilingual vocabulary:

spm_train --input=./multilingual_corpus.text --model_prefix=<model_name> --vocab_size=<target_vocab_size> \
--character_coverage=0.9995 --model_type=unigram --shuffle_input_sentence=true \
--input_sentence_size=<input_sentence_size> --vocab_path=./multilingual.vocab

where the options are described as follows:

--model_prefix: output model name prefix. <model_name>.model and <model_name>.vocab are generated.
--character_coverage: amount of characters covered by the model.
--vocab_size: same as --target_vocab_size.
--vocab_path: the required subwords in the final learned tokenizer.

Paper

Please cite our paper \cite{bo2021vocapxlm} if you found the resources in the repository useful.

@inproceedings{bo2021vocapxlm,
author = {Bo Zheng, Li Dong, Shaohan Huang, Saksham Singhal, Wanxiang Che, Ting Liu, Xia Song, Furu Wei},
booktitle = {Proceedings of EMNLP 2021},
title = {{Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training}},
year = {2021}
}

Code for EMNLP2021 paper "Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training"

Related tags

Overview

VoCapXLM

Environment

Data Preparation

Training Monolingual SentencePiece Models

Allocating Multilingual Vocabulary

Paper

Reference

Owner

Bo Zheng

Focal and Global Knowledge Distillation for Detectors

Simulated garment dataset for virtual try-on

Official Implementation of LARGE: Latent-Based Regression through GAN Semantics

Megaverse is a new 3D simulation platform for reinforcement learning and embodied AI research

Activating More Pixels in Image Super-Resolution Transformer

Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

Code associated with the paper "Deep Optics for Single-shot High-dynamic-range Imaging"

Rate-limit-semaphore - Semaphore implementation with rate limit restriction for async-style (any core)

BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches

An open source machine learning library for performing regression tasks using RVM technique.

Implementing yolov4 target detection and tracking based on nao robot

FCN (Fully Convolutional Network) is deep fully convolutional neural network architecture for semantic pixel-wise segmentation

Semi-SDP Semi-supervised parser for semantic dependency parsing.

Code for the paper "MASTER: Multi-Aspect Non-local Network for Scene Text Recognition" (Pattern Recognition 2021)

Lighthouse: Predicting Lighting Volumes for Spatially-Coherent Illumination

RuDOLPH: One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP

Official pytorch implementation of "DSPoint: Dual-scale Point Cloud Recognition with High-frequency Fusion"

Recursive Bayesian Networks

Churn-Prediction-Project - In this project, a churn prediction model is developed for a private bank as a term project for Data Mining class.

Data Augmentation with Variational Autoencoders