This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

Last update: Dec 25, 2022

Overview

BanglaBERT

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

BanglaBERT

Models

We are releasing a slightly better checkpoint than the one reported in the paper, pretrained with 27.5 GB data, more code switched and code mixed texts, and pretrained further for 2.5M steps. The pretrained model checkpoint is available here. To use this model for the supported downstream tasks in this repository see Training & Evaluation.

Note: This model was pretrained using a specific normalization pipeline available here. All finetuning scripts in this repository uses this normalization by default. If you need to adapt the pretrained model for a different task make sure the text units are normalized using this pipeline before tokenizing to get best results. A basic example is available at the model page.

Datasets

We are also releasing the Bangla Natural Language Inference (NLI) dataset introduced in the paper. The dataset can be found here.

Setup

For installing the necessary requirements, use the following snippet

$ git clone https://https://github.com/csebuetnlp/banglabert
$ cd banglabert/
$ conda create python==3.7.9 pytorch==1.8.1 torchvision==0.9.1 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch -p ./env
$ conda activate ./env # or source activate ./env (for older versions of anaconda)
$ bash setup.sh

Use the newly created environment for running the scripts in this repository.

Training & Evaluation

To use the pretrained model for finetuning / inference on different downstream tasks see the following section:

Sequence Classification.
- For single sequence classification such as
  - Document classification
  - Sentiment classification
  - Emotion classification etc.
- For double sequence classification such as
  - Natural Language Inference (NLI)
  - Paraphrase detection etc.

Token Classification.
- For token tagging / classification tasks such as
  - Named Entity Recognition (NER)
  - Parts of Speech Tagging (PoS) etc.

Benchmarks

	SC	EC	DC	NER	NLI
`Metrics`	`Accuracy`	`F1*`	`Accuracy`	`F1 (Entity)*`	`Accuracy`
mBERT	83.39	56.02	98.64	67.40	75.40
XLM-R	89.49	66.70	98.71	70.63	76.87
sagorsarker/bangla-bert-base	87.30	61.51	98.79	70.97	70.48
monsoon-nlp/bangla-electra	73.54	34.55	97.64	52.57	63.48
BanglaBERT	92.18	74.27	99.07	72.18	82.94

* - Weighted Average

The benchmarking datasets are as follows:

Acknowledgements

We would like to thank Intelligent Machines and Google TFRC Program for providing cloud support for pretraining the models.

License

Contents of this repository are restricted to non-commercial research purposes only under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

Citation

If you use any of the datasets, models or code modules, please cite the following paper:

@article{bhattacharjee2021banglabert,
  author    = {Abhik Bhattacharjee and Tahmid Hasan and Kazi Samin and Md Saiful Islam and M. Sohel Rahman and Anindya Iqbal and Rifat Shahriyar},
  title     = {BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding},
  journal   = {CoRR},
  volume    = {abs/2101.00204},
  year      = {2021},
  url       = {https://arxiv.org/abs/2101.00204},
  eprinttype = {arXiv},
  eprint    = {2101.00204}
}

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

Related tags

Overview

BanglaBERT

Table of Contents

Models

Datasets

Setup

Training & Evaluation

Benchmarks

Acknowledgements

License

Citation

Owner

Deep Learning for Natural Language Processing - Lectures 2021

Fast, general, and tested differentiable structured prediction in PyTorch

A sample project that exists for PyPUG's "Tutorial on Packaging and Distributing Projects"

Awesome-NLP-Research (ANLP)

LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP

Twewy-discord-chatbot - Build a Discord AI Chatbot that Speaks like Your Favorite Character

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

Predict the spans of toxic posts that were responsible for the toxic label of the posts

fastai ulmfit - Pretraining the Language Model, Fine-Tuning and training a Classifier

Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework.

Natural Language Processing Tasks and Examples.

Knowledge Graph,Question Answering System，基于知识图谱和向量检索的医疗诊断问答系统

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

Index different CKAN entities in Solr, not just datasets

This repository implements a brute-force spellchecker utilizing the Damerau-Levenshtein edit distance.