AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

Overview

LITMUS Predictor

LITMUS Predictor provides support for simulating performance in ~100 languages given training observations of the desired task-model. Each training observation specifies the finetuning-datasize + test-performance in different languages.

Further, the tool provides support for constructing a data-collection strategy to maximize performance in desired targets subject to different constraints.

Installation

pip install -U pip
pip install -r requirements.txt

Usage

litmus/litmus_mixing.py contains the implementation of the LITMUS Predictor which can be trained on observations of different task-model trainings.

usage: LITMUS Tool [-h] [--scores_file SCORES_FILE]
                   [--train_format {json,csv}] [--save_state SAVE_STATE]
                   [--load_state LOAD_STATE]
                   [--precomputed_features PRECOMPUTED_FEATURES]
                   [--pivot_features {none,all,data_only}] [--use_all_langs]
                   [--common_scaling] [--training_algorithm {xgboost,mlp}]
                   [--error_method {LOO,LOTO,split,kfold,manual_split}]
                   [--data_sizes DATA_SIZES] [--mode MODE [MODE ...]]
                   [--output_dir OUTPUT_DIR]
                   [--heatmap_targets HEATMAP_TARGETS]
                   [--suggestions_budget SUGGESTIONS_BUDGET]
                   [--suggestions_langbudget SUGGESTIONS_LANGBUDGET]
                   [--suggestions_targets SUGGESTIONS_TARGETS]
                   [--suggestions_weights SUGGESTIONS_WEIGHTS]
                   [--suggestions_pivots SUGGESTIONS_PIVOTS]
                   [--suggestions_augmentable SUGGESTIONS_AUGMENTABLE]
                   [--suggestions_grid {exponential,linear}]
                   [--suggestions_objective {avg,min}]
                   [--suggestions_minperf SUGGESTIONS_MINPERF]
                   [--suggestions_minlangperf SUGGESTIONS_MINLANGPERF]
                   [--suggestions_verbose]
                   {mbert,xlmr}

positional arguments:
  {mbert,xlmr}          name of model to use

optional arguments:
  -h, --help            show this help message and exit
  --scores_file SCORES_FILE
                        path of json file containing scores to train on
  --train_format {json,csv}
                        Format of the training data
  --save_state SAVE_STATE
                        Save state of training of model to pickle file
  --load_state LOAD_STATE
                        Load trained model from pickle file
  --precomputed_features PRECOMPUTED_FEATURES
                        Path to precomputed-features file.
  --pivot_features {none,all,data_only}
                        What features based on pivot langs to use
  --use_all_langs       Add features based on all langs the tool supports
                        (Needed for transfer)
  --common_scaling      Common min max scaling params that are pvt
                        dependent(data size, type overlap, distance)
  --training_algorithm {xgboost,mlp}
                        which regressor to use
  --error_method {LOO,LOTO,split,kfold,manual_split}

  --data_sizes DATA_SIZES
                        Pivot data-size configs (semi-colon separated configs,
                        each config itself being comma-separated key-value
                        pairs)

  --mode MODE [MODE ...]
                        Output modes (comma-separated). Choose from following:
                        {heatmap, suggestions}.
  --output_dir OUTPUT_DIR
                        Overrride output directory
  --heatmap_targets HEATMAP_TARGETS
                        Targets for heatmap. Overrides suggestions_targets
                        (which is used by deafult)

  --suggestions_budget SUGGESTIONS_BUDGET
                        Budget for finding suggestions of which languages to
                        add data for (0 to disable)
  --suggestions_langbudget SUGGESTIONS_LANGBUDGET
                        Language-specific budget for finding suggestions
                        (overrrides suggestions_budget for these langs, comma-
                        separated list of key:value pairs)
  --suggestions_targets SUGGESTIONS_TARGETS
                        Targets being considered (comma-separated)
  --suggestions_weights SUGGESTIONS_WEIGHTS
                        Target weights for avg perf objective (comma-separated
                        list of key:value pairs, default wt=1)
  --suggestions_pivots SUGGESTIONS_PIVOTS
                        Index of desired row in data_sizes
  --suggestions_augmentable SUGGESTIONS_AUGMENTABLE
                        Set of augmentable languages (comma-separated)
  --suggestions_grid {exponential,linear}
                        Search space grid to use for suggestions
  --suggestions_objective {avg,min}
                        Objective function to be used for finding suggestions
  --suggestions_minperf SUGGESTIONS_MINPERF
                        Minimum acceptable average performance across tgts
  --suggestions_minlangperf SUGGESTIONS_MINLANGPERF
                        Minimum acceptable performance for given tgts (comma-
                        separated list of key:value pairs)
  --suggestions_verbose
                        Verbose logging of search

Examples

From shell

python3 litmus_mixing.py xlmr --scores_file training_observations.json --common_scaling --error_method split --mode heatmap --data_sizes "en:1000,hi:1000;en:1000,ar:1000" --use_all_langs --heatmap_targets en,fr,de,hi,ar,ru

From external scripts

from litmus import litmus_mixing

data_file = "" # Location of train data file
args = litmus_mixing.parse_args([
    "xlmr", data_file,
    "--common_scaling",
    "--error_method", "kfold",
    "--training_algorithm", "xgboost"
])
res = litmus_mixing.litmus_main(args)

WebApp

frontend/ contains the code for hosting the tool as a webapp using Azure Functions. frontend/WebUx implements the client-side as a static website which interacts with a Azure Functions backend which internally runs the litmus/litmus_mixing.py script.

Instructions to self-host

  1. Create an Azure Functions resource on Azure.
  2. Install Azure CLI and Functions Core Tools
  3. cd into the frontend/ directory and deploy to azure functions using func azure functionapp publish .

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

Realistic Few-Shot Relation Extraction This repository contains code to reproduce the results in the paper "Towards Realistic Few-Shot Relation Extrac

Bloomberg 8 Nov 09, 2022
BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

BERTopic BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable

Maarten Grootendorst 3.6k Jan 07, 2023
JaQuAD: Japanese Question Answering Dataset

JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension (2022, Skelter Labs)

SkelterLabs 84 Dec 27, 2022
A Paper List for Speech Translation

Keyword: Speech Translation, Spoken Language Processing, Natural Language Processing

138 Dec 24, 2022
Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch

Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch. Topics: Face detection with Detectron 2, Time Series anomaly detection with LSTM Autoenc

Venelin Valkov 1.8k Dec 31, 2022
Rhasspy 673 Dec 28, 2022
This converter will create the exact measure for your cappuccino recipe from the grandiose Rafaella Ballerini!

About CappuccinoJs This converter will create the exact measure for your cappuccino recipe from the grandiose Rafaella Ballerini! Este conversor criar

Arthur Ottoni Ribeiro 48 Nov 15, 2022
This repository describes our reproducible framework for assessing self-supervised representation learning from speech

LeBenchmark: a reproducible framework for assessing SSL from speech Self-Supervised Learning (SSL) using huge unlabeled data has been successfully exp

49 Aug 24, 2022
Input english text, then translate it between languages n times using the Deep Translator Python Library.

mass-translator About Input english text, then translate it between languages n times using the Deep Translator Python Library. How to Use Install dep

2 Mar 04, 2022
EdiTTS: Score-based Editing for Controllable Text-to-Speech

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Neosapience 99 Jan 02, 2023
Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

GAN stability This repository contains the experiments in the supplementary material for the paper Which Training Methods for GANs do actually Converg

Lars Mescheder 884 Nov 11, 2022
NL. The natural language programming language.

NL A Natural-Language programming language. Built using Codex. A few examples are inside the nl_projects directory. How it works Write any code in pur

2 Jan 17, 2022
Code for Text Prior Guided Scene Text Image Super-Resolution

Code for Text Prior Guided Scene Text Image Super-Resolution

82 Dec 26, 2022
TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks, which unifies general text transformation, task-specific transformation, adversarial attack, sub-popu

TextFlint 587 Dec 20, 2022
Repositório do trabalho de introdução a NLP

Trabalho da disciplina de BI NLP Repositório do trabalho da disciplina Introdução a Processamento de Linguagem Natural da pós BI-Master da PUC-RIO. Eq

Leonardo Lins 1 Jan 18, 2022
A Japanese tokenizer based on recurrent neural networks

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool. This tool has the following

325 Jan 05, 2023
Data preprocessing rosetta parser for python

datapreprocessing_rosetta_parser I've never done any NLP or text data processing before, so I wanted to use this hackathon as a learning opportunity,

ASReview hackathon for Follow the Money 2 Nov 28, 2021
Türkçe küfürlü içerikleri bulan bir yapay zeka kütüphanesi / An ML library for profanity detection in Turkish sentences

"Kötü söz sahibine aittir." -Anonim Nedir? sinkaf uygunsuz yorumların bulunmasını sağlayan bir python kütüphanesidir. Farkı nedir? Diğer algoritmalard

KaraGoz 4 Feb 18, 2022
Sapiens is a human antibody language model based on BERT.

Sapiens: Human antibody language model ____ _ / ___| __ _ _ __ (_) ___ _ __ ___ \___ \ / _` | '_ \| |/ _ \ '

Merck Sharp & Dohme Corp. a subsidiary of Merck & Co., Inc. 13 Nov 20, 2022
[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

Counterfactual Attention Learning Created by Yongming Rao*, Guangyi Chen*, Jiwen Lu, Jie Zhou This repository contains PyTorch implementation for ICCV

Yongming Rao 89 Dec 18, 2022