GULAG: GUessing LAnGuages with neural networks

Related tags

Deep Learninggulag
Overview

GULAG: GUessing LAnGuages with neural networks

Main Code style: black Checked with mypy GitHub license GitHub stars

cannon on sparrows

Classify languages in text via neural networks.

> Привет! My name is Egor. Was für ein herrliches Frühlingswetter, хутка расцвітуць дрэвы.
ru -- Привет
en -- My name is Egor
de -- Was für ein herrliches Frühlingswetter
be -- хутка расцвітуць дрэвы

Usage

Use requirements.txt to install necessary dependencies:

pip install -r requirements.txt

After that you can either train model:

python -m src.main train --gin-file config/train.gin

Or run inference:

python -m src.main infer

Training

All training details are covered by PyTorch-Lightning. There are:

Both modules have explicit documentation, see source files for usage details.

Dataset

Since extracting languages from a text is a kind of synthetic task, then there is no exact dataset of that. A possible approach to handle this is to use general multilingual corpses to create a synthetic dataset with multiple languages per one text. Although there is a popular mC4 dataset with large texts in over 100 languages. It is too large for this pet project. Therefore, I used wikiann dataset that also supports over 100 languages including Russian, Ukrainian, Belarusian, Kazakh, Azerbaijani, Armenian, Georgian, Hebrew, English, and German. But this dataset consists of only small sentences for NER classification that make it more unnatural.

Synthetic data

To create a dataset with multiple languages per example, I use the following sampling strategy:

  1. Select number of languages in next example
  2. Select number of sentences for each language
  3. Sample sentences, shuffle them and concatenate into single text

For exact details about sampling algorithm see generate_example method.

This strategy allows training on a large non-repeating corpus. But for proper evaluation during training, we need a deterministic subset of data. For that, we can pre-generate a bunch of texts and then reuse them on each validation.

Model

As a training objective, I selected per-token classification. This automatically allows not only classifying languages in the text, but also specifying their ranges.

The model consists of two parts:

  1. The backbone model that embeds tokens into vectors
  2. Head classifier that predicts classes by embedding vector

As backbone model I selected vanilla BERT. This model already pretrained on large multilingual corpora including non-popular languages. During training on a target task, weights of BERT were frozen to enhance speed.

Head classifier is a simple MLP, see TokenClassifier for details.

Configuration

To handle big various of parameters, I used gin-config. config folder contains all configurations split by modules that used them.

Use --gin-file CLI argument to specify config file and --gin-param to manually overwrite some values. For example, to run debug mode on a small subset with a tiny model for 10 steps use

python -m src.main train --gin-file config/debug.gin --gin-param="train.n_steps = 10"

You can also use jupyter notebook to run training, this is a convenient way to train with Google Colab. See train.ipynb.

Artifacts

All training logs and artifacts are stored on W&B. See voudy/gulag for information about current runs, their losses and metrics. Any of the presented models may be used on inference.

Inference

In inference mode, you may play with the model to see whether it is good or not. This script requires a W&B run path where checkpoint is stored and checkpoint name. After that, you can interact with a model in a loop.

The final model is stored in voudy/gulag/a55dbee8 run. It was trained for 20 000 steps for ~9 hours on Tesla T4.

$ python -m src.main infer --wandb "voudy/gulag/a55dbee8" --ckpt "step_20000.ckpt"
...
Enter text to classify languages (Ctrl-C to exit):
> İrəli! Вперёд! Nach vorne!
az -- İrəli
ru -- Вперёд
de -- Nach vorne
Enter text to classify languages (Ctrl-C to exit):
> Давайте жити дружно
uk -- Давайте жити дружно
> ...

For now, text preprocessing removes all punctuation and digits. It makes the data more robust. But restoring them back is a straightforward technical work that I was lazy to do.

Of course, you can use model from the Jupyter Notebooks, see infer.ipynb

Further work

Next steps may include:

  • Improved dataset with more natural examples, e.g. adopt mC4.
  • Better tokenization to handle rare languages, this should help with problems on the bounds of similar texts.
  • Experiments with another embedders, e.g. mGPT-3 from Sber covers all interesting languages, but requires technical work to adopt for classification task.
Owner
Egor Spirin
DL guy
Egor Spirin
Malware Env for OpenAI Gym

Malware Env for OpenAI Gym Citing If you use this code in a publication please cite the following paper: Hyrum S. Anderson, Anant Kharkar, Bobby Fila

ENDGAME 563 Dec 29, 2022
Differentiable Surface Triangulation

Differentiable Surface Triangulation This is our implementation of the paper Differentiable Surface Triangulation that enables optimization for any pe

61 Dec 07, 2022
Implementation of momentum^2 teacher

Momentum^2 Teacher: Momentum Teacher with Momentum Statistics for Self-Supervised Learning Requirements All experiments are done with python3.6, torch

jemmy li 121 Sep 26, 2022
Uncertainty Estimation via Response Scaling for Pseudo-mask Noise Mitigation in Weakly-supervised Semantic Segmentation

Uncertainty Estimation via Response Scaling for Pseudo-mask Noise Mitigation in Weakly-supervised Semantic Segmentation Introduction This is a PyTorch

XMed-Lab 30 Sep 23, 2022
torchlm is aims to build a high level pipeline for face landmarks detection, it supports training, evaluating, exporting, inference(Python/C++) and 100+ data augmentations

💎A high level pipeline for face landmarks detection, supports training, evaluating, exporting, inference and 100+ data augmentations, compatible with torchvision and albumentations, can easily instal

DefTruth 142 Dec 25, 2022
Repository for MeshTalk supplemental material and code once the (already approved) 16 GHS captures our lab will make publicly available are released.

meshtalk This repository contains code to run MeshTalk for face animation from audio. If you use MeshTalk, please cite @inproceedings{richard2021mesht

Meta Research 221 Jan 06, 2023
Machine Learning with JAX Tutorials

The purpose of this repo is to make it easy to get started with JAX. It contains my "Machine Learning with JAX" series of tutorials (YouTube videos and Jupyter Notebooks) as well as the content I fou

Aleksa Gordić 372 Dec 28, 2022
ResNEsts and DenseNEsts: Block-based DNN Models with Improved Representation Guarantees

ResNEsts and DenseNEsts: Block-based DNN Models with Improved Representation Guarantees This repository is the official implementation of the empirica

Kuan-Lin (Jason) Chen 2 Oct 02, 2022
WHENet - ONNX, OpenVINO, TFLite, TensorRT, EdgeTPU, CoreML, TFJS, YOLOv4/YOLOv4-tiny-3L

HeadPoseEstimation-WHENet-yolov4-onnx-openvino ONNX, OpenVINO, TFLite, TensorRT, EdgeTPU, CoreML, TFJS, YOLOv4/YOLOv4-tiny-3L 1. Usage $ git clone htt

Katsuya Hyodo 49 Sep 21, 2022
Segmentation and Identification of Vertebrae in CT Scans using CNN, k-means Clustering and k-NN

Segmentation and Identification of Vertebrae in CT Scans using CNN, k-means Clustering and k-NN If you use this code for your research, please cite ou

41 Dec 08, 2022
Multiview Dataset Toolkit

Multiview Dataset Toolkit Using multi-view cameras is a natural way to obtain a complete point cloud. However, there is to date only one multi-view 3D

11 Dec 22, 2022
SoK: Vehicle Orientation Representations for Deep Rotation Estimation

SoK: Vehicle Orientation Representations for Deep Rotation Estimation Raymond H. Tu, Siyuan Peng, Valdimir Leung, Richard Gao, Jerry Lan This is the o

FIRE Capital One Machine Learning of the University of Maryland 12 Oct 07, 2022
Easy and Efficient Object Detector

EOD Easy and Efficient Object Detector EOD (Easy and Efficient Object Detection) is a general object detection model production framework. It aim on p

381 Jan 01, 2023
The self-supervised goal reaching benchmark introduced in Discovering and Achieving Goals via World Models

Lexa-Benchmark Codebase for the self-supervised goal reaching benchmark introduced in 'Discovering and Achieving Goals via World Models'. Setup Create

1 Oct 14, 2021
Instance Segmentation by Jointly Optimizing Spatial Embeddings and Clustering Bandwidth

Instance segmentation by jointly optimizing spatial embeddings and clustering bandwidth This codebase implements the loss function described in: Insta

209 Dec 07, 2022
A Free and Open Source Python Library for Multiobjective Optimization

Platypus What is Platypus? Platypus is a framework for evolutionary computing in Python with a focus on multiobjective evolutionary algorithms (MOEAs)

Project Platypus 424 Dec 18, 2022
Implementation of Google Brain's WaveGrad high-fidelity vocoder

WaveGrad Implementation (PyTorch) of Google Brain's high-fidelity WaveGrad vocoder (paper). First implementation on GitHub with high-quality generatio

Ivan Vovk 363 Dec 27, 2022
AOT (Associating Objects with Transformers) in PyTorch

An efficient modular implementation of Associating Objects with Transformers for Video Object Segmentation in PyTorch

162 Dec 14, 2022
Facilitating Database Tuning with Hyper-ParameterOptimization: A Comprehensive Experimental Evaluation

A Comprehensive Experimental Evaluation for Database Configuration Tuning This is the source code to the paper "Facilitating Database Tuning with Hype

DAIR Lab 9 Oct 29, 2022
A dead simple python wrapper for darknet that works with OpenCV 4.1, CUDA 10.1

What Dead simple python wrapper for Yolo V3 using AlexyAB's darknet fork. Works with CUDA 10.1 and OpenCV 4.1 or later (I use OpenCV master as of Jun

Pliable Pixels 6 Jan 12, 2022