Translate - a PyTorch Language Library

Overview

NOTE

PyTorch Translate is now deprecated, please use fairseq instead.


Translate - a PyTorch Language Library

Translate is a library for machine translation written in PyTorch. It provides training for sequence-to-sequence models. Translate relies on fairseq, a general sequence-to-sequence library, which means that models implemented in both Translate and Fairseq can be trained. Translate also provides the ability to export some models to Caffe2 graphs via ONNX and to load and run these models from C++ for production purposes. Currently, we export components (encoder, decoder) to Caffe2 separately and beam search is implemented in C++. In the near future, we will be able to export the beam search as well. We also plan to add export support to more models.

Quickstart

If you are just interested in training/evaluating MT models, and not in exporting the models to Caffe2 via ONNX, you can install Translate for Python 3 by following these few steps:

  1. Install pytorch
  2. Install fairseq
  3. Clone this repository git clone https://github.com/pytorch/translate.git pytorch-translate && cd pytorch-translate
  4. Run python setup.py install

Provided you have CUDA installed you should be good to go.

Requirements and Full Installation

Translate Requires:

  • A Linux operating system with a CUDA compatible card
  • GNU C++ compiler version 4.9.2 and above
  • A CUDA installation. We recommend CUDA 8.0 or CUDA 9.0

Use Our Docker Image:

Install Docker and nvidia-docker, then run

sudo docker pull pytorch/translate
sudo nvidia-docker run -i -t --rm pytorch/translate /bin/bash
. ~/miniconda/bin/activate
cd ~/translate

You should now be able to run the sample commands in the Usage Examples section below. You can also see the available image versions under https://hub.docker.com/r/pytorch/translate/tags/.

Install Translate from Source:

These instructions were mainly tested on Ubuntu 16.04.5 LTS (Xenial Xerus) with a Tesla M60 card and a CUDA 9 installation. We highly encourage you to report an issue if you are unable to install this project for your specific configuration.

  • If you don't already have an existing Anaconda environment with Python 3.6, you can install one via Miniconda3:

    wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
    chmod +x miniconda.sh
    ./miniconda.sh -b -p ~/miniconda
    rm miniconda.sh
    . ~/miniconda/bin/activate
    
  • Clone the Translate repo:

    git clone https://github.com/pytorch/translate.git
    pushd translate
    
  • Install the PyTorch conda package:

    # Set to 8 or 9 depending on your CUDA version.
    TMP_CUDA_VERSION="9"
    
    # Uninstall previous versions of PyTorch. Doing this twice is intentional.
    # Error messages about torch not being installed are benign.
    pip uninstall -y torch
    pip uninstall -y torch
    
    # This may not be necessary if you already have the latest cuDNN library.
    conda install -y cudnn
    
    # Add LAPACK support for the GPU.
    conda install -y -c pytorch "magma-cuda${TMP_CUDA_VERSION}0"
    
    # Install the combined PyTorch nightly conda package.
    conda install pytorch-nightly cudatoolkit=${TMP_CUDA_VERSION}.0 -c pytorch
    
    # Install NCCL2.
    wget "https://s3.amazonaws.com/pytorch/nccl_2.1.15-1%2Bcuda${TMP_CUDA_VERSION}.0_x86_64.txz"
    TMP_NCCL_VERSION="nccl_2.1.15-1+cuda${TMP_CUDA_VERSION}.0_x86_64"
    tar -xvf "${TMP_NCCL_VERSION}.txz"
    rm "${TMP_NCCL_VERSION}.txz"
    
    # Set some environmental variables needed to link libraries correctly.
    export CONDA_PATH="$(dirname $(which conda))/.."
    export NCCL_ROOT_DIR="$(pwd)/${TMP_NCCL_VERSION}"
    export LD_LIBRARY_PATH="${CONDA_PATH}/lib:${NCCL_ROOT_DIR}/lib:${LD_LIBRARY_PATH}"
    
  • Install ONNX:

    git clone --recursive https://github.com/onnx/onnx.git
    yes | pip install ./onnx 2>&1 | tee ONNX_OUT
    

If you get a Protobuf compiler not found error, you need to install it:

conda install -c anaconda protobuf

Then, try to install ONNX again:

yes | pip install ./onnx 2>&1 | tee ONNX_OUT
  • Build Translate:

    pip uninstall -y pytorch-translate
    python3 setup.py build develop
    

Now you should be able to run the example scripts below!

Usage Examples

Note: the example commands given assume that you are the root of the cloned GitHub repository or that you're in the translate directory of the Docker or Amazon image. You may also need to make sure you have the Anaconda environment activated.

Training

We provide an example script to train a model for the IWSLT 2014 German-English task. We used this command to obtain a pretrained model:

bash pytorch_translate/examples/train_iwslt14.sh

The pretrained model actually contains two checkpoints that correspond to training twice with random initialization of the parameters. This is useful to obtain ensembles. This dataset is relatively small (~160K sentence pairs), so training will complete in a few hours on a single GPU.

Training with tensorboard visualization

We provide support for visualizing training stats with tensorboard. As a dependency, you will need tensorboard_logger installed.

pip install tensorboard_logger

Please also make sure that tensorboard is installed. It also comes with tensorflow installation.

You can use the above example script to train with tensorboard, but need to change line 10 from :

CUDA_VISIBLE_DEVICES=0 python3 pytorch_translate/train.py

to

CUDA_VISIBLE_DEVICES=0 python3 pytorch_translate/train_with_tensorboard.py

The event log directory for tensorboard can be specified by option --tensorboard_dir with a default value: run-1234. This directory is appended to your --save_dir argument.

For example in the above script, you can visualize with:

tensorboard --logdir checkpoints/runs/run-1234

Multiple runs can be compared by specifying different --tensorboard_dir. i.e. run-1234 and run-2345. Then

tensorboard --logdir checkpoints/runs

can visualize stats from both runs.

Pretrained Model

A pretrained model for IWSLT 2014 can be evaluated by running the example script:

bash pytorch_translate/examples/generate_iwslt14.sh

Note the improvement in performance when using an ensemble of size 2 instead of a single model.

Exporting a Model with ONNX

We provide an example script to export a PyTorch model to a Caffe2 graph via ONNX:

bash pytorch_translate/examples/export_iwslt14.sh

This will output two files, encoder.pb and decoder.pb, that correspond to the computation of the encoder and one step of the decoder. The example exports a single checkpoint (--checkpoint model/averaged_checkpoint_best_0.pt but is also possible to export an ensemble (--checkpoint model/averaged_checkpoint_best_0.pt --checkpoint model/averaged_checkpoint_best_1.pt). Note that during export, you can also control a few hyperparameters such as beam search size, word and UNK rewards.

Using the Model

To use the sample exported Caffe2 model to translate sentences, run:

echo "hallo welt" | bash pytorch_translate/examples/translate_iwslt14.sh

Note that the model takes in BPE inputs, so some input words need to be split into multiple tokens. For instance, "hineinstopfen" is represented as "hinein@@ stop@@ fen".

PyTorch Translate Research

We welcome you to explore the models we have in the pytorch_translate/research folder. If you use them and encounter any errors, please paste logs and a command that we can use to reproduce the error. Feel free to contribute any bugfixes or report your experience, but keep in mind that these models are a work in progress and thus are currently unsupported.

Join the Translate Community

We welcome contributions! See the CONTRIBUTING.md file for how to help out.

License

Translate is BSD-licensed, as found in the LICENSE file.

Minimal GUI for accessing the Watson Text to Speech service.

Description Minimal graphical application for accessing the Watson Text to Speech service. Requirements Python 3 plus all dependencies listed in requi

Moritz Maxeiner 1 Oct 22, 2021
Pipeline for training LSA models using Scikit-Learn.

Latent Semantic Analysis Pipeline for training LSA models using Scikit-Learn. Usage Instead of writing custom code for latent semantic analysis, you j

Dani El-Ayyass 23 Sep 05, 2022
Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

Wasi Ahmad 138 Dec 30, 2022
A paper list of pre-trained language models (PLMs).

Large-scale pre-trained language models (PLMs) such as BERT and GPT have achieved great success and become a milestone in NLP.

RUCAIBox 124 Jan 02, 2023
Top2Vec is an algorithm for topic modeling and semantic search.

Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors.

Dimo Angelov 2.4k Jan 06, 2023
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Dec 31, 2022
this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

uber-pickups-analysis Data Source: https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city Information about data set The dataset contain

1 Nov 02, 2021
Optimal Transport Tools (OTT), A toolbox for all things Wasserstein.

Optimal Transport Tools (OTT), A toolbox for all things Wasserstein. See full documentation for detailed info on the toolbox. The goal of OTT is to pr

OTT-JAX 255 Dec 26, 2022
Crowd sourced training data for Rasa NLU models

NLU Training Data Crowd-sourced training data for the development and testing of Rasa NLU models. If you're interested in grabbing some data feel free

Rasa 169 Dec 26, 2022
This repository contains (not all) code from my project on Named Entity Recognition in philosophical text

NERphilosophy 👋 Welcome to the github repository of my BsC thesis. This repository contains (not all) code from my project on Named Entity Recognitio

Ruben 1 Jan 27, 2022
A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Delta Reading Comprehension Dataset 台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。 本資料集期望成為適用於遷移學習之標準中文閱讀理解資料集。 本資料集從2,108篇

272 Dec 15, 2022
Score-Based Point Cloud Denoising (ICCV'21)

Score-Based Point Cloud Denoising (ICCV'21) [Paper] https://arxiv.org/abs/2107.10981 Installation Recommended Environment The code has been tested in

Shitong Luo 79 Dec 26, 2022
Machine translation models released by the Gourmet project

Gourmet Models Overview The Gourmet project has released several machine translation models to translate low-resource languages. This repository conta

Edinburgh NLP 5 Dec 08, 2021
Deep learning for NLP crash course at ABBYY.

Deep NLP Course at ABBYY Deep learning for NLP crash course at ABBYY. Suggested textbook: Neural Network Methods in Natural Language Processing by Yoa

Dan Anastasyev 597 Dec 18, 2022
An A-SOUL Text Generator Based on CPM-Distill.

ASOUL-Generator-Backend 本项目为 https://asoul.infedg.xyz/ 的后端。 模型为基于 CPM-Distill 的 transformers 转化版本 CPM-Generate-distill 训练而成。

infinityedge 46 Dec 11, 2022
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.6k Dec 27, 2022
AutoGluon: AutoML for Text, Image, and Tabular Data

AutoML for Text, Image, and Tabular Data AutoGluon automates machine learning tasks enabling you to easily achieve strong predictive performance in yo

Amazon Web Services - Labs 5.2k Dec 29, 2022
NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

Project 3: Web APIs & NLP Problem Statement How do r/Libertarian and r/Neoliberal differ on Biden post-inaguration? The goal of the project is to see

Adam Muhammad Klesc 2 Mar 29, 2022
A Practitioner's Guide to Natural Language Processing

Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, Text

Dipanjan (DJ) Sarkar 1.5k Jan 03, 2023
Use the state-of-the-art m2m100 to translate large data on CPU/GPU/TPU. Super Easy!

Easy-Translate is a script for translating large text files in your machine using the M2M100 models from Facebook/Meta AI. We also privide a script fo

Iker García-Ferrero 41 Dec 15, 2022