A sentence aligner for comparable corpora

Last update: Aug 24, 2022

Related tags

Overview

About

Yalign is a tool for extracting parallel sentences from comparable corpora.

Statistical Machine Translation relies on parallel corpora (eg.. europarl) for training translation models. However these corpora are limited and take time to create. Yalign is designed to automate this process by finding sentences that are close translation matches from comparable corpora. This opens up avenues for harvesting parallel corpora from sources like translated documents and the web.

Installation

Yalign requires that you install scikit-learn.

After that you can install Yalign from PyPi via pip:

sudo pip install yalign

Usage

Firstly we need to download and unpack the english to spanish model.

wget https://raw.githubusercontent.com/machinalis/yalign/develop/data/models/0.1/en-es.tar.gz
tar -xvzf en-es.tar.gz

Now we can use the yalign-align script along with the english to spanish model to align two web pages.

yalign-align en-es http://en.wikipedia.org/wiki/Antiparticle http://es.wikipedia.org/wiki/Antipart%C3%ADcula

Yalign is not limited to any one language pair. By creating your own models you can align any two languages. For more details on how to use yalign and on yalign's implementation please read the docs.

The Yalign Team:

Yalign is a Machinalis project. You can view our other open source contributions here.

Andrew Vine

Gonzalo García Berrotarán

Rafael Carrascosa

Elías Andrawos

Laura Alonso Alemany

A sentence aligner for comparable corpora

Related tags

Overview

About

Installation

Usage

Owner

Machinalis

Watson Natural Language Understanding and Knowledge Studio

A minimal Conformer ASR implementation adapted from ESPnet.

Fake Shakespearean Text Generator

Official implementations for various pre-training models of ERNIE-family, covering topics of Language Understanding & Generation, Multimodal Understanding & Generation, and beyond.

This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

Question answering app is used to answer for a user given question from user given text.

In this Notebook I've build some machine-learning and deep-learning to classify corona virus tweets, in both multi class classification and binary classification.

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

A sentence aligner for comparable corpora

CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

Resources for "Natural Language Processing" Coursera course.

A linter to manage all your python exceptions and try/except blocks (limited only for those who like dinosaurs).

[Preprint] Escaping the Big Data Paradigm with Compact Transformers, 2021

The code from the whylogs workshop in DataTalks.Club on 29 March 2022

This is Assignment1 code for the Web Data Processing System.

Long text token classification using LongFormer

Finding Label and Model Errors in Perception Data With Learned Observation Assertions

An easy to use Natural Language Processing library and framework for predicting, training, fine-tuning, and serving up state-of-the-art NLP models.