Pytorch NLP library based on FastAI

Overview

Quick NLP

Quick NLP is a deep learning nlp library inspired by the fast.ai library

It follows the same api as fastai and extends it allowing for quick and easy running of nlp models

Features

Installation

Installation of fast.ai library is required. Please install using the instructions here . It is important that the latest version of fast.ai is used and not the pip version which is not up to date.

After setting up an environment using the fasta.ai instructions please clone the quick-nlp repo and use pip install to install the package as follows:

git clone https://github.com/outcastofmusic/quick-nlp
cd quick-nlp
pip install .

Docker Image

A docker image with the latest master is available to use it please run:

docker run --runtime nvidia -it -p 8888:8888 --mount type=bind,source="$(pwd)",target=/workspace agispof/quicknlp:latest

this will mount your current directory to /workspace and start a jupyter lab session in that directory

Usage Example

The main goal of quick-nlp is to provided the easy interface of the fast.ai library for seq2seq models.

For example Lets assume that we have a dataset_path with folders for training, validation files. Each file is a tsv file where each row is two sentences separated by a tab. For example a file inside the train folder can be a eng_to_fr.tsv file with the following first few lines:

Go. Va !
Run!        Cours !
Run!        Courez !
Wow!        Ça alors !
Fire!       Au feu !
Help!       À l'aide !
Jump.       Saute.
Stop!       Ça suffit !
Stop!       Stop !
Stop!       Arrête-toi !
Wait!       Attends !
Wait!       Attendez !
I see.      Je comprends.

loading the data from the directory is as simple as:

from fastai.plots import *
from torchtext.data import Field
from fastai.core import SGD_Momentum
from fastai.lm_rnn import seq2seq_reg
from quicknlp import SpacyTokenizer, print_batch, S2SModelData
INIT_TOKEN = "<sos>"
EOS_TOKEN = "<eos>"
DATAPATH = "dataset_path"
fields = [
    ("english", Field(init_token=INIT_TOKEN, eos_token=EOS_TOKEN, tokenize=SpacyTokenizer('en'), lower=True)),
    ("french", Field(init_token=INIT_TOKEN, eos_token=EOS_TOKEN, tokenize=SpacyTokenizer('fr'), lower=True))

]
batch_size = 64
data = S2SModelData.from_text_files(path=DATAPATH, fields=fields,
                                    train="train",
                                    validation="validation",
                                    source_names=["english", "french"],
                                    target_names=["french"],
                                    bs= batch_size
                                   )

Finally, to train a seq2seq model with the data we only need to do:

emb_size = 300
nh = 1024
nl = 3
learner = data.get_model(opt_fn=SGD_Momentum(0.7), emb_sz=emb_size,
                         nhid=nh,
                         nlayers=nl,
                         bidir=True,
                        )
clip = 0.3
learner.reg_fn = reg_fn
learner.clip = clip
learner.fit(2.0, wds=1e-6)
Owner
Agis pof
Agis pof
Need: Image Search With Python

Need: Image Search The problem is that a user needs to search for a specific ima

Surya Komandooru 1 Dec 30, 2021
BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

Table of contents Introduction Using BARTpho with fairseq Using BARTpho with transformers Notes BARTpho: Pre-trained Sequence-to-Sequence Models for V

VinAI Research 58 Dec 23, 2022
Plugin repository for Macast

Macast-plugins Plugin repository for Macast. How to use third-party player plugin Download Macast from GitHub Release. Download the plugin you want fr

109 Jan 04, 2023
To classify the News into Real/Fake using Features from the Text Content of the article

Hoax-Detector Authenticity of news has now become a major problem. The Idea is to classify the News into Real/Fake using Features from the Text Conten

Aravindhan 1 Feb 09, 2022
Finding Label and Model Errors in Perception Data With Learned Observation Assertions

Finding Label and Model Errors in Perception Data With Learned Observation Assertions This is the project page for Finding Label and Model Errors in P

Stanford Future Data Systems 17 Oct 14, 2022
Automatically search Stack Overflow for the command you want to run

stackshell Automatically search Stack Overflow (and other Stack Exchange sites) for the command you want to ru Use the up and down arrows to change be

circuit10 22 Oct 27, 2021
Topic Inference with Zeroshot models

zeroshot_topics Table of Contents Installation Usage License Installation zeroshot_topics is distributed on PyPI as a universal wheel and is available

Rita Anjana 55 Nov 28, 2022
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 847 Dec 19, 2022
customer care chatbot made with Rasa Open Source.

Customer Care Bot Customer care bot for ecomm company which can solve faq and chitchat with users, can contact directly to team. 🛠 Features Basic E-c

Dishant Gandhi 23 Oct 27, 2022
Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

Stella Douka 14 Nov 02, 2022
AllenNLP integration for Shiba: Japanese CANINE model

Allennlp Integration for Shiba allennlp-shiab-model is a Python library that provides AllenNLP integration for shiba-model. SHIBA is an approximate re

Shunsuke KITADA 12 Feb 16, 2022
Huggingface Transformers + Adapters = ❤️

adapter-transformers A friendly fork of HuggingFace's Transformers, adding Adapters to PyTorch language models adapter-transformers is an extension of

AdapterHub 1.2k Jan 09, 2023
A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

Jeong Ukjae 20 Jul 11, 2022
Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets What is LASSL • How to Use What is LASSL LASSL은 LAnguage Semi-Super

LASSL: LAnguage Self-Supervised Learning 116 Dec 27, 2022
Transformer training code for sequential tasks

Sequential Transformer This is a code for training Transformers on sequential tasks such as language modeling. Unlike the original Transformer archite

Meta Research 578 Dec 13, 2022
(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

10 Jul 01, 2022
This code is the implementation of Text Emotion Recognition (TER) with linguistic features

APSIPA-TER This code is the implementation of Text Emotion Recognition (TER) with linguistic features. The network model is BERT with a pretrained mod

kenro515 1 Feb 08, 2022
SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

Princeton Natural Language Processing 2.5k Jan 07, 2023
An easier way to build neural search on the cloud

An easier way to build neural search on the cloud Jina is a deep learning-powered search framework for building cross-/multi-modal search systems (e.g

Jina AI 17.1k Jan 09, 2023
A minimal code for fairseq vq-wav2vec model inference.

vq-wav2vec inference A minimal code for fairseq vq-wav2vec model inference. Runs without installing the fairseq toolkit and its dependencies. Usage ex

Vladimir Larin 7 Nov 15, 2022