Pytorch NLP library based on FastAI

Overview

Quick NLP

Quick NLP is a deep learning nlp library inspired by the fast.ai library

It follows the same api as fastai and extends it allowing for quick and easy running of nlp models

Features

Installation

Installation of fast.ai library is required. Please install using the instructions here . It is important that the latest version of fast.ai is used and not the pip version which is not up to date.

After setting up an environment using the fasta.ai instructions please clone the quick-nlp repo and use pip install to install the package as follows:

git clone https://github.com/outcastofmusic/quick-nlp
cd quick-nlp
pip install .

Docker Image

A docker image with the latest master is available to use it please run:

docker run --runtime nvidia -it -p 8888:8888 --mount type=bind,source="$(pwd)",target=/workspace agispof/quicknlp:latest

this will mount your current directory to /workspace and start a jupyter lab session in that directory

Usage Example

The main goal of quick-nlp is to provided the easy interface of the fast.ai library for seq2seq models.

For example Lets assume that we have a dataset_path with folders for training, validation files. Each file is a tsv file where each row is two sentences separated by a tab. For example a file inside the train folder can be a eng_to_fr.tsv file with the following first few lines:

Go. Va !
Run!        Cours !
Run!        Courez !
Wow!        Ça alors !
Fire!       Au feu !
Help!       À l'aide !
Jump.       Saute.
Stop!       Ça suffit !
Stop!       Stop !
Stop!       Arrête-toi !
Wait!       Attends !
Wait!       Attendez !
I see.      Je comprends.

loading the data from the directory is as simple as:

from fastai.plots import *
from torchtext.data import Field
from fastai.core import SGD_Momentum
from fastai.lm_rnn import seq2seq_reg
from quicknlp import SpacyTokenizer, print_batch, S2SModelData
INIT_TOKEN = "<sos>"
EOS_TOKEN = "<eos>"
DATAPATH = "dataset_path"
fields = [
    ("english", Field(init_token=INIT_TOKEN, eos_token=EOS_TOKEN, tokenize=SpacyTokenizer('en'), lower=True)),
    ("french", Field(init_token=INIT_TOKEN, eos_token=EOS_TOKEN, tokenize=SpacyTokenizer('fr'), lower=True))

]
batch_size = 64
data = S2SModelData.from_text_files(path=DATAPATH, fields=fields,
                                    train="train",
                                    validation="validation",
                                    source_names=["english", "french"],
                                    target_names=["french"],
                                    bs= batch_size
                                   )

Finally, to train a seq2seq model with the data we only need to do:

emb_size = 300
nh = 1024
nl = 3
learner = data.get_model(opt_fn=SGD_Momentum(0.7), emb_sz=emb_size,
                         nhid=nh,
                         nlayers=nl,
                         bidir=True,
                        )
clip = 0.3
learner.reg_fn = reg_fn
learner.clip = clip
learner.fit(2.0, wds=1e-6)
Owner
Agis pof
Agis pof
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 26 Dec 14, 2022
OCR을 이용하여 인원수를 인식 후 줌을 Kill 해줍니다

How To Use killtheZoom-2.0 Windows 0. https://joyhong.tistory.com/79 이 글을 보면서 tesseract를 C:\Program Files\Tesseract-OCR 경로로 설치해주세요(한국어 언어 추가 필요) 상단의 초

김정인 9 Sep 13, 2021
A sample project that exists for PyPUG's "Tutorial on Packaging and Distributing Projects"

A sample Python project A sample project that exists as an aid to the Python Packaging User Guide's Tutorial on Packaging and Distributing Projects. T

Python Packaging Authority 4.5k Dec 30, 2022
Just a Basic like Language for Zeno INC

zeno-basic-language Just a Basic like Language for Zeno INC This is written in 100% python. this is basic language like language. so its not for big p

Voidy Devleoper 1 Dec 18, 2021
Stanford CoreNLP provides a set of natural language analysis tools written in Java

Stanford CoreNLP Stanford CoreNLP provides a set of natural language analysis tools written in Java. It can take raw human language text input and giv

Stanford NLP 8.8k Jan 07, 2023
PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding This repository contains the official PyTorch implementation of th

Xiao Xu 26 Dec 14, 2022
API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend

gpt-j-api 🦜 An API to interact with the GPT-J language model. You can use and test the model in two different ways: Streamlit web app at http://api.v

Víctor Gallego 276 Dec 31, 2022
WikiPron - a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary

WikiPron WikiPron is a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary, as well as a database of pronuncia

213 Jan 01, 2023
Treemap visualisation of Maya scene files

Ever wondered which nodes are responsible for that 600 mb+ Maya scene file? Features Fast, resizable UI Parsing at 50 mb/sec Dependency-free, single-f

Marcus Ottosson 76 Nov 12, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 26 Dec 14, 2022
Script to generate VAD dataset used in Asteroid recipe

About the dataset LibriVAD is an open source dataset for voice activity detection in noisy environments. It is derived from LibriSpeech signals (clean

11 Sep 15, 2022
Maix Speech AI lib, including ASR, chat, TTS etc.

Maix-Speech 中文 | English Brief Now only support Chinese, See 中文 Build Clone code by: git clone https://github.com/sipeed/Maix-Speech Compile x86x64 c

Sipeed 267 Dec 25, 2022
MASS: Masked Sequence to Sequence Pre-training for Language Generation

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Microsoft 1.1k Dec 17, 2022
Curso práctico: NLP de cero a cien 🤗

Curso Práctico: NLP de cero a cien Comprende todos los conceptos y arquitecturas clave del estado del arte del NLP y aplícalos a casos prácticos utili

Somos NLP 147 Jan 06, 2023
Chinese Pre-Trained Language Models (CPM-LM) Version-I

CPM-Generate 为了促进中文自然语言处理研究的发展,本项目提供了 CPM-LM (2.6B) 模型的文本生成代码,可用于文本生成的本地测试,并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告] 若您想使用CPM-1进行推理,我们建议使用高效推理工具BMI

Tsinghua AI 1.4k Jan 03, 2023
Fast topic modeling platform

The state-of-the-art platform for topic modeling. Full Documentation User Mailing List Download Releases User survey What is BigARTM? BigARTM is a pow

BigARTM 633 Dec 21, 2022
Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration This is the official repository for the EMNLP 2021 long pa

70 Dec 11, 2022
A combination of autoregressors and autoencoders using XLNet for sentiment analysis

A combination of autoregressors and autoencoders using XLNet for sentiment analysis Abstract In this paper sentiment analysis has been performed in or

James Zaridis 2 Nov 20, 2021
Mastering Transformers, published by Packt

Mastering Transformers This is the code repository for Mastering Transformers, published by Packt. Build state-of-the-art models from scratch with adv

Packt 195 Jan 01, 2023
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

CrossNER is a fully-labeled collected of named entity recognition (NER) data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specia

Zihan Liu 89 Nov 10, 2022