A fast and easy implementation of Transformer with PyTorch.

Overview

FasySeq

FasySeq is a shorthand as a Fast and easy sequential modeling toolkit. It aims to provide a seq2seq model to researchers and developers, which can be trained efficiently and modified easily. This toolkit is based on Transformer(Vaswani et al.), and will add more seq2seq models in the future.

Dependency

PyTorch >= 1.4
NLTK

Result

...

Structure

...

To Be Updated

  • top-k and top-p sampling
  • multi-GPU inference
  • length penalty in beam search
  • ...

Preprocess

Build Vocabulary

createVocab.py

NamedArguments Description
-f/--file The files used to build the vocabulary.
Type: List
--vocab_num The maximum size of vocabulary, the excess word will be discard according to the frequency.
Type: Int Default: -1
--min_freq The minimum frequency of token in vocabulary. The word with frequency less than min_freq will be discard.
Type: Int Default: 0
--lower Whether to convert all words to lowercase
--save_path The path to save voacbulary.
Type: str

Process Data

preprocess.py

NamedArguments Description
--source The path of source file.
Type: str
[--target] The path of target file.
Type: str
--src_vocab The path of source vocabulary.
Type: str
[--tgt_vocab] The path of target vocabulary.
Type: str
--save_path The path to save the processed data.
Type: str

Train

train.py

NamedArguments Description
Model -
--share_embed Source and target share the same vocabulary and word embedding. The max position of embedding is max(max_src_position, max_tgt_position) if the model employ share embedding.
--max_src_position The maximum source position, all src-tgt pairs which source sentences' lenght are greater than max_src_position will be cut or discard. If max_src_position > max source length, it wil be set to max source length.
Type: Int Default: inf
--max_tgt_position The maximum target position, all src_tgt pairs which target sentences' length are greater than max_tgt_position will be cut or discard. If max_tgt_position > max target length, it wil be set to max target length.
Type: Int Default: inf
--position_method The method to introduce positional information.
Option: encoding/embedding
--normalize_before Leveraging before layer normalization. See Xiong et al.
Checkpoint -
--checkpoint_path The path to save checkpoint file.
Type: str Default: None
--restore_file The checkpoint file to be loaded.
Type: str Default: None
--checkpoint_num Save the nearest checkpoint_num breakpoint.
Type: Int Default: inf
Data -
--vocab Vocabulary path. If you use share embedding, the vocabulary will be loaded from this path.
Type: str Default: None
--src_vocab Source vocabulary path.
Type: str Default: None
--tgt_vocab Target vocabulary path.
Type: str Default: None
--file The training data file.
Type: str
--max_tokens The maximum tokens in each batch.
Type: Int Default: 1000
--discard_invalid_data The data which length of source or data is more than maximum position will be discard if use this option, otherwise the long sentences will be cut into max position.
Train -
--cuda_num The device's ID of GPU.
Type: List
--grad_accumulate The num of gradient accumulate.
Type: Int Default: 1
--epoch The total epoch to train.
Type: Int Default: inf
--batch_print_info The number of batch to print training information.
Type: Int Default: 1000

Inference

generator.py

NamedArguments Description
--cuda_num The device's ID of GPU.
Type: List
--file The inference data file which has been processed.
Type: str
--raw_file The raw inference data file, and will be preprocessed before generated.
Type: str
--ref_file The reference file.
Type: str
--max_length
--max_alpha
--max_add_token
Maximum generated length = min(max_length, max_alpha * max_src_len, max_add_token + max_src_token)
Type: Int Default: inf
--max_tokens The maximum tokens in each batch.
Type: Int Default: 1000
--src_vocab Source vocabulary path.
Type: str Default: None
--tgt_vocab Target vocabulary path.
Type: str Default: None
--vocab Vocabulary path. If you use share embedding, the vocabulary will be loaded from this path.
Type: str Default: None
--model_path The path of pre-trained model.
Type: str
--output_path The path of output. the result will be saved into output_path/result.txt.
Type: str
--decode_method The decode method.
Option:greedy/beam
--beam Beam size.
Type: Int Default: 5

Postpreposs

avg_param.py

The average parameter code we employed is the same as fairseq.

License

FasySeq(-py) is Apache-2.0 License. The license applies to the pre-trained models as well.

You might also like...
Fast, general, and tested differentiable structured prediction in PyTorch
Fast, general, and tested differentiable structured prediction in PyTorch

Torch-Struct: Structured Prediction Library A library of tested, GPU implementations of core structured prediction algorithms for deep learning applic

A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

Transformer Embedder A Word Level Transformer layer based on PyTorch and 🤗 Transformers. How to use Install the library from PyPI: pip install transf

Reformer, the efficient Transformer, in Pytorch
Reformer, the efficient Transformer, in Pytorch

Reformer, the Efficient Transformer, in Pytorch This is a Pytorch implementation of Reformer https://openreview.net/pdf?id=rkgNKkHtvB It includes LSH

An implementation of WaveNet with fast generation

pytorch-wavenet This is an implementation of the WaveNet architecture, as described in the original paper. Features Automatic creation of a dataset (t

Google's Meena transformer chatbot implementation
Google's Meena transformer chatbot implementation

Here's my attempt at recreating Meena, a state of the art chatbot developed by Google Research and described in the paper Towards a Human-like Open-Domain Chatbot.

Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.
Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

LibreTranslate Try it online! | API Docs | Community Forum Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it d

An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Extracting OpenAI CLIP (Global/Grid) Features from Image and Text This repo aims at providing an easy to use and efficient code for extracting image &

xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.
xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.

Description xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building bl

Owner
宁羽
宁羽
NLP command-line assistant powered by OpenAI

NLP command-line assistant powered by OpenAI

Axel 16 Dec 09, 2022
Entity Disambiguation as text extraction (ACL 2022)

ExtEnD: Extractive Entity Disambiguation This repository contains the code of ExtEnD: Extractive Entity Disambiguation, a novel approach to Entity Dis

Sapienza NLP group 121 Jan 03, 2023
Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation.

Covid-19-BOT Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation. This bot uses torc

Neeraj Majhi 2 Nov 05, 2021
Understand Text Summarization and create your own summarizer in python

Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent

Sreekanth M 1 Oct 18, 2022
ETM - R package for Topic Modelling in Embedding Spaces

ETM - R package for Topic Modelling in Embedding Spaces This repository contains an R package called topicmodels.etm which is an implementation of ETM

bnosac 37 Nov 06, 2022
End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Image captioning End-to-end image captioning with EfficientNet-b3 + LSTM with Attention Model is seq2seq model. In the encoder pretrained EfficientNet

2 Feb 10, 2022
End-to-end MLOps pipeline of a BERT model for emotion classification.

image source EmoBERT-MLOps The goal of this repository is to build an end-to-end MLOps pipeline based on the MLOps course from Made with ML, but this

Dimitre Oliveira 4 Nov 06, 2022
It analyze the sentiment of the user, whether it is postive or negative.

Sentiment-Analyzer-Tool It analyze the sentiment of the user, whether it is postive or negative. It uses streamlit library for creating this sentiment

Paras Patidar 18 Dec 17, 2022
What are the best Systems? New Perspectives on NLP Benchmarking

What are the best Systems? New Perspectives on NLP Benchmarking In Machine Learning, a benchmark refers to an ensemble of datasets associated with one

Pierre Colombo 12 Nov 03, 2022
TPlinker for NER 中文/英文命名实体识别

本项目是参考 TPLinker 中HandshakingTagging思想,将TPLinker由原来的关系抽取(RE)模型修改为命名实体识别(NER)模型。

GodK 113 Dec 28, 2022
Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models

Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models. A paraphrase framework is more than just a paraphrasing model.

Prithivida 681 Jan 01, 2023
KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

Kakao Brain 797 Dec 26, 2022
Just Another Telegram Ai Chat Bot Written In Python With Pyrogram.

OkaeriChatBot Just another Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher.

Wahyusaputra 2 Dec 23, 2021
Automated question generation and question answering from Turkish texts using text-to-text transformers

Turkish Question Generation Offical source code for "Automated question generation & question answering from Turkish texts using text-to-text transfor

Open Business Software Solutions 29 Dec 14, 2022
Shared, streaming Python dict

UltraDict Sychronized, streaming Python dictionary that uses shared memory as a backend Warning: This is an early hack. There are only few unit tests

Ronny Rentner 192 Dec 23, 2022
Unsupervised Abstract Reasoning for Raven’s Problem Matrices

Unsupervised Abstract Reasoning for Raven’s Problem Matrices This code is the implementation of our TIP paper. This is the first unsupervised abstract

Tao Zhuo 9 Dec 17, 2022
Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Chenyang Huang 37 Jan 04, 2023
Generate text line images for training deep learning OCR model (e.g. CRNN)

Generate text line images for training deep learning OCR model (e.g. CRNN)

532 Jan 06, 2023
Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.

A Infomation Grathering tool that reverse search phone numbers and get their details ! What is phomber? Phomber is one of the best tools available fo

S41R4J 121 Dec 27, 2022
Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

KR-BERT-SimCSE Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT. Training Unsupervised python train_unsupervised.py --mi

Jeong Ukjae 27 Dec 12, 2022