A simple implementation of N-gram language model.

Last update: Nov 24, 2021

Related tags

Text Data & NLP n-gram

Overview

About

A simple implementation of N-gram language model.

Requirements

numpy

Data preparation

Corpus

Training data for the N-gram model, a text file like this:

曼联加油
懂球直播
有也免费高清的额
直播挺全的
曼联这局肯定胜利

Text lines will be split into tokens by a delimiter when training. By default, no delimiter given, text lines will be split into characters.

Tokens

The dictionary for the model, a text file, each line of which is a token. Every token is unique in the file.

光
衰
戒
颅
阖

Training

Run the script train_n_gram.py to train an N-gram model.

python train_n_gram.py --corpus_path data/tieba.dialogues --token_path data/charset.txt --model_path data/2-gram.model --n 2

Testing

Run the script test_n_gram.py to test the trained N-gram model.

python test_n_gram.py --token_path data/charset.txt --model_path data/2-gram.model --text 哈哈

The testing output will like:

INFO - Loaded model from data/2-gram.model
INFO - Model info:
	n: 2
	head2tail length: 5947
	tokens: 5952
The most probable next token of the '哈哈' is '哈'.

A simple implementation of N-gram language model.

Related tags

Overview

About

Requirements

Data preparation

Corpus

Tokens

Training

Testing

Owner

2021搜狐校园文本匹配算法大赛baseline

The simple project to separate mixed voice (2 clean voices) to 2 separate voices.

An open-source NLP research library, built on PyTorch.

An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec

News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

Tensorflow implementation of paper: Learning to Diagnose with LSTM Recurrent Neural Networks.

wxPython app for converting encodings, modifying and fixing SRT files

NLP Text Classification

Sapiens is a human antibody language model based on BERT.

Python library for processing Chinese text

Implementation of "Adversarial purification with Score-based generative models", ICML 2021

Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense.

Labelling platform for text using distant supervision

Write Python in Urdu - اردو میں کوڈ لکھیں

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

A flask application to predict the speech emotion of any .wav file.

Contains links to publicly available datasets for modeling health outcomes using speech and language.

基于pytorch+bert的中文事件抽取