Persian Lexicon

This repo uses Uppsala Persian Corpus (UPC) to construct a lexicon of 70664 unique words. With all the excitement around game Wordle, we also extracted words with different length (2, 3, 4, ..., 10) and stored them to separate files for easier access. Please note that these files might contain offensive words, I have not check them manually.

GetWords.py can read these files and return words as a list of strings.

Cleanup details

Main Lexicon

The main lexicon (data/persian-words.txt) is build very liberally; we only filter out words that contain ASCII characters or Arabic numerals.

Fixed length Lexicons

More conservative filtering has been applied to files with fixed word length. We drop all words that contain any of the following characters:

After applying these filters, we ended up with these number of words per file:

2 letter words: 310 unique words
3 letter words: 2378 unique words
4 letter words: 7059 unique words
5 letter words: 10043 unique words
6 letter words: 9541 unique words
7 letter words: 7350 unique words
8 letter words: 4681 unique words
9 letter words: 2529 unique words
10 letter words: 1250 unique words

Persian-lexicon - A lexicon of 70K unique Persian (Farsi) words

Related tags

Overview

Persian Lexicon

Cleanup details

Main Lexicon

Fixed length Lexicons

Owner

Saman Vaisipour

Unsupervised text tokenizer focused on computational efficiency

GCRC: A Gaokao Chinese Reading Comprehension dataset for interpretable Evaluation

Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit".

History Aware Multimodal Transformer for Vision-and-Language Navigation

NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

StarGAN - Official PyTorch Implementation

✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

Constituency Tree Labeling Tool

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Comprehensive-E2E-TTS - PyTorch Implementation

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

FedNLP: A Benchmarking Framework for Federated Learning in Natural Language Processing

Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

📝An easy-to-use package to restore punctuation of the text.

A script that automatically creates a branch name using google translation api and jira api

Pattern Matching in Python

Pre-training BERT masked language models with custom vocabulary

The first online catalogue for Arabic NLP datasets.