A Multi-modal Model Chinese Spell Checker Released on ACL2021.

Last update: Dec 29, 2022

Related tags

Overview

ReaLiSe

ReaLiSe is a multi-modal Chinese spell checking model.

This the office code for the paper Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking.

The paper has been accepted in ACL Findings 2021.

Environment

Python: 3.6
Cuda: 10.0
Packages: pip install -r requirements.txt

Data

Raw Data

SIGHAN Bake-off 2013: http://ir.itc.ntnu.edu.tw/lre/sighan7csc.html
SIGHAN Bake-off 2014: http://ir.itc.ntnu.edu.tw/lre/clp14csc.html
SIGHAN Bake-off 2015: http://ir.itc.ntnu.edu.tw/lre/sighan8csc.html
Wang271K: https://github.com/wdimmy/Automatic-Corpus-Generation

Data Processing

The code and cleaned data are in the data_process directory.

You can also directly download the processed data from this and put them in the data directory. The data directory would look like this:

data
|- trainall.times2.pkl
|- test.sighan15.pkl
|- test.sighan15.lbl.tsv
|- test.sighan14.pkl
|- test.sighan14.lbl.tsv
|- test.sighan13.pkl
|- test.sighan13.lbl.tsv

Pretrain

BERT: chinese-roberta-wwm-ext

Huggingface hfl/chinese-roberta-wwm-ext: https://huggingface.co/hfl/chinese-roberta-wwm-ext
Local: /data/dobby_ceph_ir/neutrali/pretrained_models/roberta-base-ch-for-csc/
Phonetic Encoder: pretrain_pho.sh
Graphic Encoder: pretrain_res.sh
Merge: merge.py

You can also directly download the pretrained and merged BERT, Phonetic Encoder, and Graphic Encoder from this, and put them in the pretrained directory:

pretrained
|- pytorch_model.bin
|- vocab.txt
|- config.json

Train

After preparing the data and pretrained model, you can train ReaLiSe by executing the train.sh script. Note that you should set up the PRETRAINED_DIR, DATE_DIR, and OUTPUT_DIR in it.

sh train.sh

Test

Test ReaLiSe using the test.sh script. You should set up the DATE_DIR, CKPT_DIR, and OUTPUT_DIR in it. CKPT_DIR is the OUTPUT_DIR of the training process.

sh test.sh

Well-trained Model

You can also download well-trained model from this direct using. The performance scores of RealiSe and some baseline models on the SIGHAN13, SIGHAN14, SIGHAN15 test set are here:

Methods

FASpell: FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-Decoder Paradigm
Soft-Masked BERT: Spelling Error Correction with Soft-Masked BERT
SpellGCN: SpellGCN: Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check
BERT: Our implementation

Metrics

"D" means "Detection Level", "C" means "Correction Level".
"A", "P", "R", "F" means "Accuracy", "Precision", "Recall", and "F1" respectively.

SIGHAN15

Method	D-A	D-P	D-R	D-F	C-A	C-P	C-R	C-F
FASpell	74.2	67.6	60.0	63.5	73.7	66.6	59.1	62.6
Soft-Masked BERT	80.9	73.7	73.2	73.5	77.4	66.7	66.2	66.4
SpellGCN	-	74.8	80.7	77.7	-	72.1	77.7	75.9
BERT	82.4	74.2	78.0	76.1	81.0	71.6	75.3	73.4
ReaLiSe	84.7	77.3	81.3	79.3	84.0	75.9	79.9	77.8

SIGHAN14

Method	D-A	D-P	D-R	D-F	C-A	C-P	C-R	C-F
Pointer Network	-	63.2	82.5	71.6	-	79.3	68.9	73.7
SpellGCN	-	65.1	69.5	67.2	-	63.1	67.2	65.3
BERT	75.7	64.5	68.6	66.5	74.6	62.4	66.3	64.3
ReaLiSe	78.4	67.8	71.5	69.6	77.7	66.3	70.0	68.1

SIGHAN13

Method	D-A	D-P	D-R	D-F	C-A	C-P	C-R	C-F
FASpell	63.1	76.2	63.2	69.1	60.5	73.1	60.5	66.2
SpellGCN	78.8	85.7	78.8	82.1	77.8	84.6	77.8	81.0
BERT	77.0	85.0	77.0	80.8	77.4	83.0	75.2	78.9
ReaLiSe	82.7	88.6	82.5	85.4	81.4	87.2	81.2	84.1

Citation

@misc{xu2021read,
      title={Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking}, 
      author={Heng-Da Xu and Zhongli Li and Qingyu Zhou and Chao Li and Zizhen Wang and Yunbo Cao and Heyan Huang and Xian-Ling Mao},
      year={2021},
      eprint={2105.12306},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

A Multi-modal Model Chinese Spell Checker Released on ACL2021.

Related tags

Overview

ReaLiSe

Environment

Data

Raw Data

Data Processing

Pretrain

Train

Test

Well-trained Model

SIGHAN15

SIGHAN14

SIGHAN13

Citation

Owner

DaDa

This repo is to provide a list of literature regarding Deep Learning on Graphs for NLP

A Python module made to simplify the usage of Text To Speech and Speech Recognition.

A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis

Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.

NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework.

[Preprint] Escaping the Big Data Paradigm with Compact Transformers, 2021

This is a Prototype of an Ai ChatBot "Tea and Coffee Supplier" using python.

OCR을 이용하여 인원수를 인식 후 줌을 Kill 해줍니다

A method for cleaning and classifying text using transformers.

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

This repository implements a brute-force spellchecker utilizing the Damerau-Levenshtein edit distance.

Need: Image Search With Python

LSTM model - IMDB review sentiment analysis

The code for two papers: Feedback Transformer and Expire-Span.

History Aware Multimodal Transformer for Vision-and-Language Navigation

PyTranslator é simultaneamente um editor e tradutor de texto com diversos recursos e interface feito com coração e 100% em Python

This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular intervals.It sends out the most recent news at random!

Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题生成项目。

Using BERT-based models for toxic span detection