OCR Post Correction for Endangered Language Texts

Overview

πŸ“Œ Coming soon: an update to the software including features from our paper on semi-supervised OCR post-correction, to be published in the Transactions of the Association for Computational Linguistics (TACL)!

Check out the paper here.

OCR Post Correction for Endangered Language Texts

This repository contains code for models and experiments from the paper "OCR Post Correction for Endangered Language Texts".

Textual data in endangered languages is often found in formats that are not machine-readable, including scanned images of paper books. Extracting the text is challenging because there is typically no annotated data to train an OCR system for each endangered language. Instead, we focus on post-correcting the OCR output from a general-purpose OCR system.

πŸ“Œ In the paper, we present a dataset containing annotations for documents in three critically endangered languages: Ainu, Griko, Yakkha.

πŸ“Œ Our model reduces the recognition error rate by 34% on average, over a state-of-the-art OCR system.

Learn more about the paper here!

OCR Post-Correction

The goal of OCR post-correction is to automatically correct errors in the text output from an existing OCR system.

The existing OCR system is used to obtain a first pass transcription of the input image (example below in the endangered language Griko):

First pass OCR transcription

The incorrectly recognized characters in the first pass are then corrected by the post-correction model.

Corrected transcription

Model

As seen in the example above, OCR post-correction is a text-based sequence-to-sequence task.

πŸ“Œ We use a character-level encoder-decoder architecture with attention and add several adaptations for the low-resource setting. The paper has all the details!

πŸ“Œ The model is trained in a supervised manner. The training data consists of first pass OCR outputs as the source with corresponding manually corrected transcriptions as the target.

πŸ“Œ Some books that contain texts in endangered languages also contain translations of the text in another (usually high-resource) language. We incorporate an additional encoder in the model, with a multisource framework, to use the information from these translations if they are available.

We provide instructions for both single-source and multisource models:

  • The single-source model can be used for almost any document and is significantly easier to set up.

  • The multisource model can only be used if translations are available.

Dataset

This repository contains a sample from our dataset in sample_dataset, which you can use to train the post-correction model. Get the full dataset here!

However, this repository can be used to train OCR post-correction models for documents in any language!

πŸš€ If you want to use our model with a new set of documents, construct a dataset by following the steps here.

πŸš€ We'd love to hear about the new datasets and models you build: send us an email at [email protected]!

Running Experiments

Once you have a suitable dataset (e.g., sample_dataset or your own dataset), you can train a model and run experiments on OCR post-correction.

If you have your own dataset, you can use the utils/prepare_data.py script to create train, development, and test splits (see the last step here).

The steps are described below, illustrated with sample_dataset/postcorrection. If using another dataset, simply change the experiment settings to point to your dataset and run the same scripts.

Requirements

Python 3+ is required. Pip can be used to install the packages:

pip install -r postcorr_requirements.txt

Training

The process of training the post-correction model has two main steps:

  • Pretraining with first pass OCR outputs.
  • Training with manually corrected transcriptions in a supervised manner.

For a single-source model, modify the experimental settings in train_single-source.sh to point to the appropriate dataset and desired output folder. It is currently set up to use sample_dataset.

Then run

bash train_single-source.sh

For multisource, use train_multi-source.sh.

Log files and saved models are written to the user-specified experiment folder for both the pretraining and training steps. For a list of all available hyperparameters and options, look at postcorrection/constants.py and postcorrection/opts.py.

Testing

For testing with a single-source model, modify the experimental settings in test_single-source.sh. It is currently set up to use sample_dataset.

Then run

bash test_single-source.sh

For multisource, use test_multi-source.sh.

Citation

Please cite our paper if this repository was useful.

@inproceedings{rijhwani-etal-2020-ocr,
    title = "{OCR} {P}ost {C}orrection for {E}ndangered {L}anguage {T}exts",
    author = "Rijhwani, Shruti  and
      Anastasopoulos, Antonios  and
      Neubig, Graham",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.478",
    doi = "10.18653/v1/2020.emnlp-main.478",
    pages = "5931--5942",
}

License

Owner
Shruti Rijhwani
Ph.D. student at CMU, working on natural language processing.
Shruti Rijhwani
This repository is a series of notebooks that show solutions for the projects at Dataquest.io.

Dataquest Project Solutions This repository is a series of notebooks that show solutions for the projects at Dataquest.io. Of course, there are always

Dataquest 1.1k Dec 30, 2022
PyTorch implementation of a Real-ESRGAN model trained on custom dataset

Real-ESRGAN PyTorch implementation of a Real-ESRGAN model trained on custom dataset. This model shows better results on faces compared to the original

Sber AI 160 Jan 04, 2023
A rule learning algorithm for the deduction of syndrome definitions from time series data.

README This project provides a rule learning algorithm for the deduction of syndrome definitions from time series data. Large parts of the algorithm a

0 Sep 24, 2021
PAWS 🐾 Predicting View-Assignments with Support Samples

This repo provides a PyTorch implementation of PAWS (predicting view assignments with support samples), as described in the paper Semi-Supervised Learning of Visual Features by Non-Parametrically Pre

Facebook Research 437 Dec 23, 2022
Locationinfo - A script helps the user to show network information such as ip address

Description This script helps the user to show network information such as ip ad

Roxcoder 1 Dec 30, 2021
A parallel framework for population-based multi-agent reinforcement learning.

MALib: A parallel framework for population-based multi-agent reinforcement learning MALib is a parallel framework of population-based learning nested

MARL @ SJTU 348 Jan 08, 2023
A curated list of resources for Image and Video Deblurring

A curated list of resources for Image and Video Deblurring

Subeesh Vasu 1.7k Jan 01, 2023
Federated_learning codes used for the the paper "Evaluation of Federated Learning Aggregation Algorithms" and "A Federated Learning Aggregation Algorithm for Pervasive Computing: Evaluation and Comparison"

Federated Distance (FedDist) This is the code accompanying the Percom2021 paper "A Federated Learning Aggregation Algorithm for Pervasive Computing: E

GETALP 8 Jan 03, 2023
ChainerRL is a deep reinforcement learning library built on top of Chainer.

ChainerRL and PFRL ChainerRL (this repository) is a deep reinforcement learning library that implements various state-of-the-art deep reinforcement al

Chainer 1.1k Jan 01, 2023
CVPR 2021 Challenge on Super-Resolution Space

Learning the Super-Resolution Space Challenge NTIRE 2021 at CVPR Learning the Super-Resolution Space challenge is held as a part of the 6th edition of

andreas 104 Oct 26, 2022
Self-Supervised Monocular 3D Face Reconstruction by Occlusion-Aware Multi-view Geometry Consistency[ECCV 2020]

Self-Supervised Monocular 3D Face Reconstruction by Occlusion-Aware Multi-view Geometry Consistency(ECCV 2020) This is an official python implementati

304 Jan 03, 2023
AdaFocus (ICCV 2021) Adaptive Focus for Efficient Video Recognition

AdaFocus (ICCV 2021) This repo contains the official code and pre-trained models for AdaFocus. Adaptive Focus for Efficient Video Recognition Referenc

Rainforest Wang 115 Dec 21, 2022
Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression

Quantile Regression DQN Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression (https://arx

Arsenii Senya Ashukha 80 Sep 17, 2022
[CVPR-2021] UnrealPerson: An adaptive pipeline for costless person re-identification

UnrealPerson: An Adaptive Pipeline for Costless Person Re-identification In our paper (arxiv), we propose a novel pipeline, UnrealPerson, that decreas

ZhangTianyu 70 Oct 10, 2022
Facebook AI Image Similarity Challenge: Descriptor Track

Facebook AI Image Similarity Challenge: Descriptor Track This repository contains the code for our solution to the Facebook AI Image Similarity Challe

Sergio MP 17 Dec 14, 2022
ResNEsts and DenseNEsts: Block-based DNN Models with Improved Representation Guarantees

ResNEsts and DenseNEsts: Block-based DNN Models with Improved Representation Guarantees This repository is the official implementation of the empirica

Kuan-Lin (Jason) Chen 2 Oct 02, 2022
Official implementation of "One-Shot Voice Conversion with Weight Adaptive Instance Normalization".

One-Shot Voice Conversion with Weight Adaptive Instance Normalization By Shengjie Huang, Yanyan Xu*, Dengfeng Ke*, Mingjie Chen, Thomas Hain. This rep

31 Dec 07, 2022
Code for Overinterpretation paper Overinterpretation reveals image classification model pathologies

Overinterpretation This repository contains the code for the paper: Overinterpretation reveals image classification model pathologies Authors: Brandon

Gifford Lab, MIT CSAIL 17 Dec 10, 2022
N-RPG - Novel role playing game da turfu

N-RPG Ce README sera la page de garde du projet. Contenu Il contiendra la prΓ©sen

4 Mar 15, 2022
Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features

MediumVC MediumVC is an utterance-level method towards any-to-any VC. Before that, we propose SingleVC to perform A2O tasks(Xi β†’ YΜ‚i) , Xi means utter

谷下雨 47 Dec 25, 2022