Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Last update: Oct 28, 2021

Related tags

Overview

Quick and Dirty OCR of Facebook Papers

Gizmodo has been working through the Facebook Papers and releasing the docs that they process and review.

As luck would have it, I had some ugly but functional code lying around that would do a first pass on OCR on these docs. That code is in the pdf_to_image.py script. I'd welcome improvement to the code, especially in image cleanup prior to OCR (lines 92-97, approx). I experimented with cleaning up the image via PIL and cv2, but the results were less accurate, almost certainly due to my lack of familiarity with either of these approaches.

These Facebook Papers are especially challenging from an OCR perspective because many of them are pictures taken of a screen, so the base image quality isn't especially good. Because of this, not every document can be processed cleanly, and the documents that do get processed have some cruft in them.

With that said, the text pulled from these files simplifies the process of parsing through a large amount of data for keywords.

Other (Better) Options

This OCR should be seen as a first step. Text files are generally a decent starting point because they allow for a wide range of follow on analysis.

And, other/better options exist. For a comprehensive, contained analysis, these other options will almost certainly be a better choice.

Want to help?

If you want to collaborate on this project, let me know!

Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Related tags

Overview

Quick and Dirty OCR of Facebook Papers

Other (Better) Options

Want to help?

Owner

Bill Fitzgerald

scantailor - Scan Tailor is an interactive post-processing tool for scanned pages.

Tool which allow you to detect and translate text.

Connect Aseprite to Blender for painting pixelart textures in real time

Fatigue Driving Detection Based on Dlib

Pure Javascript OCR for more than 100 Languages 📖🎉🖥

📷 This repository is focused on having various feature implementation of OpenCV in Python.

Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, CVPR 2016.

Motion detector, Full body detection, Upper body detection, Cat face detection, Smile detection, Face detection (haar cascade), Silverware detection, Face detection (lbp), and Sending email notifications

Scene text detection and recognition based on Extremal Region(ER)

Kornia is a open source differentiable computer vision library for PyTorch.

Isearch (OSINT) 🔎 Face recognition reverse image search on Instagram profile feed photos.

ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data

OCR engine for all the languages

A buffered and threaded wrapper for the OpenCV VideoCapture object. Can speed up video decoding significantly. Supports

[ICCV, 2021] Cloud Transformers: A Universal Approach To Point Cloud Processing Tasks

Layout Analysis Evaluator for the ICDAR 2017 competition on Layout Analysis for Challenging Medieval Manuscripts

Satoshi is a discord bot template in python using discord.py that allow you to track some live crypto prices with your own discord bot.

An easy to use an (hopefully useful) captcha solution for pyTelegramBotAPI

A collection of resources (including the papers and datasets) of OCR (Optical Character Recognition).

Tesseract Open Source OCR Engine (main repository)