Table Extraction Tool

Overview

Tree Structure - Table Extraction

Fonduer has been successfully extended to perform information extraction from richly formatted data such as tables. A crucial step in this process is the construction of the hierarchical tree of context objects such as text blocks, figures, tables, etc. The system currently uses PDF to HTML conversion provided by Adobe Acrobat converter. Adobe Acrobat converter is not an open source tool and this can be very inconvenient for Fonduer users. We therefore need to build our own module as replacement to Adobe Acrobat. Several open source tools are available for pdf to html conversion but these tools do not preserve the cell structure in a table. Our goal in this project is to develop a tool that extracts text, figures and tables in a pdf document and maintains the structure of the document using a tree data structure.

This project is using the table-extraction tool (https://github.com/xiao-cheng/table-extraction).

Dependencies

pip install -r requirements.txt

Environment variables

First, set environment variables. The DATAPATH folder should contain the pdf files that need to be processed.

source set_env.sh

Tutorial

The table-extraction/tutorials/ folder contains a notebook table-extraction-demo.ipynb. In this demo we detail the different steps of the table extraction tool and display some examples of table detection results for paleo papers. However, to extract tables for new documents, the user should directly use the command line tool detailed in the next section.

Command Line Usage

To use the tool via command line, run:

source set_env.sh

python table-extraction/ml/extract_tables.py [-h]

usage: extract_tables.py [-h] [--mode MODE] [--train-pdf TRAIN_PDF]
                         [--test-pdf TEST_PDF] [--gt-train GT_TRAIN]
                         [--gt-test GT_TEST] [--model-path MODEL_PATH]
                         [--iou-thresh IOU_THRESH]

Script to extract tables bounding boxes from PDF files using a machine
learning approach. if model.pkl is saved in the model-path, the pickled model
will be used for prediction. Otherwise the model will be retrained. If --mode
is test (by default), the script will create a .bbox file containing the
tables for the pdf documents listed in the file --test-pdf. If --mode is dev,
the script will also extract ground truth labels fot the test data and compute
some statistics. To run the script on new documents, specify the path to the
list of pdf to analyze using the argument --test-pdf. Those files must be
saved in the DATAPATH folder.

optional arguments:
  -h, --help            show this help message and exit
  --mode MODE           usage mode dev or test, default is test
  --train-pdf TRAIN_PDF
                        list of pdf file names used for training. Those files
                        must be saved in the DATAPATH folder (cf set_env.sh)
                        must be saved in the DATAPATH folder (cf set_env.sh)
  --test-pdf TEST_PDF   list of pdf file names used for testing. Those files
                        must be saved in the DATAPATH folder (cf set_env.sh)
  --gt-train GT_TRAIN   ground truth train tables
  --gt-test GT_TEST     ground truth test tables
  --model-path MODEL_PATH
                        pretrained model
  --iou-thresh IOU_THRESH
                        intersection over union threshold to remove duplicate
                        tables

Each document must be saved in the DATAPATH folder.

The script will create a .bbox file where each row contains tables coordinates of the corresponding row document in the --test_pdf file.

The bounding boxes are stored in the format (page_num, page_width, page_height, top, left, bottom, right) and are separated with ";".

Evaluation

We provide an evaluation code to compute recall, precision and F1 score at the character level.

python table-extraction/evaluation/char_level_evaluation.py [-h] pdf_files extracted_bbox gt_bbox

usage: char_level_evaluation.py [-h] pdf_files extracted_bbox gt_bbox

Computes scores for the table localization task. Returns Recall and Precision
for the sub-objects level (characters in text). If DISPLAY=TRUE, display GT in
Red and extracted bboxes in Blue

positional arguments:
  pdf_files       list of paths of PDF file to process
  extracted_bbox  extracting bounding boxes (one line per pdf file)
  gt_bbox         ground truth bounding boxes (one line per pdf file)

optional arguments:
  -h, --help      show this help message and exit
Owner
HazyResearch
We are a CS research group led by Prof. Chris Ré.
HazyResearch
EQFace: An implementation of EQFace: A Simple Explicit Quality Network for Face Recognition

EQFace: A Simple Explicit Quality Network for Face Recognition The first face recognition network that generates explicit face quality online.

DeepCam Shenzhen 141 Dec 31, 2022
Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

doc2text doc2text extracts higher quality text by fixing common scan errors Developing text corpora can be a massive pain in the butt. Much of the tex

Joe Sutherland 1.3k Jan 04, 2023
A python screen recorder for low-end computers, provides high quality video output.

RecorderX - v1.0 A screen recorder made in Python with the help of OpenCv, it has ability to record your screen in high quality. No matter what your P

Priyanshu Jindal 4 Nov 10, 2021
A webcam-based 3x3x3 rubik's cube solver written in Python 3 and OpenCV.

Qbr Qbr, pronounced as Cuber, is a webcam-based 3x3x3 rubik's cube solver written in Python 3 and OpenCV. 🌈 Accurate color detection 🔍 Accurate 3x3x

Kim 金可明 502 Dec 29, 2022
Document Layout Analysis Projects

Layout_Analysis Introduction This is an implementation of RLSA and X-Y Cut with OpenCV Dependencies OpenCV 3.0+ How to use Compile with g++ : g++ -std

22 Dec 08, 2022
Demo for the paper "Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation"

Streaming speaker diarization Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation by Juan Manuel Coria, Hervé

Juanma Coria 185 Jan 01, 2023
Course material for the Multi-agents and computer graphics course

TC2008B Course material for the Multi-agents and computer graphics course. Setup instructions Strongly recommend using a custom conda environment. Ins

16 Dec 13, 2022
Drowsiness Detection and Alert System

A countless number of people drive on the highway day and night. Taxi drivers, bus drivers, truck drivers, and people traveling long-distance suffer from lack of sleep.

Astitva Veer Garg 4 Aug 01, 2022
TextField: Learning A Deep Direction Field for Irregular Scene Text Detection (TIP 2019)

TextField: Learning A Deep Direction Field for Irregular Scene Text Detection Introduction The code and trained models of: TextField: Learning A Deep

Yukang Wang 101 Dec 12, 2022
This project modify tensorflow object detection api code to predict oriented bounding boxes. It can be used for scene text detection.

This is an oriented object detector based on tensorflow object detection API. Most of the code is not changed except for those related to the need of

Dafang He 30 Oct 22, 2022
Connect Aseprite to Blender for painting pixelart textures in real time

Pribambase Pribambase is a small tool that connects Aseprite and Blender, to allow painting with instant viewport feedback and all functionality of ex

117 Jan 03, 2023
This project proposes a camera vision based cursor control system, using hand moment captured from a webcam through a landmarks of hand by using Mideapipe module

This project proposes a camera vision based cursor control system, using hand moment captured from a webcam through a landmarks of hand by using Mideapipe module

Chandru 2 Feb 20, 2022
An unofficial implementation of the paper "AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss".

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss This is an unofficial implementation of AutoVC based on the official one. The reposi

Chien-yu Huang 27 Jun 16, 2022
Open Source Differentiable Computer Vision Library for PyTorch

Kornia is a differentiable computer vision library for PyTorch. It consists of a set of routines and differentiable modules to solve generic computer

kornia 7.6k Jan 04, 2023
Learn computer graphics by writing GPU shaders!

This repo contains a selection of projects designed to help you learn the basics of computer graphics. We'll be writing shaders to render interactive two-dimensional and three-dimensional scenes.

Eric Zhang 1.9k Jan 02, 2023
Forked from argman/EAST for the ICPR MTWI 2018 CHALLENGE

EAST_ICPR: EAST for ICPR MTWI 2018 CHALLENGE Introduction This is a repository forked from argman/EAST for the ICPR MTWI 2018 CHALLENGE. Origin Reposi

Haozheng Li 157 Aug 23, 2022
Text recognition (optical character recognition) with deep learning methods.

What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis | paper | training and evaluation data | failure cases and cle

Clova AI Research 3.2k Jan 04, 2023
Detect and fix skew in images containing text

Alyn Skew detection and correction in images containing text Image with skew Image after deskew Install and use via pip! Recommended way(using virtual

Kakul 230 Dec 21, 2022
A simple Digits Recogniser made in Python

⭐ Python Digit Recogniser A simple digit Recogniser made in Python Demo Run Locally Clone the project git clone https://github.com/yashraj-n/python-

Yashraj narke 4 Nov 29, 2021
Image augmentation library in Python for machine learning.

Augmentor is an image augmentation library in Python for machine learning. It aims to be a standalone library that is platform and framework independe

Marcus D. Bloice 4.8k Jan 04, 2023