Convert PDF/Image to TXT using EasyOcr - the best OCR engine available!

Overview

TwitterFacebookInstagramYouTubeWhatsAppWWWPinterestE-Mail


PDFImage2TXT - DOWNLOAD INSTALLER HERE

What can you do with it?

  • Convert scanned PDFs to TXT.
  • Convert scanned Documents to TXT.
  • No coding required!!
  • Installer for Windows
  • Source code included!
  • MIT license

How to install/run?

Tutorial and things you have to know:

,.-''-.,_,.' Step 1 '.,_,.-''-., Screenshot
Defining the problem:
As a German teacher in Brazil, quite often I have
to read texts about philosophy, politics, medicine etc. with my
students. Nowadays, the PDF format is the most common file format to
share text because of its simplicity.

As a German teacher, I distinguish between 3 types of PDFs:

1) The “text-is-text-format”:
The text we see is a real text (for the computer). We can copy and
paste it and search through it using CTRL + F. If all PDFs were like
this, my tool wouldn’t be necessary.

2) The “text-is-picture-with-a-text-overlay-format”: The text we are reading is actually a picture,
but there is a text layer on top of it. Sometimes that works so great that we can’t even notice a difference to the “text-is-text-format”,
but many times we end up getting results like you can see on the right
picture when we use CTRL + C !

3) The “text-is-picture-format”: For
us humans, it is a text but for the computer it is a picture, there
are no additional text layers. We can neither copy the text nor can we
use CTRL + F to search through it.
,.-''-.,_,.' Step 2 '.,_,.-''-., Screenshot
Why PDFImage2TXT?
There are some tools around to solve this problem, but I haven’t found
any which use EasyOCR. EasyOCR is made by some Indian company and
produces the greatest results ever! It is incredible how good it is!
It is far better than everything that I have seen so far, even better
than Googles Tesseract! The only problem is that it is really slow,
but I rather wait longer for great results than having shxxxy results right away.
The only thing I needed to make EasyOcr work the way I
wanted to, was a way to convert the PDF file to images. After searching for about 5 minutes on GitHub, I found a nice tool to convert each page of a PDF
to a jpg file of good quality (300 DPI). Since pdf2jpg
uses Java, please ensure that
Java is installed on your system! If it is not installed, PDFImage2TXT
won’t work on your PC!
,.-''-.,_,.' Step 3 '.,_,.-''-., Screenshot
How to use it?
PDFImage2TXT is very simple to use:
1) Install it
2) Start it
3) Select the PDF or image you want to convert
4) If you convert a PDF file, you can decide what pages you want to convert to text:

if you write 11,12,13,14,15 after having selected the PDF, the app will only convert the pages 11,12,13,14,15!
If you write "ALL", the app will convert the whole PDF document to text!

5) If you convert a picture to text, there is nothing else to configure.
,.-''-.,_,.' Step 4 '.,_,.-''-., Screenshot
Behind the scenes:
You don't have to choose a name for the output folder or file! During the
process, a folder with the same name as your PDF + the ending "_dir"
will be created, and special characters in the name of the folder will be
replaced by underlines.

(Example: Jürgen Habermas - Strukturwandel der Öffentlichkeit -Suhrkamp (2001).pdf becomes J_rgen_Habermas_Strukturwandel_der_ffentlichkeit_Suhrkamp_2001_pdf_dir)

Please make sure that there isn’t already a folder with that name!
Inside the folder, there will be 2 files for each page: one JPG and one TXT
,.-''-.,_,.' Step 5 '.,_,.-''-., Screenshot
Once the entire process is done, PDFImage2TXT will create a single TXT file by joining all TXT files
The final TXT file will be in the same folder and will contain the prefix "complete"!
Everything that you want is in that file! You only need that file!
If you want, you can delete everything else!
,.-''-.,_,.' Step 6 '.,_,.-''-., Screenshot
As you can see, thanks to EasyOcr the results are almost perfect!
The script I wrote, takes care of the hyphens at the end of each line. It
makes sure that all separated words are not separated in the converted
document!
PDFImage2TXT took about 1h15m for a PDF with around
1,000,000 characters. I used an Intel i5-9600KF @ 3.7 GHz with 6
Cores. It should be a lot faster with CUDA enabled (for everybody with
an NVIDIA GPU), but I couldn't get it to run on my PC. I am able to
use Cupy, OpenCV and Spacy with CUDA, but not Easy OCR. If you have
any solution, please tell me/us! It would be nice to implement that
feature!

License

PDFImage2TXT - Copyright (C) 2021 Johannes Fischer www.queroestudaralemao.com.br

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Donations

If this project has helped you somehow, consider donating a small amount. After being absent from computer programming for more than 20 years, I started again this year. At the beginning of 2021, I suffered from a bone infection and had to spend more than 3 months in hospital (only laying in bed!). To kill time, I stared learning Python, which rapidly became something bigger for me than just a "time killer". pavypal

Owner
Hans Alemão
A German teacher living in Brazil
Hans Alemão
Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding for Zero-Example Video Retrieval.

Dual Encoding for Video Retrieval by Text Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding

81 Dec 01, 2022
A simple python program to record security cam footage by detecting a face and body of a person in the frame.

SecurityCam A simple python program to record security cam footage by detecting a face and body of a person in the frame. This code was created by me,

1 Nov 08, 2021
Create single line SVG illustrations from your pictures

Create single line SVG illustrations from your pictures

Javier Bórquez 686 Dec 26, 2022
Handwritten Text Recognition (HTR) system implemented with TensorFlow.

Handwritten Text Recognition with TensorFlow Update 2021: more robust model, faster dataloader, word beam search decoder also available for Windows Up

Harald Scheidl 1.5k Jan 07, 2023
Awesome anomaly detection in medical images

A curated list of awesome anomaly detection works in medical imaging, inspired by the other awesome-* initiatives.

Kang Zhou 57 Dec 19, 2022
Some codes from PyImageSearch course's and external projects.

👨‍💻 Some codes and projects 👨‍💻 💡 Technologies 📜 Projects 📍 Chrome Dinosaur Controller 📦 Script 📍 Coins Counter 📦 Script 🤓 Author Lucas Biv

Lucas Bivar 25 Oct 24, 2021
Script para controlar o movimento do mouse usando Python e openCV com câmera em tempo real que detecta pontos de referência da mão, rastreia padrões de gestos em vez de um mouse físico.

mouserController Script para controlar o movimento do mouse usando Python e openCV com câmera em tempo real que detecta pontos de referência da mão, r

Vinícius Azevedo 6 Jun 28, 2022
Code for the paper "DewarpNet: Single-Image Document Unwarping With Stacked 3D and 2D Regression Networks" (ICCV '19)

DewarpNet This repository contains the codes for DewarpNet training. Recent Updates [May, 2020] Added evaluation images and an important note about Ma

<a href=[email protected]"> 354 Jan 01, 2023
Text Detection from images using OpenCV

EAST Detector for Text Detection OpenCV’s EAST(Efficient and Accurate Scene Text Detection ) text detector is a deep learning model, based on a novel

Abhishek Singh 88 Oct 20, 2022
scantailor - Scan Tailor is an interactive post-processing tool for scanned pages.

Scan Tailor - scantailor.org This project is no longer maintained, and has not been maintained for a while. About Scan Tailor is an interactive post-p

1.5k Dec 28, 2022
This repository provides train&test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.

SCUT-CTW1500 Datasets We have updated annotations for both train and test set. Train: 1000 images [images][annos] Additional point annotation for each

Yuliang Liu 600 Dec 18, 2022
keras复现场景文本检测网络CPTN: 《Detecting Text in Natural Image with Connectionist Text Proposal Network》;欢迎试用,关注,并反馈问题...

keras-ctpn [TOC] 说明 预测 训练 例子 4.1 ICDAR2015 4.1.1 带侧边细化 4.1.2 不带带侧边细化 4.1.3 做数据增广-水平翻转 4.2 ICDAR2017 4.3 其它数据集 toDoList 总结 说明 本工程是keras实现的CPTN: Detecti

mick.yi 107 Jan 09, 2023
Virtual Zoom Gesture using OpenCV

Virtual_Zoom_Gesture I have created a virtual zoom gesture where we can Zoom in and Zoom out any image and even we can move that image anywhere on the

Mudit Sinha 2 Dec 26, 2021
Image Smoothing and Blurring Using OpenCV

Image-Smoothing-and-Blurring-Using-OpenCV This repository contains codes for performing image smoothing and blurring using OpenCV. There are different

Happy N. Monday 3 Feb 15, 2022
Repository for Scene Text Detection with Supervised Pyramid Context Network with tensorflow.

Scene-Text-Detection-with-SPCNET Unofficial repository for [Scene Text Detection with Supervised Pyramid Context Network][https://arxiv.org/abs/1811.0

121 Oct 15, 2021
This repository contains codes on how to handle mouse event using OpenCV

Handling-Mouse-Click-Events-Using-OpenCV This repository contains codes on how t

Happy N. Monday 3 Feb 15, 2022
learn how to use Gesture Control to change the volume of a computer

Volume-Control-using-gesture In this project we are going to learn how to use Gesture Control to change the volume of a computer. We first look into h

Diwas Pandey 49 Sep 22, 2022
Semantic-based Patch Detection for Binary Programs

PMatch Semantic-based Patch Detection for Binary Programs Requirement tensorflow-gpu 1.13.1 numpy 1.16.2 scikit-learn 0.20.3 ssdeep 3.4 Usage tar -xvz

Mr.Curiosity 3 Sep 02, 2022
Image Recognition Model Generator

Takes a user-inputted query and generates a machine learning image recognition model that determines if an inputted image is or isn't their query

Christopher Oka 1 Jan 13, 2022
3点クリックで円を指定し、極座標変換を行うサンプルプログラム

click-warpPolar 3点クリックで円を指定し、極座標変換を行うサンプルプログラムです。 Requirements OpenCV 3.4.2 or Later Usage 実行方法は以下です。 起動後、マウスで3点をクリックし円を指定してください。 python click-warpPol

KazuhitoTakahashi 17 Dec 30, 2022