Skip to content

Experiments with tesseract screenshots recognition. Training automation script provided

Notifications You must be signed in to change notification settings

Zloy/tesseract-training

Repository files navigation

Numbers from screenshots with Tesseract OCR sandbox

DESCRIPTION

This repo contains all necessary bits to OCR number images grabbed from the screen like that image to OCR example

If you got a bunch of number images and wish to convert them into plain text that is what you need

ON WINDOWS

HOW TO USE

  1. Clone that repository
  2. Install tesseract-3.01. If it's gone than install tesseract from distros subfolder.

So, you got subfolders:

Samples

It is full of sample number images. It is convenient to OCR them all together. That is why I created total.png file:

total.png

exp1 - as is

cd exp1 - as is

That folder contais run.cmd which ocrs total.png. The result text is in total.txt. You can see the errors:

02a.gif

Tesseract recognizes 6 and 8 as 5 and misses decimal dot .

exp2 - trained

cd exp2 - trained

That folder contais train.cmd which automatically trains tesseract for such images. See it and read userguide to learn how to train tesseract.

To train tesseract automatically just launch train.cmd

Launch run.cmd to ocr total.png with trained tesseract. The result text is in total.txt. You can see the errors:

03a.gif

You can see that tesseract learned how to distinct 6 and 8 from 5, but still misses decimal dots .

exp3 - scaled

As soon as thare are errors try to scale total.png. To do that cd exp3 - scaled

It contains total-scaled.png the fragment of which you can see below:

scaled-part.png

To ocr total-scaled launch run.cmd. The result text is in total.txt. You can see the errors:

04a.gif

It mixes 7 with 2 and adds 3 redundant spaces between digits

exp4 - resized

You can scale total.png different way: cd exp4 - resized. It contains total-resized.png the fragment of which you can see below:

resized-part.png

To ocr total-resized launch run.cmd. The result text is in total.txt. You can see the errors:

05a.gif

exp5 - one by one

What will happen if you wish ocr number images on by one?

cd exp5 - one by one

It contains 10 sample images and corresponding txt files which are the results of recognition

To ocr them launch run.cmd. See text files to find errors. Some 2 and 3 digit numbers are not recognized at all!

exp6 - ten in line

What will happen if you wish to ocr 10 images all together?

cd exp5 - ten in line

It contains teninline.png and corresponding txt file with the result of recognition

teninline.png

To ocr it launch run.cmd. See text file - it contains no errors!

ON LINUX

It takes little efforts to port all those cmd oneliners to bash ones. Write them, test them and submit pull request if you wish to contribute.

About

Experiments with tesseract screenshots recognition. Training automation script provided

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published