Quantcast
Channel: PyImageSearch
Viewing all articles
Browse latest Browse all 195

Using spellchecking to improve Tesseract OCR accuracy

$
0
0

In a previous tutorial, you learned how to use the textblob library and Tesseract to automatically OCR text and then translate it to a different language. This tutorial will also use textblob, but this time to improve OCR accuracy by automatically spellchecking OCR’d text.

To learn how to OCR results using spellchecking, just keep reading.

Looking for the source code to this post?

Jump Right To The Downloads Section

Using spellchecking to improve Tesseract OCR accuracy

It’s unrealistic to expect any OCR system, even state-of-the-art OCR engines, to be 100% accurate. That doesn’t happen in practice. Inevitably, noise in an input image, non-standard fonts that Tesseract wasn’t trained on, or less than ideal image quality will cause Tesseract to make a mistake and incorrectly OCR a piece of text.

When that happens, you need to create rules and heuristics that can be used to improve the output OCR quality. One of the first rules and heuristics you should look at is automatic spellchecking. For example, if you’re OCR’ing a book, you could use spellchecking as an attempt to automatically correct after the OCR process, thereby creating a better, more accurate version of the digitized text.

Learning Objectives

In this tutorial, you will:

  1. Learn how the textblob package can be used for spellchecking
  2. OCR a piece of text that contains incorrect spelling
  3. Automatically correct the spelling of the OCR’d text

OCR and Spellchecking

We’ll start this tutorial by reviewing our project directory structure. I’ll then show you how to implement a Python script that can automatically OCR a piece of text and then spellcheck it using the textblob library. Once our script is implemented, we’ll apply it to our example image. We’ll wrap up this tutorial with a discussion on the accuracy of our spellchecking, including some of the limitations and drawbacks associated with automatic spellchecking.

Configuring your development environment

To follow this guide, you need to have the OpenCV library installed on your system.

Luckily, OpenCV is pip-installable:

$ pip install opencv-contrib-python

If you need help configuring your development environment for OpenCV, I highly recommend that you read my pip install OpenCV guide — it will have you up and running in a matter of minutes.

Having problems configuring your development environment?

Figure 1: Having trouble configuring your dev environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join PyImageSearch University — you’ll be up and running with this tutorial in a matter of minutes.

All that said, are you:

  • Short on time?
  • Learning on your employer’s administratively locked system?
  • Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
  • Ready to run the code right now on your Windows, macOS, or Linux system?

Then join PyImageSearch University today!

Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides that are pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.

And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!

Project Structure

The project directory structure for our OCR spellchecker is quite simple:

|-- comic_spelling.png
|-- ocr_and_spellcheck.py

We only have a single Python script here, ocr_and_spellcheck.py. This script does the following:

  1. Load comic_spelling.png from disk
  2. OCR the text in the image
  3. Apply spellchecking to it

By applying the spellcheck, we will ideally be able to improve the OCR accuracy of our script, regardless if:

  1. The input image has incorrect spellings in it
  2. Tesseract incorrectly OCR’d characters

Implementing Our OCR Spellchecking Script

Let’s start implementing our OCR and spellchecking script.

Open a new file, name it ocr_and_spellcheck.py, and insert the following code:

# import the necessary packages
from textblob import TextBlob
import pytesseract
import argparse
import cv2

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,
	help="path to input image to be OCR'd")
args = vars(ap.parse_args())

Lines 2-5 import our required Python packages. You should note the use of the textblob package which we utilized in a previous lesson on translating OCR’d text from one language to another. We’ll be using textblob in this tutorial, but this time for its automatic spellchecking implementation.

Lines 8-11 then parse our command line arguments. We only need a single argument, --image which is the path to our input image:

Next, we can load the image from disk and OCR it:

# load the input image and convert it from BGR to RGB channel
# ordering
image = cv2.imread(args["image"])
rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

# use Tesseract to OCR the image
text = pytesseract.image_to_string(rgb)

# show the text *before* ocr-spellchecking has been applied
print("BEFORE SPELLCHECK")
print("=================")
print(text)
print("\n")

Line 15 loads our input image from the disk using the supplied path. We then swap the color channel ordering from BGR (OpenCV’s default ordering) to RGB (which is what Tesseract and pytesseract expect).

Once the image is loaded, we make a call to image_to_string to OCR the image. We then display the OCR’d text before spellchecking on our screen (Lines 19-25).

However, there may be misspellings, such as text misspelled by the user when creating the image or “typos” caused by Tesseract incorrectly OCR’ing one or more characters — to fix that, we need to utilize textblob:

# apply spell checking to the OCR'd text
tb = TextBlob(text)
corrected = tb.correct()

# show the text after ocr-spellchecking has been applied
print("AFTER SPELLCHECK")
print("================")
print(corrected)

Line 28 constructs a TextBlob from the OCR’d text. We then apply automatic spellcheck correction via the correct() method (Line 29). The corrected text (i.e., after spellchecking) is then displayed on the terminal (Lines 32-34).

OCR Spellchecking Results

We are now ready to apply OCR spellchecking to an example image.

Open a terminal and execute the following command:

$ python ocr_and_spellcheck.py --image comic_spelling.png
BEFORE SPELLCHECK
=================
Why can't yu
spel corrctly?

AFTER SPELLCHECK
================
Why can't you
spell correctly?

Figure 2 shows our example image (created via the Explosm comic generator), which includes words with misspellings. Using Tesseract, we can OCR the text with the original misspellings.

Figure 2. Explosm comic showing text we will OCR.

It’s important to note that these misspellings were purposely introduced — in your OCR applications, these misspellings may naturally exist in your input images or Tesseract may incorrectly OCR certain characters.

As our output shows, we are able to correct these misspellings using textblob, correcting the words “yu ⇒ you,” “spel ⇒ spell,” and “corrctly ⇒ correctly.”

Limitations and Drawbacks

One of the biggest problems with spellchecking algorithms is that most spellcheckers require some human intervention to be accurate. When we make a spelling mistake, our word processor automatically detects the error and proposes candidate fixes — often two or three words that the spellchecker thinks we meant to spell. Unless we atrociously misspelled a word, nine times out of 10, we can find the word we meant to use in the candidates proposed by the spellchecker.

We may choose to remove that human intervention piece and instead allow the spellchecker to use the word it deems is most probable based on the internal spellchecking algorithm. We risk replacing words with only minor misspellings with words that do not make sense in the sentence or paragraph’s original context. Therefore, you should be cautious when relying on totally automatic spellcheckers. There is a risk that an incorrect word (versus the correct word, but with minor spelling mistakes) is inserted in the output OCR’d text.

If you find that spellchecking is hurting your OCR accuracy, you may want to:

  1. Look into alternative spellchecking algorithms other than the generic one included in the textblob library
  2. Replace spellchecking with heuristic-based methods (e.g., regular expression matching)
  3. Allow misspellings to exist, keeping in mind that no OCR system is 100% accurate anyway

What's next? I recommend PyImageSearch University.

Course information:
30+ total classes • 39h 44m video • Last updated: 12/2021
★★★★★ 4.84 (128 Ratings) • 3,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

  • ✓ 30+ courses on essential computer vision, deep learning, and OpenCV topics
  • ✓ 30+ Certificates of Completion
  • ✓ 39h 44m on-demand video
  • ✓ Brand new courses released every month, ensuring you can keep up with state-of-the-art techniques
  • ✓ Pre-configured Jupyter Notebooks in Google Colab
  • ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
  • ✓ Access to centralized code repos for all 500+ tutorials on PyImageSearch
  • ✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
  • ✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In this tutorial, you learned how to improve OCR results by applying automatic spellchecking. While our method worked well in our particular example, it may not work well in other situations! Keep in mind that spellchecking algorithms typically require a small amount of human intervention. Most spellcheckers automatically check a document for spelling mistakes and then propose a list of candidate corrections to the human user. It’s up to the human to make the final spellcheck decision.

When we remove the human intervention component and instead allow the spellchecking algorithm to choose the correction it deems the best fit, words with only minor misspellings are replaced with words that don’t make sense within the sentence’s original context. Use spellchecking, especially automatic spellchecking, cautiously in your own OCR applications — in some cases, it will help your OCR accuracy, but it can hurt accuracy in other situations.

Citation Information

Rosebrock, A.  “Using spellchecking to improve Tesseract OCR accuracy,” PyImageSearch, 2021, https://pyimagesearch.com/2021/11/29/using-spellchecking-to-improve-tesseract-ocr-accuracy/

@article{Rosebrock_2021_Spellchecking,
  author = {Adrian Rosebrock},
  title = {Using spellchecking to improve {T}esseract {OCR} accuracy},
  journal = {PyImageSearch},
  year = {2021},
  note = {https://pyimagesearch.com/2021/11/29/using-spellchecking-to-improve-tesseract-ocr-accuracy/},
}

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!

The post Using spellchecking to improve Tesseract OCR accuracy appeared first on PyImageSearch.


Viewing all articles
Browse latest Browse all 195

Trending Articles