Engineeering & IT Projects and Resources: Optical Character Recognition (OCR)

Optical Character Recognition (OCR)

1. Introduction

Optical Character Recognition (OCR) is a technology used to extract text from images or scanned documents. This project utilizes Tesseract, an open-source OCR engine, to read and interpret text from images using Python.

2. Prerequisites

• Python: Install Python 3.x from the official Python website.
• Required Libraries:
- pytesseract: Install using pip install pytesseract
- pillow: Install using pip install pillow
- OpenCV (optional): Install using pip install opencv-python
• Tesseract OCR Engine:
- Install Tesseract from https://github.com/tesseract-ocr/tesseract.
- Add Tesseract to your system's PATH.

3. Project Setup

1. Create a Project Directory:

- Name your project folder, e.g., `OCR_Project`.
- Inside this folder, create the Python script file (`ocr_script.py`).

2. Install Required Libraries:

Ensure pytesseract, pillow, and OpenCV (optional) are installed using `pip`.

4. Writing the Code

Below is the Python code for the OCR system:

from PIL import Image
import pytesseract

# Set the path to Tesseract executable if not in PATH
# Example for Windows: pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

# Load the image
image_path = "sample_image.png" # Replace with your image path
image = Image.open(image_path)

# Perform OCR
extracted_text = pytesseract.image_to_string(image)

# Display the extracted text
print("Extracted Text:")
print(extracted_text)

# Optional: Save the extracted text to a file
with open("output.txt", "w") as text_file:
text_file.write(extracted_text)

5. Key Components

• Tesseract OCR: An open-source OCR engine for text recognition.
• Pillow: A Python Imaging Library fork for image loading and manipulation.
• pytesseract: A wrapper for Tesseract to use OCR in Python.

6. Testing

1. Prepare an Image:

- Use a sample image with clear text content, e.g., `sample_image.png`.

2. Run the script:

python ocr_script.py

3. Verify the extracted text in the console or output file.

7. Enhancements

• Preprocessing: Use OpenCV to preprocess the image (e.g., grayscale conversion, thresholding).
• Multi-Language Support: Configure Tesseract for multiple languages.
• GUI Integration: Build a graphical interface for uploading and processing images.

8. Troubleshooting

• Inaccurate Recognition: Preprocess the image to enhance text clarity.
• Tesseract Not Found: Ensure Tesseract is installed and its path is configured correctly.
• Unsupported File Format: Use Pillow to convert unsupported formats to standard ones.

9. Conclusion

This project demonstrates how to implement OCR using Tesseract and Python. With additional features and optimizations, it can be expanded into a powerful tool for digitizing text from images.

Pages