Optical Character Recognition (OCR)
1. Introduction
Optical Character Recognition (OCR) is a technology used to extract text from images or scanned documents. This project utilizes Tesseract, an open-source OCR engine, to read and interpret text from images using Python.
2. Prerequisites
• Python: Install Python 3.x from the official Python
website.
• Required Libraries:
- pytesseract: Install using pip
install pytesseract
- pillow: Install using pip install
pillow
- OpenCV (optional): Install using pip
install opencv-python
• Tesseract OCR Engine:
- Install Tesseract from
https://github.com/tesseract-ocr/tesseract.
- Add Tesseract to your system's PATH.
3. Project Setup
1. Create a Project Directory:
- Name your project folder, e.g., `OCR_Project`.
- Inside this folder, create the Python script file (`ocr_script.py`).
2. Install Required Libraries:
Ensure pytesseract, pillow, and OpenCV (optional) are installed using `pip`.
4. Writing the Code
Below is the Python code for the OCR system:
from PIL import Image
import pytesseract
# Set the path to Tesseract executable if not in PATH
# Example for Windows: pytesseract.pytesseract.tesseract_cmd = r'C:\Program
Files\Tesseract-OCR\tesseract.exe'
# Load the image
image_path = "sample_image.png"
# Replace with your image path
image = Image.open(image_path)
# Perform OCR
extracted_text = pytesseract.image_to_string(image)
# Display the extracted text
print("Extracted Text:")
print(extracted_text)
# Optional: Save the extracted text to a file
with open("output.txt", "w") as text_file:
text_file.write(extracted_text)
5. Key Components
• Tesseract OCR: An open-source OCR engine for text
recognition.
• Pillow: A Python Imaging Library fork for image loading and manipulation.
• pytesseract: A wrapper for Tesseract to use OCR in Python.
6. Testing
1. Prepare an Image:
- Use a sample image with clear text content, e.g., `sample_image.png`.
2. Run the script:
python ocr_script.py
3. Verify the extracted text in the console or output file.
7. Enhancements
• Preprocessing: Use OpenCV to preprocess the image (e.g.,
grayscale conversion, thresholding).
• Multi-Language Support: Configure Tesseract for multiple languages.
• GUI Integration: Build a graphical interface for uploading and processing
images.
8. Troubleshooting
• Inaccurate Recognition: Preprocess the image to enhance
text clarity.
• Tesseract Not Found: Ensure Tesseract is installed and its path is configured
correctly.
• Unsupported File Format: Use Pillow to convert unsupported formats to
standard ones.
9. Conclusion
This project demonstrates how to implement OCR using Tesseract and Python. With additional features and optimizations, it can be expanded into a powerful tool for digitizing text from images.