Day 3 of Learning OCR: Building a Modular Python OCR System

Day 3 of Learning OCR: Building a Modular Python OCR System

On Day 3 of my journey into Optical Character Recognition (OCR), I took a significant step forward by organizing a Python-based OCR project into a modular, scalable structure. Using powerful libraries like OpenCV and Tesseract, I built a system capable of extracting text from images with improved preprocessing techniques. Below, I’ll share the project structure, the complete code for each file, and the key lessons I learned along the way.

Why Modularize?

As my OCR project grew, I realized the importance of keeping code organized and reusable. By splitting the functionality into separate files—each handling a specific task like image loading, preprocessing, or text extraction—I made the codebase easier to maintain, debug, and extend. This approach mirrors real-world software engineering practices, making it a valuable lesson for building production-ready applications.

The Project Structure

I designed a clean folder structure to keep everything tidy:

universal_ocr/
├── images/               # Folder for input images
├── main.py               # Entry point of the application
├── ocr/
│   ├── __init__.py       # Makes ocr a Python package
│   ├── loader.py         # Handles image loading
│   ├── processor.py      # Manages image preprocessing
│   └── reader.py         # Performs text extraction

Below is the complete code for each file, along with explanations of what I learned while building them.

1. ocr/loader.py

This module handles loading images from a specified folder, filtering for common image formats like PNG, JPG, and more.

import os

def load_images_from_folder(folder):
    images = []
    for filename in os.listdir(folder):
        if filename.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.tiff')):
            images.append(os.path.join(folder, filename))
    return images

Key Learning: Using os.listdir() and os.path.join() makes file handling platform-independent. The case-insensitive check with filename.lower() ensures robustness across different image formats.

2. ocr/processor.py

This module preprocesses images to improve OCR accuracy. It includes steps like converting to grayscale, resizing, applying Gaussian blur, sharpening, adaptive thresholding, and skew correction.

import cv2
import numpy as np

def preprocess_image(image):
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    scale_percent = 150
    width = int(gray.shape[1] * scale_percent / 100)
    height = int(gray.shape[0] * scale_percent / 100)
    gray = cv2.resize(gray, (width, height), interpolation=cv2.INTER_LINEAR)

    blur = cv2.GaussianBlur(gray, (5,5), 0)

    kernel_sharpen = np.array([[0,-1,0], [-1,5,-1], [0,-1,0]])
    sharpened = cv2.filter2D(blur, -1, kernel_sharpen)

    thresh = cv2.adaptiveThreshold(
        sharpened, 255, 
        cv2.ADAPTIVE_THRESH_GAUSSIAN_C, 
        cv2.THRESH_BINARY, 31, 10)

    coords = np.column_stack(np.where(thresh > 0))
    angle = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = -(90 + angle)
    else:
        angle = -angle

    (h, w) = thresh.shape[:2]
    center = (w // 2, h // 2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    rotated = cv2.warpAffine(thresh, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)

    return rotated

Key Learning: Preprocessing is the backbone of effective OCR. Each step—grayscale conversion, resizing, blurring, sharpening, thresholding, and skew correction—addresses specific challenges like noise, low resolution, or text rotation. Tuning parameters like the thresholding block size (31) and constant (10) was critical for handling diverse image qualities.

3. ocr/reader.py

This module uses Tesseract to extract text from preprocessed images, leveraging the preprocessing function from processor.py.

import cv2
import pytesseract
from .processor import preprocess_image

# Optional: specify path if not in PATH
# pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

def extract_text_from_image(image_path):
    image = cv2.imread(image_path)
    preprocessed = preprocess_image(image)
    text = pytesseract.image_to_string(preprocessed)
    return text

Key Learning: Tesseract’s performance heavily depends on image quality, making preprocessing essential. I also learned that specifying the Tesseract executable path is necessary in some environments, like Windows, if it’s not in the system PATH.

4. main.py

The main script ties everything together, loading images and extracting text while incorporating basic error handling.

from ocr.loader import load_images_from_folder
from ocr.reader import extract_text_from_image

def main():
    image_folder = 'images'
    image_paths = load_images_from_folder(image_folder)
    
    for path in image_paths:
        print(f"\nExtracting from: {path}")
        try:
            text = extract_text_from_image(path)
            print("Text:\n", text.strip())
        except Exception as e:
            print("Failed to process image:", e)

if __name__ == "__main__":
    main()

Key Learning: A clean entry point simplifies execution and testing. Using try-except blocks ensures the program doesn’t crash on problematic images, and the if __name__ == "__main__": construct allows the script to be imported as a module without running the main logic.

Running the Project

To run the project, I placed images in the images/ folder and executed:

python main.py

The script processes each image, applies preprocessing, and prints the extracted text. This setup is simple yet flexible, allowing for future enhancements like logging or image previews.

Challenges and Takeaways

  • Challenge: Finding the right preprocessing parameters was tricky. For example, adjusting the adaptive thresholding parameters (31 and 10) required experimentation to handle different image qualities effectively.
  • Takeaway: Modular design not only improves code readability but also simplifies debugging and testing. By isolating preprocessing, I could refine it independently without affecting other components.
  • Next Steps: I plan to add logging to track errors and successes, implement image previews for visual debugging, and explore advanced preprocessing techniques for handling noisy or multilingual documents.

Why This Matters

This project is more than a learning exercise—it’s a step toward building real-world applications like document digitization, automated form processing, or assistive technologies for visually impaired users. Mastering OCR equips me with skills that have practical impact across industries, from finance to healthcare.

Final Thoughts

Day 3 taught me the power of modular design and the critical role of preprocessing in OCR. I’m excited to continue this journey, building on this foundation to tackle more complex challenges like multilingual text extraction or optimizing for low-quality images. If you’re working on OCR or computer vision projects, I’d love to hear your experiences and tips in the comments!

#Python #OCR #ComputerVision #MachineLearning #Day3

Back to Blog