Wrote a small script to do OCR of PDF files

January 13, 2025 466 words 3 minutes

Contents

I've just created a script using Tersercat for OCR on PDF or image files. Since Tesseract cannot directly handle PDF files, I employ ImageMagick to convert them into PNGs before proceeding with OCR. For multi-page PDFs, the script converts each page into a separate PNG and processes them individually, consolidating the results into a single file with “==== New Page [page name] =” marking the separation between pages.

The script offers four options:

`-h`, `–help` : Displays the help message
`-r <resolution>` : Sets the resolution for PDF conversion, defaulting to 300 DPI
`-l <language>` : Specifies the OCR language, defaulting to English; refer to Tesseract for available languages
`-p`, `–prompt` : Prompts the user before overwriting an existing output file; without this option, the script automatically overwrites the file

#!/bin/bash

# Set default values
DEFAULT_RESOLUTION=300
DEFAULT_LANGUAGE=eng

# Function to check and install required software
check_and_install() {
    if ! command -v "$1" &> /dev/null; then
        echo "Error: $1 is not installed."
        echo "Install using: sudo apt-get install $1"
        exit 1
    fi
}

check_and_install convert
check_and_install tesseract


# Function to handle potential errors and cleanup
handle_error() {
    rm -f "$OUTPUT_IMAGE_PREFIX"*.png 2>/dev/null
    exit 1
}

# Help function
help_message() {
    echo "Usage: $0 [OPTIONS] <input_pdf> [output_txt]"
    echo "  Options:"
    echo "    -h, --help           Display this help message"
    echo "    -r <resolution>      Specify resolution for conversion (default: $DEFAULT_RESOLUTION DPI)"
    echo "    -l <language>        Specify OCR language (default: $DEFAULT_LANGUAGE)"
    echo "    -p, --prompt         Ask user if overwrite existed output file"
    exit 0
}

LANGUAGE="$DEFAULT_LANGUAGE"
RESOLUTION="$DEFAULT_RESOLUTION"
OVERWRITE=1
# Parse options
while getopts ":r:l:h:p" opt; do
    case $opt in
        r) RESOLUTION="$OPTARG";;
        l) LANGUAGE="$OPTARG";;
        h) help_message; exit 0;;
        p) OVERWRITE=0;;
        \?) echo "Invalid option: -$OPTARG"; handle_error;;
    esac
done

shift $((OPTIND - 1))


if [[ $# -lt 1 ]]; then
    # echo "Error: Please specify an input PDF file."
    echo "Error: Please specify an input image file."
    help_message
fi

INPUT_FILE="$1"
shift

OUTPUT_TEXT="output.txt"
if [[ -n "$1" ]]; then
    OUTPUT_TEXT="$1"
fi

if [ "$OVERWRITE" -eq 0 -a -f "$OUTPUT_TEXT" ]; then
    echo "$OUTPUT_TEXT already exists, overwrite it? (yes/no): "
    read answer
    case "$answer" in
        yes|yes)
            rm "$OUTPUT_TEXT" 2>/dev/null
            ;;
        no|no)
            exit 1
            ;;
    esac
fi

OUTPUT_IMAGE_PREFIX="${INPUT_FILE%.*}"

if [ ! -f "$INPUT_FILE" ]; then
    echo "Error: Input file not found: $INPUT_FILE"
    exit 1
fi

# Use the file command to determine the file type
file_type=$(file -b --mime-type "$INPUT_FILE")

if [ "$file_type" == "application/pdf" ]; then
    convert -density "$RESOLUTION" "$INPUT_FILE" "$OUTPUT_IMAGE_PREFIX"-%04d.png 2>/dev/null || handle_error
    for img in "$OUTPUT_IMAGE_PREFIX"*.png; do
        OUTPUT_PAGE_TEXT="$img"
        echo "converting new page.."
        tesseract "$img" "$OUTPUT_PAGE_TEXT" -l "$LANGUAGE" 2>/dev/null || handle_error
        echo "================" >> "$OUTPUT_TEXT"
        echo "== Page $page ==" >> "$OUTPUT_TEXT"
        echo "================" >> "$OUTPUT_TEXT"
        cat "$OUTPUT_PAGE_TEXT".txt >> "$OUTPUT_TEXT"
        rm -f "$OUTPUT_PAGE_TEXT".txt
    done
    rm -f "$OUTPUT_IMAGE_PREFIX"*.png 2>/dev/null
else
    tesseract "$INPUT_FILE" "${OUTPUT_TEXT%.*}" -l "$LANGUAGE" 2>/dev/null || handle_error
fi

echo "OCR completed successfully. Output in $OUTPUT_TEXT"

I uploaded the script to my github repository