Wrote a small script to do OCR of PDF files
Contents
I've just created a script using Tersercat for OCR on PDF or image files. Since Tesseract cannot directly handle PDF files, I employ ImageMagick to convert them into PNGs before proceeding with OCR. For multi-page PDFs, the script converts each page into a separate PNG and processes them individually, consolidating the results into a single file with “====
New Page [page name] =
” marking the separation between pages.
The script offers four options:
- `-h`, `–help` : Displays the help message
- `-r <resolution>` : Sets the resolution for PDF conversion, defaulting to 300 DPI
- `-l <language>` : Specifies the OCR language, defaulting to English; refer to Tesseract for available languages
- `-p`, `–prompt` : Prompts the user before overwriting an existing output file; without this option, the script automatically overwrites the file
#!/bin/bash
# Set default values
DEFAULT_RESOLUTION=300
DEFAULT_LANGUAGE=eng
# Function to check and install required software
check_and_install() {
if ! command -v "$1" &> /dev/null; then
echo "Error: $1 is not installed."
echo "Install using: sudo apt-get install $1"
exit 1
fi
}
check_and_install convert
check_and_install tesseract
# Function to handle potential errors and cleanup
handle_error() {
rm -f "$OUTPUT_IMAGE_PREFIX"*.png 2>/dev/null
exit 1
}
# Help function
help_message() {
echo "Usage: $0 [OPTIONS] <input_pdf> [output_txt]"
echo " Options:"
echo " -h, --help Display this help message"
echo " -r <resolution> Specify resolution for conversion (default: $DEFAULT_RESOLUTION DPI)"
echo " -l <language> Specify OCR language (default: $DEFAULT_LANGUAGE)"
echo " -p, --prompt Ask user if overwrite existed output file"
exit 0
}
LANGUAGE="$DEFAULT_LANGUAGE"
RESOLUTION="$DEFAULT_RESOLUTION"
OVERWRITE=1
# Parse options
while getopts ":r:l:h:p" opt; do
case $opt in
r) RESOLUTION="$OPTARG";;
l) LANGUAGE="$OPTARG";;
h) help_message; exit 0;;
p) OVERWRITE=0;;
\?) echo "Invalid option: -$OPTARG"; handle_error;;
esac
done
shift $((OPTIND - 1))
if [[ $# -lt 1 ]]; then
# echo "Error: Please specify an input PDF file."
echo "Error: Please specify an input image file."
help_message
fi
INPUT_FILE="$1"
shift
OUTPUT_TEXT="output.txt"
if [[ -n "$1" ]]; then
OUTPUT_TEXT="$1"
fi
if [ "$OVERWRITE" -eq 0 -a -f "$OUTPUT_TEXT" ]; then
echo "$OUTPUT_TEXT already exists, overwrite it? (yes/no): "
read answer
case "$answer" in
yes|yes)
rm "$OUTPUT_TEXT" 2>/dev/null
;;
no|no)
exit 1
;;
esac
fi
OUTPUT_IMAGE_PREFIX="${INPUT_FILE%.*}"
if [ ! -f "$INPUT_FILE" ]; then
echo "Error: Input file not found: $INPUT_FILE"
exit 1
fi
# Use the file command to determine the file type
file_type=$(file -b --mime-type "$INPUT_FILE")
if [ "$file_type" == "application/pdf" ]; then
convert -density "$RESOLUTION" "$INPUT_FILE" "$OUTPUT_IMAGE_PREFIX"-%04d.png 2>/dev/null || handle_error
for img in "$OUTPUT_IMAGE_PREFIX"*.png; do
OUTPUT_PAGE_TEXT="$img"
echo "converting new page.."
tesseract "$img" "$OUTPUT_PAGE_TEXT" -l "$LANGUAGE" 2>/dev/null || handle_error
echo "================" >> "$OUTPUT_TEXT"
echo "== Page $page ==" >> "$OUTPUT_TEXT"
echo "================" >> "$OUTPUT_TEXT"
cat "$OUTPUT_PAGE_TEXT".txt >> "$OUTPUT_TEXT"
rm -f "$OUTPUT_PAGE_TEXT".txt
done
rm -f "$OUTPUT_IMAGE_PREFIX"*.png 2>/dev/null
else
tesseract "$INPUT_FILE" "${OUTPUT_TEXT%.*}" -l "$LANGUAGE" 2>/dev/null || handle_error
fi
echo "OCR completed successfully. Output in $OUTPUT_TEXT"
I uploaded the script to my github repository