OCR: Converting a scanned image to text on Linux

I found a nice script on this page: http://ubuntuforums.org/showthread.php?t=880471 which works really well.

#!/bin/sh
SCRIPT_NAME=`basename "$0" .sh`
TMP_DIR=${SCRIPT_NAME}-tmp
OUTPUT_FILE=${SCRIPT_NAME}-output.txt

mkdir $TMP_DIR
cp $@ $TMP_DIR
cd $TMP_DIR

pdftoppm -r 600 * ocrbook

for i in *.ppm
do
  BASE=`basename "$i" .ppm`
  convert "$i" "${BASE}.tif"
  tesseract "${BASE}.tif" "${BASE}" -l eng
  cat ${BASE}.txt | tee -a $OUTPUT_FILE
  echo "[pagebreak]" | tee -a $OUTPUT_FILE
  rm ${BASE}.*
done

mv $OUTPUT_FILE ..
rm *
cd ..
rmdir $TMP_DIR
Advertisements

Posted on January 27, 2012, in Linux, Oneiric Ocelot, Technology, Ubuntu. Bookmark the permalink. Leave a comment.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: