Optical character recognition software is used to produce a text or word-processing file from a scanned image, which in turn may have been produced from a fax or other form of hard copy. The use of OCR software is growing amongst translators; one reason is no doubt the increasing use of translation memory applications, which require the source text to be in a machine-readable form.

ABBYY OCR CLI

It was a long wait, but worth it: Abbyy's Finereader engine is now available for Linux. So what's the catch?

Firstly, it's command line only. Having said that, command-line operation could be advantageous. Also, although the application has a large number of functions and a correspondingly bewildering range of command-line arguments, the basic syntax delivers adequate functionality. For example, quite usable results from a scanned document in German can be obtained with:

 abbyyocr -rl German -if original_filename.pdf -of output_filename.txt

Secondly, the pricing/licensing conditions are strange (or perhaps not considering the huge volumes that could be leveraged by use of the command line on a server). The price is around the same as for the Windows version, but Windows users presumably have a GUI and no page number restriction.

Details of pricing/licensing details can be found here. The page count is stored on the user's system and is linked to the system date. According to the information I have from Abbyy, licences can be transferred to new hardware (up to three times) by sending a request to Abbyy to de-activate the existing license and issue of a new one.

A free trial version with no time limit and a 100-page limit is available. 100 pages may not sound much, but once you've reached the limit, you have a basis for calculating what the paid-for version is worth to you. That also has a page limit, but at 12,000 pages per year for the cheapest paid-for version, one that few translators are likely to reach. (The page counter is reset each year.)

There is also a free online version that might be worth considering for larger volumes of reference material (in other words, non-sensitive texts). There is also a user group at groups.google.com/group/abbyy-ocr-for-linux.

Tesseract

Tesseract was originally developed as proprietary software by Hewlett-Packard. Development lapsed for many years, but the code was released as open source in 2005. Tesseract is currently being developed by Google. At the time of writing, Tesseract is by a wide margin the most promising open-source optical character reader for Linux.

According to Wikipedia, Tesseract can currently read English, French, Italian, German, Spanish, Brazilian Portuguese and Dutch, and can be trained to read other languages.

An early review of Tesseract by Nathan Willis can be found here, and helpful instructions on using it can also be found here.

Kooka, Ocrad, ClaraOCR, GOCR

Before the release of Tesseract's source code, a variety of open-source OCR applications were available, and may still be worth experimenting with.

Vividata

Vividata is a commercial vendor offering a range of OCR utilities and plug-ins for Linux. I tried the demonstration version of "OCR Shop XTR Lite", a command-line product, and used it to convert the text in a non-extractable PDF file into a plain text file. Despite the command-line interface, it was easy to use, and the results were excellent.

And now the bad news: OCR Shop XTR Lite costs US$995, plus US$200 for the facility for reading non-extractable text in PDF files, plus $100 for each additional language. In other words, for the price of this particular optical character reader software, you could buy a mid-range PC with Windows and Abbyy Finereader installed for use solely as an optical character reader – and still have enough cash left over for a weekend for two in Venice. Clearly, the vendor is not targeting the desktop market, but areas such as perhaps libraries seeking to digitize their entire stock.