Linux for Translators

Optical character recognition (OCR)

Optical character recognition (OCR) software is used to produce a text or word-processing file from a scanned image, which in turn may have been produced from a fax or other form of hard copy.

The only "industrial-strength" but affordable optical character application available to my knowledge at the present time is ABBYY's Finereader OCR CLI. This product has two drawbacks: it's available only as a command-line application, and it comes with a hefty price tag of €199 in Europe. (This pricing option also limits the number of scanned pages to 12,000 pages year, but most translators are unlikely to exceed their limit, unless they are digitizing all their paper dictionaries at once).

Alternatives to ABBYY Finereader exist, notably the free and open-source Tesseract. My own tests have found these to far fall short of the results delivered by ABBYY Finereader, most importantly in their test recognition accuracy: if you have to spend a lot of time correcting the output, it somewhat defeats the object. There is however a lot that you can do at the preparatory stage (such as adjusting the contrast, correcting skew, etc.) to improve the results of the scan stage itself.

Besides its text recognition accuracy, ABBYY Finereader also does a good job of exporting to a Microsoft Word file and reproducing complex page layouts. It does this so well in fact that it is worth using to scan "proper" PDF files, i.e. those including machine-readable text which as such requires no OCR.

For the record, mention should also be made of Vividata. Some years ago, I tried the demonstration version of Vividata's "OCR Shop XTR Lite" (also a command-line application. Despite the command-line interface, it was easy to use, and the results were outstanding.

Even in its cheapest version however, the price of Vividata's OCR software is well into four figures. Clearly, the vendor is not targeting the single-user desktop market, but areas such as perhaps libraries seeking to digitize their entire stock.

An alternative to using optical character recognition on the desktop is to take advantage of online services, such as ABBYY Finereader online.

> TranslateOnLinux: OCR