Large-Scale Optical Character Recognition of Ancient Greek

Articolo

Data di Pubblicazione:

2017

Abstract:

This paper documents our campaign to undertake the large-scale optical character recognition of ancient, or polytonic, Greek. Building upon the Gamera OCR engine and developing a suite of post-processing tools, including automatic spellcheck, we processed 1,200 volumes comprising 329,002,271 Greek words. A sample of 10 pages is studied in detail; they demonstrate the degree to which each step of post-processing improved the results, and with which source documents. These pages attain an average character accuracy of about 96%. These results will provide a basis for further improvements, including the training of other open-source OCR engines.

Tipologia CRIS:

01.01 Articolo in rivista

Keywords:

OCR; Ancient Greek

Elenco autori: