Preprocessing of recto-verso printed documents based on neural networks for text analysis
Contributo in Atti di convegno
Data di Pubblicazione:
2023
Abstract:
Among the many and varied damages affecting
ancient documents, the penetration of ink from one side of the
page to the other is one of the most frequent and invasive. In this
work, we are interested in binarizing such degraded documents,
for the application of OCR or other automatic text analysis
tools, which can help philologists and palaeographers in text
transcription. We previously proposed a data model that roughly
describes this damage for front-to-back documents, and used it
to generate an artificial training set that can teach a shallow
neural network how to classify pixels on both sides into clean or
corrupt. We show that this joint processing of the two sides of the
document can significantly improve binarization and therefore
OCR and other text analysis tasks, compared to the separate
processing of the single sides, using the same information.
Tipologia CRIS:
04.01 Contributo in Atti di convegno
Keywords:
Ancient document text analysis; Degraded document binarization; Optical character recognition; Recto-verso documents; Shallow multilayer neural networks
Elenco autori:
Tonazzini, Anna; Savino, Pasquale
Link alla scheda completa: