End-to-End Transcript Alignment of 17th Century Manuscripts: The Case of <i>Moccia Code</i>

The growth of digital libraries has yielded a large number of handwritten historical documents in the form of images, often accompanied by a digital transcription of the content. The ability to track the position of the words of the digital transcription in the images can be important both for the s...

Full description

Bibliographic Details
Main Authors: Giuseppe De Gregorio, Giuliana Capriolo, Angelo Marcelli
Format: Article
Language:English
Published: MDPI AG 2023-01-01
Series:Journal of Imaging
Subjects:
Online Access:https://www.mdpi.com/2313-433X/9/1/17
_version_ 1797440480706822144
author Giuseppe De Gregorio
Giuliana Capriolo
Angelo Marcelli
author_facet Giuseppe De Gregorio
Giuliana Capriolo
Angelo Marcelli
author_sort Giuseppe De Gregorio
collection DOAJ
description The growth of digital libraries has yielded a large number of handwritten historical documents in the form of images, often accompanied by a digital transcription of the content. The ability to track the position of the words of the digital transcription in the images can be important both for the study of the document by humanities scholars and for further automatic processing. We propose a learning-free method for automatically aligning the transcription to the document image. The method receives as input the digital image of the document and the transcription of its content and aims at linking the transcription to the corresponding images within the page at the word level. The method comprises two main original contributions: a line-level segmentation algorithm capable of detecting text lines with curved baseline, and a text-to-image alignment algorithm capable of dealing with under- and over-segmentation errors at the word level. Experiments on pages from a 17th-century Italian manuscript have demonstrated that the line segmentation method allows one to segment 92% of the text line correctly. They also demonstrated that it achieves a correct alignment accuracy greater than 68%. Moreover, the performance achieved on widely used data sets compare favourably with the state of the art.
first_indexed 2024-03-09T12:07:49Z
format Article
id doaj.art-8e129c16140045e5a2a5901de3550db9
institution Directory Open Access Journal
issn 2313-433X
language English
last_indexed 2024-03-09T12:07:49Z
publishDate 2023-01-01
publisher MDPI AG
record_format Article
series Journal of Imaging
spelling doaj.art-8e129c16140045e5a2a5901de3550db92023-11-30T22:55:33ZengMDPI AGJournal of Imaging2313-433X2023-01-01911710.3390/jimaging9010017End-to-End Transcript Alignment of 17th Century Manuscripts: The Case of <i>Moccia Code</i>Giuseppe De Gregorio0Giuliana Capriolo1Angelo Marcelli2Department of Information and Electrical Engineering and Applied Mathematics, University of Salerno, Via Giovanni Paolo II, 132, 84084 Fisciano, ItalyDepartment of Cultural Heritage, University of Salerno, Via Giovanni Paolo II, 132, 84084 Fisciano, ItalyDepartment of Information and Electrical Engineering and Applied Mathematics, University of Salerno, Via Giovanni Paolo II, 132, 84084 Fisciano, ItalyThe growth of digital libraries has yielded a large number of handwritten historical documents in the form of images, often accompanied by a digital transcription of the content. The ability to track the position of the words of the digital transcription in the images can be important both for the study of the document by humanities scholars and for further automatic processing. We propose a learning-free method for automatically aligning the transcription to the document image. The method receives as input the digital image of the document and the transcription of its content and aims at linking the transcription to the corresponding images within the page at the word level. The method comprises two main original contributions: a line-level segmentation algorithm capable of detecting text lines with curved baseline, and a text-to-image alignment algorithm capable of dealing with under- and over-segmentation errors at the word level. Experiments on pages from a 17th-century Italian manuscript have demonstrated that the line segmentation method allows one to segment 92% of the text line correctly. They also demonstrated that it achieves a correct alignment accuracy greater than 68%. Moreover, the performance achieved on widely used data sets compare favourably with the state of the art.https://www.mdpi.com/2313-433X/9/1/17historical handwritten document processingtext-line segmentationword segmentationtranscript alignment
spellingShingle Giuseppe De Gregorio
Giuliana Capriolo
Angelo Marcelli
End-to-End Transcript Alignment of 17th Century Manuscripts: The Case of <i>Moccia Code</i>
Journal of Imaging
historical handwritten document processing
text-line segmentation
word segmentation
transcript alignment
title End-to-End Transcript Alignment of 17th Century Manuscripts: The Case of <i>Moccia Code</i>
title_full End-to-End Transcript Alignment of 17th Century Manuscripts: The Case of <i>Moccia Code</i>
title_fullStr End-to-End Transcript Alignment of 17th Century Manuscripts: The Case of <i>Moccia Code</i>
title_full_unstemmed End-to-End Transcript Alignment of 17th Century Manuscripts: The Case of <i>Moccia Code</i>
title_short End-to-End Transcript Alignment of 17th Century Manuscripts: The Case of <i>Moccia Code</i>
title_sort end to end transcript alignment of 17th century manuscripts the case of i moccia code i
topic historical handwritten document processing
text-line segmentation
word segmentation
transcript alignment
url https://www.mdpi.com/2313-433X/9/1/17
work_keys_str_mv AT giuseppedegregorio endtoendtranscriptalignmentof17thcenturymanuscriptsthecaseofimocciacodei
AT giulianacapriolo endtoendtranscriptalignmentof17thcenturymanuscriptsthecaseofimocciacodei
AT angelomarcelli endtoendtranscriptalignmentof17thcenturymanuscriptsthecaseofimocciacodei