Nautilus

When a digital collection has been processed by OCR, the usability expectations of patrons and researchers are high. While the former expect full text search to return all instances of terms in historical collections correctly, the latter are more familiar with the impacts of OCR errors but would s...

Full description

Bibliographic Details
Main Authors: Yves Maurer, Pit Schneider, Ralph Marschall
Format: Article
Language:English
Published: openjournals.nl 2023-04-01
Series:Liber Quarterly: The Journal of European Research Libraries
Subjects:
Online Access:https://liberquarterly.eu/article/view/13330
_version_ 1827955856819355648
author Yves Maurer
Pit Schneider
Ralph Marschall
author_facet Yves Maurer
Pit Schneider
Ralph Marschall
author_sort Yves Maurer
collection DOAJ
description When a digital collection has been processed by OCR, the usability expectations of patrons and researchers are high. While the former expect full text search to return all instances of terms in historical collections correctly, the latter are more familiar with the impacts of OCR errors but would still like to apply big data analysis or machine-learning methods. All of these use cases depend on high quality textual transcriptions of the scans. This is why the National Library of Luxembourg (BnL) has developed a pipeline to improve OCR for existing digitised documents. Enhancing OCR in a digital library not only demands improved machine learning models, but also requires a coherent reprocessing strategy in order to apply them efficiently in production systems. The newly developed software tool, Nautilus, fulfils these requirements using METS/ALTO as a pivot format. The BnL has open-sourced it so that other libraries can re-use it on their own collections. This paper covers the creation of the ground truth, the details of the reprocessing pipeline, its production use on the entirety of the BnL collection, along with the estimated results. Based on a quality prediction measure, developed during the project, approximately 28 million additional text lines now exceed the quality threshold.
first_indexed 2024-04-09T14:52:40Z
format Article
id doaj.art-fbeabbf6fa8c46fe94517f5db5a3b4f6
institution Directory Open Access Journal
issn 2213-056X
language English
last_indexed 2024-04-09T14:52:40Z
publishDate 2023-04-01
publisher openjournals.nl
record_format Article
series Liber Quarterly: The Journal of European Research Libraries
spelling doaj.art-fbeabbf6fa8c46fe94517f5db5a3b4f62023-05-02T09:15:19Zengopenjournals.nlLiber Quarterly: The Journal of European Research Libraries2213-056X2023-04-0133110.53377/lq.13330NautilusYves Maurer0Pit Schneider1Ralph Marschall2National Library of LuxembourgNational Library of LuxembourgNational Library of Luxembourg When a digital collection has been processed by OCR, the usability expectations of patrons and researchers are high. While the former expect full text search to return all instances of terms in historical collections correctly, the latter are more familiar with the impacts of OCR errors but would still like to apply big data analysis or machine-learning methods. All of these use cases depend on high quality textual transcriptions of the scans. This is why the National Library of Luxembourg (BnL) has developed a pipeline to improve OCR for existing digitised documents. Enhancing OCR in a digital library not only demands improved machine learning models, but also requires a coherent reprocessing strategy in order to apply them efficiently in production systems. The newly developed software tool, Nautilus, fulfils these requirements using METS/ALTO as a pivot format. The BnL has open-sourced it so that other libraries can re-use it on their own collections. This paper covers the creation of the ground truth, the details of the reprocessing pipeline, its production use on the entirety of the BnL collection, along with the estimated results. Based on a quality prediction measure, developed during the project, approximately 28 million additional text lines now exceed the quality threshold. https://liberquarterly.eu/article/view/13330OCR qualityOCR correctionLuxembourg historical newspapersground truthMETS/ALTO;
spellingShingle Yves Maurer
Pit Schneider
Ralph Marschall
Nautilus
Liber Quarterly: The Journal of European Research Libraries
OCR quality
OCR correction
Luxembourg historical newspapers
ground truth
METS/ALTO;
title Nautilus
title_full Nautilus
title_fullStr Nautilus
title_full_unstemmed Nautilus
title_short Nautilus
title_sort nautilus
topic OCR quality
OCR correction
Luxembourg historical newspapers
ground truth
METS/ALTO;
url https://liberquarterly.eu/article/view/13330
work_keys_str_mv AT yvesmaurer nautilus
AT pitschneider nautilus
AT ralphmarschall nautilus