Nautilus
When a digital collection has been processed by OCR, the usability expectations of patrons and researchers are high. While the former expect full text search to return all instances of terms in historical collections correctly, the latter are more familiar with the impacts of OCR errors but would s...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
openjournals.nl
2023-04-01
|
Series: | Liber Quarterly: The Journal of European Research Libraries |
Subjects: | |
Online Access: | https://liberquarterly.eu/article/view/13330 |
_version_ | 1827955856819355648 |
---|---|
author | Yves Maurer Pit Schneider Ralph Marschall |
author_facet | Yves Maurer Pit Schneider Ralph Marschall |
author_sort | Yves Maurer |
collection | DOAJ |
description |
When a digital collection has been processed by OCR, the usability expectations
of patrons and researchers are high. While the former expect full text
search to return all instances of terms in historical collections correctly, the latter
are more familiar with the impacts of OCR errors but would still like to
apply big data analysis or machine-learning methods. All of these use cases
depend on high quality textual transcriptions of the scans. This is why the
National Library of Luxembourg (BnL) has developed a pipeline to improve
OCR for existing digitised documents. Enhancing OCR in a digital library not
only demands improved machine learning models, but also requires a coherent
reprocessing strategy in order to apply them efficiently in production systems. The newly developed software tool, Nautilus, fulfils these requirements using
METS/ALTO as a pivot format. The BnL has open-sourced it so that other
libraries can re-use it on their own collections. This paper covers the creation
of the ground truth, the details of the reprocessing pipeline, its production use
on the entirety of the BnL collection, along with the estimated results. Based
on a quality prediction measure, developed during the project, approximately
28 million additional text lines now exceed the quality threshold.
|
first_indexed | 2024-04-09T14:52:40Z |
format | Article |
id | doaj.art-fbeabbf6fa8c46fe94517f5db5a3b4f6 |
institution | Directory Open Access Journal |
issn | 2213-056X |
language | English |
last_indexed | 2024-04-09T14:52:40Z |
publishDate | 2023-04-01 |
publisher | openjournals.nl |
record_format | Article |
series | Liber Quarterly: The Journal of European Research Libraries |
spelling | doaj.art-fbeabbf6fa8c46fe94517f5db5a3b4f62023-05-02T09:15:19Zengopenjournals.nlLiber Quarterly: The Journal of European Research Libraries2213-056X2023-04-0133110.53377/lq.13330NautilusYves Maurer0Pit Schneider1Ralph Marschall2National Library of LuxembourgNational Library of LuxembourgNational Library of Luxembourg When a digital collection has been processed by OCR, the usability expectations of patrons and researchers are high. While the former expect full text search to return all instances of terms in historical collections correctly, the latter are more familiar with the impacts of OCR errors but would still like to apply big data analysis or machine-learning methods. All of these use cases depend on high quality textual transcriptions of the scans. This is why the National Library of Luxembourg (BnL) has developed a pipeline to improve OCR for existing digitised documents. Enhancing OCR in a digital library not only demands improved machine learning models, but also requires a coherent reprocessing strategy in order to apply them efficiently in production systems. The newly developed software tool, Nautilus, fulfils these requirements using METS/ALTO as a pivot format. The BnL has open-sourced it so that other libraries can re-use it on their own collections. This paper covers the creation of the ground truth, the details of the reprocessing pipeline, its production use on the entirety of the BnL collection, along with the estimated results. Based on a quality prediction measure, developed during the project, approximately 28 million additional text lines now exceed the quality threshold. https://liberquarterly.eu/article/view/13330OCR qualityOCR correctionLuxembourg historical newspapersground truthMETS/ALTO; |
spellingShingle | Yves Maurer Pit Schneider Ralph Marschall Nautilus Liber Quarterly: The Journal of European Research Libraries OCR quality OCR correction Luxembourg historical newspapers ground truth METS/ALTO; |
title | Nautilus |
title_full | Nautilus |
title_fullStr | Nautilus |
title_full_unstemmed | Nautilus |
title_short | Nautilus |
title_sort | nautilus |
topic | OCR quality OCR correction Luxembourg historical newspapers ground truth METS/ALTO; |
url | https://liberquarterly.eu/article/view/13330 |
work_keys_str_mv | AT yvesmaurer nautilus AT pitschneider nautilus AT ralphmarschall nautilus |