Are Searches in OCR-generated Archives Trustworthy?

Digitised archives are revolutionary tools for research that, in a few seconds, generate results that earlier often took years to obtain. But do they provide all results for the terms searched for? The accuracy of searches was tested by performing sample searches of leading newspaper databases. The...

Full description

Bibliographic Details
Main Author: Burchardt Jørgen
Format: Article
Language:deu
Published: De Gruyter 2023-05-01
Series:Jahrbuch für Wirtschaftsgeschichte
Subjects:
Online Access:https://doi.org/10.1515/jbwg-2023-0003
_version_ 1797246646216556544
author Burchardt Jørgen
author_facet Burchardt Jørgen
author_sort Burchardt Jørgen
collection DOAJ
description Digitised archives are revolutionary tools for research that, in a few seconds, generate results that earlier often took years to obtain. But do they provide all results for the terms searched for? The accuracy of searches was tested by performing sample searches of leading newspaper databases. The test revealed several weaknesses in the search process, including an average 18 percent error rate for single words in body text, and a far higher error rates for advertisements. Such high error rates encourage a critical look at the 20-year-old sector. Although these errors can be reduced by a re-digitation and with new improved OCR engines and new search algorithms, searches will nevertheless return manipulated results. In response, and to identify infringed bias and skewed representation, database owners need to provide thorough metadata to ensure source criticism.
first_indexed 2024-04-24T19:46:06Z
format Article
id doaj.art-415f8a44bba244c0b214ede229d29c75
institution Directory Open Access Journal
issn 0075-2800
2196-6842
language deu
last_indexed 2024-04-24T19:46:06Z
publishDate 2023-05-01
publisher De Gruyter
record_format Article
series Jahrbuch für Wirtschaftsgeschichte
spelling doaj.art-415f8a44bba244c0b214ede229d29c752024-03-25T07:28:55ZdeuDe GruyterJahrbuch für Wirtschaftsgeschichte0075-28002196-68422023-05-01641315410.1515/jbwg-2023-0003Are Searches in OCR-generated Archives Trustworthy?Burchardt Jørgen0Nyborgvej 13, DK-5750Ringe, DenmarkDigitised archives are revolutionary tools for research that, in a few seconds, generate results that earlier often took years to obtain. But do they provide all results for the terms searched for? The accuracy of searches was tested by performing sample searches of leading newspaper databases. The test revealed several weaknesses in the search process, including an average 18 percent error rate for single words in body text, and a far higher error rates for advertisements. Such high error rates encourage a critical look at the 20-year-old sector. Although these errors can be reduced by a re-digitation and with new improved OCR engines and new search algorithms, searches will nevertheless return manipulated results. In response, and to identify infringed bias and skewed representation, database owners need to provide thorough metadata to ensure source criticism.https://doi.org/10.1515/jbwg-2023-0003optical character recognitionhistorical archivesource criticismresearch methodologyhistorische archivequellenkritikforschungsmethodikocrc 82
spellingShingle Burchardt Jørgen
Are Searches in OCR-generated Archives Trustworthy?
Jahrbuch für Wirtschaftsgeschichte
optical character recognition
historical archive
source criticism
research methodology
historische archive
quellenkritik
forschungsmethodik
ocr
c 82
title Are Searches in OCR-generated Archives Trustworthy?
title_full Are Searches in OCR-generated Archives Trustworthy?
title_fullStr Are Searches in OCR-generated Archives Trustworthy?
title_full_unstemmed Are Searches in OCR-generated Archives Trustworthy?
title_short Are Searches in OCR-generated Archives Trustworthy?
title_sort are searches in ocr generated archives trustworthy
topic optical character recognition
historical archive
source criticism
research methodology
historische archive
quellenkritik
forschungsmethodik
ocr
c 82
url https://doi.org/10.1515/jbwg-2023-0003
work_keys_str_mv AT burchardtjørgen aresearchesinocrgeneratedarchivestrustworthy