Are Searches in OCR-generated Archives Trustworthy?
Digitised archives are revolutionary tools for research that, in a few seconds, generate results that earlier often took years to obtain. But do they provide all results for the terms searched for? The accuracy of searches was tested by performing sample searches of leading newspaper databases. The...
Main Author: | |
---|---|
Format: | Article |
Language: | deu |
Published: |
De Gruyter
2023-05-01
|
Series: | Jahrbuch für Wirtschaftsgeschichte |
Subjects: | |
Online Access: | https://doi.org/10.1515/jbwg-2023-0003 |
_version_ | 1797246646216556544 |
---|---|
author | Burchardt Jørgen |
author_facet | Burchardt Jørgen |
author_sort | Burchardt Jørgen |
collection | DOAJ |
description | Digitised archives are revolutionary tools for research that, in a few seconds, generate results that earlier often took years to obtain. But do they provide all results for the terms searched for? The accuracy of searches was tested by performing sample searches of leading newspaper databases. The test revealed several weaknesses in the search process, including an average 18 percent error rate for single words in body text, and a far higher error rates for advertisements. Such high error rates encourage a critical look at the 20-year-old sector. Although these errors can be reduced by a re-digitation and with new improved OCR engines and new search algorithms, searches will nevertheless return manipulated results. In response, and to identify infringed bias and skewed representation, database owners need to provide thorough metadata to ensure source criticism. |
first_indexed | 2024-04-24T19:46:06Z |
format | Article |
id | doaj.art-415f8a44bba244c0b214ede229d29c75 |
institution | Directory Open Access Journal |
issn | 0075-2800 2196-6842 |
language | deu |
last_indexed | 2024-04-24T19:46:06Z |
publishDate | 2023-05-01 |
publisher | De Gruyter |
record_format | Article |
series | Jahrbuch für Wirtschaftsgeschichte |
spelling | doaj.art-415f8a44bba244c0b214ede229d29c752024-03-25T07:28:55ZdeuDe GruyterJahrbuch für Wirtschaftsgeschichte0075-28002196-68422023-05-01641315410.1515/jbwg-2023-0003Are Searches in OCR-generated Archives Trustworthy?Burchardt Jørgen0Nyborgvej 13, DK-5750Ringe, DenmarkDigitised archives are revolutionary tools for research that, in a few seconds, generate results that earlier often took years to obtain. But do they provide all results for the terms searched for? The accuracy of searches was tested by performing sample searches of leading newspaper databases. The test revealed several weaknesses in the search process, including an average 18 percent error rate for single words in body text, and a far higher error rates for advertisements. Such high error rates encourage a critical look at the 20-year-old sector. Although these errors can be reduced by a re-digitation and with new improved OCR engines and new search algorithms, searches will nevertheless return manipulated results. In response, and to identify infringed bias and skewed representation, database owners need to provide thorough metadata to ensure source criticism.https://doi.org/10.1515/jbwg-2023-0003optical character recognitionhistorical archivesource criticismresearch methodologyhistorische archivequellenkritikforschungsmethodikocrc 82 |
spellingShingle | Burchardt Jørgen Are Searches in OCR-generated Archives Trustworthy? Jahrbuch für Wirtschaftsgeschichte optical character recognition historical archive source criticism research methodology historische archive quellenkritik forschungsmethodik ocr c 82 |
title | Are Searches in OCR-generated Archives Trustworthy? |
title_full | Are Searches in OCR-generated Archives Trustworthy? |
title_fullStr | Are Searches in OCR-generated Archives Trustworthy? |
title_full_unstemmed | Are Searches in OCR-generated Archives Trustworthy? |
title_short | Are Searches in OCR-generated Archives Trustworthy? |
title_sort | are searches in ocr generated archives trustworthy |
topic | optical character recognition historical archive source criticism research methodology historische archive quellenkritik forschungsmethodik ocr c 82 |
url | https://doi.org/10.1515/jbwg-2023-0003 |
work_keys_str_mv | AT burchardtjørgen aresearchesinocrgeneratedarchivestrustworthy |