Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945)

Reichsanzeiger-GT is a ground truth dataset for OCR training and evaluation based on the historical German newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette), which was published from 1819 to 1945 and printed mostly in the type...

Full description

Bibliographic Details
Main Authors: Thomas Schmidt, Jan Kamlah, Stefan Weil
Format: Article
Language:English
Published: Elsevier 2024-06-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340924002439
_version_ 1827226527360614400
author Thomas Schmidt
Jan Kamlah
Stefan Weil
author_facet Thomas Schmidt
Jan Kamlah
Stefan Weil
author_sort Thomas Schmidt
collection DOAJ
description Reichsanzeiger-GT is a ground truth dataset for OCR training and evaluation based on the historical German newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette), which was published from 1819 to 1945 and printed mostly in the typeface Fraktur (Black Letter). The dataset consists of 101 newspaper pages for the years 1820–1939, that cover a wide variety of topics, page layouts (lists, tables, and advertisements) as well as different typefaces. Using the transcription software Transkribus and the open-source OCR engine Tesseract we automatically created and manually corrected layout segmentations and transcriptions for each page, resulting in 65,563 text regions, 412 table regions, 119,429 text lines and 490,679 words. By applying transcription guidelines that preserve the printing conditions, the dataset contains language and printing specific phenomena like the historical use of glyphs like long s (ſ), rotunda r (ꝛ), and historical currency symbols (M, ₰) among others. The dataset is provided in two variants in PAGE XML format. The first one contains ground truth data with table regions transformed to text regions for easier processing. The second variant preserves all table regions. Researchers can reuse this dataset to train new or finetune existing text recognition or layout segmentation models. The dataset can also be used to evaluate the accuracy of existing OCR models. Using specific, community driven transcription guidelines our dataset is easily interoperable and reusable with other datasets based on the same transcription level.
first_indexed 2024-04-24T20:13:02Z
format Article
id doaj.art-5685cb3e97f842cfa744758d3b8b0db4
institution Directory Open Access Journal
issn 2352-3409
language English
last_indexed 2025-03-21T17:43:29Z
publishDate 2024-06-01
publisher Elsevier
record_format Article
series Data in Brief
spelling doaj.art-5685cb3e97f842cfa744758d3b8b0db42024-06-12T04:45:53ZengElsevierData in Brief2352-34092024-06-0154110274Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945)Thomas Schmidt0Jan Kamlah1Stefan Weil2Corresponding author.; University of Mannheim, University Library, Schloss Schneckenhof, 68161 MannheimUniversity of Mannheim, University Library, Schloss Schneckenhof, 68161 MannheimUniversity of Mannheim, University Library, Schloss Schneckenhof, 68161 MannheimReichsanzeiger-GT is a ground truth dataset for OCR training and evaluation based on the historical German newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette), which was published from 1819 to 1945 and printed mostly in the typeface Fraktur (Black Letter). The dataset consists of 101 newspaper pages for the years 1820–1939, that cover a wide variety of topics, page layouts (lists, tables, and advertisements) as well as different typefaces. Using the transcription software Transkribus and the open-source OCR engine Tesseract we automatically created and manually corrected layout segmentations and transcriptions for each page, resulting in 65,563 text regions, 412 table regions, 119,429 text lines and 490,679 words. By applying transcription guidelines that preserve the printing conditions, the dataset contains language and printing specific phenomena like the historical use of glyphs like long s (ſ), rotunda r (ꝛ), and historical currency symbols (M, ₰) among others. The dataset is provided in two variants in PAGE XML format. The first one contains ground truth data with table regions transformed to text regions for easier processing. The second variant preserves all table regions. Researchers can reuse this dataset to train new or finetune existing text recognition or layout segmentation models. The dataset can also be used to evaluate the accuracy of existing OCR models. Using specific, community driven transcription guidelines our dataset is easily interoperable and reusable with other datasets based on the same transcription level.http://www.sciencedirect.com/science/article/pii/S2352340924002439OCRText recognitionGround truthHistorical newspapers
spellingShingle Thomas Schmidt
Jan Kamlah
Stefan Weil
Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945)
Data in Brief
OCR
Text recognition
Ground truth
Historical newspapers
title Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945)
title_full Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945)
title_fullStr Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945)
title_full_unstemmed Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945)
title_short Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945)
title_sort reichsanzeiger gt an ocr ground truth dataset based on the historical newspaper deutscher reichsanzeiger und preussischer staatsanzeiger german imperial gazette and prussian official gazette 1819 1945
topic OCR
Text recognition
Ground truth
Historical newspapers
url http://www.sciencedirect.com/science/article/pii/S2352340924002439
work_keys_str_mv AT thomasschmidt reichsanzeigergtanocrgroundtruthdatasetbasedonthehistoricalnewspaperdeutscherreichsanzeigerundpreußischerstaatsanzeigergermanimperialgazetteandprussianofficialgazette18191945
AT jankamlah reichsanzeigergtanocrgroundtruthdatasetbasedonthehistoricalnewspaperdeutscherreichsanzeigerundpreußischerstaatsanzeigergermanimperialgazetteandprussianofficialgazette18191945
AT stefanweil reichsanzeigergtanocrgroundtruthdatasetbasedonthehistoricalnewspaperdeutscherreichsanzeigerundpreußischerstaatsanzeigergermanimperialgazetteandprussianofficialgazette18191945