Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945)
Reichsanzeiger-GT is a ground truth dataset for OCR training and evaluation based on the historical German newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette), which was published from 1819 to 1945 and printed mostly in the type...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2024-06-01
|
Series: | Data in Brief |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2352340924002439 |
_version_ | 1827226527360614400 |
---|---|
author | Thomas Schmidt Jan Kamlah Stefan Weil |
author_facet | Thomas Schmidt Jan Kamlah Stefan Weil |
author_sort | Thomas Schmidt |
collection | DOAJ |
description | Reichsanzeiger-GT is a ground truth dataset for OCR training and evaluation based on the historical German newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette), which was published from 1819 to 1945 and printed mostly in the typeface Fraktur (Black Letter). The dataset consists of 101 newspaper pages for the years 1820–1939, that cover a wide variety of topics, page layouts (lists, tables, and advertisements) as well as different typefaces. Using the transcription software Transkribus and the open-source OCR engine Tesseract we automatically created and manually corrected layout segmentations and transcriptions for each page, resulting in 65,563 text regions, 412 table regions, 119,429 text lines and 490,679 words. By applying transcription guidelines that preserve the printing conditions, the dataset contains language and printing specific phenomena like the historical use of glyphs like long s (ſ), rotunda r (ꝛ), and historical currency symbols (M, ₰) among others. The dataset is provided in two variants in PAGE XML format. The first one contains ground truth data with table regions transformed to text regions for easier processing. The second variant preserves all table regions. Researchers can reuse this dataset to train new or finetune existing text recognition or layout segmentation models. The dataset can also be used to evaluate the accuracy of existing OCR models. Using specific, community driven transcription guidelines our dataset is easily interoperable and reusable with other datasets based on the same transcription level. |
first_indexed | 2024-04-24T20:13:02Z |
format | Article |
id | doaj.art-5685cb3e97f842cfa744758d3b8b0db4 |
institution | Directory Open Access Journal |
issn | 2352-3409 |
language | English |
last_indexed | 2025-03-21T17:43:29Z |
publishDate | 2024-06-01 |
publisher | Elsevier |
record_format | Article |
series | Data in Brief |
spelling | doaj.art-5685cb3e97f842cfa744758d3b8b0db42024-06-12T04:45:53ZengElsevierData in Brief2352-34092024-06-0154110274Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945)Thomas Schmidt0Jan Kamlah1Stefan Weil2Corresponding author.; University of Mannheim, University Library, Schloss Schneckenhof, 68161 MannheimUniversity of Mannheim, University Library, Schloss Schneckenhof, 68161 MannheimUniversity of Mannheim, University Library, Schloss Schneckenhof, 68161 MannheimReichsanzeiger-GT is a ground truth dataset for OCR training and evaluation based on the historical German newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette), which was published from 1819 to 1945 and printed mostly in the typeface Fraktur (Black Letter). The dataset consists of 101 newspaper pages for the years 1820–1939, that cover a wide variety of topics, page layouts (lists, tables, and advertisements) as well as different typefaces. Using the transcription software Transkribus and the open-source OCR engine Tesseract we automatically created and manually corrected layout segmentations and transcriptions for each page, resulting in 65,563 text regions, 412 table regions, 119,429 text lines and 490,679 words. By applying transcription guidelines that preserve the printing conditions, the dataset contains language and printing specific phenomena like the historical use of glyphs like long s (ſ), rotunda r (ꝛ), and historical currency symbols (M, ₰) among others. The dataset is provided in two variants in PAGE XML format. The first one contains ground truth data with table regions transformed to text regions for easier processing. The second variant preserves all table regions. Researchers can reuse this dataset to train new or finetune existing text recognition or layout segmentation models. The dataset can also be used to evaluate the accuracy of existing OCR models. Using specific, community driven transcription guidelines our dataset is easily interoperable and reusable with other datasets based on the same transcription level.http://www.sciencedirect.com/science/article/pii/S2352340924002439OCRText recognitionGround truthHistorical newspapers |
spellingShingle | Thomas Schmidt Jan Kamlah Stefan Weil Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945) Data in Brief OCR Text recognition Ground truth Historical newspapers |
title | Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945) |
title_full | Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945) |
title_fullStr | Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945) |
title_full_unstemmed | Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945) |
title_short | Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945) |
title_sort | reichsanzeiger gt an ocr ground truth dataset based on the historical newspaper deutscher reichsanzeiger und preussischer staatsanzeiger german imperial gazette and prussian official gazette 1819 1945 |
topic | OCR Text recognition Ground truth Historical newspapers |
url | http://www.sciencedirect.com/science/article/pii/S2352340924002439 |
work_keys_str_mv | AT thomasschmidt reichsanzeigergtanocrgroundtruthdatasetbasedonthehistoricalnewspaperdeutscherreichsanzeigerundpreußischerstaatsanzeigergermanimperialgazetteandprussianofficialgazette18191945 AT jankamlah reichsanzeigergtanocrgroundtruthdatasetbasedonthehistoricalnewspaperdeutscherreichsanzeigerundpreußischerstaatsanzeigergermanimperialgazetteandprussianofficialgazette18191945 AT stefanweil reichsanzeigergtanocrgroundtruthdatasetbasedonthehistoricalnewspaperdeutscherreichsanzeigerundpreußischerstaatsanzeigergermanimperialgazetteandprussianofficialgazette18191945 |