Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945)

Reichsanzeiger-GT is a ground truth dataset for OCR training and evaluation based on the historical German newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette), which was published from 1819 to 1945 and printed mostly in the type...

Full description

Bibliographic Details
Main Authors:	Thomas Schmidt, Jan Kamlah, Stefan Weil
Format:	Article
Language:	English
Published:	Elsevier 2024-06-01
Series:	Data in Brief
Subjects:	OCR Text recognition Ground truth Historical newspapers
Online Access:	http://www.sciencedirect.com/science/article/pii/S2352340924002439

_version_	1827226527360614400
author	Thomas Schmidt Jan Kamlah Stefan Weil
author_facet	Thomas Schmidt Jan Kamlah Stefan Weil
author_sort	Thomas Schmidt
collection	DOAJ
description	Reichsanzeiger-GT is a ground truth dataset for OCR training and evaluation based on the historical German newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette), which was published from 1819 to 1945 and printed mostly in the typeface Fraktur (Black Letter). The dataset consists of 101 newspaper pages for the years 1820–1939, that cover a wide variety of topics, page layouts (lists, tables, and advertisements) as well as different typefaces. Using the transcription software Transkribus and the open-source OCR engine Tesseract we automatically created and manually corrected layout segmentations and transcriptions for each page, resulting in 65,563 text regions, 412 table regions, 119,429 text lines and 490,679 words. By applying transcription guidelines that preserve the printing conditions, the dataset contains language and printing specific phenomena like the historical use of glyphs like long s (ſ), rotunda r (ꝛ), and historical currency symbols (M, ₰) among others. The dataset is provided in two variants in PAGE XML format. The first one contains ground truth data with table regions transformed to text regions for easier processing. The second variant preserves all table regions. Researchers can reuse this dataset to train new or finetune existing text recognition or layout segmentation models. The dataset can also be used to evaluate the accuracy of existing OCR models. Using specific, community driven transcription guidelines our dataset is easily interoperable and reusable with other datasets based on the same transcription level.
first_indexed	2024-04-24T20:13:02Z
format	Article
id	doaj.art-5685cb3e97f842cfa744758d3b8b0db4
institution	Directory Open Access Journal
issn	2352-3409
language	English
last_indexed	2025-03-21T17:43:29Z
publishDate	2024-06-01
publisher	Elsevier
record_format	Article
series	Data in Brief
spelling	doaj.art-5685cb3e97f842cfa744758d3b8b0db42024-06-12T04:45:53ZengElsevierData in Brief2352-34092024-06-0154110274Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945)Thomas Schmidt0Jan Kamlah1Stefan Weil2Corresponding author.; University of Mannheim, University Library, Schloss Schneckenhof, 68161 MannheimUniversity of Mannheim, University Library, Schloss Schneckenhof, 68161 MannheimUniversity of Mannheim, University Library, Schloss Schneckenhof, 68161 MannheimReichsanzeiger-GT is a ground truth dataset for OCR training and evaluation based on the historical German newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette), which was published from 1819 to 1945 and printed mostly in the typeface Fraktur (Black Letter). The dataset consists of 101 newspaper pages for the years 1820–1939, that cover a wide variety of topics, page layouts (lists, tables, and advertisements) as well as different typefaces. Using the transcription software Transkribus and the open-source OCR engine Tesseract we automatically created and manually corrected layout segmentations and transcriptions for each page, resulting in 65,563 text regions, 412 table regions, 119,429 text lines and 490,679 words. By applying transcription guidelines that preserve the printing conditions, the dataset contains language and printing specific phenomena like the historical use of glyphs like long s (ſ), rotunda r (ꝛ), and historical currency symbols (M, ₰) among others. The dataset is provided in two variants in PAGE XML format. The first one contains ground truth data with table regions transformed to text regions for easier processing. The second variant preserves all table regions. Researchers can reuse this dataset to train new or finetune existing text recognition or layout segmentation models. The dataset can also be used to evaluate the accuracy of existing OCR models. Using specific, community driven transcription guidelines our dataset is easily interoperable and reusable with other datasets based on the same transcription level.http://www.sciencedirect.com/science/article/pii/S2352340924002439OCRText recognitionGround truthHistorical newspapers
spellingShingle	Thomas Schmidt Jan Kamlah Stefan Weil Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945) Data in Brief OCR Text recognition Ground truth Historical newspapers
title	Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945)
title_full	Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945)
title_fullStr	Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945)
title_full_unstemmed	Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945)
title_short	Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945)
title_sort	reichsanzeiger gt an ocr ground truth dataset based on the historical newspaper deutscher reichsanzeiger und preussischer staatsanzeiger german imperial gazette and prussian official gazette 1819 1945
topic	OCR Text recognition Ground truth Historical newspapers
url	http://www.sciencedirect.com/science/article/pii/S2352340924002439
work_keys_str_mv	AT thomasschmidt reichsanzeigergtanocrgroundtruthdatasetbasedonthehistoricalnewspaperdeutscherreichsanzeigerundpreußischerstaatsanzeigergermanimperialgazetteandprussianofficialgazette18191945 AT jankamlah reichsanzeigergtanocrgroundtruthdatasetbasedonthehistoricalnewspaperdeutscherreichsanzeigerundpreußischerstaatsanzeigergermanimperialgazetteandprussianofficialgazette18191945 AT stefanweil reichsanzeigergtanocrgroundtruthdatasetbasedonthehistoricalnewspaperdeutscherreichsanzeigerundpreußischerstaatsanzeigergermanimperialgazetteandprussianofficialgazette18191945

Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945)

Similar Items