Publishing an OCR ground truth data set for reuse in an unclear copyright setting. Two case studies with legal and technical solutions to enable a collective OCR ground truth data set effort

We present an OCR ground truth data set for historical prints and show improvement of recognition results over baselines with training on this data. We reflect on reusability of the ground truth data set based on two experiments that look into the lega...

Full description

Bibliographic Details
Main Authors:	David Lassner, Julius Coburger, Clemens Neudecker, Anne Baillot
Format:	Article
Language:	deu
Published:	Forschungsverbund Marbach Weimar Wolfenbüttel 2021-09-01
Series:	Zeitschrift für digitale Geisteswissenschaften
Subjects:	informatik maschinelles lernen optische zeichenerkennung urheberrecht
Online Access:	https://www.zfdg.de/node/340

_version_	1797966257055596544
author	David Lassner Julius Coburger Clemens Neudecker Anne Baillot
author_facet	David Lassner Julius Coburger Clemens Neudecker Anne Baillot
author_sort	David Lassner
collection	DOAJ
description	We present an OCR ground truth data set for historical prints and show improvement of recognition results over baselines with training on this data. We reflect on reusability of the ground truth data set based on two experiments that look into the legal basis for reuse of digitized document images in the case of 19th century English and German books. We propose a framework for publishing ground truth data even when digitized document images cannot be easily redistributed.
first_indexed	2024-04-11T02:11:45Z
format	Article
id	doaj.art-6d4f24b55965459fbcebf31cc807db14
institution	Directory Open Access Journal
issn	2510-1358
language	deu
last_indexed	2024-04-11T02:11:45Z
publishDate	2021-09-01
publisher	Forschungsverbund Marbach Weimar Wolfenbüttel
record_format	Article
series	Zeitschrift für digitale Geisteswissenschaften
spelling	doaj.art-6d4f24b55965459fbcebf31cc807db142023-01-03T01:53:59ZdeuForschungsverbund Marbach Weimar WolfenbüttelZeitschrift für digitale Geisteswissenschaften2510-13582021-09-015610.17175/sb005_0061780168195Publishing an OCR ground truth data set for reuse in an unclear copyright setting. Two case studies with legal and technical solutions to enable a collective OCR ground truth data set effortDavid Lassnerhttps://orcid.org/0000-0001-9013-0834Julius Coburgerhttps://orcid.org/0000-0003-4502-7955Clemens Neudeckerhttps://orcid.org/0000-0001-5293-8322Anne Baillothttps://orcid.org/0000-0002-4593-059XWe present an OCR ground truth data set for historical prints and show improvement of recognition results over baselines with training on this data. We reflect on reusability of the ground truth data set based on two experiments that look into the legal basis for reuse of digitized document images in the case of 19th century English and German books. We propose a framework for publishing ground truth data even when digitized document images cannot be easily redistributed.https://www.zfdg.de/node/340informatik maschinelles lernen optische zeichenerkennung urheberrecht
spellingShingle	David Lassner Julius Coburger Clemens Neudecker Anne Baillot Publishing an OCR ground truth data set for reuse in an unclear copyright setting. Two case studies with legal and technical solutions to enable a collective OCR ground truth data set effort Zeitschrift für digitale Geisteswissenschaften informatik maschinelles lernen optische zeichenerkennung urheberrecht
title	Publishing an OCR ground truth data set for reuse in an unclear copyright setting. Two case studies with legal and technical solutions to enable a collective OCR ground truth data set effort
title_full	Publishing an OCR ground truth data set for reuse in an unclear copyright setting. Two case studies with legal and technical solutions to enable a collective OCR ground truth data set effort
title_fullStr	Publishing an OCR ground truth data set for reuse in an unclear copyright setting. Two case studies with legal and technical solutions to enable a collective OCR ground truth data set effort
title_full_unstemmed	Publishing an OCR ground truth data set for reuse in an unclear copyright setting. Two case studies with legal and technical solutions to enable a collective OCR ground truth data set effort
title_short	Publishing an OCR ground truth data set for reuse in an unclear copyright setting. Two case studies with legal and technical solutions to enable a collective OCR ground truth data set effort
title_sort	publishing an ocr ground truth data set for reuse in an unclear copyright setting two case studies with legal and technical solutions to enable a collective ocr ground truth data set effort
topic	informatik maschinelles lernen optische zeichenerkennung urheberrecht
url	https://www.zfdg.de/node/340
work_keys_str_mv	AT davidlassner publishinganocrgroundtruthdatasetforreuseinanunclearcopyrightsettingtwocasestudieswithlegalandtechnicalsolutionstoenableacollectiveocrgroundtruthdataseteffort AT juliuscoburger publishinganocrgroundtruthdatasetforreuseinanunclearcopyrightsettingtwocasestudieswithlegalandtechnicalsolutionstoenableacollectiveocrgroundtruthdataseteffort AT clemensneudecker publishinganocrgroundtruthdatasetforreuseinanunclearcopyrightsettingtwocasestudieswithlegalandtechnicalsolutionstoenableacollectiveocrgroundtruthdataseteffort AT annebaillot publishinganocrgroundtruthdatasetforreuseinanunclearcopyrightsettingtwocasestudieswithlegalandtechnicalsolutionstoenableacollectiveocrgroundtruthdataseteffort

Publishing an OCR ground truth data set for reuse in an unclear copyright setting. Two case studies with legal and technical solutions to enable a collective OCR ground truth data set effort

Similar Items