Publishing an OCR ground truth data set for reuse in an unclear copyright setting. Two case studies with legal and technical solutions to enable a collective OCR ground truth data set effort

We present an OCR ground truth data set for historical prints and show improvement of recognition results over baselines with training on this data. We reflect on reusability of the ground truth data set based on two experiments that look into the lega...

Full description

Bibliographic Details
Main Authors: David Lassner, Julius Coburger, Clemens Neudecker, Anne Baillot
Format: Article
Language:deu
Published: Forschungsverbund Marbach Weimar Wolfenbüttel 2021-09-01
Series:Zeitschrift für digitale Geisteswissenschaften
Subjects:
Online Access:https://www.zfdg.de/node/340
_version_ 1797966257055596544
author David Lassner
Julius Coburger
Clemens Neudecker
Anne Baillot
author_facet David Lassner
Julius Coburger
Clemens Neudecker
Anne Baillot
author_sort David Lassner
collection DOAJ
description We present an OCR ground truth data set for historical prints and show improvement of recognition results over baselines with training on this data. We reflect on reusability of the ground truth data set based on two experiments that look into the legal basis for reuse of digitized document images in the case of 19th century English and German books. We propose a framework for publishing ground truth data even when digitized document images cannot be easily redistributed.
first_indexed 2024-04-11T02:11:45Z
format Article
id doaj.art-6d4f24b55965459fbcebf31cc807db14
institution Directory Open Access Journal
issn 2510-1358
language deu
last_indexed 2024-04-11T02:11:45Z
publishDate 2021-09-01
publisher Forschungsverbund Marbach Weimar Wolfenbüttel
record_format Article
series Zeitschrift für digitale Geisteswissenschaften
spelling doaj.art-6d4f24b55965459fbcebf31cc807db142023-01-03T01:53:59ZdeuForschungsverbund Marbach Weimar WolfenbüttelZeitschrift für digitale Geisteswissenschaften2510-13582021-09-015610.17175/sb005_0061780168195Publishing an OCR ground truth data set for reuse in an unclear copyright setting. Two case studies with legal and technical solutions to enable a collective OCR ground truth data set effortDavid Lassnerhttps://orcid.org/0000-0001-9013-0834Julius Coburgerhttps://orcid.org/0000-0003-4502-7955Clemens Neudeckerhttps://orcid.org/0000-0001-5293-8322Anne Baillothttps://orcid.org/0000-0002-4593-059XWe present an OCR ground truth data set for historical prints and show improvement of recognition results over baselines with training on this data. We reflect on reusability of the ground truth data set based on two experiments that look into the legal basis for reuse of digitized document images in the case of 19th century English and German books. We propose a framework for publishing ground truth data even when digitized document images cannot be easily redistributed.https://www.zfdg.de/node/340informatik maschinelles lernen optische zeichenerkennung urheberrecht
spellingShingle David Lassner
Julius Coburger
Clemens Neudecker
Anne Baillot
Publishing an OCR ground truth data set for reuse in an unclear copyright setting. Two case studies with legal and technical solutions to enable a collective OCR ground truth data set effort
Zeitschrift für digitale Geisteswissenschaften
informatik
maschinelles lernen
optische zeichenerkennung
urheberrecht
title Publishing an OCR ground truth data set for reuse in an unclear copyright setting. Two case studies with legal and technical solutions to enable a collective OCR ground truth data set effort
title_full Publishing an OCR ground truth data set for reuse in an unclear copyright setting. Two case studies with legal and technical solutions to enable a collective OCR ground truth data set effort
title_fullStr Publishing an OCR ground truth data set for reuse in an unclear copyright setting. Two case studies with legal and technical solutions to enable a collective OCR ground truth data set effort
title_full_unstemmed Publishing an OCR ground truth data set for reuse in an unclear copyright setting. Two case studies with legal and technical solutions to enable a collective OCR ground truth data set effort
title_short Publishing an OCR ground truth data set for reuse in an unclear copyright setting. Two case studies with legal and technical solutions to enable a collective OCR ground truth data set effort
title_sort publishing an ocr ground truth data set for reuse in an unclear copyright setting two case studies with legal and technical solutions to enable a collective ocr ground truth data set effort
topic informatik
maschinelles lernen
optische zeichenerkennung
urheberrecht
url https://www.zfdg.de/node/340
work_keys_str_mv AT davidlassner publishinganocrgroundtruthdatasetforreuseinanunclearcopyrightsettingtwocasestudieswithlegalandtechnicalsolutionstoenableacollectiveocrgroundtruthdataseteffort
AT juliuscoburger publishinganocrgroundtruthdatasetforreuseinanunclearcopyrightsettingtwocasestudieswithlegalandtechnicalsolutionstoenableacollectiveocrgroundtruthdataseteffort
AT clemensneudecker publishinganocrgroundtruthdatasetforreuseinanunclearcopyrightsettingtwocasestudieswithlegalandtechnicalsolutionstoenableacollectiveocrgroundtruthdataseteffort
AT annebaillot publishinganocrgroundtruthdatasetforreuseinanunclearcopyrightsettingtwocasestudieswithlegalandtechnicalsolutionstoenableacollectiveocrgroundtruthdataseteffort