Generating an Ordered Data Set from an OCR Text File

This tutorial illustrates strategies for taking raw OCR output from a scanned text, parsing it to isolate and correct essential elements of metadata, and generating an ordered data set (a python dictionary) from it. These illustrations are specific to a particular text, but the overall strategy, and...

Full description

Bibliographic Details
Main Author: Jon Crump
Format: Article
Language:English
Published: Editorial Board of the Programming Historian 2014-11-01
Series:The Programming Historian
Subjects:
Online Access:http://programminghistorian.org/lessons/generating-an-ordered-data-set-from-an-OCR-text-file
_version_ 1811242719172886528
author Jon Crump
author_facet Jon Crump
author_sort Jon Crump
collection DOAJ
description This tutorial illustrates strategies for taking raw OCR output from a scanned text, parsing it to isolate and correct essential elements of metadata, and generating an ordered data set (a python dictionary) from it. These illustrations are specific to a particular text, but the overall strategy, and some of the individual procedures, can be adapted to organize any scanned text, even if it doesn’t look like this one.
first_indexed 2024-04-12T13:56:11Z
format Article
id doaj.art-96b59da1c0c64634b587542ac7d48265
institution Directory Open Access Journal
issn 2397-2068
language English
last_indexed 2024-04-12T13:56:11Z
publishDate 2014-11-01
publisher Editorial Board of the Programming Historian
record_format Article
series The Programming Historian
spelling doaj.art-96b59da1c0c64634b587542ac7d482652022-12-22T03:30:23ZengEditorial Board of the Programming HistorianThe Programming Historian2397-20682014-11-01Generating an Ordered Data Set from an OCR Text FileJon Crump0Freelance digital humanistThis tutorial illustrates strategies for taking raw OCR output from a scanned text, parsing it to isolate and correct essential elements of metadata, and generating an ordered data set (a python dictionary) from it. These illustrations are specific to a particular text, but the overall strategy, and some of the individual procedures, can be adapted to organize any scanned text, even if it doesn’t look like this one.http://programminghistorian.org/lessons/generating-an-ordered-data-set-from-an-OCR-text-filedata manipulationOCRPythondataset
spellingShingle Jon Crump
Generating an Ordered Data Set from an OCR Text File
The Programming Historian
data manipulation
OCR
Python
dataset
title Generating an Ordered Data Set from an OCR Text File
title_full Generating an Ordered Data Set from an OCR Text File
title_fullStr Generating an Ordered Data Set from an OCR Text File
title_full_unstemmed Generating an Ordered Data Set from an OCR Text File
title_short Generating an Ordered Data Set from an OCR Text File
title_sort generating an ordered data set from an ocr text file
topic data manipulation
OCR
Python
dataset
url http://programminghistorian.org/lessons/generating-an-ordered-data-set-from-an-OCR-text-file
work_keys_str_mv AT joncrump generatinganordereddatasetfromanocrtextfile