Cleaning OCR'd text with Regular Expressions

Optical Character Recognition (OCR)—the conversion of scanned images to machine-encoded text—has proven a godsend for historical research. This process allows texts to be searchable on one hand and more easily parsed and mined on the other. But we’ve all noticed that the OCR for historic texts is fa...

Full description

Bibliographic Details
Main Author: Laura Turner O'Hara
Format: Article
Language:English
Published: Editorial Board of the Programming Historian 2013-05-01
Series:The Programming Historian
Subjects:
Online Access:http://programminghistorian.org/lessons/cleaning-ocrd-text-with-regular-expressions
_version_ 1811262262624649216
author Laura Turner O'Hara
author_facet Laura Turner O'Hara
author_sort Laura Turner O'Hara
collection DOAJ
description Optical Character Recognition (OCR)—the conversion of scanned images to machine-encoded text—has proven a godsend for historical research. This process allows texts to be searchable on one hand and more easily parsed and mined on the other. But we’ve all noticed that the OCR for historic texts is far from perfect. Old type faces and formats make for unique OCR. How might we improve poor quality OCR? The answer is Regular Expressions or “regex.”
first_indexed 2024-04-12T19:22:04Z
format Article
id doaj.art-d9d36db33cc047dba63b2a459e31281c
institution Directory Open Access Journal
issn 2397-2068
language English
last_indexed 2024-04-12T19:22:04Z
publishDate 2013-05-01
publisher Editorial Board of the Programming Historian
record_format Article
series The Programming Historian
spelling doaj.art-d9d36db33cc047dba63b2a459e31281c2022-12-22T03:19:35ZengEditorial Board of the Programming HistorianThe Programming Historian2397-20682013-05-01Cleaning OCR'd text with Regular ExpressionsLaura Turner O'Hara0Office of the Historian at the U.S. House of RepresentativesOptical Character Recognition (OCR)—the conversion of scanned images to machine-encoded text—has proven a godsend for historical research. This process allows texts to be searchable on one hand and more easily parsed and mined on the other. But we’ve all noticed that the OCR for historic texts is far from perfect. Old type faces and formats make for unique OCR. How might we improve poor quality OCR? The answer is Regular Expressions or “regex.”http://programminghistorian.org/lessons/cleaning-ocrd-text-with-regular-expressionsRegular expressionsdata manipulation
spellingShingle Laura Turner O'Hara
Cleaning OCR'd text with Regular Expressions
The Programming Historian
Regular expressions
data manipulation
title Cleaning OCR'd text with Regular Expressions
title_full Cleaning OCR'd text with Regular Expressions
title_fullStr Cleaning OCR'd text with Regular Expressions
title_full_unstemmed Cleaning OCR'd text with Regular Expressions
title_short Cleaning OCR'd text with Regular Expressions
title_sort cleaning ocr d text with regular expressions
topic Regular expressions
data manipulation
url http://programminghistorian.org/lessons/cleaning-ocrd-text-with-regular-expressions
work_keys_str_mv AT lauraturnerohara cleaningocrdtextwithregularexpressions