Cleaning OCR'd text with Regular Expressions

Optical Character Recognition (OCR)—the conversion of scanned images to machine-encoded text—has proven a godsend for historical research. This process allows texts to be searchable on one hand and more easily parsed and mined on the other. But we’ve all noticed that the OCR for historic texts is fa...

Full description

Bibliographic Details
Main Author: Laura Turner O'Hara
Format: Article
Language:English
Published: Editorial Board of the Programming Historian 2013-05-01
Series:The Programming Historian
Subjects:
Online Access:http://programminghistorian.org/lessons/cleaning-ocrd-text-with-regular-expressions
Description
Summary:Optical Character Recognition (OCR)—the conversion of scanned images to machine-encoded text—has proven a godsend for historical research. This process allows texts to be searchable on one hand and more easily parsed and mined on the other. But we’ve all noticed that the OCR for historic texts is far from perfect. Old type faces and formats make for unique OCR. How might we improve poor quality OCR? The answer is Regular Expressions or “regex.”
ISSN:2397-2068