Cleaning OCR'd text with Regular Expressions
Optical Character Recognition (OCR)—the conversion of scanned images to machine-encoded text—has proven a godsend for historical research. This process allows texts to be searchable on one hand and more easily parsed and mined on the other. But we’ve all noticed that the OCR for historic texts is fa...
Main Author: | |
---|---|
Format: | Article |
Language: | English |
Published: |
Editorial Board of the Programming Historian
2013-05-01
|
Series: | The Programming Historian |
Subjects: | |
Online Access: | http://programminghistorian.org/lessons/cleaning-ocrd-text-with-regular-expressions |
_version_ | 1811262262624649216 |
---|---|
author | Laura Turner O'Hara |
author_facet | Laura Turner O'Hara |
author_sort | Laura Turner O'Hara |
collection | DOAJ |
description | Optical Character Recognition (OCR)—the conversion of scanned images to machine-encoded text—has proven a godsend for historical research. This process allows texts to be searchable on one hand and more easily parsed and mined on the other. But we’ve all noticed that the OCR for historic texts is far from perfect. Old type faces and formats make for unique OCR. How might we improve poor quality OCR? The answer is Regular Expressions or “regex.” |
first_indexed | 2024-04-12T19:22:04Z |
format | Article |
id | doaj.art-d9d36db33cc047dba63b2a459e31281c |
institution | Directory Open Access Journal |
issn | 2397-2068 |
language | English |
last_indexed | 2024-04-12T19:22:04Z |
publishDate | 2013-05-01 |
publisher | Editorial Board of the Programming Historian |
record_format | Article |
series | The Programming Historian |
spelling | doaj.art-d9d36db33cc047dba63b2a459e31281c2022-12-22T03:19:35ZengEditorial Board of the Programming HistorianThe Programming Historian2397-20682013-05-01Cleaning OCR'd text with Regular ExpressionsLaura Turner O'Hara0Office of the Historian at the U.S. House of RepresentativesOptical Character Recognition (OCR)—the conversion of scanned images to machine-encoded text—has proven a godsend for historical research. This process allows texts to be searchable on one hand and more easily parsed and mined on the other. But we’ve all noticed that the OCR for historic texts is far from perfect. Old type faces and formats make for unique OCR. How might we improve poor quality OCR? The answer is Regular Expressions or “regex.”http://programminghistorian.org/lessons/cleaning-ocrd-text-with-regular-expressionsRegular expressionsdata manipulation |
spellingShingle | Laura Turner O'Hara Cleaning OCR'd text with Regular Expressions The Programming Historian Regular expressions data manipulation |
title | Cleaning OCR'd text with Regular Expressions |
title_full | Cleaning OCR'd text with Regular Expressions |
title_fullStr | Cleaning OCR'd text with Regular Expressions |
title_full_unstemmed | Cleaning OCR'd text with Regular Expressions |
title_short | Cleaning OCR'd text with Regular Expressions |
title_sort | cleaning ocr d text with regular expressions |
topic | Regular expressions data manipulation |
url | http://programminghistorian.org/lessons/cleaning-ocrd-text-with-regular-expressions |
work_keys_str_mv | AT lauraturnerohara cleaningocrdtextwithregularexpressions |