Grammar-Based Recognition of Documentary Forms and Extraction of Metadata

Metadata extraction is a critical aspect of ingestion of collections into digital archives and libraries. A method for automatically recognizing document types and extracting metadata from digital records has been developed. The method is based on a method for automatically annotating semantic categ...

Full description

Bibliographic Details
Main Author: William Underwood
Format: Article
Language:English
Published: University of Edinburgh 2010-06-01
Series:International Journal of Digital Curation
Online Access:https://ijdc.net/index.php/ijdc/article/view/149
_version_ 1797323867137507328
author William Underwood
author_facet William Underwood
author_sort William Underwood
collection DOAJ
description Metadata extraction is a critical aspect of ingestion of collections into digital archives and libraries. A method for automatically recognizing document types and extracting metadata from digital records has been developed. The method is based on a method for automatically annotating semantic categories such as person’s names, job titles, dates, and postal addresses that may occur in a record. It extends this method by using the semantic annotations to identify the intellectual elements of a document’s form, parsing these elements using context-free grammars that define documentary forms, and interpreting the elements of the form of the document to identify metadata such as the chronological date, author(s), addressee(s), and topic. Context-free grammars were developed for fourteen of the documentary forms occurring in Presidential records. In an experiment, the document type recognizer successfully recognized the documentary form and extracted the metadata of two-thirds of the records in a series of Presidential e-records containing twenty-one document types.
first_indexed 2024-03-08T05:35:20Z
format Article
id doaj.art-c9726e657f35460891f6602d4457c5b9
institution Directory Open Access Journal
issn 1746-8256
language English
last_indexed 2024-03-08T05:35:20Z
publishDate 2010-06-01
publisher University of Edinburgh
record_format Article
series International Journal of Digital Curation
spelling doaj.art-c9726e657f35460891f6602d4457c5b92024-02-06T00:07:26ZengUniversity of EdinburghInternational Journal of Digital Curation1746-82562010-06-0151Grammar-Based Recognition of Documentary Forms and Extraction of MetadataWilliam UnderwoodMetadata extraction is a critical aspect of ingestion of collections into digital archives and libraries. A method for automatically recognizing document types and extracting metadata from digital records has been developed. The method is based on a method for automatically annotating semantic categories such as person’s names, job titles, dates, and postal addresses that may occur in a record. It extends this method by using the semantic annotations to identify the intellectual elements of a document’s form, parsing these elements using context-free grammars that define documentary forms, and interpreting the elements of the form of the document to identify metadata such as the chronological date, author(s), addressee(s), and topic. Context-free grammars were developed for fourteen of the documentary forms occurring in Presidential records. In an experiment, the document type recognizer successfully recognized the documentary form and extracted the metadata of two-thirds of the records in a series of Presidential e-records containing twenty-one document types.https://ijdc.net/index.php/ijdc/article/view/149
spellingShingle William Underwood
Grammar-Based Recognition of Documentary Forms and Extraction of Metadata
International Journal of Digital Curation
title Grammar-Based Recognition of Documentary Forms and Extraction of Metadata
title_full Grammar-Based Recognition of Documentary Forms and Extraction of Metadata
title_fullStr Grammar-Based Recognition of Documentary Forms and Extraction of Metadata
title_full_unstemmed Grammar-Based Recognition of Documentary Forms and Extraction of Metadata
title_short Grammar-Based Recognition of Documentary Forms and Extraction of Metadata
title_sort grammar based recognition of documentary forms and extraction of metadata
url https://ijdc.net/index.php/ijdc/article/view/149
work_keys_str_mv AT williamunderwood grammarbasedrecognitionofdocumentaryformsandextractionofmetadata