Grammar-Based Recognition of Documentary Forms and Extraction of Metadata
Metadata extraction is a critical aspect of ingestion of collections into digital archives and libraries. A method for automatically recognizing document types and extracting metadata from digital records has been developed. The method is based on a method for automatically annotating semantic categ...
Main Author: | |
---|---|
Format: | Article |
Language: | English |
Published: |
University of Edinburgh
2010-06-01
|
Series: | International Journal of Digital Curation |
Online Access: | https://ijdc.net/index.php/ijdc/article/view/149 |
_version_ | 1797323867137507328 |
---|---|
author | William Underwood |
author_facet | William Underwood |
author_sort | William Underwood |
collection | DOAJ |
description | Metadata extraction is a critical aspect of ingestion of collections into digital archives and libraries. A method for automatically recognizing document types and extracting metadata from digital records has been developed. The method is based on a method for automatically annotating semantic categories such as person’s names, job titles, dates, and postal addresses that may occur in a record. It extends this method by using the semantic annotations to identify the intellectual elements of a document’s form, parsing these elements using context-free grammars that define documentary forms, and interpreting the elements of the form of the document to identify metadata such as the chronological date, author(s), addressee(s), and topic. Context-free grammars were developed for fourteen of the documentary forms occurring in Presidential records. In an experiment, the document type recognizer successfully recognized the documentary form and extracted the metadata of two-thirds of the records in a series of Presidential e-records containing twenty-one document types. |
first_indexed | 2024-03-08T05:35:20Z |
format | Article |
id | doaj.art-c9726e657f35460891f6602d4457c5b9 |
institution | Directory Open Access Journal |
issn | 1746-8256 |
language | English |
last_indexed | 2024-03-08T05:35:20Z |
publishDate | 2010-06-01 |
publisher | University of Edinburgh |
record_format | Article |
series | International Journal of Digital Curation |
spelling | doaj.art-c9726e657f35460891f6602d4457c5b92024-02-06T00:07:26ZengUniversity of EdinburghInternational Journal of Digital Curation1746-82562010-06-0151Grammar-Based Recognition of Documentary Forms and Extraction of MetadataWilliam UnderwoodMetadata extraction is a critical aspect of ingestion of collections into digital archives and libraries. A method for automatically recognizing document types and extracting metadata from digital records has been developed. The method is based on a method for automatically annotating semantic categories such as person’s names, job titles, dates, and postal addresses that may occur in a record. It extends this method by using the semantic annotations to identify the intellectual elements of a document’s form, parsing these elements using context-free grammars that define documentary forms, and interpreting the elements of the form of the document to identify metadata such as the chronological date, author(s), addressee(s), and topic. Context-free grammars were developed for fourteen of the documentary forms occurring in Presidential records. In an experiment, the document type recognizer successfully recognized the documentary form and extracted the metadata of two-thirds of the records in a series of Presidential e-records containing twenty-one document types.https://ijdc.net/index.php/ijdc/article/view/149 |
spellingShingle | William Underwood Grammar-Based Recognition of Documentary Forms and Extraction of Metadata International Journal of Digital Curation |
title | Grammar-Based Recognition of Documentary Forms and Extraction of Metadata |
title_full | Grammar-Based Recognition of Documentary Forms and Extraction of Metadata |
title_fullStr | Grammar-Based Recognition of Documentary Forms and Extraction of Metadata |
title_full_unstemmed | Grammar-Based Recognition of Documentary Forms and Extraction of Metadata |
title_short | Grammar-Based Recognition of Documentary Forms and Extraction of Metadata |
title_sort | grammar based recognition of documentary forms and extraction of metadata |
url | https://ijdc.net/index.php/ijdc/article/view/149 |
work_keys_str_mv | AT williamunderwood grammarbasedrecognitionofdocumentaryformsandextractionofmetadata |