NER in Archival Finding Aids: Extended

The amount of information preserved in Portuguese archives has increased over the years. These documents represent a national heritage of high importance, as they portray the country’s history. Currently, most Portuguese archives have made their finding aids available to the public in digital format...

Full description

Bibliographic Details
Main Authors: Luís Filipe da Costa Cunha, José Carlos Ramalho
Format: Article
Language:English
Published: MDPI AG 2022-01-01
Series:Machine Learning and Knowledge Extraction
Subjects:
Online Access:https://www.mdpi.com/2504-4990/4/1/3
_version_ 1797445817918816256
author Luís Filipe da Costa Cunha
José Carlos Ramalho
author_facet Luís Filipe da Costa Cunha
José Carlos Ramalho
author_sort Luís Filipe da Costa Cunha
collection DOAJ
description The amount of information preserved in Portuguese archives has increased over the years. These documents represent a national heritage of high importance, as they portray the country’s history. Currently, most Portuguese archives have made their finding aids available to the public in digital format, however, these data do not have any annotation, so it is not always easy to analyze their content. In this work, Named Entity Recognition solutions were created that allow the identification and classification of several named entities from the archival finding aids. These named entities translate into crucial information about their context and, with high confidence results, they can be used for several purposes, for example, the creation of smart browsing tools by using entity linking and record linking techniques. In order to achieve high result scores, we annotated several corpora to train our own Machine Learning algorithms in this context domain. We also used different architectures, such as CNNs, LSTMs, and Maximum Entropy models. Finally, all the created datasets and ML models were made available to the public with a developed web platform, NER@DI.
first_indexed 2024-03-09T13:32:15Z
format Article
id doaj.art-9383b91b50d44052a98644ca842f77e7
institution Directory Open Access Journal
issn 2504-4990
language English
last_indexed 2024-03-09T13:32:15Z
publishDate 2022-01-01
publisher MDPI AG
record_format Article
series Machine Learning and Knowledge Extraction
spelling doaj.art-9383b91b50d44052a98644ca842f77e72023-11-30T21:16:49ZengMDPI AGMachine Learning and Knowledge Extraction2504-49902022-01-0141426510.3390/make4010003NER in Archival Finding Aids: ExtendedLuís Filipe da Costa Cunha0José Carlos Ramalho1Department of Informatics, University of Minho, 4710-057 Braga, PortugalDepartment of Informatics, University of Minho, 4710-057 Braga, PortugalThe amount of information preserved in Portuguese archives has increased over the years. These documents represent a national heritage of high importance, as they portray the country’s history. Currently, most Portuguese archives have made their finding aids available to the public in digital format, however, these data do not have any annotation, so it is not always easy to analyze their content. In this work, Named Entity Recognition solutions were created that allow the identification and classification of several named entities from the archival finding aids. These named entities translate into crucial information about their context and, with high confidence results, they can be used for several purposes, for example, the creation of smart browsing tools by using entity linking and record linking techniques. In order to achieve high result scores, we annotated several corpora to train our own Machine Learning algorithms in this context domain. We also used different architectures, such as CNNs, LSTMs, and Maximum Entropy models. Finally, all the created datasets and ML models were made available to the public with a developed web platform, NER@DI.https://www.mdpi.com/2504-4990/4/1/3named entity recognitionarchival search aidsmachine learningdeep learningmaximum entropy
spellingShingle Luís Filipe da Costa Cunha
José Carlos Ramalho
NER in Archival Finding Aids: Extended
Machine Learning and Knowledge Extraction
named entity recognition
archival search aids
machine learning
deep learning
maximum entropy
title NER in Archival Finding Aids: Extended
title_full NER in Archival Finding Aids: Extended
title_fullStr NER in Archival Finding Aids: Extended
title_full_unstemmed NER in Archival Finding Aids: Extended
title_short NER in Archival Finding Aids: Extended
title_sort ner in archival finding aids extended
topic named entity recognition
archival search aids
machine learning
deep learning
maximum entropy
url https://www.mdpi.com/2504-4990/4/1/3
work_keys_str_mv AT luisfilipedacostacunha nerinarchivalfindingaidsextended
AT josecarlosramalho nerinarchivalfindingaidsextended