Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science

Data heterogeneity is a pressing issue and is further compounded if we have to deal with data from textual documents. The unstructured nature of such documents implies that collating, comparing and analysing the information contained therein can be a challenging task. Automating these processes can...

Full description

Bibliographic Details
Main Authors:	Vatsala Nundloll, Robert Smail, Carly Stevens, Gordon Blair
Format:	Article
Language:	English
Published:	Elsevier 2022-10-01
Series:	Heliyon
Subjects:	Data extraction Unstructured data Semantic integration Natural language processing Machine learning Ontologies
Online Access:	http://www.sciencedirect.com/science/article/pii/S2405844022019983

_version_	1811192557974061056
author	Vatsala Nundloll Robert Smail Carly Stevens Gordon Blair
author_facet	Vatsala Nundloll Robert Smail Carly Stevens Gordon Blair
author_sort	Vatsala Nundloll
collection	DOAJ
description	Data heterogeneity is a pressing issue and is further compounded if we have to deal with data from textual documents. The unstructured nature of such documents implies that collating, comparing and analysing the information contained therein can be a challenging task. Automating these processes can help to unleash insightful knowledge that otherwise remains buried in them. Moreover, integrating the extracted information from the documents with other related information can help to make more information-rich queries. In this context, the paper presents a comprehensive review of text extraction and data integration techniques to enable this automation process in an ecological context. The paper investigates into extracting valuable floristic information from a historical Botany journal. The purpose behind this extraction is to bring to light relevant pieces of information contained within the document. In addition, the paper also explores the need to integrate the extracted information together with other related information from disparate sources. All the information is then rendered into a query-able form in order to make unified queries. Hence, the paper makes use of a combination of Machine Learning, Natural Language Processing and Semantic Web techniques to achieve this. The proposed approach is demonstrated through the information extracted from the journal and the information-rich queries made through the integration process. The paper shows that the approach has a merit in extracting relevant information from the journal, discusses how the machine learning models have been designed to classify complex information and also gives a measure of their performance. The paper also shows that the approach has a merit in query time in regard to querying floristic information from a multi-source linked data model.
first_indexed	2024-04-11T23:54:36Z
format	Article
id	doaj.art-eb03e28d3d2f4d65a598068412287365
institution	Directory Open Access Journal
issn	2405-8440
language	English
last_indexed	2024-04-11T23:54:36Z
publishDate	2022-10-01
publisher	Elsevier
record_format	Article
series	Heliyon
spelling	doaj.art-eb03e28d3d2f4d65a5980684122873652022-12-22T03:56:24ZengElsevierHeliyon2405-84402022-10-01810e10710Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation scienceVatsala Nundloll0Robert Smail1Carly Stevens2Gordon Blair3School of Computing and Communications, Lancaster University, Lancaster, UK; Corresponding author.Lancaster Environment Centre, Lancaster University, UK11 Robert Smail worked at this organisation.Lancaster Environment Centre, Lancaster University, UK11 Robert Smail worked at this organisation.School of Computing and Communications, Lancaster University, Lancaster, UKData heterogeneity is a pressing issue and is further compounded if we have to deal with data from textual documents. The unstructured nature of such documents implies that collating, comparing and analysing the information contained therein can be a challenging task. Automating these processes can help to unleash insightful knowledge that otherwise remains buried in them. Moreover, integrating the extracted information from the documents with other related information can help to make more information-rich queries. In this context, the paper presents a comprehensive review of text extraction and data integration techniques to enable this automation process in an ecological context. The paper investigates into extracting valuable floristic information from a historical Botany journal. The purpose behind this extraction is to bring to light relevant pieces of information contained within the document. In addition, the paper also explores the need to integrate the extracted information together with other related information from disparate sources. All the information is then rendered into a query-able form in order to make unified queries. Hence, the paper makes use of a combination of Machine Learning, Natural Language Processing and Semantic Web techniques to achieve this. The proposed approach is demonstrated through the information extracted from the journal and the information-rich queries made through the integration process. The paper shows that the approach has a merit in extracting relevant information from the journal, discusses how the machine learning models have been designed to classify complex information and also gives a measure of their performance. The paper also shows that the approach has a merit in query time in regard to querying floristic information from a multi-source linked data model.http://www.sciencedirect.com/science/article/pii/S2405844022019983Data extractionUnstructured dataSemantic integrationNatural language processingMachine learningOntologies
spellingShingle	Vatsala Nundloll Robert Smail Carly Stevens Gordon Blair Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science Heliyon Data extraction Unstructured data Semantic integration Natural language processing Machine learning Ontologies
title	Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science
title_full	Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science
title_fullStr	Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science
title_full_unstemmed	Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science
title_short	Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science
title_sort	automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science
topic	Data extraction Unstructured data Semantic integration Natural language processing Machine learning Ontologies
url	http://www.sciencedirect.com/science/article/pii/S2405844022019983
work_keys_str_mv	AT vatsalanundloll automatingtheextractionofinformationfromahistoricaltextandbuildingalinkeddatamodelforthedomainofecologyandconservationscience AT robertsmail automatingtheextractionofinformationfromahistoricaltextandbuildingalinkeddatamodelforthedomainofecologyandconservationscience AT carlystevens automatingtheextractionofinformationfromahistoricaltextandbuildingalinkeddatamodelforthedomainofecologyandconservationscience AT gordonblair automatingtheextractionofinformationfromahistoricaltextandbuildingalinkeddatamodelforthedomainofecologyandconservationscience

Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science

Similar Items