Finite State Automata on Multi-Word Units for Efficient Text-Mining

Text mining is crucial for analyzing unstructured and semi-structured textual documents. This paper introduces a fast and precise text mining method based on a finite automaton to extract knowledge domains. Unlike simple words, multi-word units (such as credit card) are emphasized for their efficien...

Full description

Bibliographic Details
Main Author:	Alberto Postiglione
Format:	Article
Language:	English
Published:	MDPI AG 2024-02-01
Series:	Mathematics
Subjects:	text mining knowledge extraction finite automata ontology multi-word units natural language processing
Online Access:	https://www.mdpi.com/2227-7390/12/4/506

_version_	1797297612887425024
author	Alberto Postiglione
author_facet	Alberto Postiglione
author_sort	Alberto Postiglione
collection	DOAJ
description	Text mining is crucial for analyzing unstructured and semi-structured textual documents. This paper introduces a fast and precise text mining method based on a finite automaton to extract knowledge domains. Unlike simple words, multi-word units (such as credit card) are emphasized for their efficiency in identifying specific semantic areas due to their predominantly monosemic nature, their limited number and their distinctiveness. The method focuses on identifying multi-word units within terminological ontologies, where each multi-word unit is associated with a sub-domain of ontology knowledge. The algorithm, designed to handle the challenges posed by very long multi-word units composed of a variable number of simple words, integrates user-selected ontologies into a single finite automaton during a fast pre-processing step. At runtime, the automaton reads input text character by character, efficiently locating multi-word units even if they overlap. This approach is efficient for both short and long documents, requiring no prior training. Ontologies can be updated without additional computational costs. An early system prototype, tested on 100 short and medium-length documents, recognized the knowledge domains for the vast majority of texts (over 90%) analyzed. The authors suggest that this method could be a valuable semantic-based knowledge domain extraction technique in unstructured documents.
first_indexed	2024-03-07T22:23:18Z
format	Article
id	doaj.art-eff8a023b2b34c8d8d1fa1e25aa84045
institution	Directory Open Access Journal
issn	2227-7390
language	English
last_indexed	2024-03-07T22:23:18Z
publishDate	2024-02-01
publisher	MDPI AG
record_format	Article
series	Mathematics
spelling	doaj.art-eff8a023b2b34c8d8d1fa1e25aa840452024-02-23T15:26:00ZengMDPI AGMathematics2227-73902024-02-0112450610.3390/math12040506Finite State Automata on Multi-Word Units for Efficient Text-MiningAlberto Postiglione0Department of Business Science and Management & Innovation Systems, University of Salerno, Via San Giovanni Paolo II, 84084 Fisciano, ItalyText mining is crucial for analyzing unstructured and semi-structured textual documents. This paper introduces a fast and precise text mining method based on a finite automaton to extract knowledge domains. Unlike simple words, multi-word units (such as credit card) are emphasized for their efficiency in identifying specific semantic areas due to their predominantly monosemic nature, their limited number and their distinctiveness. The method focuses on identifying multi-word units within terminological ontologies, where each multi-word unit is associated with a sub-domain of ontology knowledge. The algorithm, designed to handle the challenges posed by very long multi-word units composed of a variable number of simple words, integrates user-selected ontologies into a single finite automaton during a fast pre-processing step. At runtime, the automaton reads input text character by character, efficiently locating multi-word units even if they overlap. This approach is efficient for both short and long documents, requiring no prior training. Ontologies can be updated without additional computational costs. An early system prototype, tested on 100 short and medium-length documents, recognized the knowledge domains for the vast majority of texts (over 90%) analyzed. The authors suggest that this method could be a valuable semantic-based knowledge domain extraction technique in unstructured documents.https://www.mdpi.com/2227-7390/12/4/506text miningknowledge extractionfinite automataontologymulti-word unitsnatural language processing
spellingShingle	Alberto Postiglione Finite State Automata on Multi-Word Units for Efficient Text-Mining Mathematics text mining knowledge extraction finite automata ontology multi-word units natural language processing
title	Finite State Automata on Multi-Word Units for Efficient Text-Mining
title_full	Finite State Automata on Multi-Word Units for Efficient Text-Mining
title_fullStr	Finite State Automata on Multi-Word Units for Efficient Text-Mining
title_full_unstemmed	Finite State Automata on Multi-Word Units for Efficient Text-Mining
title_short	Finite State Automata on Multi-Word Units for Efficient Text-Mining
title_sort	finite state automata on multi word units for efficient text mining
topic	text mining knowledge extraction finite automata ontology multi-word units natural language processing
url	https://www.mdpi.com/2227-7390/12/4/506
work_keys_str_mv	AT albertopostiglione finitestateautomataonmultiwordunitsforefficienttextmining

Finite State Automata on Multi-Word Units for Efficient Text-Mining

Similar Items