TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information

With the exponential growth in the daily publication of scientific articles, automatic classification and categorization can assist in assigning articles to a predefined category. Article titles are concise descriptions of the articles’ content with valuable information that can be useful in documen...

Full description

Bibliographic Details
Main Authors: Daniel Voskergian, Burcu Bakir-Gungor, Malik Yousef
Format: Article
Language:English
Published: Frontiers Media S.A. 2023-10-01
Series:Frontiers in Genetics
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fgene.2023.1243874/full
_version_ 1827799254128656384
author Daniel Voskergian
Burcu Bakir-Gungor
Malik Yousef
Malik Yousef
author_facet Daniel Voskergian
Burcu Bakir-Gungor
Malik Yousef
Malik Yousef
author_sort Daniel Voskergian
collection DOAJ
description With the exponential growth in the daily publication of scientific articles, automatic classification and categorization can assist in assigning articles to a predefined category. Article titles are concise descriptions of the articles’ content with valuable information that can be useful in document classification and categorization. However, shortness, data sparseness, limited word occurrences, and the inadequate contextual information of scientific document titles hinder the direct application of conventional text mining and machine learning algorithms on these short texts, making their classification a challenging task. This study firstly explores the performance of our earlier study, TextNetTopics on the short text. Secondly, here we propose an advanced version called TextNetTopics Pro, which is a novel short-text classification framework that utilizes a promising combination of lexical features organized in topics of words and topic distribution extracted by a topic model to alleviate the data-sparseness problem when classifying short texts. We evaluate our proposed approach using nine state-of-the-art short-text topic models on two publicly available datasets of scientific article titles as short-text documents. The first dataset is related to the Biomedical field, and the other one is related to Computer Science publications. Additionally, we comparatively evaluate the predictive performance of the models generated with and without using the abstracts. Finally, we demonstrate the robustness and effectiveness of the proposed approach in handling the imbalanced data, particularly in the classification of Drug-Induced Liver Injury articles as part of the CAMDA challenge. Taking advantage of the semantic information detected by topic models proved to be a reliable way to improve the overall performance of ML classifiers.
first_indexed 2024-03-11T19:48:57Z
format Article
id doaj.art-eea73878057c4c0f898a28ac55cb1ee9
institution Directory Open Access Journal
issn 1664-8021
language English
last_indexed 2024-03-11T19:48:57Z
publishDate 2023-10-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Genetics
spelling doaj.art-eea73878057c4c0f898a28ac55cb1ee92023-10-05T14:02:48ZengFrontiers Media S.A.Frontiers in Genetics1664-80212023-10-011410.3389/fgene.2023.12438741243874TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution informationDaniel Voskergian0Burcu Bakir-Gungor1Malik Yousef2Malik Yousef3Computer Engineering Department, Faculty of Engineering, Al-Quds University, Jerusalem, PalestineDepartment of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, TürkiyeDepartment of Information Systems, Zefat Academic College, Zefat, IsraelGalilee Digital Health Research Center, Zefat Academic College, Zefat, IsraelWith the exponential growth in the daily publication of scientific articles, automatic classification and categorization can assist in assigning articles to a predefined category. Article titles are concise descriptions of the articles’ content with valuable information that can be useful in document classification and categorization. However, shortness, data sparseness, limited word occurrences, and the inadequate contextual information of scientific document titles hinder the direct application of conventional text mining and machine learning algorithms on these short texts, making their classification a challenging task. This study firstly explores the performance of our earlier study, TextNetTopics on the short text. Secondly, here we propose an advanced version called TextNetTopics Pro, which is a novel short-text classification framework that utilizes a promising combination of lexical features organized in topics of words and topic distribution extracted by a topic model to alleviate the data-sparseness problem when classifying short texts. We evaluate our proposed approach using nine state-of-the-art short-text topic models on two publicly available datasets of scientific article titles as short-text documents. The first dataset is related to the Biomedical field, and the other one is related to Computer Science publications. Additionally, we comparatively evaluate the predictive performance of the models generated with and without using the abstracts. Finally, we demonstrate the robustness and effectiveness of the proposed approach in handling the imbalanced data, particularly in the classification of Drug-Induced Liver Injury articles as part of the CAMDA challenge. Taking advantage of the semantic information detected by topic models proved to be a reliable way to improve the overall performance of ML classifiers.https://www.frontiersin.org/articles/10.3389/fgene.2023.1243874/fulltext classificationfeature selectiontopic selectiontopic projectiontopic modelingshort text
spellingShingle Daniel Voskergian
Burcu Bakir-Gungor
Malik Yousef
Malik Yousef
TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information
Frontiers in Genetics
text classification
feature selection
topic selection
topic projection
topic modeling
short text
title TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information
title_full TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information
title_fullStr TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information
title_full_unstemmed TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information
title_short TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information
title_sort textnettopics pro a topic model based text classification for short text by integration of semantic and document topic distribution information
topic text classification
feature selection
topic selection
topic projection
topic modeling
short text
url https://www.frontiersin.org/articles/10.3389/fgene.2023.1243874/full
work_keys_str_mv AT danielvoskergian textnettopicsproatopicmodelbasedtextclassificationforshorttextbyintegrationofsemanticanddocumenttopicdistributioninformation
AT burcubakirgungor textnettopicsproatopicmodelbasedtextclassificationforshorttextbyintegrationofsemanticanddocumenttopicdistributioninformation
AT malikyousef textnettopicsproatopicmodelbasedtextclassificationforshorttextbyintegrationofsemanticanddocumenttopicdistributioninformation
AT malikyousef textnettopicsproatopicmodelbasedtextclassificationforshorttextbyintegrationofsemanticanddocumenttopicdistributioninformation