TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information
With the exponential growth in the daily publication of scientific articles, automatic classification and categorization can assist in assigning articles to a predefined category. Article titles are concise descriptions of the articles’ content with valuable information that can be useful in documen...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Frontiers Media S.A.
2023-10-01
|
Series: | Frontiers in Genetics |
Subjects: | |
Online Access: | https://www.frontiersin.org/articles/10.3389/fgene.2023.1243874/full |
_version_ | 1827799254128656384 |
---|---|
author | Daniel Voskergian Burcu Bakir-Gungor Malik Yousef Malik Yousef |
author_facet | Daniel Voskergian Burcu Bakir-Gungor Malik Yousef Malik Yousef |
author_sort | Daniel Voskergian |
collection | DOAJ |
description | With the exponential growth in the daily publication of scientific articles, automatic classification and categorization can assist in assigning articles to a predefined category. Article titles are concise descriptions of the articles’ content with valuable information that can be useful in document classification and categorization. However, shortness, data sparseness, limited word occurrences, and the inadequate contextual information of scientific document titles hinder the direct application of conventional text mining and machine learning algorithms on these short texts, making their classification a challenging task. This study firstly explores the performance of our earlier study, TextNetTopics on the short text. Secondly, here we propose an advanced version called TextNetTopics Pro, which is a novel short-text classification framework that utilizes a promising combination of lexical features organized in topics of words and topic distribution extracted by a topic model to alleviate the data-sparseness problem when classifying short texts. We evaluate our proposed approach using nine state-of-the-art short-text topic models on two publicly available datasets of scientific article titles as short-text documents. The first dataset is related to the Biomedical field, and the other one is related to Computer Science publications. Additionally, we comparatively evaluate the predictive performance of the models generated with and without using the abstracts. Finally, we demonstrate the robustness and effectiveness of the proposed approach in handling the imbalanced data, particularly in the classification of Drug-Induced Liver Injury articles as part of the CAMDA challenge. Taking advantage of the semantic information detected by topic models proved to be a reliable way to improve the overall performance of ML classifiers. |
first_indexed | 2024-03-11T19:48:57Z |
format | Article |
id | doaj.art-eea73878057c4c0f898a28ac55cb1ee9 |
institution | Directory Open Access Journal |
issn | 1664-8021 |
language | English |
last_indexed | 2024-03-11T19:48:57Z |
publishDate | 2023-10-01 |
publisher | Frontiers Media S.A. |
record_format | Article |
series | Frontiers in Genetics |
spelling | doaj.art-eea73878057c4c0f898a28ac55cb1ee92023-10-05T14:02:48ZengFrontiers Media S.A.Frontiers in Genetics1664-80212023-10-011410.3389/fgene.2023.12438741243874TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution informationDaniel Voskergian0Burcu Bakir-Gungor1Malik Yousef2Malik Yousef3Computer Engineering Department, Faculty of Engineering, Al-Quds University, Jerusalem, PalestineDepartment of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, TürkiyeDepartment of Information Systems, Zefat Academic College, Zefat, IsraelGalilee Digital Health Research Center, Zefat Academic College, Zefat, IsraelWith the exponential growth in the daily publication of scientific articles, automatic classification and categorization can assist in assigning articles to a predefined category. Article titles are concise descriptions of the articles’ content with valuable information that can be useful in document classification and categorization. However, shortness, data sparseness, limited word occurrences, and the inadequate contextual information of scientific document titles hinder the direct application of conventional text mining and machine learning algorithms on these short texts, making their classification a challenging task. This study firstly explores the performance of our earlier study, TextNetTopics on the short text. Secondly, here we propose an advanced version called TextNetTopics Pro, which is a novel short-text classification framework that utilizes a promising combination of lexical features organized in topics of words and topic distribution extracted by a topic model to alleviate the data-sparseness problem when classifying short texts. We evaluate our proposed approach using nine state-of-the-art short-text topic models on two publicly available datasets of scientific article titles as short-text documents. The first dataset is related to the Biomedical field, and the other one is related to Computer Science publications. Additionally, we comparatively evaluate the predictive performance of the models generated with and without using the abstracts. Finally, we demonstrate the robustness and effectiveness of the proposed approach in handling the imbalanced data, particularly in the classification of Drug-Induced Liver Injury articles as part of the CAMDA challenge. Taking advantage of the semantic information detected by topic models proved to be a reliable way to improve the overall performance of ML classifiers.https://www.frontiersin.org/articles/10.3389/fgene.2023.1243874/fulltext classificationfeature selectiontopic selectiontopic projectiontopic modelingshort text |
spellingShingle | Daniel Voskergian Burcu Bakir-Gungor Malik Yousef Malik Yousef TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information Frontiers in Genetics text classification feature selection topic selection topic projection topic modeling short text |
title | TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information |
title_full | TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information |
title_fullStr | TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information |
title_full_unstemmed | TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information |
title_short | TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information |
title_sort | textnettopics pro a topic model based text classification for short text by integration of semantic and document topic distribution information |
topic | text classification feature selection topic selection topic projection topic modeling short text |
url | https://www.frontiersin.org/articles/10.3389/fgene.2023.1243874/full |
work_keys_str_mv | AT danielvoskergian textnettopicsproatopicmodelbasedtextclassificationforshorttextbyintegrationofsemanticanddocumenttopicdistributioninformation AT burcubakirgungor textnettopicsproatopicmodelbasedtextclassificationforshorttextbyintegrationofsemanticanddocumenttopicdistributioninformation AT malikyousef textnettopicsproatopicmodelbasedtextclassificationforshorttextbyintegrationofsemanticanddocumenttopicdistributioninformation AT malikyousef textnettopicsproatopicmodelbasedtextclassificationforshorttextbyintegrationofsemanticanddocumenttopicdistributioninformation |