MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News Data
Searching, retrieving, and arranging text in ever-larger document collections necessitate more efficient information processing algorithms. Document categorization is a crucial component of various information processing systems for supervised learning. As the quantity of documents grows, the perfor...
Main Authors: | , , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10233873/ |
_version_ | 1797689566119854080 |
---|---|
author | Khan Md Hasib Sami Azam Asif Karim Ahmed Al Marouf F M Javed Mehedi Shamrat Sidratul Montaha Kheng Cher Yeo Mirjam Jonkman Reda Alhajj Jon G. Rokne |
author_facet | Khan Md Hasib Sami Azam Asif Karim Ahmed Al Marouf F M Javed Mehedi Shamrat Sidratul Montaha Kheng Cher Yeo Mirjam Jonkman Reda Alhajj Jon G. Rokne |
author_sort | Khan Md Hasib |
collection | DOAJ |
description | Searching, retrieving, and arranging text in ever-larger document collections necessitate more efficient information processing algorithms. Document categorization is a crucial component of various information processing systems for supervised learning. As the quantity of documents grows, the performance of classic supervised classifiers has deteriorated because of the number of document categories. Assigning documents to a predetermined set of classes is called text classification. It is utilized extensively in a wide range of data-intensive applications. However, the fact that real-world implementations of these models are plagued with shortcomings begs for more investigation. Imbalanced datasets hinder the most prevalent high-performance algorithms. In this paper, we propose an approach name multi-class Convolutional Neural Network (MCNN)-Long Short-Time Memory (LSTM), which combines two deep learning techniques, Convolutional Neural Network (CNN) and Long Short-Time Memory, for text classification in news data. CNN’s are used as feature extractors for the LSTMs on text input data and have the spatial structure of words in a sentence, paragraph, or document. The dataset is also imbalanced, and we use the Tomek-Link algorithm to balance the dataset and then apply our model, which shows better performance in terms of F1-score (98%) and Accuracy (99.71%) than the existing works. The combination of deep learning techniques used in our approach is ideal for the classification of imbalanced datasets with underrepresented categories. Hence, our method outperformed other machine learning algorithms in text classification by a large margin. We also compare our results with traditional machine learning algorithms in terms of imbalanced and balanced datasets. |
first_indexed | 2024-03-12T01:47:25Z |
format | Article |
id | doaj.art-69f37c9defb54adcbdfbfad607178b1b |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-03-12T01:47:25Z |
publishDate | 2023-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-69f37c9defb54adcbdfbfad607178b1b2023-09-08T23:01:46ZengIEEEIEEE Access2169-35362023-01-0111930489306310.1109/ACCESS.2023.330969710233873MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News DataKhan Md Hasib0https://orcid.org/0000-0001-6504-4192Sami Azam1https://orcid.org/0000-0001-7572-9750Asif Karim2https://orcid.org/0000-0001-8532-6816Ahmed Al Marouf3https://orcid.org/0000-0001-6520-0749F M Javed Mehedi Shamrat4https://orcid.org/0000-0001-9176-3537Sidratul Montaha5https://orcid.org/0000-0002-5276-3793Kheng Cher Yeo6https://orcid.org/0000-0002-0453-3248Mirjam Jonkman7https://orcid.org/0000-0002-0396-8370Reda Alhajj8https://orcid.org/0000-0001-6657-9738Jon G. Rokne9https://orcid.org/0000-0002-3439-2917Department of Computer Science and Engineering, Bangladesh University of Business and Technology, Dhaka, BangladeshFaculty of Science and Technology, Charles Darwin University, Casuarina, NT, AustraliaFaculty of Science and Technology, Charles Darwin University, Casuarina, NT, AustraliaDepartment of Computer Science, University of Calgary, Calgary, CanadaDepartment of Computer System and Technology, University of Malaya, Kuala Lumpur, MalaysiaDepartment of Computer Science and Engineering, Daffodil International University, Dhaka, BangladeshFaculty of Science and Technology, Charles Darwin University, Casuarina, NT, AustraliaFaculty of Science and Technology, Charles Darwin University, Casuarina, NT, AustraliaDepartment of Computer Science, University of Calgary, Calgary, CanadaDepartment of Computer Science, University of Calgary, Calgary, CanadaSearching, retrieving, and arranging text in ever-larger document collections necessitate more efficient information processing algorithms. Document categorization is a crucial component of various information processing systems for supervised learning. As the quantity of documents grows, the performance of classic supervised classifiers has deteriorated because of the number of document categories. Assigning documents to a predetermined set of classes is called text classification. It is utilized extensively in a wide range of data-intensive applications. However, the fact that real-world implementations of these models are plagued with shortcomings begs for more investigation. Imbalanced datasets hinder the most prevalent high-performance algorithms. In this paper, we propose an approach name multi-class Convolutional Neural Network (MCNN)-Long Short-Time Memory (LSTM), which combines two deep learning techniques, Convolutional Neural Network (CNN) and Long Short-Time Memory, for text classification in news data. CNN’s are used as feature extractors for the LSTMs on text input data and have the spatial structure of words in a sentence, paragraph, or document. The dataset is also imbalanced, and we use the Tomek-Link algorithm to balance the dataset and then apply our model, which shows better performance in terms of F1-score (98%) and Accuracy (99.71%) than the existing works. The combination of deep learning techniques used in our approach is ideal for the classification of imbalanced datasets with underrepresented categories. Hence, our method outperformed other machine learning algorithms in text classification by a large margin. We also compare our results with traditional machine learning algorithms in terms of imbalanced and balanced datasets.https://ieeexplore.ieee.org/document/10233873/Big datatext classificationimbalanced datamachine learningMCNN-LSTM |
spellingShingle | Khan Md Hasib Sami Azam Asif Karim Ahmed Al Marouf F M Javed Mehedi Shamrat Sidratul Montaha Kheng Cher Yeo Mirjam Jonkman Reda Alhajj Jon G. Rokne MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News Data IEEE Access Big data text classification imbalanced data machine learning MCNN-LSTM |
title | MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News Data |
title_full | MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News Data |
title_fullStr | MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News Data |
title_full_unstemmed | MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News Data |
title_short | MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News Data |
title_sort | mcnn lstm combining cnn and lstm to classify multi class text in imbalanced news data |
topic | Big data text classification imbalanced data machine learning MCNN-LSTM |
url | https://ieeexplore.ieee.org/document/10233873/ |
work_keys_str_mv | AT khanmdhasib mcnnlstmcombiningcnnandlstmtoclassifymulticlasstextinimbalancednewsdata AT samiazam mcnnlstmcombiningcnnandlstmtoclassifymulticlasstextinimbalancednewsdata AT asifkarim mcnnlstmcombiningcnnandlstmtoclassifymulticlasstextinimbalancednewsdata AT ahmedalmarouf mcnnlstmcombiningcnnandlstmtoclassifymulticlasstextinimbalancednewsdata AT fmjavedmehedishamrat mcnnlstmcombiningcnnandlstmtoclassifymulticlasstextinimbalancednewsdata AT sidratulmontaha mcnnlstmcombiningcnnandlstmtoclassifymulticlasstextinimbalancednewsdata AT khengcheryeo mcnnlstmcombiningcnnandlstmtoclassifymulticlasstextinimbalancednewsdata AT mirjamjonkman mcnnlstmcombiningcnnandlstmtoclassifymulticlasstextinimbalancednewsdata AT redaalhajj mcnnlstmcombiningcnnandlstmtoclassifymulticlasstextinimbalancednewsdata AT jongrokne mcnnlstmcombiningcnnandlstmtoclassifymulticlasstextinimbalancednewsdata |