MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News Data

Searching, retrieving, and arranging text in ever-larger document collections necessitate more efficient information processing algorithms. Document categorization is a crucial component of various information processing systems for supervised learning. As the quantity of documents grows, the perfor...

Full description

Bibliographic Details
Main Authors: Khan Md Hasib, Sami Azam, Asif Karim, Ahmed Al Marouf, F M Javed Mehedi Shamrat, Sidratul Montaha, Kheng Cher Yeo, Mirjam Jonkman, Reda Alhajj, Jon G. Rokne
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10233873/
_version_ 1797689566119854080
author Khan Md Hasib
Sami Azam
Asif Karim
Ahmed Al Marouf
F M Javed Mehedi Shamrat
Sidratul Montaha
Kheng Cher Yeo
Mirjam Jonkman
Reda Alhajj
Jon G. Rokne
author_facet Khan Md Hasib
Sami Azam
Asif Karim
Ahmed Al Marouf
F M Javed Mehedi Shamrat
Sidratul Montaha
Kheng Cher Yeo
Mirjam Jonkman
Reda Alhajj
Jon G. Rokne
author_sort Khan Md Hasib
collection DOAJ
description Searching, retrieving, and arranging text in ever-larger document collections necessitate more efficient information processing algorithms. Document categorization is a crucial component of various information processing systems for supervised learning. As the quantity of documents grows, the performance of classic supervised classifiers has deteriorated because of the number of document categories. Assigning documents to a predetermined set of classes is called text classification. It is utilized extensively in a wide range of data-intensive applications. However, the fact that real-world implementations of these models are plagued with shortcomings begs for more investigation. Imbalanced datasets hinder the most prevalent high-performance algorithms. In this paper, we propose an approach name multi-class Convolutional Neural Network (MCNN)-Long Short-Time Memory (LSTM), which combines two deep learning techniques, Convolutional Neural Network (CNN) and Long Short-Time Memory, for text classification in news data. CNN’s are used as feature extractors for the LSTMs on text input data and have the spatial structure of words in a sentence, paragraph, or document. The dataset is also imbalanced, and we use the Tomek-Link algorithm to balance the dataset and then apply our model, which shows better performance in terms of F1-score (98%) and Accuracy (99.71%) than the existing works. The combination of deep learning techniques used in our approach is ideal for the classification of imbalanced datasets with underrepresented categories. Hence, our method outperformed other machine learning algorithms in text classification by a large margin. We also compare our results with traditional machine learning algorithms in terms of imbalanced and balanced datasets.
first_indexed 2024-03-12T01:47:25Z
format Article
id doaj.art-69f37c9defb54adcbdfbfad607178b1b
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-03-12T01:47:25Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-69f37c9defb54adcbdfbfad607178b1b2023-09-08T23:01:46ZengIEEEIEEE Access2169-35362023-01-0111930489306310.1109/ACCESS.2023.330969710233873MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News DataKhan Md Hasib0https://orcid.org/0000-0001-6504-4192Sami Azam1https://orcid.org/0000-0001-7572-9750Asif Karim2https://orcid.org/0000-0001-8532-6816Ahmed Al Marouf3https://orcid.org/0000-0001-6520-0749F M Javed Mehedi Shamrat4https://orcid.org/0000-0001-9176-3537Sidratul Montaha5https://orcid.org/0000-0002-5276-3793Kheng Cher Yeo6https://orcid.org/0000-0002-0453-3248Mirjam Jonkman7https://orcid.org/0000-0002-0396-8370Reda Alhajj8https://orcid.org/0000-0001-6657-9738Jon G. Rokne9https://orcid.org/0000-0002-3439-2917Department of Computer Science and Engineering, Bangladesh University of Business and Technology, Dhaka, BangladeshFaculty of Science and Technology, Charles Darwin University, Casuarina, NT, AustraliaFaculty of Science and Technology, Charles Darwin University, Casuarina, NT, AustraliaDepartment of Computer Science, University of Calgary, Calgary, CanadaDepartment of Computer System and Technology, University of Malaya, Kuala Lumpur, MalaysiaDepartment of Computer Science and Engineering, Daffodil International University, Dhaka, BangladeshFaculty of Science and Technology, Charles Darwin University, Casuarina, NT, AustraliaFaculty of Science and Technology, Charles Darwin University, Casuarina, NT, AustraliaDepartment of Computer Science, University of Calgary, Calgary, CanadaDepartment of Computer Science, University of Calgary, Calgary, CanadaSearching, retrieving, and arranging text in ever-larger document collections necessitate more efficient information processing algorithms. Document categorization is a crucial component of various information processing systems for supervised learning. As the quantity of documents grows, the performance of classic supervised classifiers has deteriorated because of the number of document categories. Assigning documents to a predetermined set of classes is called text classification. It is utilized extensively in a wide range of data-intensive applications. However, the fact that real-world implementations of these models are plagued with shortcomings begs for more investigation. Imbalanced datasets hinder the most prevalent high-performance algorithms. In this paper, we propose an approach name multi-class Convolutional Neural Network (MCNN)-Long Short-Time Memory (LSTM), which combines two deep learning techniques, Convolutional Neural Network (CNN) and Long Short-Time Memory, for text classification in news data. CNN’s are used as feature extractors for the LSTMs on text input data and have the spatial structure of words in a sentence, paragraph, or document. The dataset is also imbalanced, and we use the Tomek-Link algorithm to balance the dataset and then apply our model, which shows better performance in terms of F1-score (98%) and Accuracy (99.71%) than the existing works. The combination of deep learning techniques used in our approach is ideal for the classification of imbalanced datasets with underrepresented categories. Hence, our method outperformed other machine learning algorithms in text classification by a large margin. We also compare our results with traditional machine learning algorithms in terms of imbalanced and balanced datasets.https://ieeexplore.ieee.org/document/10233873/Big datatext classificationimbalanced datamachine learningMCNN-LSTM
spellingShingle Khan Md Hasib
Sami Azam
Asif Karim
Ahmed Al Marouf
F M Javed Mehedi Shamrat
Sidratul Montaha
Kheng Cher Yeo
Mirjam Jonkman
Reda Alhajj
Jon G. Rokne
MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News Data
IEEE Access
Big data
text classification
imbalanced data
machine learning
MCNN-LSTM
title MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News Data
title_full MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News Data
title_fullStr MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News Data
title_full_unstemmed MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News Data
title_short MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News Data
title_sort mcnn lstm combining cnn and lstm to classify multi class text in imbalanced news data
topic Big data
text classification
imbalanced data
machine learning
MCNN-LSTM
url https://ieeexplore.ieee.org/document/10233873/
work_keys_str_mv AT khanmdhasib mcnnlstmcombiningcnnandlstmtoclassifymulticlasstextinimbalancednewsdata
AT samiazam mcnnlstmcombiningcnnandlstmtoclassifymulticlasstextinimbalancednewsdata
AT asifkarim mcnnlstmcombiningcnnandlstmtoclassifymulticlasstextinimbalancednewsdata
AT ahmedalmarouf mcnnlstmcombiningcnnandlstmtoclassifymulticlasstextinimbalancednewsdata
AT fmjavedmehedishamrat mcnnlstmcombiningcnnandlstmtoclassifymulticlasstextinimbalancednewsdata
AT sidratulmontaha mcnnlstmcombiningcnnandlstmtoclassifymulticlasstextinimbalancednewsdata
AT khengcheryeo mcnnlstmcombiningcnnandlstmtoclassifymulticlasstextinimbalancednewsdata
AT mirjamjonkman mcnnlstmcombiningcnnandlstmtoclassifymulticlasstextinimbalancednewsdata
AT redaalhajj mcnnlstmcombiningcnnandlstmtoclassifymulticlasstextinimbalancednewsdata
AT jongrokne mcnnlstmcombiningcnnandlstmtoclassifymulticlasstextinimbalancednewsdata