Malicious Text Identification: Deep Learning from Public Comments and Emails

Identifying internet spam has been a challenging problem for decades. Several solutions have succeeded to detect spam comments in social media or fraudulent emails. However, an adequate strategy for filtering messages is difficult to achieve, as these messages resemble real communications. From the...

Full description

Bibliographic Details
Main Authors:	Asma Baccouche, Sadaf Ahmed, Daniel Sierra-Sosa, Adel Elmaghraby
Format:	Article
Language:	English
Published:	MDPI AG 2020-06-01
Series:	Information
Subjects:	spam text filter text mining content-based classification natural language processing multi-label classification LSTM
Online Access:	https://www.mdpi.com/2078-2489/11/6/312

_version_	1797565629912317952
author	Asma Baccouche Sadaf Ahmed Daniel Sierra-Sosa Adel Elmaghraby
author_facet	Asma Baccouche Sadaf Ahmed Daniel Sierra-Sosa Adel Elmaghraby
author_sort	Asma Baccouche
collection	DOAJ
description	Identifying internet spam has been a challenging problem for decades. Several solutions have succeeded to detect spam comments in social media or fraudulent emails. However, an adequate strategy for filtering messages is difficult to achieve, as these messages resemble real communications. From the Natural Language Processing (NLP) perspective, Deep Learning models are a good alternative for classifying text after being preprocessed. In particular, Long Short-Term Memory (LSTM) networks are one of the models that perform well for the binary and multi-label text classification problems. In this paper, an approach merging two different data sources, one intended for Spam in social media posts and the other for Fraud classification in emails, is presented. We designed a multi-label LSTM model and trained it on the joint datasets including text with common bigrams, extracted from each independent dataset. The experiment results show that our proposed model is capable of identifying malicious text regardless of the source. The LSTM model trained with the merged dataset outperforms the models trained independently on each dataset.
first_indexed	2024-03-10T19:15:50Z
format	Article
id	doaj.art-3bb5b7176d574560be4aface49bc8aa2
institution	Directory Open Access Journal
issn	2078-2489
language	English
last_indexed	2024-03-10T19:15:50Z
publishDate	2020-06-01
publisher	MDPI AG
record_format	Article
series	Information
spelling	doaj.art-3bb5b7176d574560be4aface49bc8aa22023-11-20T03:24:56ZengMDPI AGInformation2078-24892020-06-0111631210.3390/info11060312Malicious Text Identification: Deep Learning from Public Comments and EmailsAsma Baccouche0Sadaf Ahmed1Daniel Sierra-Sosa2Adel Elmaghraby3Department of Computer Science and Engineering, University of Louisville, Louisville, KY 40292, USADepartment of Computer Science and Engineering, University of Louisville, Louisville, KY 40292, USADepartment of Computer Science and Engineering, University of Louisville, Louisville, KY 40292, USADepartment of Computer Science and Engineering, University of Louisville, Louisville, KY 40292, USAIdentifying internet spam has been a challenging problem for decades. Several solutions have succeeded to detect spam comments in social media or fraudulent emails. However, an adequate strategy for filtering messages is difficult to achieve, as these messages resemble real communications. From the Natural Language Processing (NLP) perspective, Deep Learning models are a good alternative for classifying text after being preprocessed. In particular, Long Short-Term Memory (LSTM) networks are one of the models that perform well for the binary and multi-label text classification problems. In this paper, an approach merging two different data sources, one intended for Spam in social media posts and the other for Fraud classification in emails, is presented. We designed a multi-label LSTM model and trained it on the joint datasets including text with common bigrams, extracted from each independent dataset. The experiment results show that our proposed model is capable of identifying malicious text regardless of the source. The LSTM model trained with the merged dataset outperforms the models trained independently on each dataset.https://www.mdpi.com/2078-2489/11/6/312spam text filtertext miningcontent-based classificationnatural language processingmulti-label classificationLSTM
spellingShingle	Asma Baccouche Sadaf Ahmed Daniel Sierra-Sosa Adel Elmaghraby Malicious Text Identification: Deep Learning from Public Comments and Emails Information spam text filter text mining content-based classification natural language processing multi-label classification LSTM
title	Malicious Text Identification: Deep Learning from Public Comments and Emails
title_full	Malicious Text Identification: Deep Learning from Public Comments and Emails
title_fullStr	Malicious Text Identification: Deep Learning from Public Comments and Emails
title_full_unstemmed	Malicious Text Identification: Deep Learning from Public Comments and Emails
title_short	Malicious Text Identification: Deep Learning from Public Comments and Emails
title_sort	malicious text identification deep learning from public comments and emails
topic	spam text filter text mining content-based classification natural language processing multi-label classification LSTM
url	https://www.mdpi.com/2078-2489/11/6/312
work_keys_str_mv	AT asmabaccouche malicioustextidentificationdeeplearningfrompubliccommentsandemails AT sadafahmed malicioustextidentificationdeeplearningfrompubliccommentsandemails AT danielsierrasosa malicioustextidentificationdeeplearningfrompubliccommentsandemails AT adelelmaghraby malicioustextidentificationdeeplearningfrompubliccommentsandemails

Malicious Text Identification: Deep Learning from Public Comments and Emails

Similar Items