Malicious Text Identification: Deep Learning from Public Comments and Emails

Identifying internet spam has been a challenging problem for decades. Several solutions have succeeded to detect spam comments in social media or fraudulent emails. However, an adequate strategy for filtering messages is difficult to achieve, as these messages resemble real communications. From the...

Full description

Bibliographic Details
Main Authors: Asma Baccouche, Sadaf Ahmed, Daniel Sierra-Sosa, Adel Elmaghraby
Format: Article
Language:English
Published: MDPI AG 2020-06-01
Series:Information
Subjects:
Online Access:https://www.mdpi.com/2078-2489/11/6/312
_version_ 1797565629912317952
author Asma Baccouche
Sadaf Ahmed
Daniel Sierra-Sosa
Adel Elmaghraby
author_facet Asma Baccouche
Sadaf Ahmed
Daniel Sierra-Sosa
Adel Elmaghraby
author_sort Asma Baccouche
collection DOAJ
description Identifying internet spam has been a challenging problem for decades. Several solutions have succeeded to detect spam comments in social media or fraudulent emails. However, an adequate strategy for filtering messages is difficult to achieve, as these messages resemble real communications. From the Natural Language Processing (NLP) perspective, Deep Learning models are a good alternative for classifying text after being preprocessed. In particular, Long Short-Term Memory (LSTM) networks are one of the models that perform well for the binary and multi-label text classification problems. In this paper, an approach merging two different data sources, one intended for Spam in social media posts and the other for Fraud classification in emails, is presented. We designed a multi-label LSTM model and trained it on the joint datasets including text with common bigrams, extracted from each independent dataset. The experiment results show that our proposed model is capable of identifying malicious text regardless of the source. The LSTM model trained with the merged dataset outperforms the models trained independently on each dataset.
first_indexed 2024-03-10T19:15:50Z
format Article
id doaj.art-3bb5b7176d574560be4aface49bc8aa2
institution Directory Open Access Journal
issn 2078-2489
language English
last_indexed 2024-03-10T19:15:50Z
publishDate 2020-06-01
publisher MDPI AG
record_format Article
series Information
spelling doaj.art-3bb5b7176d574560be4aface49bc8aa22023-11-20T03:24:56ZengMDPI AGInformation2078-24892020-06-0111631210.3390/info11060312Malicious Text Identification: Deep Learning from Public Comments and EmailsAsma Baccouche0Sadaf Ahmed1Daniel Sierra-Sosa2Adel Elmaghraby3Department of Computer Science and Engineering, University of Louisville, Louisville, KY 40292, USADepartment of Computer Science and Engineering, University of Louisville, Louisville, KY 40292, USADepartment of Computer Science and Engineering, University of Louisville, Louisville, KY 40292, USADepartment of Computer Science and Engineering, University of Louisville, Louisville, KY 40292, USAIdentifying internet spam has been a challenging problem for decades. Several solutions have succeeded to detect spam comments in social media or fraudulent emails. However, an adequate strategy for filtering messages is difficult to achieve, as these messages resemble real communications. From the Natural Language Processing (NLP) perspective, Deep Learning models are a good alternative for classifying text after being preprocessed. In particular, Long Short-Term Memory (LSTM) networks are one of the models that perform well for the binary and multi-label text classification problems. In this paper, an approach merging two different data sources, one intended for Spam in social media posts and the other for Fraud classification in emails, is presented. We designed a multi-label LSTM model and trained it on the joint datasets including text with common bigrams, extracted from each independent dataset. The experiment results show that our proposed model is capable of identifying malicious text regardless of the source. The LSTM model trained with the merged dataset outperforms the models trained independently on each dataset.https://www.mdpi.com/2078-2489/11/6/312spam text filtertext miningcontent-based classificationnatural language processingmulti-label classificationLSTM
spellingShingle Asma Baccouche
Sadaf Ahmed
Daniel Sierra-Sosa
Adel Elmaghraby
Malicious Text Identification: Deep Learning from Public Comments and Emails
Information
spam text filter
text mining
content-based classification
natural language processing
multi-label classification
LSTM
title Malicious Text Identification: Deep Learning from Public Comments and Emails
title_full Malicious Text Identification: Deep Learning from Public Comments and Emails
title_fullStr Malicious Text Identification: Deep Learning from Public Comments and Emails
title_full_unstemmed Malicious Text Identification: Deep Learning from Public Comments and Emails
title_short Malicious Text Identification: Deep Learning from Public Comments and Emails
title_sort malicious text identification deep learning from public comments and emails
topic spam text filter
text mining
content-based classification
natural language processing
multi-label classification
LSTM
url https://www.mdpi.com/2078-2489/11/6/312
work_keys_str_mv AT asmabaccouche malicioustextidentificationdeeplearningfrompubliccommentsandemails
AT sadafahmed malicioustextidentificationdeeplearningfrompubliccommentsandemails
AT danielsierrasosa malicioustextidentificationdeeplearningfrompubliccommentsandemails
AT adelelmaghraby malicioustextidentificationdeeplearningfrompubliccommentsandemails