Malicious Text Identification: Deep Learning from Public Comments and Emails
Identifying internet spam has been a challenging problem for decades. Several solutions have succeeded to detect spam comments in social media or fraudulent emails. However, an adequate strategy for filtering messages is difficult to achieve, as these messages resemble real communications. From the...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2020-06-01
|
Series: | Information |
Subjects: | |
Online Access: | https://www.mdpi.com/2078-2489/11/6/312 |
_version_ | 1797565629912317952 |
---|---|
author | Asma Baccouche Sadaf Ahmed Daniel Sierra-Sosa Adel Elmaghraby |
author_facet | Asma Baccouche Sadaf Ahmed Daniel Sierra-Sosa Adel Elmaghraby |
author_sort | Asma Baccouche |
collection | DOAJ |
description | Identifying internet spam has been a challenging problem for decades. Several solutions have succeeded to detect spam comments in social media or fraudulent emails. However, an adequate strategy for filtering messages is difficult to achieve, as these messages resemble real communications. From the Natural Language Processing (NLP) perspective, Deep Learning models are a good alternative for classifying text after being preprocessed. In particular, Long Short-Term Memory (LSTM) networks are one of the models that perform well for the binary and multi-label text classification problems. In this paper, an approach merging two different data sources, one intended for Spam in social media posts and the other for Fraud classification in emails, is presented. We designed a multi-label LSTM model and trained it on the joint datasets including text with common bigrams, extracted from each independent dataset. The experiment results show that our proposed model is capable of identifying malicious text regardless of the source. The LSTM model trained with the merged dataset outperforms the models trained independently on each dataset. |
first_indexed | 2024-03-10T19:15:50Z |
format | Article |
id | doaj.art-3bb5b7176d574560be4aface49bc8aa2 |
institution | Directory Open Access Journal |
issn | 2078-2489 |
language | English |
last_indexed | 2024-03-10T19:15:50Z |
publishDate | 2020-06-01 |
publisher | MDPI AG |
record_format | Article |
series | Information |
spelling | doaj.art-3bb5b7176d574560be4aface49bc8aa22023-11-20T03:24:56ZengMDPI AGInformation2078-24892020-06-0111631210.3390/info11060312Malicious Text Identification: Deep Learning from Public Comments and EmailsAsma Baccouche0Sadaf Ahmed1Daniel Sierra-Sosa2Adel Elmaghraby3Department of Computer Science and Engineering, University of Louisville, Louisville, KY 40292, USADepartment of Computer Science and Engineering, University of Louisville, Louisville, KY 40292, USADepartment of Computer Science and Engineering, University of Louisville, Louisville, KY 40292, USADepartment of Computer Science and Engineering, University of Louisville, Louisville, KY 40292, USAIdentifying internet spam has been a challenging problem for decades. Several solutions have succeeded to detect spam comments in social media or fraudulent emails. However, an adequate strategy for filtering messages is difficult to achieve, as these messages resemble real communications. From the Natural Language Processing (NLP) perspective, Deep Learning models are a good alternative for classifying text after being preprocessed. In particular, Long Short-Term Memory (LSTM) networks are one of the models that perform well for the binary and multi-label text classification problems. In this paper, an approach merging two different data sources, one intended for Spam in social media posts and the other for Fraud classification in emails, is presented. We designed a multi-label LSTM model and trained it on the joint datasets including text with common bigrams, extracted from each independent dataset. The experiment results show that our proposed model is capable of identifying malicious text regardless of the source. The LSTM model trained with the merged dataset outperforms the models trained independently on each dataset.https://www.mdpi.com/2078-2489/11/6/312spam text filtertext miningcontent-based classificationnatural language processingmulti-label classificationLSTM |
spellingShingle | Asma Baccouche Sadaf Ahmed Daniel Sierra-Sosa Adel Elmaghraby Malicious Text Identification: Deep Learning from Public Comments and Emails Information spam text filter text mining content-based classification natural language processing multi-label classification LSTM |
title | Malicious Text Identification: Deep Learning from Public Comments and Emails |
title_full | Malicious Text Identification: Deep Learning from Public Comments and Emails |
title_fullStr | Malicious Text Identification: Deep Learning from Public Comments and Emails |
title_full_unstemmed | Malicious Text Identification: Deep Learning from Public Comments and Emails |
title_short | Malicious Text Identification: Deep Learning from Public Comments and Emails |
title_sort | malicious text identification deep learning from public comments and emails |
topic | spam text filter text mining content-based classification natural language processing multi-label classification LSTM |
url | https://www.mdpi.com/2078-2489/11/6/312 |
work_keys_str_mv | AT asmabaccouche malicioustextidentificationdeeplearningfrompubliccommentsandemails AT sadafahmed malicioustextidentificationdeeplearningfrompubliccommentsandemails AT danielsierrasosa malicioustextidentificationdeeplearningfrompubliccommentsandemails AT adelelmaghraby malicioustextidentificationdeeplearningfrompubliccommentsandemails |