A Systematic Literature Review on Phishing Email Detection Using Natural Language Processing Techniques

Every year, phishing results in losses of billions of dollars and is a major threat to the Internet economy. Phishing attacks are now most often carried out by email. To better comprehend the existing research trend of phishing email detection, several review studies have been performed. However, it...

Full description

Bibliographic Details
Main Authors: Said Salloum, Tarek Gaber, Sunil Vadera, Khaled Shaalan
Format: Article
Language:English
Published: IEEE 2022-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9795286/
_version_ 1818217526762405888
author Said Salloum
Tarek Gaber
Sunil Vadera
Khaled Shaalan
author_facet Said Salloum
Tarek Gaber
Sunil Vadera
Khaled Shaalan
author_sort Said Salloum
collection DOAJ
description Every year, phishing results in losses of billions of dollars and is a major threat to the Internet economy. Phishing attacks are now most often carried out by email. To better comprehend the existing research trend of phishing email detection, several review studies have been performed. However, it is important to assess this issue from different perspectives. None of the surveys have ever comprehensively studied the use of Natural Language Processing (NLP) techniques for detection of phishing except one that shed light on the use of NLP techniques for classification and training purposes, while exploring a few alternatives. To bridge the gap, this study aims to systematically review and synthesise research on the use of NLP for detecting phishing emails. Based on specific predefined criteria, a total of 100 research articles published between 2006 and 2022 were identified and analysed. We study the key research areas in phishing email detection using NLP, machine learning algorithms used in phishing detection email, text features in phishing emails, datasets and resources that have been used in phishing emails, and the evaluation criteria. The findings include that the main research area in phishing detection studies is feature extraction and selection, followed by methods for classifying and optimizing the detection of phishing emails. Amongst the range of classification algorithms, support vector machines (SVMs) are heavily utilised for detecting phishing emails. The most frequently used NLP techniques are found to be TF-IDF and word embeddings. Furthermore, the most commonly used datasets for benchmarking phishing email detection methods is the Nazario phishing corpus. Also, Python is the most commonly used one for phishing email detection. It is expected that the findings of this paper can be helpful for the scientific community, especially in the field of NLP application in cybersecurity problems. This survey also is unique in the sense that it relates works to their openly available tools and resources. The analysis of the presented works revealed that not much work had been performed on Arabic language phishing emails using NLP techniques. Therefore, many open issues are associated with Arabic phishing email detection.
first_indexed 2024-12-12T07:09:16Z
format Article
id doaj.art-c5e61ef2f5554b078965c9ec77bce98f
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-12T07:09:16Z
publishDate 2022-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-c5e61ef2f5554b078965c9ec77bce98f2022-12-22T00:33:40ZengIEEEIEEE Access2169-35362022-01-0110657036572710.1109/ACCESS.2022.31830839795286A Systematic Literature Review on Phishing Email Detection Using Natural Language Processing TechniquesSaid Salloum0https://orcid.org/0000-0002-6073-3981Tarek Gaber1https://orcid.org/0000-0003-4065-4191Sunil Vadera2https://orcid.org/0000-0001-6041-2646Khaled Shaalan3https://orcid.org/0000-0003-0823-8390School of Science, Engineering and Environment, University of Salford, Salford, U.K.School of Science, Engineering and Environment, University of Salford, Salford, U.K.School of Science, Engineering and Environment, University of Salford, Salford, U.K.Faculty of Engineering and IT, The British University in Dubai, Dubai, United Arab EmiratesEvery year, phishing results in losses of billions of dollars and is a major threat to the Internet economy. Phishing attacks are now most often carried out by email. To better comprehend the existing research trend of phishing email detection, several review studies have been performed. However, it is important to assess this issue from different perspectives. None of the surveys have ever comprehensively studied the use of Natural Language Processing (NLP) techniques for detection of phishing except one that shed light on the use of NLP techniques for classification and training purposes, while exploring a few alternatives. To bridge the gap, this study aims to systematically review and synthesise research on the use of NLP for detecting phishing emails. Based on specific predefined criteria, a total of 100 research articles published between 2006 and 2022 were identified and analysed. We study the key research areas in phishing email detection using NLP, machine learning algorithms used in phishing detection email, text features in phishing emails, datasets and resources that have been used in phishing emails, and the evaluation criteria. The findings include that the main research area in phishing detection studies is feature extraction and selection, followed by methods for classifying and optimizing the detection of phishing emails. Amongst the range of classification algorithms, support vector machines (SVMs) are heavily utilised for detecting phishing emails. The most frequently used NLP techniques are found to be TF-IDF and word embeddings. Furthermore, the most commonly used datasets for benchmarking phishing email detection methods is the Nazario phishing corpus. Also, Python is the most commonly used one for phishing email detection. It is expected that the findings of this paper can be helpful for the scientific community, especially in the field of NLP application in cybersecurity problems. This survey also is unique in the sense that it relates works to their openly available tools and resources. The analysis of the presented works revealed that not much work had been performed on Arabic language phishing emails using NLP techniques. Therefore, many open issues are associated with Arabic phishing email detection.https://ieeexplore.ieee.org/document/9795286/Phishing email detectionsystematic literature reviewnatural language processingmachine learning
spellingShingle Said Salloum
Tarek Gaber
Sunil Vadera
Khaled Shaalan
A Systematic Literature Review on Phishing Email Detection Using Natural Language Processing Techniques
IEEE Access
Phishing email detection
systematic literature review
natural language processing
machine learning
title A Systematic Literature Review on Phishing Email Detection Using Natural Language Processing Techniques
title_full A Systematic Literature Review on Phishing Email Detection Using Natural Language Processing Techniques
title_fullStr A Systematic Literature Review on Phishing Email Detection Using Natural Language Processing Techniques
title_full_unstemmed A Systematic Literature Review on Phishing Email Detection Using Natural Language Processing Techniques
title_short A Systematic Literature Review on Phishing Email Detection Using Natural Language Processing Techniques
title_sort systematic literature review on phishing email detection using natural language processing techniques
topic Phishing email detection
systematic literature review
natural language processing
machine learning
url https://ieeexplore.ieee.org/document/9795286/
work_keys_str_mv AT saidsalloum asystematicliteraturereviewonphishingemaildetectionusingnaturallanguageprocessingtechniques
AT tarekgaber asystematicliteraturereviewonphishingemaildetectionusingnaturallanguageprocessingtechniques
AT sunilvadera asystematicliteraturereviewonphishingemaildetectionusingnaturallanguageprocessingtechniques
AT khaledshaalan asystematicliteraturereviewonphishingemaildetectionusingnaturallanguageprocessingtechniques
AT saidsalloum systematicliteraturereviewonphishingemaildetectionusingnaturallanguageprocessingtechniques
AT tarekgaber systematicliteraturereviewonphishingemaildetectionusingnaturallanguageprocessingtechniques
AT sunilvadera systematicliteraturereviewonphishingemaildetectionusingnaturallanguageprocessingtechniques
AT khaledshaalan systematicliteraturereviewonphishingemaildetectionusingnaturallanguageprocessingtechniques