Natural language based malicious domain detection using machine learning and deep learning

Cyberattacks are still challenging since they are increasing day by day. Cybercriminals employ a variety of strategies to manipulate and exploit their targets vulnerabilities. Malicious URLs are one such strategy which is used to target large groups on various social media platforms. To draw intern...

Full description

Bibliographic Details
Main Authors:	Abdul Samad Saleem Raja, Ganesan Pradeepa, Somasundaram Mahalakshmi, Manickam Sam Jayakumar
Format:	Article
Language:	English
Published:	Saint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University) 2023-04-01
Series:	Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki
Subjects:	malicious domain phishing url nlp machine learning deep learning ann cnn
Online Access:	https://ntv.ifmo.ru/file/article/21907.pdf

_version_	1797845384351973376
author	Abdul Samad Saleem Raja Ganesan Pradeepa Somasundaram Mahalakshmi Manickam Sam Jayakumar
author_facet	Abdul Samad Saleem Raja Ganesan Pradeepa Somasundaram Mahalakshmi Manickam Sam Jayakumar
author_sort	Abdul Samad Saleem Raja
collection	DOAJ
description	Cyberattacks are still challenging since they are increasing day by day. Cybercriminals employ a variety of strategies to manipulate and exploit their targets vulnerabilities. Malicious URLs are one such strategy which is used to target large groups on various social media platforms. To draw internet users, these web addresses are disguised as being safe. Deliberate or inadvertent use of such URLs exposes the user or the organization in the cyberspace and opens the way for further attacks. Systems that use rules-based or machine learning algorithms to find malicious URLs usually rely on feature engineering. This requires domain expertise and experience. Sometimes, even after extracting features from a dataset, it may not completely leverage the potential of the dataset. The proposed method employs Natural Language Processing (NLP) approaches to vectorize the words in the URLs and applies machine learning and deep learning models for classification. Vectorization technique in NLP reduces the effort of feature engineering and maximizing the use of the dataset. For the experiment, two separate datasets are used. To vectorize the URL text, three different vectorization methods are used. To evaluate the performance of the proposed method, two different datasets (D1 and D2) that are regularly utilized in the research domain were used. The results demonstrate that the superior accuracy of 92.4 % with the D1 dataset is achieved by the Decision Tree (DT) with count vectorizer and the Random Forest (RF) with Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer. With the D2 dataset, DT with TF-IDF vectorizer obtains a greater accuracy of 99.5 %. The Artificial Neural Network (ANN) model achieves 89.6 % accuracy with the D1 dataset and 99.2 % accuracy with the D2 dataset.
first_indexed	2024-04-09T17:38:10Z
format	Article
id	doaj.art-231ca04bbf0245528b5519d2ca4ce7a0
institution	Directory Open Access Journal
issn	2226-1494 2500-0373
language	English
last_indexed	2024-04-09T17:38:10Z
publishDate	2023-04-01
publisher	Saint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University)
record_format	Article
series	Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki
spelling	doaj.art-231ca04bbf0245528b5519d2ca4ce7a02023-04-17T09:32:57ZengSaint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University)Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki2226-14942500-03732023-04-0123230431210.17586/2226-1494-2023-23-2-304-312Natural language based malicious domain detection using machine learning and deep learningAbdul Samad Saleem Raja0https://orcid.org/0000-0002-7203-1426Ganesan Pradeepa1https://orcid.org/0000-0002-5920-066XSomasundaram Mahalakshmi2https://orcid.org/0009-0008-5059-4384Manickam Sam Jayakumar3https://orcid.org/0000-0002-5417-5960PhD, Lecturer, University of Technology and Applied Sciences, Shinas, 324, Oman, sc 56862209800Lecturer, University of Technology and Applied Sciences, Shinas, 324, Oman, sc 57673491800Assistant Professor, Vivekananda College of Arts and Sciences for Women, Tiruchengode, 637211, IndiaLecturer, University of Technology and Applied Sciences, Shinas, 324, OmanCyberattacks are still challenging since they are increasing day by day. Cybercriminals employ a variety of strategies to manipulate and exploit their targets vulnerabilities. Malicious URLs are one such strategy which is used to target large groups on various social media platforms. To draw internet users, these web addresses are disguised as being safe. Deliberate or inadvertent use of such URLs exposes the user or the organization in the cyberspace and opens the way for further attacks. Systems that use rules-based or machine learning algorithms to find malicious URLs usually rely on feature engineering. This requires domain expertise and experience. Sometimes, even after extracting features from a dataset, it may not completely leverage the potential of the dataset. The proposed method employs Natural Language Processing (NLP) approaches to vectorize the words in the URLs and applies machine learning and deep learning models for classification. Vectorization technique in NLP reduces the effort of feature engineering and maximizing the use of the dataset. For the experiment, two separate datasets are used. To vectorize the URL text, three different vectorization methods are used. To evaluate the performance of the proposed method, two different datasets (D1 and D2) that are regularly utilized in the research domain were used. The results demonstrate that the superior accuracy of 92.4 % with the D1 dataset is achieved by the Decision Tree (DT) with count vectorizer and the Random Forest (RF) with Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer. With the D2 dataset, DT with TF-IDF vectorizer obtains a greater accuracy of 99.5 %. The Artificial Neural Network (ANN) model achieves 89.6 % accuracy with the D1 dataset and 99.2 % accuracy with the D2 dataset.https://ntv.ifmo.ru/file/article/21907.pdfmalicious domainphishing urlnlpmachine learningdeep learninganncnn
spellingShingle	Abdul Samad Saleem Raja Ganesan Pradeepa Somasundaram Mahalakshmi Manickam Sam Jayakumar Natural language based malicious domain detection using machine learning and deep learning Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki malicious domain phishing url nlp machine learning deep learning ann cnn
title	Natural language based malicious domain detection using machine learning and deep learning
title_full	Natural language based malicious domain detection using machine learning and deep learning
title_fullStr	Natural language based malicious domain detection using machine learning and deep learning
title_full_unstemmed	Natural language based malicious domain detection using machine learning and deep learning
title_short	Natural language based malicious domain detection using machine learning and deep learning
title_sort	natural language based malicious domain detection using machine learning and deep learning
topic	malicious domain phishing url nlp machine learning deep learning ann cnn
url	https://ntv.ifmo.ru/file/article/21907.pdf
work_keys_str_mv	AT abdulsamadsaleemraja naturallanguagebasedmaliciousdomaindetectionusingmachinelearninganddeeplearning AT ganesanpradeepa naturallanguagebasedmaliciousdomaindetectionusingmachinelearninganddeeplearning AT somasundarammahalakshmi naturallanguagebasedmaliciousdomaindetectionusingmachinelearninganddeeplearning AT manickamsamjayakumar naturallanguagebasedmaliciousdomaindetectionusingmachinelearninganddeeplearning

Natural language based malicious domain detection using machine learning and deep learning

Similar Items