Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data

With the development of internet technology, the number of illicit websites such as gambling and pornography has dramatically increased, posing serious threats to people’s physical and mental health, as well as their financial security. Currently, the governance of such illicit websites mainly focus...

Full description

Bibliographic Details
Main Authors: Bo Wang, Fan Shi, Haiyang Zheng
Format: Article
Language:English
Published: MDPI AG 2023-08-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/13/17/9837
_version_ 1827728252261629952
author Bo Wang
Fan Shi
Haiyang Zheng
author_facet Bo Wang
Fan Shi
Haiyang Zheng
author_sort Bo Wang
collection DOAJ
description With the development of internet technology, the number of illicit websites such as gambling and pornography has dramatically increased, posing serious threats to people’s physical and mental health, as well as their financial security. Currently, the governance of such illicit websites mainly focuses on limited-scale detection through manual annotation. However, the need for effective solutions to govern illicit websites is urgent, requiring the ability to rapidly acquire large volumes of existing website data from the internet. Web mapping engines can provide massive, near real-time web data, which plays a crucial role in batch detection of illicit websites. Therefore, in this paper, we propose a method that combines web mapping engine big data to perform unsupervised multimodal clustering (MDC) for illicit website discovery. By extracting features based on contrastive learning methods from webpage screenshots and OCR text, we conduct feature similarity clustering to identify illicit websites. Finally, our unsupervised clustering model achieved an overall accuracy of 84.1% on all confidence levels, and a 92.39% accuracy at a confidence level of 0.999 or higher. By applying the MDC model to 3.7 million real web mapping data, we obtained 397,275 illicit websites primarily focused on gambling and pornography, with 14 attributes. This dataset is made publicly.
first_indexed 2024-03-10T23:27:43Z
format Article
id doaj.art-73a2974e6c724da4a0b37fefffa8d994
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-10T23:27:43Z
publishDate 2023-08-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-73a2974e6c724da4a0b37fefffa8d9942023-11-19T07:52:26ZengMDPI AGApplied Sciences2076-34172023-08-011317983710.3390/app13179837Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big DataBo Wang0Fan Shi1Haiyang Zheng2College of Electronic Engineering, National University of Defense Technology, Hefei 230037, ChinaCollege of Electronic Engineering, National University of Defense Technology, Hefei 230037, ChinaCollege of Electronic Engineering, National University of Defense Technology, Hefei 230037, ChinaWith the development of internet technology, the number of illicit websites such as gambling and pornography has dramatically increased, posing serious threats to people’s physical and mental health, as well as their financial security. Currently, the governance of such illicit websites mainly focuses on limited-scale detection through manual annotation. However, the need for effective solutions to govern illicit websites is urgent, requiring the ability to rapidly acquire large volumes of existing website data from the internet. Web mapping engines can provide massive, near real-time web data, which plays a crucial role in batch detection of illicit websites. Therefore, in this paper, we propose a method that combines web mapping engine big data to perform unsupervised multimodal clustering (MDC) for illicit website discovery. By extracting features based on contrastive learning methods from webpage screenshots and OCR text, we conduct feature similarity clustering to identify illicit websites. Finally, our unsupervised clustering model achieved an overall accuracy of 84.1% on all confidence levels, and a 92.39% accuracy at a confidence level of 0.999 or higher. By applying the MDC model to 3.7 million real web mapping data, we obtained 397,275 illicit websites primarily focused on gambling and pornography, with 14 attributes. This dataset is made publicly.https://www.mdpi.com/2076-3417/13/17/9837unsupervised learningclusteringmultimodalnetwork mapping
spellingShingle Bo Wang
Fan Shi
Haiyang Zheng
Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data
Applied Sciences
unsupervised learning
clustering
multimodal
network mapping
title Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data
title_full Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data
title_fullStr Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data
title_full_unstemmed Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data
title_short Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data
title_sort multi modal clustering discovery method for illegal websites based on network surveying and mapping big data
topic unsupervised learning
clustering
multimodal
network mapping
url https://www.mdpi.com/2076-3417/13/17/9837
work_keys_str_mv AT bowang multimodalclusteringdiscoverymethodforillegalwebsitesbasedonnetworksurveyingandmappingbigdata
AT fanshi multimodalclusteringdiscoverymethodforillegalwebsitesbasedonnetworksurveyingandmappingbigdata
AT haiyangzheng multimodalclusteringdiscoverymethodforillegalwebsitesbasedonnetworksurveyingandmappingbigdata