Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data
With the development of internet technology, the number of illicit websites such as gambling and pornography has dramatically increased, posing serious threats to people’s physical and mental health, as well as their financial security. Currently, the governance of such illicit websites mainly focus...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-08-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/13/17/9837 |
_version_ | 1827728252261629952 |
---|---|
author | Bo Wang Fan Shi Haiyang Zheng |
author_facet | Bo Wang Fan Shi Haiyang Zheng |
author_sort | Bo Wang |
collection | DOAJ |
description | With the development of internet technology, the number of illicit websites such as gambling and pornography has dramatically increased, posing serious threats to people’s physical and mental health, as well as their financial security. Currently, the governance of such illicit websites mainly focuses on limited-scale detection through manual annotation. However, the need for effective solutions to govern illicit websites is urgent, requiring the ability to rapidly acquire large volumes of existing website data from the internet. Web mapping engines can provide massive, near real-time web data, which plays a crucial role in batch detection of illicit websites. Therefore, in this paper, we propose a method that combines web mapping engine big data to perform unsupervised multimodal clustering (MDC) for illicit website discovery. By extracting features based on contrastive learning methods from webpage screenshots and OCR text, we conduct feature similarity clustering to identify illicit websites. Finally, our unsupervised clustering model achieved an overall accuracy of 84.1% on all confidence levels, and a 92.39% accuracy at a confidence level of 0.999 or higher. By applying the MDC model to 3.7 million real web mapping data, we obtained 397,275 illicit websites primarily focused on gambling and pornography, with 14 attributes. This dataset is made publicly. |
first_indexed | 2024-03-10T23:27:43Z |
format | Article |
id | doaj.art-73a2974e6c724da4a0b37fefffa8d994 |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-10T23:27:43Z |
publishDate | 2023-08-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-73a2974e6c724da4a0b37fefffa8d9942023-11-19T07:52:26ZengMDPI AGApplied Sciences2076-34172023-08-011317983710.3390/app13179837Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big DataBo Wang0Fan Shi1Haiyang Zheng2College of Electronic Engineering, National University of Defense Technology, Hefei 230037, ChinaCollege of Electronic Engineering, National University of Defense Technology, Hefei 230037, ChinaCollege of Electronic Engineering, National University of Defense Technology, Hefei 230037, ChinaWith the development of internet technology, the number of illicit websites such as gambling and pornography has dramatically increased, posing serious threats to people’s physical and mental health, as well as their financial security. Currently, the governance of such illicit websites mainly focuses on limited-scale detection through manual annotation. However, the need for effective solutions to govern illicit websites is urgent, requiring the ability to rapidly acquire large volumes of existing website data from the internet. Web mapping engines can provide massive, near real-time web data, which plays a crucial role in batch detection of illicit websites. Therefore, in this paper, we propose a method that combines web mapping engine big data to perform unsupervised multimodal clustering (MDC) for illicit website discovery. By extracting features based on contrastive learning methods from webpage screenshots and OCR text, we conduct feature similarity clustering to identify illicit websites. Finally, our unsupervised clustering model achieved an overall accuracy of 84.1% on all confidence levels, and a 92.39% accuracy at a confidence level of 0.999 or higher. By applying the MDC model to 3.7 million real web mapping data, we obtained 397,275 illicit websites primarily focused on gambling and pornography, with 14 attributes. This dataset is made publicly.https://www.mdpi.com/2076-3417/13/17/9837unsupervised learningclusteringmultimodalnetwork mapping |
spellingShingle | Bo Wang Fan Shi Haiyang Zheng Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data Applied Sciences unsupervised learning clustering multimodal network mapping |
title | Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data |
title_full | Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data |
title_fullStr | Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data |
title_full_unstemmed | Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data |
title_short | Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data |
title_sort | multi modal clustering discovery method for illegal websites based on network surveying and mapping big data |
topic | unsupervised learning clustering multimodal network mapping |
url | https://www.mdpi.com/2076-3417/13/17/9837 |
work_keys_str_mv | AT bowang multimodalclusteringdiscoverymethodforillegalwebsitesbasedonnetworksurveyingandmappingbigdata AT fanshi multimodalclusteringdiscoverymethodforillegalwebsitesbasedonnetworksurveyingandmappingbigdata AT haiyangzheng multimodalclusteringdiscoverymethodforillegalwebsitesbasedonnetworksurveyingandmappingbigdata |