Locality-sensitive hashing K-means algorithm for large-scale datasets

Efficient processing strategy for large datasets is a key support for coal mine intelligent constructions, such as the intelligent construction of coal mine safety monitoring and mining. To address the problem of insufficient clustering efficiency and accuracy of the K-means algorithm for large data...

Full description

Bibliographic Details
Main Authors:	WEI Feng, MA Long
Format:	Article
Language:	zho
Published:	Editorial Department of Industry and Mine Automation 2023-03-01
Series:	Gong-kuang zidonghua
Subjects:	intelligent mine large-scale dataset k-means clustering locality-sensitive hashing noise point filtering density biased sampling
Online Access:	http://www.gkzdh.cn/article/doi/10.13272/j.issn.1671-251x.2022080018

_version_	1827942054482673664
author	WEI Feng MA Long
author_facet	WEI Feng MA Long
author_sort	WEI Feng
collection	DOAJ
description	Efficient processing strategy for large datasets is a key support for coal mine intelligent constructions, such as the intelligent construction of coal mine safety monitoring and mining. To address the problem of insufficient clustering efficiency and accuracy of the K-means algorithm for large datasets, a highly efficient K-means clustering algorithm based on locality-sensitive hashing (LSH) is proposed. Based on LSH, the sampling process is optimized, and a data grouping algorithm LSH-G is proposed. The large dataset is divided into subgroups and the noisy points in the dataset are removed effectively. Based on LSH-G, the subgroup division process in the density biased sampling (DBS) algorithm is optimized. And a data group sampling algorithm, LSH-GD, is proposed. The sample set can more accurately reflect the distribution law of the original dataset. On this basis, the K-means algorithm is used to cluster the generated sample set, achieving efficient mining of effective data from large datasets with low time complexity. The experimental results show that the optimal cascade combination consists of 10 AND operations and 8 OR operations, resulting in the smallest sum of squares due to error of class center (SSEC). On the artificial dataset, compared with the K-means algorithm based on multi-layer simple random sampling (M-SRS), the K-means algorithm based on DBS, and the K-means algorithm based on grid density biased sampling (G-DBS), the K-means algorithm based on LSH-GD achieves an average improvement of 56.63%, 54.59%, and 25.34% respectively in clustering accuracy. The proposed algorithm achieves an average improvement of 27.26%, 16.81%, and 7.07% in clustering efficiency respectively. On the UCI standard dataset, the K-means clustering algorithm based on LSH-GD obtains optimal SSEC and CPU time consumption (CPU-C).
first_indexed	2024-03-13T09:52:55Z
format	Article
id	doaj.art-cb79ded4755d40d5af1f9d68439a132c
institution	Directory Open Access Journal
issn	1671-251X
language	zho
last_indexed	2024-03-13T09:52:55Z
publishDate	2023-03-01
publisher	Editorial Department of Industry and Mine Automation
record_format	Article
series	Gong-kuang zidonghua
spelling	doaj.art-cb79ded4755d40d5af1f9d68439a132c2023-05-24T06:23:16ZzhoEditorial Department of Industry and Mine AutomationGong-kuang zidonghua1671-251X2023-03-01493536210.13272/j.issn.1671-251x.2022080018Locality-sensitive hashing K-means algorithm for large-scale datasetsWEI FengMA LongEfficient processing strategy for large datasets is a key support for coal mine intelligent constructions, such as the intelligent construction of coal mine safety monitoring and mining. To address the problem of insufficient clustering efficiency and accuracy of the K-means algorithm for large datasets, a highly efficient K-means clustering algorithm based on locality-sensitive hashing (LSH) is proposed. Based on LSH, the sampling process is optimized, and a data grouping algorithm LSH-G is proposed. The large dataset is divided into subgroups and the noisy points in the dataset are removed effectively. Based on LSH-G, the subgroup division process in the density biased sampling (DBS) algorithm is optimized. And a data group sampling algorithm, LSH-GD, is proposed. The sample set can more accurately reflect the distribution law of the original dataset. On this basis, the K-means algorithm is used to cluster the generated sample set, achieving efficient mining of effective data from large datasets with low time complexity. The experimental results show that the optimal cascade combination consists of 10 AND operations and 8 OR operations, resulting in the smallest sum of squares due to error of class center (SSEC). On the artificial dataset, compared with the K-means algorithm based on multi-layer simple random sampling (M-SRS), the K-means algorithm based on DBS, and the K-means algorithm based on grid density biased sampling (G-DBS), the K-means algorithm based on LSH-GD achieves an average improvement of 56.63%, 54.59%, and 25.34% respectively in clustering accuracy. The proposed algorithm achieves an average improvement of 27.26%, 16.81%, and 7.07% in clustering efficiency respectively. On the UCI standard dataset, the K-means clustering algorithm based on LSH-GD obtains optimal SSEC and CPU time consumption (CPU-C).http://www.gkzdh.cn/article/doi/10.13272/j.issn.1671-251x.2022080018intelligent minelarge-scale datasetk-means clusteringlocality-sensitive hashingnoise point filteringdensity biased sampling
spellingShingle	WEI Feng MA Long Locality-sensitive hashing K-means algorithm for large-scale datasets Gong-kuang zidonghua intelligent mine large-scale dataset k-means clustering locality-sensitive hashing noise point filtering density biased sampling
title	Locality-sensitive hashing K-means algorithm for large-scale datasets
title_full	Locality-sensitive hashing K-means algorithm for large-scale datasets
title_fullStr	Locality-sensitive hashing K-means algorithm for large-scale datasets
title_full_unstemmed	Locality-sensitive hashing K-means algorithm for large-scale datasets
title_short	Locality-sensitive hashing K-means algorithm for large-scale datasets
title_sort	locality sensitive hashing k means algorithm for large scale datasets
topic	intelligent mine large-scale dataset k-means clustering locality-sensitive hashing noise point filtering density biased sampling
url	http://www.gkzdh.cn/article/doi/10.13272/j.issn.1671-251x.2022080018
work_keys_str_mv	AT weifeng localitysensitivehashingkmeansalgorithmforlargescaledatasets AT malong localitysensitivehashingkmeansalgorithmforlargescaledatasets

Locality-sensitive hashing K-means algorithm for large-scale datasets

Similar Items