Summary: | The importance of unsupervised clustering methods is well established in the statistics and machine learning literature. Many sophisticated unsupervised classification techniques have been made available to deal with a growing number of datasets. Due to its simplicity and efficiency in clustering a large dataset, the <i>k</i>-means clustering algorithm is still popular and widely used in the machine learning community. However, as with other clustering methods, it requires one to choose the balanced number of clusters in advance. This paper’s primary emphasis is to develop a novel method for finding the optimum number of clusters, <i>k</i>, using a data-driven approach. Taking into account the cluster symmetry property, the <i>k</i>-means algorithm is applied multiple times to a range of <i>k</i> values within which the balanced optimum <i>k</i> value is expected. This is based on the uniqueness and symmetrical nature among the centroid values for the clusters produced, and we chose the final <i>k</i> value as the one for which symmetry is observed. We evaluated the proposed algorithm’s performance on different simulated datasets with controlled parameters and also on real datasets taken from the UCI machine learning repository. We also evaluated the performance of the proposed method with the aim of remote sensing, such as in deforestation and urbanization, using satellite images of the Islamabad region in Pakistan, taken from the Sentinel-2B satellite of the United States Geological Survey. From the experimental results and real data analysis, it is concluded that the proposed algorithm has better accuracy and minimum root mean square error than the existing methods.
|