Determining the Quality of a Dataset in Clustering Terms
The purpose of the theoretical considerations and research conducted was to indicate the instruments with which the quality of a dataset can be verified for the segmentation of observations occurring in the dataset. The paper proposes a novel way to deal with mixed datasets containing categorical an...
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-02-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/13/5/2942 |
_version_ | 1797615773691150336 |
---|---|
author | Alicja Rachwał Emilia Popławska Izolda Gorgol Tomasz Cieplak Damian Pliszczuk Łukasz Skowron Tomasz Rymarczyk |
author_facet | Alicja Rachwał Emilia Popławska Izolda Gorgol Tomasz Cieplak Damian Pliszczuk Łukasz Skowron Tomasz Rymarczyk |
author_sort | Alicja Rachwał |
collection | DOAJ |
description | The purpose of the theoretical considerations and research conducted was to indicate the instruments with which the quality of a dataset can be verified for the segmentation of observations occurring in the dataset. The paper proposes a novel way to deal with mixed datasets containing categorical and continuous attributes in a customer segmentation task. The categorical variables were embedded using an innovative unsupervised model based on an autoencoder. The customers were then divided into groups using different clustering algorithms, based on similarity matrices. In addition to the classic <i>k</i>-means method and the more modern DBSCAN, three graph algorithms were used: the Louvain algorithm, the greedy algorithm and the label propagation algorithm. The research was conducted on two datasets: one containing on retail customers and the other containing wholesale customers. The Calinski–Harabasz index, Davies–Bouldins index, NMI index, Fowlkes–Mallows index and silhouette score were used to assess the quality of the clustering. It was noted that the modularity parameter for graph methods was a good indicator of whether a given set could be meaningfully divided into groups. |
first_indexed | 2024-03-11T07:31:34Z |
format | Article |
id | doaj.art-2c60b35dc3b64a238f92f2d51b65ba8f |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-11T07:31:34Z |
publishDate | 2023-02-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-2c60b35dc3b64a238f92f2d51b65ba8f2023-11-17T07:17:09ZengMDPI AGApplied Sciences2076-34172023-02-01135294210.3390/app13052942Determining the Quality of a Dataset in Clustering TermsAlicja Rachwał0Emilia Popławska1Izolda Gorgol2Tomasz Cieplak3Damian Pliszczuk4Łukasz Skowron5Tomasz Rymarczyk6Faculty of Electrical Engineering and Computer Science, Lublin University of Technology, 20-618 Lublin, PolandFaculty of Technology Fundamentals, Lublin University of Technology, 20-618 Lublin, PolandFaculty of Technology Fundamentals, Lublin University of Technology, 20-618 Lublin, PolandFaculty of Management, Lublin University of Technology, 20-618 Lublin, PolandNetrix S.A. Research and Development Center, 20-704 Lublin, PolandFaculty of Management, Lublin University of Technology, 20-618 Lublin, PolandNetrix S.A. Research and Development Center, 20-704 Lublin, PolandThe purpose of the theoretical considerations and research conducted was to indicate the instruments with which the quality of a dataset can be verified for the segmentation of observations occurring in the dataset. The paper proposes a novel way to deal with mixed datasets containing categorical and continuous attributes in a customer segmentation task. The categorical variables were embedded using an innovative unsupervised model based on an autoencoder. The customers were then divided into groups using different clustering algorithms, based on similarity matrices. In addition to the classic <i>k</i>-means method and the more modern DBSCAN, three graph algorithms were used: the Louvain algorithm, the greedy algorithm and the label propagation algorithm. The research was conducted on two datasets: one containing on retail customers and the other containing wholesale customers. The Calinski–Harabasz index, Davies–Bouldins index, NMI index, Fowlkes–Mallows index and silhouette score were used to assess the quality of the clustering. It was noted that the modularity parameter for graph methods was a good indicator of whether a given set could be meaningfully divided into groups.https://www.mdpi.com/2076-3417/13/5/2942artificial intelligencestatistical learningmachine learningdecision-making based on data-driven modelsdata set qualityclustering |
spellingShingle | Alicja Rachwał Emilia Popławska Izolda Gorgol Tomasz Cieplak Damian Pliszczuk Łukasz Skowron Tomasz Rymarczyk Determining the Quality of a Dataset in Clustering Terms Applied Sciences artificial intelligence statistical learning machine learning decision-making based on data-driven models data set quality clustering |
title | Determining the Quality of a Dataset in Clustering Terms |
title_full | Determining the Quality of a Dataset in Clustering Terms |
title_fullStr | Determining the Quality of a Dataset in Clustering Terms |
title_full_unstemmed | Determining the Quality of a Dataset in Clustering Terms |
title_short | Determining the Quality of a Dataset in Clustering Terms |
title_sort | determining the quality of a dataset in clustering terms |
topic | artificial intelligence statistical learning machine learning decision-making based on data-driven models data set quality clustering |
url | https://www.mdpi.com/2076-3417/13/5/2942 |
work_keys_str_mv | AT alicjarachwał determiningthequalityofadatasetinclusteringterms AT emiliapopławska determiningthequalityofadatasetinclusteringterms AT izoldagorgol determiningthequalityofadatasetinclusteringterms AT tomaszcieplak determiningthequalityofadatasetinclusteringterms AT damianpliszczuk determiningthequalityofadatasetinclusteringterms AT łukaszskowron determiningthequalityofadatasetinclusteringterms AT tomaszrymarczyk determiningthequalityofadatasetinclusteringterms |