Determining the Quality of a Dataset in Clustering Terms

The purpose of the theoretical considerations and research conducted was to indicate the instruments with which the quality of a dataset can be verified for the segmentation of observations occurring in the dataset. The paper proposes a novel way to deal with mixed datasets containing categorical an...

Full description

Bibliographic Details
Main Authors: Alicja Rachwał, Emilia Popławska, Izolda Gorgol, Tomasz Cieplak, Damian Pliszczuk, Łukasz Skowron, Tomasz Rymarczyk
Format: Article
Language:English
Published: MDPI AG 2023-02-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/13/5/2942
_version_ 1797615773691150336
author Alicja Rachwał
Emilia Popławska
Izolda Gorgol
Tomasz Cieplak
Damian Pliszczuk
Łukasz Skowron
Tomasz Rymarczyk
author_facet Alicja Rachwał
Emilia Popławska
Izolda Gorgol
Tomasz Cieplak
Damian Pliszczuk
Łukasz Skowron
Tomasz Rymarczyk
author_sort Alicja Rachwał
collection DOAJ
description The purpose of the theoretical considerations and research conducted was to indicate the instruments with which the quality of a dataset can be verified for the segmentation of observations occurring in the dataset. The paper proposes a novel way to deal with mixed datasets containing categorical and continuous attributes in a customer segmentation task. The categorical variables were embedded using an innovative unsupervised model based on an autoencoder. The customers were then divided into groups using different clustering algorithms, based on similarity matrices. In addition to the classic <i>k</i>-means method and the more modern DBSCAN, three graph algorithms were used: the Louvain algorithm, the greedy algorithm and the label propagation algorithm. The research was conducted on two datasets: one containing on retail customers and the other containing wholesale customers. The Calinski–Harabasz index, Davies–Bouldins index, NMI index, Fowlkes–Mallows index and silhouette score were used to assess the quality of the clustering. It was noted that the modularity parameter for graph methods was a good indicator of whether a given set could be meaningfully divided into groups.
first_indexed 2024-03-11T07:31:34Z
format Article
id doaj.art-2c60b35dc3b64a238f92f2d51b65ba8f
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-11T07:31:34Z
publishDate 2023-02-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-2c60b35dc3b64a238f92f2d51b65ba8f2023-11-17T07:17:09ZengMDPI AGApplied Sciences2076-34172023-02-01135294210.3390/app13052942Determining the Quality of a Dataset in Clustering TermsAlicja Rachwał0Emilia Popławska1Izolda Gorgol2Tomasz Cieplak3Damian Pliszczuk4Łukasz Skowron5Tomasz Rymarczyk6Faculty of Electrical Engineering and Computer Science, Lublin University of Technology, 20-618 Lublin, PolandFaculty of Technology Fundamentals, Lublin University of Technology, 20-618 Lublin, PolandFaculty of Technology Fundamentals, Lublin University of Technology, 20-618 Lublin, PolandFaculty of Management, Lublin University of Technology, 20-618 Lublin, PolandNetrix S.A. Research and Development Center, 20-704 Lublin, PolandFaculty of Management, Lublin University of Technology, 20-618 Lublin, PolandNetrix S.A. Research and Development Center, 20-704 Lublin, PolandThe purpose of the theoretical considerations and research conducted was to indicate the instruments with which the quality of a dataset can be verified for the segmentation of observations occurring in the dataset. The paper proposes a novel way to deal with mixed datasets containing categorical and continuous attributes in a customer segmentation task. The categorical variables were embedded using an innovative unsupervised model based on an autoencoder. The customers were then divided into groups using different clustering algorithms, based on similarity matrices. In addition to the classic <i>k</i>-means method and the more modern DBSCAN, three graph algorithms were used: the Louvain algorithm, the greedy algorithm and the label propagation algorithm. The research was conducted on two datasets: one containing on retail customers and the other containing wholesale customers. The Calinski–Harabasz index, Davies–Bouldins index, NMI index, Fowlkes–Mallows index and silhouette score were used to assess the quality of the clustering. It was noted that the modularity parameter for graph methods was a good indicator of whether a given set could be meaningfully divided into groups.https://www.mdpi.com/2076-3417/13/5/2942artificial intelligencestatistical learningmachine learningdecision-making based on data-driven modelsdata set qualityclustering
spellingShingle Alicja Rachwał
Emilia Popławska
Izolda Gorgol
Tomasz Cieplak
Damian Pliszczuk
Łukasz Skowron
Tomasz Rymarczyk
Determining the Quality of a Dataset in Clustering Terms
Applied Sciences
artificial intelligence
statistical learning
machine learning
decision-making based on data-driven models
data set quality
clustering
title Determining the Quality of a Dataset in Clustering Terms
title_full Determining the Quality of a Dataset in Clustering Terms
title_fullStr Determining the Quality of a Dataset in Clustering Terms
title_full_unstemmed Determining the Quality of a Dataset in Clustering Terms
title_short Determining the Quality of a Dataset in Clustering Terms
title_sort determining the quality of a dataset in clustering terms
topic artificial intelligence
statistical learning
machine learning
decision-making based on data-driven models
data set quality
clustering
url https://www.mdpi.com/2076-3417/13/5/2942
work_keys_str_mv AT alicjarachwał determiningthequalityofadatasetinclusteringterms
AT emiliapopławska determiningthequalityofadatasetinclusteringterms
AT izoldagorgol determiningthequalityofadatasetinclusteringterms
AT tomaszcieplak determiningthequalityofadatasetinclusteringterms
AT damianpliszczuk determiningthequalityofadatasetinclusteringterms
AT łukaszskowron determiningthequalityofadatasetinclusteringterms
AT tomaszrymarczyk determiningthequalityofadatasetinclusteringterms