Determining the Quality of a Dataset in Clustering Terms

The purpose of the theoretical considerations and research conducted was to indicate the instruments with which the quality of a dataset can be verified for the segmentation of observations occurring in the dataset. The paper proposes a novel way to deal with mixed datasets containing categorical an...

Full description

Bibliographic Details
Main Authors:	Alicja Rachwał, Emilia Popławska, Izolda Gorgol, Tomasz Cieplak, Damian Pliszczuk, Łukasz Skowron, Tomasz Rymarczyk
Format:	Article
Language:	English
Published:	MDPI AG 2023-02-01
Series:	Applied Sciences
Subjects:	artificial intelligence statistical learning machine learning decision-making based on data-driven models data set quality clustering
Online Access:	https://www.mdpi.com/2076-3417/13/5/2942

_version_	1797615773691150336
author	Alicja Rachwał Emilia Popławska Izolda Gorgol Tomasz Cieplak Damian Pliszczuk Łukasz Skowron Tomasz Rymarczyk
author_facet	Alicja Rachwał Emilia Popławska Izolda Gorgol Tomasz Cieplak Damian Pliszczuk Łukasz Skowron Tomasz Rymarczyk
author_sort	Alicja Rachwał
collection	DOAJ
description	The purpose of the theoretical considerations and research conducted was to indicate the instruments with which the quality of a dataset can be verified for the segmentation of observations occurring in the dataset. The paper proposes a novel way to deal with mixed datasets containing categorical and continuous attributes in a customer segmentation task. The categorical variables were embedded using an innovative unsupervised model based on an autoencoder. The customers were then divided into groups using different clustering algorithms, based on similarity matrices. In addition to the classic <i>k</i>-means method and the more modern DBSCAN, three graph algorithms were used: the Louvain algorithm, the greedy algorithm and the label propagation algorithm. The research was conducted on two datasets: one containing on retail customers and the other containing wholesale customers. The Calinski–Harabasz index, Davies–Bouldins index, NMI index, Fowlkes–Mallows index and silhouette score were used to assess the quality of the clustering. It was noted that the modularity parameter for graph methods was a good indicator of whether a given set could be meaningfully divided into groups.
first_indexed	2024-03-11T07:31:34Z
format	Article
id	doaj.art-2c60b35dc3b64a238f92f2d51b65ba8f
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-11T07:31:34Z
publishDate	2023-02-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-2c60b35dc3b64a238f92f2d51b65ba8f2023-11-17T07:17:09ZengMDPI AGApplied Sciences2076-34172023-02-01135294210.3390/app13052942Determining the Quality of a Dataset in Clustering TermsAlicja Rachwał0Emilia Popławska1Izolda Gorgol2Tomasz Cieplak3Damian Pliszczuk4Łukasz Skowron5Tomasz Rymarczyk6Faculty of Electrical Engineering and Computer Science, Lublin University of Technology, 20-618 Lublin, PolandFaculty of Technology Fundamentals, Lublin University of Technology, 20-618 Lublin, PolandFaculty of Technology Fundamentals, Lublin University of Technology, 20-618 Lublin, PolandFaculty of Management, Lublin University of Technology, 20-618 Lublin, PolandNetrix S.A. Research and Development Center, 20-704 Lublin, PolandFaculty of Management, Lublin University of Technology, 20-618 Lublin, PolandNetrix S.A. Research and Development Center, 20-704 Lublin, PolandThe purpose of the theoretical considerations and research conducted was to indicate the instruments with which the quality of a dataset can be verified for the segmentation of observations occurring in the dataset. The paper proposes a novel way to deal with mixed datasets containing categorical and continuous attributes in a customer segmentation task. The categorical variables were embedded using an innovative unsupervised model based on an autoencoder. The customers were then divided into groups using different clustering algorithms, based on similarity matrices. In addition to the classic <i>k</i>-means method and the more modern DBSCAN, three graph algorithms were used: the Louvain algorithm, the greedy algorithm and the label propagation algorithm. The research was conducted on two datasets: one containing on retail customers and the other containing wholesale customers. The Calinski–Harabasz index, Davies–Bouldins index, NMI index, Fowlkes–Mallows index and silhouette score were used to assess the quality of the clustering. It was noted that the modularity parameter for graph methods was a good indicator of whether a given set could be meaningfully divided into groups.https://www.mdpi.com/2076-3417/13/5/2942artificial intelligencestatistical learningmachine learningdecision-making based on data-driven modelsdata set qualityclustering
spellingShingle	Alicja Rachwał Emilia Popławska Izolda Gorgol Tomasz Cieplak Damian Pliszczuk Łukasz Skowron Tomasz Rymarczyk Determining the Quality of a Dataset in Clustering Terms Applied Sciences artificial intelligence statistical learning machine learning decision-making based on data-driven models data set quality clustering
title	Determining the Quality of a Dataset in Clustering Terms
title_full	Determining the Quality of a Dataset in Clustering Terms
title_fullStr	Determining the Quality of a Dataset in Clustering Terms
title_full_unstemmed	Determining the Quality of a Dataset in Clustering Terms
title_short	Determining the Quality of a Dataset in Clustering Terms
title_sort	determining the quality of a dataset in clustering terms
topic	artificial intelligence statistical learning machine learning decision-making based on data-driven models data set quality clustering
url	https://www.mdpi.com/2076-3417/13/5/2942
work_keys_str_mv	AT alicjarachwał determiningthequalityofadatasetinclusteringterms AT emiliapopławska determiningthequalityofadatasetinclusteringterms AT izoldagorgol determiningthequalityofadatasetinclusteringterms AT tomaszcieplak determiningthequalityofadatasetinclusteringterms AT damianpliszczuk determiningthequalityofadatasetinclusteringterms AT łukaszskowron determiningthequalityofadatasetinclusteringterms AT tomaszrymarczyk determiningthequalityofadatasetinclusteringterms

Determining the Quality of a Dataset in Clustering Terms

Similar Items