On the Quality of Synthetic Generated Tabular Data

Class imbalance is a common issue while developing classification models. In order to tackle this problem, synthetic data have recently been developed to enhance the minority class. These artificially generated samples aim to bolster the representation of the minority class. However, evaluating the...

Full description

Bibliographic Details
Main Authors: Erica Espinosa, Alvaro Figueira
Format: Article
Language:English
Published: MDPI AG 2023-07-01
Series:Mathematics
Subjects:
Online Access:https://www.mdpi.com/2227-7390/11/15/3278
_version_ 1797586421079343104
author Erica Espinosa
Alvaro Figueira
author_facet Erica Espinosa
Alvaro Figueira
author_sort Erica Espinosa
collection DOAJ
description Class imbalance is a common issue while developing classification models. In order to tackle this problem, synthetic data have recently been developed to enhance the minority class. These artificially generated samples aim to bolster the representation of the minority class. However, evaluating the suitability of such generated data is crucial to ensure their alignment with the original data distribution. Utility measures come into play here to quantify how similar the distribution of the generated data is to the original one. For tabular data, there are various evaluation methods that assess different characteristics of the generated data. In this study, we collected utility measures and categorized them based on the type of analysis they performed. We then applied these measures to synthetic data generated from two well-known datasets, Adults Income, and Liar+. We also used five well-known generative models, Borderline SMOTE, DataSynthesizer, CTGAN, CopulaGAN, and REaLTabFormer, to generate the synthetic data and evaluated its quality using the utility measures. The measurements have proven to be informative, indicating that if one synthetic dataset is superior to another in terms of utility measures, it will be more effective as an augmentation for the minority class when performing classification tasks.
first_indexed 2024-03-11T00:23:00Z
format Article
id doaj.art-f910d4837d464af1a3f7f1466cf4f8fb
institution Directory Open Access Journal
issn 2227-7390
language English
last_indexed 2024-03-11T00:23:00Z
publishDate 2023-07-01
publisher MDPI AG
record_format Article
series Mathematics
spelling doaj.art-f910d4837d464af1a3f7f1466cf4f8fb2023-11-18T23:14:30ZengMDPI AGMathematics2227-73902023-07-011115327810.3390/math11153278On the Quality of Synthetic Generated Tabular DataErica Espinosa0Alvaro Figueira1Department of Mathematics Engineering, Politecnico di Milano, 20133 Milan, ItalyFaculty of Sciences, University of Porto, 4169-007 Porto, PortugalClass imbalance is a common issue while developing classification models. In order to tackle this problem, synthetic data have recently been developed to enhance the minority class. These artificially generated samples aim to bolster the representation of the minority class. However, evaluating the suitability of such generated data is crucial to ensure their alignment with the original data distribution. Utility measures come into play here to quantify how similar the distribution of the generated data is to the original one. For tabular data, there are various evaluation methods that assess different characteristics of the generated data. In this study, we collected utility measures and categorized them based on the type of analysis they performed. We then applied these measures to synthetic data generated from two well-known datasets, Adults Income, and Liar+. We also used five well-known generative models, Borderline SMOTE, DataSynthesizer, CTGAN, CopulaGAN, and REaLTabFormer, to generate the synthetic data and evaluated its quality using the utility measures. The measurements have proven to be informative, indicating that if one synthetic dataset is superior to another in terms of utility measures, it will be more effective as an augmentation for the minority class when performing classification tasks.https://www.mdpi.com/2227-7390/11/15/3278utility measuressynthetic dataclass imbalancetabular data
spellingShingle Erica Espinosa
Alvaro Figueira
On the Quality of Synthetic Generated Tabular Data
Mathematics
utility measures
synthetic data
class imbalance
tabular data
title On the Quality of Synthetic Generated Tabular Data
title_full On the Quality of Synthetic Generated Tabular Data
title_fullStr On the Quality of Synthetic Generated Tabular Data
title_full_unstemmed On the Quality of Synthetic Generated Tabular Data
title_short On the Quality of Synthetic Generated Tabular Data
title_sort on the quality of synthetic generated tabular data
topic utility measures
synthetic data
class imbalance
tabular data
url https://www.mdpi.com/2227-7390/11/15/3278
work_keys_str_mv AT ericaespinosa onthequalityofsyntheticgeneratedtabulardata
AT alvarofigueira onthequalityofsyntheticgeneratedtabulardata