Enhancing Small Tabular Clinical Trial Dataset through Hybrid Data Augmentation: Combining SMOTE and WCGAN-GP

This study addressed the challenge of training generative adversarial networks (GANs) on small tabular clinical trial datasets for data augmentation, which are known to pose difficulties in training due to limited sample sizes. To overcome this obstacle, a hybrid approach is proposed, combining the...

Full description

Bibliographic Details
Main Authors: Winston Wang, Tun-Wen Pai
Format: Article
Language:English
Published: MDPI AG 2023-08-01
Series:Data
Subjects:
Online Access:https://www.mdpi.com/2306-5729/8/9/135
_version_ 1797580643590209536
author Winston Wang
Tun-Wen Pai
author_facet Winston Wang
Tun-Wen Pai
author_sort Winston Wang
collection DOAJ
description This study addressed the challenge of training generative adversarial networks (GANs) on small tabular clinical trial datasets for data augmentation, which are known to pose difficulties in training due to limited sample sizes. To overcome this obstacle, a hybrid approach is proposed, combining the synthetic minority oversampling technique (SMOTE) to initially augment the original data to a more substantial size for improving the subsequent GAN training with a Wasserstein conditional generative adversarial network with gradient penalty (WCGAN-GP), proven for its state-of-art performance and enhanced stability. The ultimate objective of this research was to demonstrate that the quality of synthetic tabular data generated by the final WCGAN-GP model maintains the structural integrity and statistical representation of the original small dataset using this hybrid approach. This focus is particularly relevant for clinical trials, where limited data availability due to privacy concerns and restricted accessibility to subject enrollment pose common challenges. Despite the limitation of data, the findings demonstrate that the hybrid approach successfully generates synthetic data that closely preserved the characteristics of the original small dataset. By harnessing the power of this hybrid approach to generate faithful synthetic data, the potential for enhancing data-driven research in drug clinical trials become evident. This includes enabling a robust analysis on small datasets, supplementing the lack of clinical trial data, facilitating its utility in machine learning tasks, even extending to using the model for anomaly detection to ensure better quality control during clinical trial data collection, all while prioritizing data privacy and implementing strict data protection measures.
first_indexed 2024-03-10T22:53:47Z
format Article
id doaj.art-ff50ad3621ce472d85dc39a9b241e25f
institution Directory Open Access Journal
issn 2306-5729
language English
last_indexed 2024-03-10T22:53:47Z
publishDate 2023-08-01
publisher MDPI AG
record_format Article
series Data
spelling doaj.art-ff50ad3621ce472d85dc39a9b241e25f2023-11-19T10:11:36ZengMDPI AGData2306-57292023-08-018913510.3390/data8090135Enhancing Small Tabular Clinical Trial Dataset through Hybrid Data Augmentation: Combining SMOTE and WCGAN-GPWinston Wang0Tun-Wen Pai1Department of Computer Science and Information Engineering, National Taipei University of Technology, Taipei 10608, TaiwanDepartment of Computer Science and Information Engineering, National Taipei University of Technology, Taipei 10608, TaiwanThis study addressed the challenge of training generative adversarial networks (GANs) on small tabular clinical trial datasets for data augmentation, which are known to pose difficulties in training due to limited sample sizes. To overcome this obstacle, a hybrid approach is proposed, combining the synthetic minority oversampling technique (SMOTE) to initially augment the original data to a more substantial size for improving the subsequent GAN training with a Wasserstein conditional generative adversarial network with gradient penalty (WCGAN-GP), proven for its state-of-art performance and enhanced stability. The ultimate objective of this research was to demonstrate that the quality of synthetic tabular data generated by the final WCGAN-GP model maintains the structural integrity and statistical representation of the original small dataset using this hybrid approach. This focus is particularly relevant for clinical trials, where limited data availability due to privacy concerns and restricted accessibility to subject enrollment pose common challenges. Despite the limitation of data, the findings demonstrate that the hybrid approach successfully generates synthetic data that closely preserved the characteristics of the original small dataset. By harnessing the power of this hybrid approach to generate faithful synthetic data, the potential for enhancing data-driven research in drug clinical trials become evident. This includes enabling a robust analysis on small datasets, supplementing the lack of clinical trial data, facilitating its utility in machine learning tasks, even extending to using the model for anomaly detection to ensure better quality control during clinical trial data collection, all while prioritizing data privacy and implementing strict data protection measures.https://www.mdpi.com/2306-5729/8/9/135clinical trialGANmultiple sclerosissmall tabular datasetSMOTEWCGAN-GP
spellingShingle Winston Wang
Tun-Wen Pai
Enhancing Small Tabular Clinical Trial Dataset through Hybrid Data Augmentation: Combining SMOTE and WCGAN-GP
Data
clinical trial
GAN
multiple sclerosis
small tabular dataset
SMOTE
WCGAN-GP
title Enhancing Small Tabular Clinical Trial Dataset through Hybrid Data Augmentation: Combining SMOTE and WCGAN-GP
title_full Enhancing Small Tabular Clinical Trial Dataset through Hybrid Data Augmentation: Combining SMOTE and WCGAN-GP
title_fullStr Enhancing Small Tabular Clinical Trial Dataset through Hybrid Data Augmentation: Combining SMOTE and WCGAN-GP
title_full_unstemmed Enhancing Small Tabular Clinical Trial Dataset through Hybrid Data Augmentation: Combining SMOTE and WCGAN-GP
title_short Enhancing Small Tabular Clinical Trial Dataset through Hybrid Data Augmentation: Combining SMOTE and WCGAN-GP
title_sort enhancing small tabular clinical trial dataset through hybrid data augmentation combining smote and wcgan gp
topic clinical trial
GAN
multiple sclerosis
small tabular dataset
SMOTE
WCGAN-GP
url https://www.mdpi.com/2306-5729/8/9/135
work_keys_str_mv AT winstonwang enhancingsmalltabularclinicaltrialdatasetthroughhybriddataaugmentationcombiningsmoteandwcgangp
AT tunwenpai enhancingsmalltabularclinicaltrialdatasetthroughhybriddataaugmentationcombiningsmoteandwcgangp