Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain

Dataset size is considered a major concern in the medical domain, where lack of data is a common occurrence. This study aims to investigate the impact of dataset size on the overall performance of supervised classification models. We examined the performance of six widely-used models in the medical...

Full description

Bibliographic Details
Main Authors: Althnian, Alhanoof, AlSaeed, Duaa, Al-Baity, Heyam, Samha, Amani, Dris, Alanoud Bin, Alzakari, Najla, Abou Elwafa, Afnan, Kurdi, Heba
Format: Article
Published: Multidisciplinary Digital Publishing Institute 2021
Online Access:https://hdl.handle.net/1721.1/131330
_version_ 1826211676752445440
author Althnian, Alhanoof
AlSaeed, Duaa
Al-Baity, Heyam
Samha, Amani
Dris, Alanoud Bin
Alzakari, Najla
Abou Elwafa, Afnan
Kurdi, Heba
author_facet Althnian, Alhanoof
AlSaeed, Duaa
Al-Baity, Heyam
Samha, Amani
Dris, Alanoud Bin
Alzakari, Najla
Abou Elwafa, Afnan
Kurdi, Heba
author_sort Althnian, Alhanoof
collection MIT
description Dataset size is considered a major concern in the medical domain, where lack of data is a common occurrence. This study aims to investigate the impact of dataset size on the overall performance of supervised classification models. We examined the performance of six widely-used models in the medical field, including support vector machine (SVM), neural networks (NN), C4.5 decision tree (DT), random forest (RF), adaboost (AB), and naïve Bayes (NB) on eighteen small medical UCI datasets. We further implemented three dataset size reduction scenarios on two large datasets and analyze the performance of the models when trained on each resulting dataset with respect to accuracy, precision, recall, f-score, specificity, and area under the ROC curve (AUC). Our results indicated that the overall performance of classifiers depend on how much a dataset represents the original distribution rather than its size. Moreover, we found that the most robust model for limited medical data is AB and NB, followed by SVM, and then RF and NN, while the least robust model is DT. Furthermore, an interesting observation is that a robust machine learning model to limited dataset does not necessary imply that it provides the best performance compared to other models.
first_indexed 2024-09-23T15:09:46Z
format Article
id mit-1721.1/131330
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T15:09:46Z
publishDate 2021
publisher Multidisciplinary Digital Publishing Institute
record_format dspace
spelling mit-1721.1/1313302021-09-21T03:14:04Z Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain Althnian, Alhanoof AlSaeed, Duaa Al-Baity, Heyam Samha, Amani Dris, Alanoud Bin Alzakari, Najla Abou Elwafa, Afnan Kurdi, Heba Dataset size is considered a major concern in the medical domain, where lack of data is a common occurrence. This study aims to investigate the impact of dataset size on the overall performance of supervised classification models. We examined the performance of six widely-used models in the medical field, including support vector machine (SVM), neural networks (NN), C4.5 decision tree (DT), random forest (RF), adaboost (AB), and naïve Bayes (NB) on eighteen small medical UCI datasets. We further implemented three dataset size reduction scenarios on two large datasets and analyze the performance of the models when trained on each resulting dataset with respect to accuracy, precision, recall, f-score, specificity, and area under the ROC curve (AUC). Our results indicated that the overall performance of classifiers depend on how much a dataset represents the original distribution rather than its size. Moreover, we found that the most robust model for limited medical data is AB and NB, followed by SVM, and then RF and NN, while the least robust model is DT. Furthermore, an interesting observation is that a robust machine learning model to limited dataset does not necessary imply that it provides the best performance compared to other models. 2021-09-20T14:16:14Z 2021-09-20T14:16:14Z 2021-01-15 2021-01-22T15:59:33Z Article http://purl.org/eprint/type/JournalArticle https://hdl.handle.net/1721.1/131330 Applied Sciences 11 (2): 796 (2021) PUBLISHER_CC http://dx.doi.org/10.3390/app11020796 Creative Commons Attribution https://creativecommons.org/licenses/by/4.0/ application/pdf Multidisciplinary Digital Publishing Institute Multidisciplinary Digital Publishing Institute
spellingShingle Althnian, Alhanoof
AlSaeed, Duaa
Al-Baity, Heyam
Samha, Amani
Dris, Alanoud Bin
Alzakari, Najla
Abou Elwafa, Afnan
Kurdi, Heba
Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain
title Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain
title_full Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain
title_fullStr Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain
title_full_unstemmed Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain
title_short Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain
title_sort impact of dataset size on classification performance an empirical evaluation in the medical domain
url https://hdl.handle.net/1721.1/131330
work_keys_str_mv AT althnianalhanoof impactofdatasetsizeonclassificationperformanceanempiricalevaluationinthemedicaldomain
AT alsaeedduaa impactofdatasetsizeonclassificationperformanceanempiricalevaluationinthemedicaldomain
AT albaityheyam impactofdatasetsizeonclassificationperformanceanempiricalevaluationinthemedicaldomain
AT samhaamani impactofdatasetsizeonclassificationperformanceanempiricalevaluationinthemedicaldomain
AT drisalanoudbin impactofdatasetsizeonclassificationperformanceanempiricalevaluationinthemedicaldomain
AT alzakarinajla impactofdatasetsizeonclassificationperformanceanempiricalevaluationinthemedicaldomain
AT abouelwafaafnan impactofdatasetsizeonclassificationperformanceanempiricalevaluationinthemedicaldomain
AT kurdiheba impactofdatasetsizeonclassificationperformanceanempiricalevaluationinthemedicaldomain