Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain
Dataset size is considered a major concern in the medical domain, where lack of data is a common occurrence. This study aims to investigate the impact of dataset size on the overall performance of supervised classification models. We examined the performance of six widely-used models in the medical...
Main Authors: | , , , , , , , |
---|---|
Format: | Article |
Published: |
Multidisciplinary Digital Publishing Institute
2021
|
Online Access: | https://hdl.handle.net/1721.1/131330 |
_version_ | 1826211676752445440 |
---|---|
author | Althnian, Alhanoof AlSaeed, Duaa Al-Baity, Heyam Samha, Amani Dris, Alanoud Bin Alzakari, Najla Abou Elwafa, Afnan Kurdi, Heba |
author_facet | Althnian, Alhanoof AlSaeed, Duaa Al-Baity, Heyam Samha, Amani Dris, Alanoud Bin Alzakari, Najla Abou Elwafa, Afnan Kurdi, Heba |
author_sort | Althnian, Alhanoof |
collection | MIT |
description | Dataset size is considered a major concern in the medical domain, where lack of data is a common occurrence. This study aims to investigate the impact of dataset size on the overall performance of supervised classification models. We examined the performance of six widely-used models in the medical field, including support vector machine (SVM), neural networks (NN), C4.5 decision tree (DT), random forest (RF), adaboost (AB), and naïve Bayes (NB) on eighteen small medical UCI datasets. We further implemented three dataset size reduction scenarios on two large datasets and analyze the performance of the models when trained on each resulting dataset with respect to accuracy, precision, recall, f-score, specificity, and area under the ROC curve (AUC). Our results indicated that the overall performance of classifiers depend on how much a dataset represents the original distribution rather than its size. Moreover, we found that the most robust model for limited medical data is AB and NB, followed by SVM, and then RF and NN, while the least robust model is DT. Furthermore, an interesting observation is that a robust machine learning model to limited dataset does not necessary imply that it provides the best performance compared to other models. |
first_indexed | 2024-09-23T15:09:46Z |
format | Article |
id | mit-1721.1/131330 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T15:09:46Z |
publishDate | 2021 |
publisher | Multidisciplinary Digital Publishing Institute |
record_format | dspace |
spelling | mit-1721.1/1313302021-09-21T03:14:04Z Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain Althnian, Alhanoof AlSaeed, Duaa Al-Baity, Heyam Samha, Amani Dris, Alanoud Bin Alzakari, Najla Abou Elwafa, Afnan Kurdi, Heba Dataset size is considered a major concern in the medical domain, where lack of data is a common occurrence. This study aims to investigate the impact of dataset size on the overall performance of supervised classification models. We examined the performance of six widely-used models in the medical field, including support vector machine (SVM), neural networks (NN), C4.5 decision tree (DT), random forest (RF), adaboost (AB), and naïve Bayes (NB) on eighteen small medical UCI datasets. We further implemented three dataset size reduction scenarios on two large datasets and analyze the performance of the models when trained on each resulting dataset with respect to accuracy, precision, recall, f-score, specificity, and area under the ROC curve (AUC). Our results indicated that the overall performance of classifiers depend on how much a dataset represents the original distribution rather than its size. Moreover, we found that the most robust model for limited medical data is AB and NB, followed by SVM, and then RF and NN, while the least robust model is DT. Furthermore, an interesting observation is that a robust machine learning model to limited dataset does not necessary imply that it provides the best performance compared to other models. 2021-09-20T14:16:14Z 2021-09-20T14:16:14Z 2021-01-15 2021-01-22T15:59:33Z Article http://purl.org/eprint/type/JournalArticle https://hdl.handle.net/1721.1/131330 Applied Sciences 11 (2): 796 (2021) PUBLISHER_CC http://dx.doi.org/10.3390/app11020796 Creative Commons Attribution https://creativecommons.org/licenses/by/4.0/ application/pdf Multidisciplinary Digital Publishing Institute Multidisciplinary Digital Publishing Institute |
spellingShingle | Althnian, Alhanoof AlSaeed, Duaa Al-Baity, Heyam Samha, Amani Dris, Alanoud Bin Alzakari, Najla Abou Elwafa, Afnan Kurdi, Heba Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain |
title | Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain |
title_full | Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain |
title_fullStr | Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain |
title_full_unstemmed | Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain |
title_short | Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain |
title_sort | impact of dataset size on classification performance an empirical evaluation in the medical domain |
url | https://hdl.handle.net/1721.1/131330 |
work_keys_str_mv | AT althnianalhanoof impactofdatasetsizeonclassificationperformanceanempiricalevaluationinthemedicaldomain AT alsaeedduaa impactofdatasetsizeonclassificationperformanceanempiricalevaluationinthemedicaldomain AT albaityheyam impactofdatasetsizeonclassificationperformanceanempiricalevaluationinthemedicaldomain AT samhaamani impactofdatasetsizeonclassificationperformanceanempiricalevaluationinthemedicaldomain AT drisalanoudbin impactofdatasetsizeonclassificationperformanceanempiricalevaluationinthemedicaldomain AT alzakarinajla impactofdatasetsizeonclassificationperformanceanempiricalevaluationinthemedicaldomain AT abouelwafaafnan impactofdatasetsizeonclassificationperformanceanempiricalevaluationinthemedicaldomain AT kurdiheba impactofdatasetsizeonclassificationperformanceanempiricalevaluationinthemedicaldomain |