Classification of breast cancer recurrence based on imputed data: a simulation study

Abstract Several studies have been conducted to classify various real life events but few are in medical fields; particularly about breast recurrence under statistical techniques. To our knowledge, there is no reported comparison of statistical classification accuracy and classifiers’ discriminative...

Full description

Bibliographic Details
Main Authors: Rahibu A. Abassi, Amina S. Msengwa
Format: Article
Language:English
Published: BMC 2022-12-01
Series:BioData Mining
Subjects:
Online Access:https://doi.org/10.1186/s13040-022-00316-8
_version_ 1828091231509413888
author Rahibu A. Abassi
Amina S. Msengwa
author_facet Rahibu A. Abassi
Amina S. Msengwa
author_sort Rahibu A. Abassi
collection DOAJ
description Abstract Several studies have been conducted to classify various real life events but few are in medical fields; particularly about breast recurrence under statistical techniques. To our knowledge, there is no reported comparison of statistical classification accuracy and classifiers’ discriminative ability on breast cancer recurrence in presence of imputed missing data. Therefore, this article aims to fill this analysis gap by comparing the performance of binary classifiers (logistic regression, linear and quadratic discriminant analysis) using several datasets resulted from imputation process using various simulation conditions. Our study aids the knowledge about how classifiers’ accuracy and discriminative ability in classifying a binary outcome variable are affected by the presence of imputed numerical missing data. We simulated incomplete datasets with 15, 30, 45 and 60% of missingness under Missing At Random (MAR) and Missing Completely At Random (MCAR) mechanisms. Mean imputation, hot deck, k-nearest neighbour, multiple imputations via chained equation, expected-maximisation, and predictive mean matching were used to impute incomplete datasets. For each classifier, correct classification accuracy and area under the Receiver Operating Characteristic (ROC) curves under MAR and MCAR mechanisms were compared. The linear discriminant classifier attained the highest classification accuracy (73.9%) based on mean-imputed data at 45% of missing data under MCAR mechanism. As a classifier, the logistic regression based on predictive mean matching imputed-data yields the greatest areas under ROC curves (0.6418) at 30% missingness while k-nearest neighbour tops the value (0.6428) at 60% of missing data under MCAR mechanism.
first_indexed 2024-04-11T06:10:07Z
format Article
id doaj.art-c8a912b8a2994b56b89c5cd6dd290cc9
institution Directory Open Access Journal
issn 1756-0381
language English
last_indexed 2024-04-11T06:10:07Z
publishDate 2022-12-01
publisher BMC
record_format Article
series BioData Mining
spelling doaj.art-c8a912b8a2994b56b89c5cd6dd290cc92022-12-22T04:41:19ZengBMCBioData Mining1756-03812022-12-0115111310.1186/s13040-022-00316-8Classification of breast cancer recurrence based on imputed data: a simulation studyRahibu A. Abassi0Amina S. Msengwa1Department of Natural Sciences, State University of ZanzibarDepartment of Statistics, University of Dar es SalaamAbstract Several studies have been conducted to classify various real life events but few are in medical fields; particularly about breast recurrence under statistical techniques. To our knowledge, there is no reported comparison of statistical classification accuracy and classifiers’ discriminative ability on breast cancer recurrence in presence of imputed missing data. Therefore, this article aims to fill this analysis gap by comparing the performance of binary classifiers (logistic regression, linear and quadratic discriminant analysis) using several datasets resulted from imputation process using various simulation conditions. Our study aids the knowledge about how classifiers’ accuracy and discriminative ability in classifying a binary outcome variable are affected by the presence of imputed numerical missing data. We simulated incomplete datasets with 15, 30, 45 and 60% of missingness under Missing At Random (MAR) and Missing Completely At Random (MCAR) mechanisms. Mean imputation, hot deck, k-nearest neighbour, multiple imputations via chained equation, expected-maximisation, and predictive mean matching were used to impute incomplete datasets. For each classifier, correct classification accuracy and area under the Receiver Operating Characteristic (ROC) curves under MAR and MCAR mechanisms were compared. The linear discriminant classifier attained the highest classification accuracy (73.9%) based on mean-imputed data at 45% of missing data under MCAR mechanism. As a classifier, the logistic regression based on predictive mean matching imputed-data yields the greatest areas under ROC curves (0.6418) at 30% missingness while k-nearest neighbour tops the value (0.6428) at 60% of missing data under MCAR mechanism.https://doi.org/10.1186/s13040-022-00316-8Classification accuracyImputed dataMissing data mechanismsMissingness percentagesSimulation
spellingShingle Rahibu A. Abassi
Amina S. Msengwa
Classification of breast cancer recurrence based on imputed data: a simulation study
BioData Mining
Classification accuracy
Imputed data
Missing data mechanisms
Missingness percentages
Simulation
title Classification of breast cancer recurrence based on imputed data: a simulation study
title_full Classification of breast cancer recurrence based on imputed data: a simulation study
title_fullStr Classification of breast cancer recurrence based on imputed data: a simulation study
title_full_unstemmed Classification of breast cancer recurrence based on imputed data: a simulation study
title_short Classification of breast cancer recurrence based on imputed data: a simulation study
title_sort classification of breast cancer recurrence based on imputed data a simulation study
topic Classification accuracy
Imputed data
Missing data mechanisms
Missingness percentages
Simulation
url https://doi.org/10.1186/s13040-022-00316-8
work_keys_str_mv AT rahibuaabassi classificationofbreastcancerrecurrencebasedonimputeddataasimulationstudy
AT aminasmsengwa classificationofbreastcancerrecurrencebasedonimputeddataasimulationstudy