Diagnosis of acute myeloid leukaemia on microarray gene expression data using categorical gradient boosted trees

We define an iterative method for dimensionality reduction using categorical gradient boosted trees and Shapley values and created four machine learning models which potentially could be used as diagnostic tests for acute myeloid leukaemia (AML). For the final Catboost model we use a dataset of 2177...

Full description

Bibliographic Details
Main Authors: Athanasios Angelakis, Ioanna Soulioti, Michael Filippakis
Format: Article
Language:English
Published: Elsevier 2023-10-01
Series:Heliyon
Online Access:http://www.sciencedirect.com/science/article/pii/S2405844023077381
_version_ 1797646546796281856
author Athanasios Angelakis
Ioanna Soulioti
Michael Filippakis
author_facet Athanasios Angelakis
Ioanna Soulioti
Michael Filippakis
author_sort Athanasios Angelakis
collection DOAJ
description We define an iterative method for dimensionality reduction using categorical gradient boosted trees and Shapley values and created four machine learning models which potentially could be used as diagnostic tests for acute myeloid leukaemia (AML). For the final Catboost model we use a dataset of 2177 individuals using as features 16 probe sets and the age in order to classify if someone has AML or is healthy. The dataset is multicentric and consists of data from 27 organizations, 25 cities, 15 countries and 4 continents. The performance of our last model is specificity: 0.9909, sensitivity: 0.9985, F1-score: 0.9976 and its ROC-AUC: 0.9962 using ten fold cross validation. On an inference dataset the perormance is: specificity: 0.9909, sensitivity: 0.9969, F1-score: 0.9969 and its ROC-AUC: 0.9939. To the best of our knowledge the performance of our model is the best one in the literature, as regards the diagnosis of AML using similar or not data. Moreover, there has not been any bibliographic reference which associates AML or any other type of cancer with the 16 probe sets we used as features in our final model.
first_indexed 2024-03-11T15:04:18Z
format Article
id doaj.art-e188e06b8e4841d4a75f70166f3fd500
institution Directory Open Access Journal
issn 2405-8440
language English
last_indexed 2024-03-11T15:04:18Z
publishDate 2023-10-01
publisher Elsevier
record_format Article
series Heliyon
spelling doaj.art-e188e06b8e4841d4a75f70166f3fd5002023-10-30T06:06:28ZengElsevierHeliyon2405-84402023-10-01910e20530Diagnosis of acute myeloid leukaemia on microarray gene expression data using categorical gradient boosted treesAthanasios Angelakis0Ioanna Soulioti1Michael Filippakis2Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam Public Health Research Institute, University of Amsterdam Data Science Center, Netherlands; Corresponding author.Department of Biology, National and Kapodistrian University of Athens, GreeceDepartment of Digital Systems, University of Piraeus, GreeceWe define an iterative method for dimensionality reduction using categorical gradient boosted trees and Shapley values and created four machine learning models which potentially could be used as diagnostic tests for acute myeloid leukaemia (AML). For the final Catboost model we use a dataset of 2177 individuals using as features 16 probe sets and the age in order to classify if someone has AML or is healthy. The dataset is multicentric and consists of data from 27 organizations, 25 cities, 15 countries and 4 continents. The performance of our last model is specificity: 0.9909, sensitivity: 0.9985, F1-score: 0.9976 and its ROC-AUC: 0.9962 using ten fold cross validation. On an inference dataset the perormance is: specificity: 0.9909, sensitivity: 0.9969, F1-score: 0.9969 and its ROC-AUC: 0.9939. To the best of our knowledge the performance of our model is the best one in the literature, as regards the diagnosis of AML using similar or not data. Moreover, there has not been any bibliographic reference which associates AML or any other type of cancer with the 16 probe sets we used as features in our final model.http://www.sciencedirect.com/science/article/pii/S2405844023077381
spellingShingle Athanasios Angelakis
Ioanna Soulioti
Michael Filippakis
Diagnosis of acute myeloid leukaemia on microarray gene expression data using categorical gradient boosted trees
Heliyon
title Diagnosis of acute myeloid leukaemia on microarray gene expression data using categorical gradient boosted trees
title_full Diagnosis of acute myeloid leukaemia on microarray gene expression data using categorical gradient boosted trees
title_fullStr Diagnosis of acute myeloid leukaemia on microarray gene expression data using categorical gradient boosted trees
title_full_unstemmed Diagnosis of acute myeloid leukaemia on microarray gene expression data using categorical gradient boosted trees
title_short Diagnosis of acute myeloid leukaemia on microarray gene expression data using categorical gradient boosted trees
title_sort diagnosis of acute myeloid leukaemia on microarray gene expression data using categorical gradient boosted trees
url http://www.sciencedirect.com/science/article/pii/S2405844023077381
work_keys_str_mv AT athanasiosangelakis diagnosisofacutemyeloidleukaemiaonmicroarraygeneexpressiondatausingcategoricalgradientboostedtrees
AT ioannasoulioti diagnosisofacutemyeloidleukaemiaonmicroarraygeneexpressiondatausingcategoricalgradientboostedtrees
AT michaelfilippakis diagnosisofacutemyeloidleukaemiaonmicroarraygeneexpressiondatausingcategoricalgradientboostedtrees