Doppelgänger spotting in biomedical gene expression data

Summary: Doppelgänger effects (DEs) occur when samples exhibit chance similarities such that, when split across training and validation sets, inflates the trained machine learning (ML) model performance. This inflationary effect causes misleading confidence on the deployability of the model. Thus, s...

Full description

Bibliographic Details
Main Authors: Li Rong Wang, Xin Yun Choy, Wilson Wen Bin Goh
Format: Article
Language:English
Published: Elsevier 2022-08-01
Series:iScience
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2589004222010604
_version_ 1818498642406801408
author Li Rong Wang
Xin Yun Choy
Wilson Wen Bin Goh
author_facet Li Rong Wang
Xin Yun Choy
Wilson Wen Bin Goh
author_sort Li Rong Wang
collection DOAJ
description Summary: Doppelgänger effects (DEs) occur when samples exhibit chance similarities such that, when split across training and validation sets, inflates the trained machine learning (ML) model performance. This inflationary effect causes misleading confidence on the deployability of the model. Thus, so far, there are no tools for doppelgänger identification or standard practices to manage their confounding implications. We present doppelgangerIdentifier, a software suite for doppelgänger identification and verification. Applying doppelgangerIdentifier across a multitude of diseases and data types, we show the pervasive nature of DEs in biomedical gene expression data. We also provide guidelines toward proper doppelgänger identification by exploring the ramifications of lingering batch effects from batch imbalances on the sensitivity of our doppelgänger identification algorithm. We suggest doppelgänger verification as a useful procedure to establish baselines for model evaluation that may inform on whether feature selection and ML on the data set may yield meaningful insights.
first_indexed 2024-12-10T20:18:21Z
format Article
id doaj.art-5d1e1ca92eb7484f86a83b7332274446
institution Directory Open Access Journal
issn 2589-0042
language English
last_indexed 2024-12-10T20:18:21Z
publishDate 2022-08-01
publisher Elsevier
record_format Article
series iScience
spelling doaj.art-5d1e1ca92eb7484f86a83b73322744462022-12-22T01:35:07ZengElsevieriScience2589-00422022-08-01258104788Doppelgänger spotting in biomedical gene expression dataLi Rong Wang0Xin Yun Choy1Wilson Wen Bin Goh2School of Computer Science and Engineering, Nanyang Technological University, 60 Nanyang Drive, 637551, SingaporeSchool of Computer Science and Engineering, Nanyang Technological University, 60 Nanyang Drive, 637551, SingaporeSchool of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, 637551, Singapore; Lee Kong Chian School of Medicine, Nanyang Technological University, 60 Nanyang Drive, 637551, Singapore; Centre for Biomedical Informatics, Nanyang Technological University, 60 Nanyang Drive, 637551, Singapore; Corresponding authorSummary: Doppelgänger effects (DEs) occur when samples exhibit chance similarities such that, when split across training and validation sets, inflates the trained machine learning (ML) model performance. This inflationary effect causes misleading confidence on the deployability of the model. Thus, so far, there are no tools for doppelgänger identification or standard practices to manage their confounding implications. We present doppelgangerIdentifier, a software suite for doppelgänger identification and verification. Applying doppelgangerIdentifier across a multitude of diseases and data types, we show the pervasive nature of DEs in biomedical gene expression data. We also provide guidelines toward proper doppelgänger identification by exploring the ramifications of lingering batch effects from batch imbalances on the sensitivity of our doppelgänger identification algorithm. We suggest doppelgänger verification as a useful procedure to establish baselines for model evaluation that may inform on whether feature selection and ML on the data set may yield meaningful insights.http://www.sciencedirect.com/science/article/pii/S2589004222010604BioinformaticsGenomicsHuman Genetics
spellingShingle Li Rong Wang
Xin Yun Choy
Wilson Wen Bin Goh
Doppelgänger spotting in biomedical gene expression data
iScience
Bioinformatics
Genomics
Human Genetics
title Doppelgänger spotting in biomedical gene expression data
title_full Doppelgänger spotting in biomedical gene expression data
title_fullStr Doppelgänger spotting in biomedical gene expression data
title_full_unstemmed Doppelgänger spotting in biomedical gene expression data
title_short Doppelgänger spotting in biomedical gene expression data
title_sort doppelganger spotting in biomedical gene expression data
topic Bioinformatics
Genomics
Human Genetics
url http://www.sciencedirect.com/science/article/pii/S2589004222010604
work_keys_str_mv AT lirongwang doppelgangerspottinginbiomedicalgeneexpressiondata
AT xinyunchoy doppelgangerspottinginbiomedicalgeneexpressiondata
AT wilsonwenbingoh doppelgangerspottinginbiomedicalgeneexpressiondata