A pitfall for machine learning methods aiming to predict across cell types

Abstract Machine learning models that predict genomic activity are most useful when they make accurate predictions across cell types. Here, we show that when the training and test sets contain the same genomic loci, the resulting model may falsely appear to perform well by effectively memorizing the...

Full description

Bibliographic Details
Main Authors: Jacob Schreiber, Ritambhara Singh, Jeffrey Bilmes, William Stafford Noble
Format: Article
Language:English
Published: BMC 2020-11-01
Series:Genome Biology
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13059-020-02177-y
_version_ 1818116497881432064
author Jacob Schreiber
Ritambhara Singh
Jeffrey Bilmes
William Stafford Noble
author_facet Jacob Schreiber
Ritambhara Singh
Jeffrey Bilmes
William Stafford Noble
author_sort Jacob Schreiber
collection DOAJ
description Abstract Machine learning models that predict genomic activity are most useful when they make accurate predictions across cell types. Here, we show that when the training and test sets contain the same genomic loci, the resulting model may falsely appear to perform well by effectively memorizing the average activity associated with each locus across the training cell types. We demonstrate this phenomenon in the context of predicting gene expression and chromatin domain boundaries, and we suggest methods to diagnose and avoid the pitfall. We anticipate that, as more data becomes available, future projects will increasingly risk suffering from this issue.
first_indexed 2024-12-11T04:23:28Z
format Article
id doaj.art-cee53112831b480bbf7ea471980d7d3e
institution Directory Open Access Journal
issn 1474-760X
language English
last_indexed 2024-12-11T04:23:28Z
publishDate 2020-11-01
publisher BMC
record_format Article
series Genome Biology
spelling doaj.art-cee53112831b480bbf7ea471980d7d3e2022-12-22T01:21:03ZengBMCGenome Biology1474-760X2020-11-012111610.1186/s13059-020-02177-yA pitfall for machine learning methods aiming to predict across cell typesJacob Schreiber0Ritambhara Singh1Jeffrey Bilmes2William Stafford Noble3Paul G. Allen School of Computer Science & Engineering, University of WashingtonDepartment of Genome Science, University of WashingtonPaul G. Allen School of Computer Science & Engineering, University of WashingtonPaul G. Allen School of Computer Science & Engineering, University of WashingtonAbstract Machine learning models that predict genomic activity are most useful when they make accurate predictions across cell types. Here, we show that when the training and test sets contain the same genomic loci, the resulting model may falsely appear to perform well by effectively memorizing the average activity associated with each locus across the training cell types. We demonstrate this phenomenon in the context of predicting gene expression and chromatin domain boundaries, and we suggest methods to diagnose and avoid the pitfall. We anticipate that, as more data becomes available, future projects will increasingly risk suffering from this issue.http://link.springer.com/article/10.1186/s13059-020-02177-yMachine learningEpigenomicsGenomics
spellingShingle Jacob Schreiber
Ritambhara Singh
Jeffrey Bilmes
William Stafford Noble
A pitfall for machine learning methods aiming to predict across cell types
Genome Biology
Machine learning
Epigenomics
Genomics
title A pitfall for machine learning methods aiming to predict across cell types
title_full A pitfall for machine learning methods aiming to predict across cell types
title_fullStr A pitfall for machine learning methods aiming to predict across cell types
title_full_unstemmed A pitfall for machine learning methods aiming to predict across cell types
title_short A pitfall for machine learning methods aiming to predict across cell types
title_sort pitfall for machine learning methods aiming to predict across cell types
topic Machine learning
Epigenomics
Genomics
url http://link.springer.com/article/10.1186/s13059-020-02177-y
work_keys_str_mv AT jacobschreiber apitfallformachinelearningmethodsaimingtopredictacrosscelltypes
AT ritambharasingh apitfallformachinelearningmethodsaimingtopredictacrosscelltypes
AT jeffreybilmes apitfallformachinelearningmethodsaimingtopredictacrosscelltypes
AT williamstaffordnoble apitfallformachinelearningmethodsaimingtopredictacrosscelltypes
AT jacobschreiber pitfallformachinelearningmethodsaimingtopredictacrosscelltypes
AT ritambharasingh pitfallformachinelearningmethodsaimingtopredictacrosscelltypes
AT jeffreybilmes pitfallformachinelearningmethodsaimingtopredictacrosscelltypes
AT williamstaffordnoble pitfallformachinelearningmethodsaimingtopredictacrosscelltypes