The impact of different negative training data on regulatory sequence predictions.

Regulatory regions, like promoters and enhancers, cover an estimated 5-15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in...

Full description

Bibliographic Details
Main Authors: Louisa-Marie Krützfeldt, Max Schubach, Martin Kircher
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2020-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0237412
_version_ 1818579437638123520
author Louisa-Marie Krützfeldt
Max Schubach
Martin Kircher
author_facet Louisa-Marie Krützfeldt
Max Schubach
Martin Kircher
author_sort Louisa-Marie Krützfeldt
collection DOAJ
description Regulatory regions, like promoters and enhancers, cover an estimated 5-15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in DNA is still very limited. Applying machine or deep learning methods can shed light on this encoding and gapped k-mer support vector machines (gkm-SVMs) or convolutional neural networks (CNNs) are commonly trained on putative regulatory sequences. Here, we investigate the impact of negative sequence selection on model performance. By training gkm-SVM and CNN models on open chromatin data and corresponding negative training dataset, both learners and two approaches for negative training data are compared. Negative sets use either genomic background sequences or sequence shuffles of the positive sequences. Model performance was evaluated on three different tasks: predicting elements active in a cell-type, predicting cell-type specific elements, and predicting elements' relative activity as measured from independent experimental data. Our results indicate strong effects of the negative training data, with genomic backgrounds showing overall best results. Specifically, models trained on highly shuffled sequences perform worse on the complex tasks of tissue-specific activity and quantitative activity prediction, and seem to learn features of artificial sequences rather than regulatory activity. Further, we observe that insufficient matching of genomic background sequences results in model biases. While CNNs achieved and exceeded the performance of gkm-SVMs for larger training datasets, gkm-SVMs gave robust and best results for typical training dataset sizes without the need of hyperparameter optimization.
first_indexed 2024-12-16T07:01:42Z
format Article
id doaj.art-a1fc825d71454cb08de3f646ae5e311b
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-12-16T07:01:42Z
publishDate 2020-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-a1fc825d71454cb08de3f646ae5e311b2022-12-21T22:40:09ZengPublic Library of Science (PLoS)PLoS ONE1932-62032020-01-011512e023741210.1371/journal.pone.0237412The impact of different negative training data on regulatory sequence predictions.Louisa-Marie KrützfeldtMax SchubachMartin KircherRegulatory regions, like promoters and enhancers, cover an estimated 5-15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in DNA is still very limited. Applying machine or deep learning methods can shed light on this encoding and gapped k-mer support vector machines (gkm-SVMs) or convolutional neural networks (CNNs) are commonly trained on putative regulatory sequences. Here, we investigate the impact of negative sequence selection on model performance. By training gkm-SVM and CNN models on open chromatin data and corresponding negative training dataset, both learners and two approaches for negative training data are compared. Negative sets use either genomic background sequences or sequence shuffles of the positive sequences. Model performance was evaluated on three different tasks: predicting elements active in a cell-type, predicting cell-type specific elements, and predicting elements' relative activity as measured from independent experimental data. Our results indicate strong effects of the negative training data, with genomic backgrounds showing overall best results. Specifically, models trained on highly shuffled sequences perform worse on the complex tasks of tissue-specific activity and quantitative activity prediction, and seem to learn features of artificial sequences rather than regulatory activity. Further, we observe that insufficient matching of genomic background sequences results in model biases. While CNNs achieved and exceeded the performance of gkm-SVMs for larger training datasets, gkm-SVMs gave robust and best results for typical training dataset sizes without the need of hyperparameter optimization.https://doi.org/10.1371/journal.pone.0237412
spellingShingle Louisa-Marie Krützfeldt
Max Schubach
Martin Kircher
The impact of different negative training data on regulatory sequence predictions.
PLoS ONE
title The impact of different negative training data on regulatory sequence predictions.
title_full The impact of different negative training data on regulatory sequence predictions.
title_fullStr The impact of different negative training data on regulatory sequence predictions.
title_full_unstemmed The impact of different negative training data on regulatory sequence predictions.
title_short The impact of different negative training data on regulatory sequence predictions.
title_sort impact of different negative training data on regulatory sequence predictions
url https://doi.org/10.1371/journal.pone.0237412
work_keys_str_mv AT louisamariekrutzfeldt theimpactofdifferentnegativetrainingdataonregulatorysequencepredictions
AT maxschubach theimpactofdifferentnegativetrainingdataonregulatorysequencepredictions
AT martinkircher theimpactofdifferentnegativetrainingdataonregulatorysequencepredictions
AT louisamariekrutzfeldt impactofdifferentnegativetrainingdataonregulatorysequencepredictions
AT maxschubach impactofdifferentnegativetrainingdataonregulatorysequencepredictions
AT martinkircher impactofdifferentnegativetrainingdataonregulatorysequencepredictions