A Comparative Study of Supervised Machine Learning Algorithms for the Prediction of Long-Range Chromatin Interactions

The role of three-dimensional genome organization as a critical regulator of gene expression has become increasingly clear over the last decade. Most of our understanding of this association comes from the study of long range chromatin interaction maps provided by Chromatin Conformation Capture-base...

Full description

Bibliographic Details
Main Authors:	Thomas Vanhaeren, Federico Divina, Miguel García-Torres, Francisco Gómez-Vela, Wim Vanhoof, Pedro Manuel Martínez-García
Format:	Article
Language:	English
Published:	MDPI AG 2020-08-01
Series:	Genes
Subjects:	machine-learning chromatin interactions prediction genomics genome architecture
Online Access:	https://www.mdpi.com/2073-4425/11/9/985

_version_	1797555914335584256
author	Thomas Vanhaeren Federico Divina Miguel García-Torres Francisco Gómez-Vela Wim Vanhoof Pedro Manuel Martínez-García
author_facet	Thomas Vanhaeren Federico Divina Miguel García-Torres Francisco Gómez-Vela Wim Vanhoof Pedro Manuel Martínez-García
author_sort	Thomas Vanhaeren
collection	DOAJ
description	The role of three-dimensional genome organization as a critical regulator of gene expression has become increasingly clear over the last decade. Most of our understanding of this association comes from the study of long range chromatin interaction maps provided by Chromatin Conformation Capture-based techniques, which have greatly improved in recent years. Since these procedures are experimentally laborious and expensive, in silico prediction has emerged as an alternative strategy to generate virtual maps in cell types and conditions for which experimental data of chromatin interactions is not available. Several methods have been based on predictive models trained on one-dimensional (1D) sequencing features, yielding promising results. However, different approaches vary both in the way they model chromatin interactions and in the machine learning-based strategy they rely on, making it challenging to carry out performance comparison of existing methods. In this study, we use publicly available 1D sequencing signals to model cohesin-mediated chromatin interactions in two human cell lines and evaluate the prediction performance of six popular machine learning algorithms: decision trees, random forests, gradient boosting, support vector machines, multi-layer perceptron and deep learning. Our approach accurately predicts long-range interactions and reveals that gradient boosting significantly outperforms the other five methods, yielding accuracies of about 95%. We show that chromatin features in close genomic proximity to the anchors cover most of the predictive information, as has been previously reported. Moreover, we demonstrate that gradient boosting models trained with different subsets of chromatin features, unlike the other methods tested, are able to produce accurate predictions. In this regard, and besides architectural proteins, transcription factors are shown to be highly informative. Our study provides a framework for the systematic prediction of long-range chromatin interactions, identifies gradient boosting as the best suited algorithm for this task and highlights cell-type specific binding of transcription factors at the anchors as important determinants of chromatin wiring mediated by cohesin.
first_indexed	2024-03-10T16:54:16Z
format	Article
id	doaj.art-c08c8ae053f548348da45c5a2ec0121e
institution	Directory Open Access Journal
issn	2073-4425
language	English
last_indexed	2024-03-10T16:54:16Z
publishDate	2020-08-01
publisher	MDPI AG
record_format	Article
series	Genes
spelling	doaj.art-c08c8ae053f548348da45c5a2ec0121e2023-11-20T11:13:03ZengMDPI AGGenes2073-44252020-08-0111998510.3390/genes11090985A Comparative Study of Supervised Machine Learning Algorithms for the Prediction of Long-Range Chromatin InteractionsThomas Vanhaeren0Federico Divina1Miguel García-Torres2Francisco Gómez-Vela3Wim Vanhoof4Pedro Manuel Martínez-García5Division of Computer Science, Universidad Pablo de Olavide, 41013 Sevilla, SpainDivision of Computer Science, Universidad Pablo de Olavide, 41013 Sevilla, SpainDivision of Computer Science, Universidad Pablo de Olavide, 41013 Sevilla, SpainDivision of Computer Science, Universidad Pablo de Olavide, 41013 Sevilla, SpainFaculty of Computer Science, University of Namur, 5000 Namur, BelgiumCentro Andaluz de Biología Molecular y Medicina Regenerativa (CABIMER), CSIC-Universidad de Sevilla-Universidad Pablo de Olavide, 41092 Sevilla, SpainThe role of three-dimensional genome organization as a critical regulator of gene expression has become increasingly clear over the last decade. Most of our understanding of this association comes from the study of long range chromatin interaction maps provided by Chromatin Conformation Capture-based techniques, which have greatly improved in recent years. Since these procedures are experimentally laborious and expensive, in silico prediction has emerged as an alternative strategy to generate virtual maps in cell types and conditions for which experimental data of chromatin interactions is not available. Several methods have been based on predictive models trained on one-dimensional (1D) sequencing features, yielding promising results. However, different approaches vary both in the way they model chromatin interactions and in the machine learning-based strategy they rely on, making it challenging to carry out performance comparison of existing methods. In this study, we use publicly available 1D sequencing signals to model cohesin-mediated chromatin interactions in two human cell lines and evaluate the prediction performance of six popular machine learning algorithms: decision trees, random forests, gradient boosting, support vector machines, multi-layer perceptron and deep learning. Our approach accurately predicts long-range interactions and reveals that gradient boosting significantly outperforms the other five methods, yielding accuracies of about 95%. We show that chromatin features in close genomic proximity to the anchors cover most of the predictive information, as has been previously reported. Moreover, we demonstrate that gradient boosting models trained with different subsets of chromatin features, unlike the other methods tested, are able to produce accurate predictions. In this regard, and besides architectural proteins, transcription factors are shown to be highly informative. Our study provides a framework for the systematic prediction of long-range chromatin interactions, identifies gradient boosting as the best suited algorithm for this task and highlights cell-type specific binding of transcription factors at the anchors as important determinants of chromatin wiring mediated by cohesin.https://www.mdpi.com/2073-4425/11/9/985machine-learningchromatin interactionspredictiongenomicsgenome architecture
spellingShingle	Thomas Vanhaeren Federico Divina Miguel García-Torres Francisco Gómez-Vela Wim Vanhoof Pedro Manuel Martínez-García A Comparative Study of Supervised Machine Learning Algorithms for the Prediction of Long-Range Chromatin Interactions Genes machine-learning chromatin interactions prediction genomics genome architecture
title	A Comparative Study of Supervised Machine Learning Algorithms for the Prediction of Long-Range Chromatin Interactions
title_full	A Comparative Study of Supervised Machine Learning Algorithms for the Prediction of Long-Range Chromatin Interactions
title_fullStr	A Comparative Study of Supervised Machine Learning Algorithms for the Prediction of Long-Range Chromatin Interactions
title_full_unstemmed	A Comparative Study of Supervised Machine Learning Algorithms for the Prediction of Long-Range Chromatin Interactions
title_short	A Comparative Study of Supervised Machine Learning Algorithms for the Prediction of Long-Range Chromatin Interactions
title_sort	comparative study of supervised machine learning algorithms for the prediction of long range chromatin interactions
topic	machine-learning chromatin interactions prediction genomics genome architecture
url	https://www.mdpi.com/2073-4425/11/9/985
work_keys_str_mv	AT thomasvanhaeren acomparativestudyofsupervisedmachinelearningalgorithmsforthepredictionoflongrangechromatininteractions AT federicodivina acomparativestudyofsupervisedmachinelearningalgorithmsforthepredictionoflongrangechromatininteractions AT miguelgarciatorres acomparativestudyofsupervisedmachinelearningalgorithmsforthepredictionoflongrangechromatininteractions AT franciscogomezvela acomparativestudyofsupervisedmachinelearningalgorithmsforthepredictionoflongrangechromatininteractions AT wimvanhoof acomparativestudyofsupervisedmachinelearningalgorithmsforthepredictionoflongrangechromatininteractions AT pedromanuelmartinezgarcia acomparativestudyofsupervisedmachinelearningalgorithmsforthepredictionoflongrangechromatininteractions AT thomasvanhaeren comparativestudyofsupervisedmachinelearningalgorithmsforthepredictionoflongrangechromatininteractions AT federicodivina comparativestudyofsupervisedmachinelearningalgorithmsforthepredictionoflongrangechromatininteractions AT miguelgarciatorres comparativestudyofsupervisedmachinelearningalgorithmsforthepredictionoflongrangechromatininteractions AT franciscogomezvela comparativestudyofsupervisedmachinelearningalgorithmsforthepredictionoflongrangechromatininteractions AT wimvanhoof comparativestudyofsupervisedmachinelearningalgorithmsforthepredictionoflongrangechromatininteractions AT pedromanuelmartinezgarcia comparativestudyofsupervisedmachinelearningalgorithmsforthepredictionoflongrangechromatininteractions

A Comparative Study of Supervised Machine Learning Algorithms for the Prediction of Long-Range Chromatin Interactions

Similar Items