GBDTLRL2D Predicts LncRNA–Disease Associations Using MetaGraph2Vec and K-Means Based on Heterogeneous Network
In recent years, the long noncoding RNA (lncRNA) has been shown to be involved in many disease processes. The prediction of the lncRNA–disease association is helpful to clarify the mechanism of disease occurrence and bring some new methods of disease prevention and treatment. The current methods for...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Frontiers Media S.A.
2021-12-01
|
Series: | Frontiers in Cell and Developmental Biology |
Subjects: | |
Online Access: | https://www.frontiersin.org/articles/10.3389/fcell.2021.753027/full |
_version_ | 1828911371750735872 |
---|---|
author | Tao Duan Zhufang Kuang Jiaqi Wang Zhihao Ma |
author_facet | Tao Duan Zhufang Kuang Jiaqi Wang Zhihao Ma |
author_sort | Tao Duan |
collection | DOAJ |
description | In recent years, the long noncoding RNA (lncRNA) has been shown to be involved in many disease processes. The prediction of the lncRNA–disease association is helpful to clarify the mechanism of disease occurrence and bring some new methods of disease prevention and treatment. The current methods for predicting the potential lncRNA–disease association seldom consider the heterogeneous networks with complex node paths, and these methods have the problem of unbalanced positive and negative samples. To solve this problem, a method based on the Gradient Boosting Decision Tree (GBDT) and logistic regression (LR) to predict the lncRNA–disease association (GBDTLRL2D) is proposed in this paper. MetaGraph2Vec is used for feature learning, and negative sample sets are selected by using K-means clustering. The innovation of the GBDTLRL2D is that the clustering algorithm is used to select a representative negative sample set, and the use of MetaGraph2Vec can better retain the semantic and structural features in heterogeneous networks. The average area under the receiver operating characteristic curve (AUC) values of GBDTLRL2D obtained on the three datasets are 0.98, 0.98, and 0.96 in 10-fold cross-validation. |
first_indexed | 2024-12-13T19:01:53Z |
format | Article |
id | doaj.art-37c21304a25e4d08afe6ac94f61e42d8 |
institution | Directory Open Access Journal |
issn | 2296-634X |
language | English |
last_indexed | 2024-12-13T19:01:53Z |
publishDate | 2021-12-01 |
publisher | Frontiers Media S.A. |
record_format | Article |
series | Frontiers in Cell and Developmental Biology |
spelling | doaj.art-37c21304a25e4d08afe6ac94f61e42d82022-12-21T23:34:40ZengFrontiers Media S.A.Frontiers in Cell and Developmental Biology2296-634X2021-12-01910.3389/fcell.2021.753027753027GBDTLRL2D Predicts LncRNA–Disease Associations Using MetaGraph2Vec and K-Means Based on Heterogeneous NetworkTao DuanZhufang KuangJiaqi WangZhihao MaIn recent years, the long noncoding RNA (lncRNA) has been shown to be involved in many disease processes. The prediction of the lncRNA–disease association is helpful to clarify the mechanism of disease occurrence and bring some new methods of disease prevention and treatment. The current methods for predicting the potential lncRNA–disease association seldom consider the heterogeneous networks with complex node paths, and these methods have the problem of unbalanced positive and negative samples. To solve this problem, a method based on the Gradient Boosting Decision Tree (GBDT) and logistic regression (LR) to predict the lncRNA–disease association (GBDTLRL2D) is proposed in this paper. MetaGraph2Vec is used for feature learning, and negative sample sets are selected by using K-means clustering. The innovation of the GBDTLRL2D is that the clustering algorithm is used to select a representative negative sample set, and the use of MetaGraph2Vec can better retain the semantic and structural features in heterogeneous networks. The average area under the receiver operating characteristic curve (AUC) values of GBDTLRL2D obtained on the three datasets are 0.98, 0.98, and 0.96 in 10-fold cross-validation.https://www.frontiersin.org/articles/10.3389/fcell.2021.753027/fulllong noncoding RNAheterogeneous networkMetaGraph2VecK-meansGradient Boosting Decision Treelogistic regression |
spellingShingle | Tao Duan Zhufang Kuang Jiaqi Wang Zhihao Ma GBDTLRL2D Predicts LncRNA–Disease Associations Using MetaGraph2Vec and K-Means Based on Heterogeneous Network Frontiers in Cell and Developmental Biology long noncoding RNA heterogeneous network MetaGraph2Vec K-means Gradient Boosting Decision Tree logistic regression |
title | GBDTLRL2D Predicts LncRNA–Disease Associations Using MetaGraph2Vec and K-Means Based on Heterogeneous Network |
title_full | GBDTLRL2D Predicts LncRNA–Disease Associations Using MetaGraph2Vec and K-Means Based on Heterogeneous Network |
title_fullStr | GBDTLRL2D Predicts LncRNA–Disease Associations Using MetaGraph2Vec and K-Means Based on Heterogeneous Network |
title_full_unstemmed | GBDTLRL2D Predicts LncRNA–Disease Associations Using MetaGraph2Vec and K-Means Based on Heterogeneous Network |
title_short | GBDTLRL2D Predicts LncRNA–Disease Associations Using MetaGraph2Vec and K-Means Based on Heterogeneous Network |
title_sort | gbdtlrl2d predicts lncrna disease associations using metagraph2vec and k means based on heterogeneous network |
topic | long noncoding RNA heterogeneous network MetaGraph2Vec K-means Gradient Boosting Decision Tree logistic regression |
url | https://www.frontiersin.org/articles/10.3389/fcell.2021.753027/full |
work_keys_str_mv | AT taoduan gbdtlrl2dpredictslncrnadiseaseassociationsusingmetagraph2vecandkmeansbasedonheterogeneousnetwork AT zhufangkuang gbdtlrl2dpredictslncrnadiseaseassociationsusingmetagraph2vecandkmeansbasedonheterogeneousnetwork AT jiaqiwang gbdtlrl2dpredictslncrnadiseaseassociationsusingmetagraph2vecandkmeansbasedonheterogeneousnetwork AT zhihaoma gbdtlrl2dpredictslncrnadiseaseassociationsusingmetagraph2vecandkmeansbasedonheterogeneousnetwork |