GBDTLRL2D Predicts LncRNA–Disease Associations Using MetaGraph2Vec and K-Means Based on Heterogeneous Network

In recent years, the long noncoding RNA (lncRNA) has been shown to be involved in many disease processes. The prediction of the lncRNA–disease association is helpful to clarify the mechanism of disease occurrence and bring some new methods of disease prevention and treatment. The current methods for...

Full description

Bibliographic Details
Main Authors: Tao Duan, Zhufang Kuang, Jiaqi Wang, Zhihao Ma
Format: Article
Language:English
Published: Frontiers Media S.A. 2021-12-01
Series:Frontiers in Cell and Developmental Biology
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fcell.2021.753027/full
_version_ 1828911371750735872
author Tao Duan
Zhufang Kuang
Jiaqi Wang
Zhihao Ma
author_facet Tao Duan
Zhufang Kuang
Jiaqi Wang
Zhihao Ma
author_sort Tao Duan
collection DOAJ
description In recent years, the long noncoding RNA (lncRNA) has been shown to be involved in many disease processes. The prediction of the lncRNA–disease association is helpful to clarify the mechanism of disease occurrence and bring some new methods of disease prevention and treatment. The current methods for predicting the potential lncRNA–disease association seldom consider the heterogeneous networks with complex node paths, and these methods have the problem of unbalanced positive and negative samples. To solve this problem, a method based on the Gradient Boosting Decision Tree (GBDT) and logistic regression (LR) to predict the lncRNA–disease association (GBDTLRL2D) is proposed in this paper. MetaGraph2Vec is used for feature learning, and negative sample sets are selected by using K-means clustering. The innovation of the GBDTLRL2D is that the clustering algorithm is used to select a representative negative sample set, and the use of MetaGraph2Vec can better retain the semantic and structural features in heterogeneous networks. The average area under the receiver operating characteristic curve (AUC) values of GBDTLRL2D obtained on the three datasets are 0.98, 0.98, and 0.96 in 10-fold cross-validation.
first_indexed 2024-12-13T19:01:53Z
format Article
id doaj.art-37c21304a25e4d08afe6ac94f61e42d8
institution Directory Open Access Journal
issn 2296-634X
language English
last_indexed 2024-12-13T19:01:53Z
publishDate 2021-12-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Cell and Developmental Biology
spelling doaj.art-37c21304a25e4d08afe6ac94f61e42d82022-12-21T23:34:40ZengFrontiers Media S.A.Frontiers in Cell and Developmental Biology2296-634X2021-12-01910.3389/fcell.2021.753027753027GBDTLRL2D Predicts LncRNA–Disease Associations Using MetaGraph2Vec and K-Means Based on Heterogeneous NetworkTao DuanZhufang KuangJiaqi WangZhihao MaIn recent years, the long noncoding RNA (lncRNA) has been shown to be involved in many disease processes. The prediction of the lncRNA–disease association is helpful to clarify the mechanism of disease occurrence and bring some new methods of disease prevention and treatment. The current methods for predicting the potential lncRNA–disease association seldom consider the heterogeneous networks with complex node paths, and these methods have the problem of unbalanced positive and negative samples. To solve this problem, a method based on the Gradient Boosting Decision Tree (GBDT) and logistic regression (LR) to predict the lncRNA–disease association (GBDTLRL2D) is proposed in this paper. MetaGraph2Vec is used for feature learning, and negative sample sets are selected by using K-means clustering. The innovation of the GBDTLRL2D is that the clustering algorithm is used to select a representative negative sample set, and the use of MetaGraph2Vec can better retain the semantic and structural features in heterogeneous networks. The average area under the receiver operating characteristic curve (AUC) values of GBDTLRL2D obtained on the three datasets are 0.98, 0.98, and 0.96 in 10-fold cross-validation.https://www.frontiersin.org/articles/10.3389/fcell.2021.753027/fulllong noncoding RNAheterogeneous networkMetaGraph2VecK-meansGradient Boosting Decision Treelogistic regression
spellingShingle Tao Duan
Zhufang Kuang
Jiaqi Wang
Zhihao Ma
GBDTLRL2D Predicts LncRNA–Disease Associations Using MetaGraph2Vec and K-Means Based on Heterogeneous Network
Frontiers in Cell and Developmental Biology
long noncoding RNA
heterogeneous network
MetaGraph2Vec
K-means
Gradient Boosting Decision Tree
logistic regression
title GBDTLRL2D Predicts LncRNA–Disease Associations Using MetaGraph2Vec and K-Means Based on Heterogeneous Network
title_full GBDTLRL2D Predicts LncRNA–Disease Associations Using MetaGraph2Vec and K-Means Based on Heterogeneous Network
title_fullStr GBDTLRL2D Predicts LncRNA–Disease Associations Using MetaGraph2Vec and K-Means Based on Heterogeneous Network
title_full_unstemmed GBDTLRL2D Predicts LncRNA–Disease Associations Using MetaGraph2Vec and K-Means Based on Heterogeneous Network
title_short GBDTLRL2D Predicts LncRNA–Disease Associations Using MetaGraph2Vec and K-Means Based on Heterogeneous Network
title_sort gbdtlrl2d predicts lncrna disease associations using metagraph2vec and k means based on heterogeneous network
topic long noncoding RNA
heterogeneous network
MetaGraph2Vec
K-means
Gradient Boosting Decision Tree
logistic regression
url https://www.frontiersin.org/articles/10.3389/fcell.2021.753027/full
work_keys_str_mv AT taoduan gbdtlrl2dpredictslncrnadiseaseassociationsusingmetagraph2vecandkmeansbasedonheterogeneousnetwork
AT zhufangkuang gbdtlrl2dpredictslncrnadiseaseassociationsusingmetagraph2vecandkmeansbasedonheterogeneousnetwork
AT jiaqiwang gbdtlrl2dpredictslncrnadiseaseassociationsusingmetagraph2vecandkmeansbasedonheterogeneousnetwork
AT zhihaoma gbdtlrl2dpredictslncrnadiseaseassociationsusingmetagraph2vecandkmeansbasedonheterogeneousnetwork