LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information
Background: Long non-coding RNA (lncRNA) is one of the most essential forms of transcripts, playing crucial regulatory roles in the development of cancers and diseases without protein-coding ability. It was assumed that short ORFs (sORFs) in lncRNA were weak to translate proteins. However, recent re...
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2023-01-01
|
Series: | Computational and Structural Biotechnology Journal |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2001037023000582 |
_version_ | 1797384146001068032 |
---|---|
author | Hongqi Feng Shaocong Wang Yan Wang Xinye Ni Zexi Yang Xuemei Hu Sen Yang |
author_facet | Hongqi Feng Shaocong Wang Yan Wang Xinye Ni Zexi Yang Xuemei Hu Sen Yang |
author_sort | Hongqi Feng |
collection | DOAJ |
description | Background: Long non-coding RNA (lncRNA) is one of the most essential forms of transcripts, playing crucial regulatory roles in the development of cancers and diseases without protein-coding ability. It was assumed that short ORFs (sORFs) in lncRNA were weak to translate proteins. However, recent research has shown that sORFs can encode peptides, which increases the difficulty to identify lncRNA. Therefore, identifying lncRNAs with sORFs facilitates finding novel regulatory factors. Results: In this paper, we propose LncCat for identifying lncRNA based on category boosting (CatBoost) and ORF-attention features. LncCat combines five types of features to encode transcript sequences and employs CatBoost to build a prediction model. In addition, the visualization comparison reveals that the ORF-attention features between lncRNAs and protein-coding transcripts are significantly distinct. The comparison results show that LncCat outperforms competing methods on several benchmark datasets. For Matthew’s Correlation Coefficient (MCC), LncCat achieves 0.9503, 0.9219, 0.8591, 0.8672, and 0.9047 on the human, mouse, zebrafish, wheat, and chicken datasets, with improvements ranging from 1.90% to 7.82%, 1.49–17.63%, 6.11–21.50%, 3.02–51.64% and 5.35–26.90%, respectively. Moreover, LncCat dramatically improves the MCC by at least 11.90%, 12.96% and 42.61% on sORF test datasets of human, mouse, and zebrafish, respectively. Conclusions: Experiments indicate that LncCat performs better both on long ORF and sORF datasets, and ORF-attention features show positive effects on predicting lncRNA. In brief, LncCat is a reliable method for identifying lncRNA. Additionally, a user-friendly web server is developed for academics at http://cczubio.top/lnccat. |
first_indexed | 2024-03-08T21:31:16Z |
format | Article |
id | doaj.art-73149bc3e97d4763a8d5af53764dfdf3 |
institution | Directory Open Access Journal |
issn | 2001-0370 |
language | English |
last_indexed | 2024-03-08T21:31:16Z |
publishDate | 2023-01-01 |
publisher | Elsevier |
record_format | Article |
series | Computational and Structural Biotechnology Journal |
spelling | doaj.art-73149bc3e97d4763a8d5af53764dfdf32023-12-21T07:31:00ZengElsevierComputational and Structural Biotechnology Journal2001-03702023-01-012114331447LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence informationHongqi Feng0Shaocong Wang1Yan Wang2Xinye Ni3Zexi Yang4Xuemei Hu5 Sen Yang6School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou 213164, ChinaSchool of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou 213164, ChinaKey Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; School of Artificial Intelligence, Jilin University, Changchun 130012, ChinaThe Affiliated Changzhou No.2 People’s Hospital of Nanjing Medical University, Changzhou 213164, ChinaSchool of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou 213164, ChinaKey Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, ChinaSchool of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou 213164, China; The Affiliated Changzhou No.2 People’s Hospital of Nanjing Medical University, Changzhou 213164, China; Corresponding author at: School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou 213164, China.Background: Long non-coding RNA (lncRNA) is one of the most essential forms of transcripts, playing crucial regulatory roles in the development of cancers and diseases without protein-coding ability. It was assumed that short ORFs (sORFs) in lncRNA were weak to translate proteins. However, recent research has shown that sORFs can encode peptides, which increases the difficulty to identify lncRNA. Therefore, identifying lncRNAs with sORFs facilitates finding novel regulatory factors. Results: In this paper, we propose LncCat for identifying lncRNA based on category boosting (CatBoost) and ORF-attention features. LncCat combines five types of features to encode transcript sequences and employs CatBoost to build a prediction model. In addition, the visualization comparison reveals that the ORF-attention features between lncRNAs and protein-coding transcripts are significantly distinct. The comparison results show that LncCat outperforms competing methods on several benchmark datasets. For Matthew’s Correlation Coefficient (MCC), LncCat achieves 0.9503, 0.9219, 0.8591, 0.8672, and 0.9047 on the human, mouse, zebrafish, wheat, and chicken datasets, with improvements ranging from 1.90% to 7.82%, 1.49–17.63%, 6.11–21.50%, 3.02–51.64% and 5.35–26.90%, respectively. Moreover, LncCat dramatically improves the MCC by at least 11.90%, 12.96% and 42.61% on sORF test datasets of human, mouse, and zebrafish, respectively. Conclusions: Experiments indicate that LncCat performs better both on long ORF and sORF datasets, and ORF-attention features show positive effects on predicting lncRNA. In brief, LncCat is a reliable method for identifying lncRNA. Additionally, a user-friendly web server is developed for academics at http://cczubio.top/lnccat.http://www.sciencedirect.com/science/article/pii/S2001037023000582LncRNAs identificationEnsemble learningORF-attention featuresSmall ORF |
spellingShingle | Hongqi Feng Shaocong Wang Yan Wang Xinye Ni Zexi Yang Xuemei Hu Sen Yang LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information Computational and Structural Biotechnology Journal LncRNAs identification Ensemble learning ORF-attention features Small ORF |
title | LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information |
title_full | LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information |
title_fullStr | LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information |
title_full_unstemmed | LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information |
title_short | LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information |
title_sort | lnccat an orf attention model to identify lncrna based on ensemble learning strategy and fused sequence information |
topic | LncRNAs identification Ensemble learning ORF-attention features Small ORF |
url | http://www.sciencedirect.com/science/article/pii/S2001037023000582 |
work_keys_str_mv | AT hongqifeng lnccatanorfattentionmodeltoidentifylncrnabasedonensemblelearningstrategyandfusedsequenceinformation AT shaocongwang lnccatanorfattentionmodeltoidentifylncrnabasedonensemblelearningstrategyandfusedsequenceinformation AT yanwang lnccatanorfattentionmodeltoidentifylncrnabasedonensemblelearningstrategyandfusedsequenceinformation AT xinyeni lnccatanorfattentionmodeltoidentifylncrnabasedonensemblelearningstrategyandfusedsequenceinformation AT zexiyang lnccatanorfattentionmodeltoidentifylncrnabasedonensemblelearningstrategyandfusedsequenceinformation AT xuemeihu lnccatanorfattentionmodeltoidentifylncrnabasedonensemblelearningstrategyandfusedsequenceinformation AT senyang lnccatanorfattentionmodeltoidentifylncrnabasedonensemblelearningstrategyandfusedsequenceinformation |