Flnc: Machine Learning Improves the Identification of Novel Long Noncoding RNAs from Stand-Alone RNA-Seq Data

Long noncoding RNAs (lncRNAs) play critical regulatory roles in human development and disease. Although there are over 100,000 samples with available RNA sequencing (RNA-seq) data, many lncRNAs have yet to be annotated. The conventional approach to identifying novel lncRNAs from RNA-seq data is to f...

Full description

Bibliographic Details
Main Authors: Zixiu Li, Peng Zhou, Euijin Kwon, Katherine A. Fitzgerald, Zhiping Weng, Chan Zhou
Format: Article
Language:English
Published: MDPI AG 2022-10-01
Series:Non-Coding RNA
Subjects:
Online Access:https://www.mdpi.com/2311-553X/8/5/70
_version_ 1797470654362025984
author Zixiu Li
Peng Zhou
Euijin Kwon
Katherine A. Fitzgerald
Zhiping Weng
Chan Zhou
author_facet Zixiu Li
Peng Zhou
Euijin Kwon
Katherine A. Fitzgerald
Zhiping Weng
Chan Zhou
author_sort Zixiu Li
collection DOAJ
description Long noncoding RNAs (lncRNAs) play critical regulatory roles in human development and disease. Although there are over 100,000 samples with available RNA sequencing (RNA-seq) data, many lncRNAs have yet to be annotated. The conventional approach to identifying novel lncRNAs from RNA-seq data is to find transcripts without coding potential but this approach has a false discovery rate of 30–75%. Other existing methods either identify only multi-exon lncRNAs, missing single-exon lncRNAs, or require transcriptional initiation profiling data (such as H3K4me3 ChIP-seq data), which is unavailable for many samples with RNA-seq data. Because of these limitations, current methods cannot accurately identify novel lncRNAs from existing RNA-seq data. To address this problem, we have developed software, <i>Flnc</i>, to accurately identify both novel and annotated full-length lncRNAs, including single-exon lncRNAs, directly from RNA-seq data without requiring transcriptional initiation profiles. <i>Flnc</i> integrates machine learning models built by incorporating four types of features: transcript length, promoter signature, multiple exons, and genomic location. <i>Flnc</i> achieves state-of-the-art prediction power with an AUROC score over 0.92. <i>Flnc</i> significantly improves the prediction accuracy from less than 50% using the conventional approach to over 85%. <i>Flnc</i> is available via GitHub platform.
first_indexed 2024-03-09T19:39:05Z
format Article
id doaj.art-04db2740038f4ab382fc7794fc41cf29
institution Directory Open Access Journal
issn 2311-553X
language English
last_indexed 2024-03-09T19:39:05Z
publishDate 2022-10-01
publisher MDPI AG
record_format Article
series Non-Coding RNA
spelling doaj.art-04db2740038f4ab382fc7794fc41cf292023-11-24T01:42:31ZengMDPI AGNon-Coding RNA2311-553X2022-10-01857010.3390/ncrna8050070Flnc: Machine Learning Improves the Identification of Novel Long Noncoding RNAs from Stand-Alone RNA-Seq DataZixiu Li0Peng Zhou1Euijin Kwon2Katherine A. Fitzgerald3Zhiping Weng4Chan Zhou5Division of Biostatistics and Health Services Research, Department of Population and Quantitative Health Sciences, University of Massachusetts Chan Medical School, Worcester, MA 01605, USADivision of Biostatistics and Health Services Research, Department of Population and Quantitative Health Sciences, University of Massachusetts Chan Medical School, Worcester, MA 01605, USADivision of Biostatistics and Health Services Research, Department of Population and Quantitative Health Sciences, University of Massachusetts Chan Medical School, Worcester, MA 01605, USAProgram in Innate Immunity, Division of Infectious Disease and Immunology, Department of Medicine, University of Massachusetts Chan Medical School, Worcester, MA 01605, USAProgram in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, Worcester, MA 01605, USADivision of Biostatistics and Health Services Research, Department of Population and Quantitative Health Sciences, University of Massachusetts Chan Medical School, Worcester, MA 01605, USALong noncoding RNAs (lncRNAs) play critical regulatory roles in human development and disease. Although there are over 100,000 samples with available RNA sequencing (RNA-seq) data, many lncRNAs have yet to be annotated. The conventional approach to identifying novel lncRNAs from RNA-seq data is to find transcripts without coding potential but this approach has a false discovery rate of 30–75%. Other existing methods either identify only multi-exon lncRNAs, missing single-exon lncRNAs, or require transcriptional initiation profiling data (such as H3K4me3 ChIP-seq data), which is unavailable for many samples with RNA-seq data. Because of these limitations, current methods cannot accurately identify novel lncRNAs from existing RNA-seq data. To address this problem, we have developed software, <i>Flnc</i>, to accurately identify both novel and annotated full-length lncRNAs, including single-exon lncRNAs, directly from RNA-seq data without requiring transcriptional initiation profiles. <i>Flnc</i> integrates machine learning models built by incorporating four types of features: transcript length, promoter signature, multiple exons, and genomic location. <i>Flnc</i> achieves state-of-the-art prediction power with an AUROC score over 0.92. <i>Flnc</i> significantly improves the prediction accuracy from less than 50% using the conventional approach to over 85%. <i>Flnc</i> is available via GitHub platform.https://www.mdpi.com/2311-553X/8/5/70lncRNAmachine learningRNA-seqtoolunannotated
spellingShingle Zixiu Li
Peng Zhou
Euijin Kwon
Katherine A. Fitzgerald
Zhiping Weng
Chan Zhou
Flnc: Machine Learning Improves the Identification of Novel Long Noncoding RNAs from Stand-Alone RNA-Seq Data
Non-Coding RNA
lncRNA
machine learning
RNA-seq
tool
unannotated
title Flnc: Machine Learning Improves the Identification of Novel Long Noncoding RNAs from Stand-Alone RNA-Seq Data
title_full Flnc: Machine Learning Improves the Identification of Novel Long Noncoding RNAs from Stand-Alone RNA-Seq Data
title_fullStr Flnc: Machine Learning Improves the Identification of Novel Long Noncoding RNAs from Stand-Alone RNA-Seq Data
title_full_unstemmed Flnc: Machine Learning Improves the Identification of Novel Long Noncoding RNAs from Stand-Alone RNA-Seq Data
title_short Flnc: Machine Learning Improves the Identification of Novel Long Noncoding RNAs from Stand-Alone RNA-Seq Data
title_sort flnc machine learning improves the identification of novel long noncoding rnas from stand alone rna seq data
topic lncRNA
machine learning
RNA-seq
tool
unannotated
url https://www.mdpi.com/2311-553X/8/5/70
work_keys_str_mv AT zixiuli flncmachinelearningimprovestheidentificationofnovellongnoncodingrnasfromstandalonernaseqdata
AT pengzhou flncmachinelearningimprovestheidentificationofnovellongnoncodingrnasfromstandalonernaseqdata
AT euijinkwon flncmachinelearningimprovestheidentificationofnovellongnoncodingrnasfromstandalonernaseqdata
AT katherineafitzgerald flncmachinelearningimprovestheidentificationofnovellongnoncodingrnasfromstandalonernaseqdata
AT zhipingweng flncmachinelearningimprovestheidentificationofnovellongnoncodingrnasfromstandalonernaseqdata
AT chanzhou flncmachinelearningimprovestheidentificationofnovellongnoncodingrnasfromstandalonernaseqdata