DNABERT-based explainable lncRNA identification in plant genome assemblies

Long non-coding ribonucleic acids (lncRNAs) have been shown to play an important role in plant gene regulation, involving both epigenetic and transcript regulation. LncRNAs are transcripts longer than 200 nucleotides that are not translated into functional proteins but can be translated into small p...

Full description

Bibliographic Details
Main Authors: Monica F. Danilevicz, Mitchell Gill, Cassandria G. Tay Fernandez, Jakob Petereit, Shriprabha R. Upadhyaya, Jacqueline Batley, Mohammed Bennamoun, David Edwards, Philipp E. Bayer
Format: Article
Language:English
Published: Elsevier 2023-01-01
Series:Computational and Structural Biotechnology Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2001037023004397
_version_ 1797384096277594112
author Monica F. Danilevicz
Mitchell Gill
Cassandria G. Tay Fernandez
Jakob Petereit
Shriprabha R. Upadhyaya
Jacqueline Batley
Mohammed Bennamoun
David Edwards
Philipp E. Bayer
author_facet Monica F. Danilevicz
Mitchell Gill
Cassandria G. Tay Fernandez
Jakob Petereit
Shriprabha R. Upadhyaya
Jacqueline Batley
Mohammed Bennamoun
David Edwards
Philipp E. Bayer
author_sort Monica F. Danilevicz
collection DOAJ
description Long non-coding ribonucleic acids (lncRNAs) have been shown to play an important role in plant gene regulation, involving both epigenetic and transcript regulation. LncRNAs are transcripts longer than 200 nucleotides that are not translated into functional proteins but can be translated into small peptides. Machine learning models have predominantly used transcriptome data with manually defined features to detect lncRNAs, however, they often underrepresent the abundance of lncRNAs and can be biased in their detection. Here we present a study using Natural Language Processing (NLP) models to identify plant lncRNAs from genomic sequences rather than transcriptomic data. The NLP models were trained to predict lncRNAs for seven model and crop species (Zea mays, Arabidopsis thaliana, Brassica napus, Brassica oleracea, Brassica rapa, Glycine max and Oryza sativa) using publicly available genomic references. We demonstrated that lncRNAs can be accurately predicted from genomic sequences with the highest accuracy of 83.4% for Z. mays and the lowest accuracy of 57.9% for B. rapa, revealing that genome assembly quality might affect the accuracy of lncRNA identification. Furthermore, we demonstrated the potential of using NLP models for cross-species prediction with an average of 63.1% accuracy using target species not previously seen by the model. As more species are incorporated into the training datasets, we expect the accuracy to increase, becoming a more reliable tool for uncovering novel lncRNAs. Finally, we show that the models can be interpreted using explainable artificial intelligence to identify motifs important to lncRNA prediction and that these motifs frequently flanked the lncRNA sequence.
first_indexed 2024-03-08T21:30:34Z
format Article
id doaj.art-d14bf230c8f04ee88fecd66685d13436
institution Directory Open Access Journal
issn 2001-0370
language English
last_indexed 2024-03-08T21:30:34Z
publishDate 2023-01-01
publisher Elsevier
record_format Article
series Computational and Structural Biotechnology Journal
spelling doaj.art-d14bf230c8f04ee88fecd66685d134362023-12-21T07:32:32ZengElsevierComputational and Structural Biotechnology Journal2001-03702023-01-012156765685DNABERT-based explainable lncRNA identification in plant genome assembliesMonica F. Danilevicz0Mitchell Gill1Cassandria G. Tay Fernandez2Jakob Petereit3Shriprabha R. Upadhyaya4Jacqueline Batley5Mohammed Bennamoun6David Edwards7Philipp E. Bayer8School of Biological Sciences, University of Western Australia, AustraliaSchool of Biological Sciences, University of Western Australia, AustraliaSchool of Biological Sciences, University of Western Australia, AustraliaSchool of Biological Sciences, University of Western Australia, AustraliaSchool of Biological Sciences, University of Western Australia, AustraliaSchool of Biological Sciences, University of Western Australia, AustraliaSchool of Physics, Mathematics and Computing, University of Western Australia, AustraliaSchool of Biological Sciences, University of Western Australia, AustraliaSchool of Biological Sciences, University of Western Australia, Australia; Corresponding author.Long non-coding ribonucleic acids (lncRNAs) have been shown to play an important role in plant gene regulation, involving both epigenetic and transcript regulation. LncRNAs are transcripts longer than 200 nucleotides that are not translated into functional proteins but can be translated into small peptides. Machine learning models have predominantly used transcriptome data with manually defined features to detect lncRNAs, however, they often underrepresent the abundance of lncRNAs and can be biased in their detection. Here we present a study using Natural Language Processing (NLP) models to identify plant lncRNAs from genomic sequences rather than transcriptomic data. The NLP models were trained to predict lncRNAs for seven model and crop species (Zea mays, Arabidopsis thaliana, Brassica napus, Brassica oleracea, Brassica rapa, Glycine max and Oryza sativa) using publicly available genomic references. We demonstrated that lncRNAs can be accurately predicted from genomic sequences with the highest accuracy of 83.4% for Z. mays and the lowest accuracy of 57.9% for B. rapa, revealing that genome assembly quality might affect the accuracy of lncRNA identification. Furthermore, we demonstrated the potential of using NLP models for cross-species prediction with an average of 63.1% accuracy using target species not previously seen by the model. As more species are incorporated into the training datasets, we expect the accuracy to increase, becoming a more reliable tool for uncovering novel lncRNAs. Finally, we show that the models can be interpreted using explainable artificial intelligence to identify motifs important to lncRNA prediction and that these motifs frequently flanked the lncRNA sequence.http://www.sciencedirect.com/science/article/pii/S2001037023004397LncRNAsNatural language processingDeep learningGenomic motifCross-species prediction
spellingShingle Monica F. Danilevicz
Mitchell Gill
Cassandria G. Tay Fernandez
Jakob Petereit
Shriprabha R. Upadhyaya
Jacqueline Batley
Mohammed Bennamoun
David Edwards
Philipp E. Bayer
DNABERT-based explainable lncRNA identification in plant genome assemblies
Computational and Structural Biotechnology Journal
LncRNAs
Natural language processing
Deep learning
Genomic motif
Cross-species prediction
title DNABERT-based explainable lncRNA identification in plant genome assemblies
title_full DNABERT-based explainable lncRNA identification in plant genome assemblies
title_fullStr DNABERT-based explainable lncRNA identification in plant genome assemblies
title_full_unstemmed DNABERT-based explainable lncRNA identification in plant genome assemblies
title_short DNABERT-based explainable lncRNA identification in plant genome assemblies
title_sort dnabert based explainable lncrna identification in plant genome assemblies
topic LncRNAs
Natural language processing
Deep learning
Genomic motif
Cross-species prediction
url http://www.sciencedirect.com/science/article/pii/S2001037023004397
work_keys_str_mv AT monicafdanilevicz dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies
AT mitchellgill dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies
AT cassandriagtayfernandez dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies
AT jakobpetereit dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies
AT shriprabharupadhyaya dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies
AT jacquelinebatley dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies
AT mohammedbennamoun dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies
AT davidedwards dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies
AT philippebayer dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies