DNABERT-based explainable lncRNA identification in plant genome assemblies
Long non-coding ribonucleic acids (lncRNAs) have been shown to play an important role in plant gene regulation, involving both epigenetic and transcript regulation. LncRNAs are transcripts longer than 200 nucleotides that are not translated into functional proteins but can be translated into small p...
Main Authors: | , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2023-01-01
|
Series: | Computational and Structural Biotechnology Journal |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2001037023004397 |
_version_ | 1797384096277594112 |
---|---|
author | Monica F. Danilevicz Mitchell Gill Cassandria G. Tay Fernandez Jakob Petereit Shriprabha R. Upadhyaya Jacqueline Batley Mohammed Bennamoun David Edwards Philipp E. Bayer |
author_facet | Monica F. Danilevicz Mitchell Gill Cassandria G. Tay Fernandez Jakob Petereit Shriprabha R. Upadhyaya Jacqueline Batley Mohammed Bennamoun David Edwards Philipp E. Bayer |
author_sort | Monica F. Danilevicz |
collection | DOAJ |
description | Long non-coding ribonucleic acids (lncRNAs) have been shown to play an important role in plant gene regulation, involving both epigenetic and transcript regulation. LncRNAs are transcripts longer than 200 nucleotides that are not translated into functional proteins but can be translated into small peptides. Machine learning models have predominantly used transcriptome data with manually defined features to detect lncRNAs, however, they often underrepresent the abundance of lncRNAs and can be biased in their detection. Here we present a study using Natural Language Processing (NLP) models to identify plant lncRNAs from genomic sequences rather than transcriptomic data. The NLP models were trained to predict lncRNAs for seven model and crop species (Zea mays, Arabidopsis thaliana, Brassica napus, Brassica oleracea, Brassica rapa, Glycine max and Oryza sativa) using publicly available genomic references. We demonstrated that lncRNAs can be accurately predicted from genomic sequences with the highest accuracy of 83.4% for Z. mays and the lowest accuracy of 57.9% for B. rapa, revealing that genome assembly quality might affect the accuracy of lncRNA identification. Furthermore, we demonstrated the potential of using NLP models for cross-species prediction with an average of 63.1% accuracy using target species not previously seen by the model. As more species are incorporated into the training datasets, we expect the accuracy to increase, becoming a more reliable tool for uncovering novel lncRNAs. Finally, we show that the models can be interpreted using explainable artificial intelligence to identify motifs important to lncRNA prediction and that these motifs frequently flanked the lncRNA sequence. |
first_indexed | 2024-03-08T21:30:34Z |
format | Article |
id | doaj.art-d14bf230c8f04ee88fecd66685d13436 |
institution | Directory Open Access Journal |
issn | 2001-0370 |
language | English |
last_indexed | 2024-03-08T21:30:34Z |
publishDate | 2023-01-01 |
publisher | Elsevier |
record_format | Article |
series | Computational and Structural Biotechnology Journal |
spelling | doaj.art-d14bf230c8f04ee88fecd66685d134362023-12-21T07:32:32ZengElsevierComputational and Structural Biotechnology Journal2001-03702023-01-012156765685DNABERT-based explainable lncRNA identification in plant genome assembliesMonica F. Danilevicz0Mitchell Gill1Cassandria G. Tay Fernandez2Jakob Petereit3Shriprabha R. Upadhyaya4Jacqueline Batley5Mohammed Bennamoun6David Edwards7Philipp E. Bayer8School of Biological Sciences, University of Western Australia, AustraliaSchool of Biological Sciences, University of Western Australia, AustraliaSchool of Biological Sciences, University of Western Australia, AustraliaSchool of Biological Sciences, University of Western Australia, AustraliaSchool of Biological Sciences, University of Western Australia, AustraliaSchool of Biological Sciences, University of Western Australia, AustraliaSchool of Physics, Mathematics and Computing, University of Western Australia, AustraliaSchool of Biological Sciences, University of Western Australia, AustraliaSchool of Biological Sciences, University of Western Australia, Australia; Corresponding author.Long non-coding ribonucleic acids (lncRNAs) have been shown to play an important role in plant gene regulation, involving both epigenetic and transcript regulation. LncRNAs are transcripts longer than 200 nucleotides that are not translated into functional proteins but can be translated into small peptides. Machine learning models have predominantly used transcriptome data with manually defined features to detect lncRNAs, however, they often underrepresent the abundance of lncRNAs and can be biased in their detection. Here we present a study using Natural Language Processing (NLP) models to identify plant lncRNAs from genomic sequences rather than transcriptomic data. The NLP models were trained to predict lncRNAs for seven model and crop species (Zea mays, Arabidopsis thaliana, Brassica napus, Brassica oleracea, Brassica rapa, Glycine max and Oryza sativa) using publicly available genomic references. We demonstrated that lncRNAs can be accurately predicted from genomic sequences with the highest accuracy of 83.4% for Z. mays and the lowest accuracy of 57.9% for B. rapa, revealing that genome assembly quality might affect the accuracy of lncRNA identification. Furthermore, we demonstrated the potential of using NLP models for cross-species prediction with an average of 63.1% accuracy using target species not previously seen by the model. As more species are incorporated into the training datasets, we expect the accuracy to increase, becoming a more reliable tool for uncovering novel lncRNAs. Finally, we show that the models can be interpreted using explainable artificial intelligence to identify motifs important to lncRNA prediction and that these motifs frequently flanked the lncRNA sequence.http://www.sciencedirect.com/science/article/pii/S2001037023004397LncRNAsNatural language processingDeep learningGenomic motifCross-species prediction |
spellingShingle | Monica F. Danilevicz Mitchell Gill Cassandria G. Tay Fernandez Jakob Petereit Shriprabha R. Upadhyaya Jacqueline Batley Mohammed Bennamoun David Edwards Philipp E. Bayer DNABERT-based explainable lncRNA identification in plant genome assemblies Computational and Structural Biotechnology Journal LncRNAs Natural language processing Deep learning Genomic motif Cross-species prediction |
title | DNABERT-based explainable lncRNA identification in plant genome assemblies |
title_full | DNABERT-based explainable lncRNA identification in plant genome assemblies |
title_fullStr | DNABERT-based explainable lncRNA identification in plant genome assemblies |
title_full_unstemmed | DNABERT-based explainable lncRNA identification in plant genome assemblies |
title_short | DNABERT-based explainable lncRNA identification in plant genome assemblies |
title_sort | dnabert based explainable lncrna identification in plant genome assemblies |
topic | LncRNAs Natural language processing Deep learning Genomic motif Cross-species prediction |
url | http://www.sciencedirect.com/science/article/pii/S2001037023004397 |
work_keys_str_mv | AT monicafdanilevicz dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies AT mitchellgill dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies AT cassandriagtayfernandez dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies AT jakobpetereit dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies AT shriprabharupadhyaya dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies AT jacquelinebatley dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies AT mohammedbennamoun dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies AT davidedwards dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies AT philippebayer dnabertbasedexplainablelncrnaidentificationinplantgenomeassemblies |