Reducing the number of artifactual repeats in de novo assembly of RNA-Seq data by optimizing the assembly pipeline

One of the problems of de novo assembly is the occurrence of artifactual direct or inverted repeats that are mainly formed by misassembly of short sequencing reads and cannot be differentiated from real sequence repeats. In this study, we compared the frequency of artifactual repeats generated by fo...

Full description

Bibliographic Details
Main Authors: Lee, Wei Kang, Mohd Zainuddin, Nur Afiza, Teh, Hui Ying, Lim, Yi Yi, Jaafar, Mohd Uzair, Khoo, Jia Shiun, Ghazali, Ahmad Kamal, Namasivayam, Parameswari, Abdullah, Janna Ong, Ho, Chai Ling
Format: Article
Language:English
Published: Elsevier 2017
Online Access:http://psasir.upm.edu.my/id/eprint/62927/1/Reducing%20the%20number%20of%20artifactual%20repeats%20in%20de%20novo%20.pdf
_version_ 1825932534397009920
author Lee, Wei Kang
Mohd Zainuddin, Nur Afiza
Teh, Hui Ying
Lim, Yi Yi
Jaafar, Mohd Uzair
Khoo, Jia Shiun
Ghazali, Ahmad Kamal
Namasivayam, Parameswari
Abdullah, Janna Ong
Ho, Chai Ling
author_facet Lee, Wei Kang
Mohd Zainuddin, Nur Afiza
Teh, Hui Ying
Lim, Yi Yi
Jaafar, Mohd Uzair
Khoo, Jia Shiun
Ghazali, Ahmad Kamal
Namasivayam, Parameswari
Abdullah, Janna Ong
Ho, Chai Ling
author_sort Lee, Wei Kang
collection UPM
description One of the problems of de novo assembly is the occurrence of artifactual direct or inverted repeats that are mainly formed by misassembly of short sequencing reads and cannot be differentiated from real sequence repeats. In this study, we compared the frequency of artifactual repeats generated by four de novo assembly pipelines: (1) Velvet-Oases-The Gene Index Clustering Tool (TGICL), (2) Velvet-TGICL, (3) Trinity-TGICL and (4) SOAPdenovo-Trans‑TGICL by analysing the RNA-Seq data of Gracilaria changii. The overall completeness of these four de novo assemblies were in the range of 85.2–90.0% for complete Benchmarking Universal Single-Copy Orthologs (BUSCOs), with the Velvet-TGICL assembly having the highest percentage of single copy and complete BUSCOs (78.9%). When Velvet-Oases-TGICL was used, a total of 2510 (8.44%) direct and 1967 (6.61%) inverted artifactual repeats were found among the assembled sequences. Polymerase chain reaction (PCR) analysis of 15 unigenes containing direct or inverted repeats confirmed that the repeats were due to assembly artifacts. When Oases was omitted from the assembly pipeline (i.e. Vetvet-TGICL), the number of unigenes containing artifactual direct and inverted repeats reduced significantly to 238 (1.63%) and 8 (0.06%), respectively. Among the four de novo assemblies, the Velvet-Oases-TGICL and Velvet-TGICL assemblies had the highest and the lower percentage of unigene containing artifactual repeats, respectively. The occurrence of artifactual repeats in the transcriptome data may complicate downstream analyses such as identification of splice variants and gene fusion, but the differential gene expression was less affected by the presence of artifactual repeats in this study. The information provided in this paper (based on a non-model seaweed G. changii) could be useful for designing and optimizing assembly pipeline for future analysis of RNA-Seq data from organisms without a reference genome.
first_indexed 2024-03-06T09:43:30Z
format Article
id upm.eprints-62927
institution Universiti Putra Malaysia
language English
last_indexed 2024-03-06T09:43:30Z
publishDate 2017
publisher Elsevier
record_format dspace
spelling upm.eprints-629272018-09-28T10:36:50Z http://psasir.upm.edu.my/id/eprint/62927/ Reducing the number of artifactual repeats in de novo assembly of RNA-Seq data by optimizing the assembly pipeline Lee, Wei Kang Mohd Zainuddin, Nur Afiza Teh, Hui Ying Lim, Yi Yi Jaafar, Mohd Uzair Khoo, Jia Shiun Ghazali, Ahmad Kamal Namasivayam, Parameswari Abdullah, Janna Ong Ho, Chai Ling One of the problems of de novo assembly is the occurrence of artifactual direct or inverted repeats that are mainly formed by misassembly of short sequencing reads and cannot be differentiated from real sequence repeats. In this study, we compared the frequency of artifactual repeats generated by four de novo assembly pipelines: (1) Velvet-Oases-The Gene Index Clustering Tool (TGICL), (2) Velvet-TGICL, (3) Trinity-TGICL and (4) SOAPdenovo-Trans‑TGICL by analysing the RNA-Seq data of Gracilaria changii. The overall completeness of these four de novo assemblies were in the range of 85.2–90.0% for complete Benchmarking Universal Single-Copy Orthologs (BUSCOs), with the Velvet-TGICL assembly having the highest percentage of single copy and complete BUSCOs (78.9%). When Velvet-Oases-TGICL was used, a total of 2510 (8.44%) direct and 1967 (6.61%) inverted artifactual repeats were found among the assembled sequences. Polymerase chain reaction (PCR) analysis of 15 unigenes containing direct or inverted repeats confirmed that the repeats were due to assembly artifacts. When Oases was omitted from the assembly pipeline (i.e. Vetvet-TGICL), the number of unigenes containing artifactual direct and inverted repeats reduced significantly to 238 (1.63%) and 8 (0.06%), respectively. Among the four de novo assemblies, the Velvet-Oases-TGICL and Velvet-TGICL assemblies had the highest and the lower percentage of unigene containing artifactual repeats, respectively. The occurrence of artifactual repeats in the transcriptome data may complicate downstream analyses such as identification of splice variants and gene fusion, but the differential gene expression was less affected by the presence of artifactual repeats in this study. The information provided in this paper (based on a non-model seaweed G. changii) could be useful for designing and optimizing assembly pipeline for future analysis of RNA-Seq data from organisms without a reference genome. Elsevier 2017-12 Article PeerReviewed text en http://psasir.upm.edu.my/id/eprint/62927/1/Reducing%20the%20number%20of%20artifactual%20repeats%20in%20de%20novo%20.pdf Lee, Wei Kang and Mohd Zainuddin, Nur Afiza and Teh, Hui Ying and Lim, Yi Yi and Jaafar, Mohd Uzair and Khoo, Jia Shiun and Ghazali, Ahmad Kamal and Namasivayam, Parameswari and Abdullah, Janna Ong and Ho, Chai Ling (2017) Reducing the number of artifactual repeats in de novo assembly of RNA-Seq data by optimizing the assembly pipeline. Gene Reports, 9. pp. 7-12. ISSN 2452-0144 https://www.sciencedirect.com/science/article/pii/S2452014417300572 10.1016/j.genrep.2017.08.003
spellingShingle Lee, Wei Kang
Mohd Zainuddin, Nur Afiza
Teh, Hui Ying
Lim, Yi Yi
Jaafar, Mohd Uzair
Khoo, Jia Shiun
Ghazali, Ahmad Kamal
Namasivayam, Parameswari
Abdullah, Janna Ong
Ho, Chai Ling
Reducing the number of artifactual repeats in de novo assembly of RNA-Seq data by optimizing the assembly pipeline
title Reducing the number of artifactual repeats in de novo assembly of RNA-Seq data by optimizing the assembly pipeline
title_full Reducing the number of artifactual repeats in de novo assembly of RNA-Seq data by optimizing the assembly pipeline
title_fullStr Reducing the number of artifactual repeats in de novo assembly of RNA-Seq data by optimizing the assembly pipeline
title_full_unstemmed Reducing the number of artifactual repeats in de novo assembly of RNA-Seq data by optimizing the assembly pipeline
title_short Reducing the number of artifactual repeats in de novo assembly of RNA-Seq data by optimizing the assembly pipeline
title_sort reducing the number of artifactual repeats in de novo assembly of rna seq data by optimizing the assembly pipeline
url http://psasir.upm.edu.my/id/eprint/62927/1/Reducing%20the%20number%20of%20artifactual%20repeats%20in%20de%20novo%20.pdf
work_keys_str_mv AT leeweikang reducingthenumberofartifactualrepeatsindenovoassemblyofrnaseqdatabyoptimizingtheassemblypipeline
AT mohdzainuddinnurafiza reducingthenumberofartifactualrepeatsindenovoassemblyofrnaseqdatabyoptimizingtheassemblypipeline
AT tehhuiying reducingthenumberofartifactualrepeatsindenovoassemblyofrnaseqdatabyoptimizingtheassemblypipeline
AT limyiyi reducingthenumberofartifactualrepeatsindenovoassemblyofrnaseqdatabyoptimizingtheassemblypipeline
AT jaafarmohduzair reducingthenumberofartifactualrepeatsindenovoassemblyofrnaseqdatabyoptimizingtheassemblypipeline
AT khoojiashiun reducingthenumberofartifactualrepeatsindenovoassemblyofrnaseqdatabyoptimizingtheassemblypipeline
AT ghazaliahmadkamal reducingthenumberofartifactualrepeatsindenovoassemblyofrnaseqdatabyoptimizingtheassemblypipeline
AT namasivayamparameswari reducingthenumberofartifactualrepeatsindenovoassemblyofrnaseqdatabyoptimizingtheassemblypipeline
AT abdullahjannaong reducingthenumberofartifactualrepeatsindenovoassemblyofrnaseqdatabyoptimizingtheassemblypipeline
AT hochailing reducingthenumberofartifactualrepeatsindenovoassemblyofrnaseqdatabyoptimizingtheassemblypipeline