Finding a suitable library size to call variants in RNA-Seq

Abstract Background RNA sequencing allows the study of both gene expression changes and transcribed mutations, providing a highly effective way to gain insight into cancer biology. When planning the sequencing of a large cohort of samples, library size is a fundamental factor affecting both the over...

Full description

Bibliographic Details
Main Authors:	Anna Quaglieri, Christoffer Flensburg, Terence P. Speed, Ian J. Majewski
Format:	Article
Language:	English
Published:	BMC 2020-12-01
Series:	BMC Bioinformatics
Subjects:	Cancer RNA-Seq Variant calling Library size Sequencing depth
Online Access:	https://doi.org/10.1186/s12859-020-03860-4

_version_	1828910817845706752
author	Anna Quaglieri Christoffer Flensburg Terence P. Speed Ian J. Majewski
author_facet	Anna Quaglieri Christoffer Flensburg Terence P. Speed Ian J. Majewski
author_sort	Anna Quaglieri
collection	DOAJ
description	Abstract Background RNA sequencing allows the study of both gene expression changes and transcribed mutations, providing a highly effective way to gain insight into cancer biology. When planning the sequencing of a large cohort of samples, library size is a fundamental factor affecting both the overall cost and the quality of the results. Here we specifically address how overall library size influences the detection of somatic mutations in RNA-seq data in two acute myeloid leukaemia datasets. Results We simulated shallower sequencing depths by downsampling 45 acute myeloid leukaemia samples (100 bp PE) that are part of the Leucegene project, which were originally sequenced at high depth. We compared the sensitivity of six methods of recovering validated mutations on the same samples. The methods compared are a combination of three popular callers (MuTect, VarScan, and VarDict) and two filtering strategies. We observed an incremental loss in sensitivity when simulating libraries of 80M, 50M, 40M, 30M and 20M fragments, with the largest loss detected with less than 30M fragments (below 90%, average loss of 7%). The sensitivity in recovering insertions and deletions varied markedly between callers, with VarDict showing the highest sensitivity (60%). Single nucleotide variant sensitivity is relatively consistent across methods, apart from MuTect, whose default filters need adjustment when using RNA-Seq. We also analysed 136 RNA-Seq samples from the TCGA-LAML cohort (50 bp PE) and assessed the change in sensitivity between the initial libraries (average 59M fragments) and after downsampling to 40M fragments. When considering single nucleotide variants in recurrently mutated myeloid genes we found a comparable performance, with a 6% average loss in sensitivity using 40M fragments. Conclusions Between 30M and 40M 100 bp PE reads are needed to recover 90–95% of the initial variants on recurrently mutated myeloid genes. To extend this result to another cancer type, an exploration of the characteristics of its mutations and gene expression patterns is suggested.
first_indexed	2024-12-13T18:51:18Z
format	Article
id	doaj.art-a5451a81538c4ba88f54c9380ccb9110
institution	Directory Open Access Journal
issn	1471-2105
language	English
last_indexed	2024-12-13T18:51:18Z
publishDate	2020-12-01
publisher	BMC
record_format	Article
series	BMC Bioinformatics
spelling	doaj.art-a5451a81538c4ba88f54c9380ccb91102022-12-21T23:34:56ZengBMCBMC Bioinformatics1471-21052020-12-0121111910.1186/s12859-020-03860-4Finding a suitable library size to call variants in RNA-SeqAnna Quaglieri0Christoffer Flensburg1Terence P. Speed2Ian J. Majewski3Walter and Eliza Hall Institute of Medical ResearchWalter and Eliza Hall Institute of Medical ResearchWalter and Eliza Hall Institute of Medical ResearchWalter and Eliza Hall Institute of Medical ResearchAbstract Background RNA sequencing allows the study of both gene expression changes and transcribed mutations, providing a highly effective way to gain insight into cancer biology. When planning the sequencing of a large cohort of samples, library size is a fundamental factor affecting both the overall cost and the quality of the results. Here we specifically address how overall library size influences the detection of somatic mutations in RNA-seq data in two acute myeloid leukaemia datasets. Results We simulated shallower sequencing depths by downsampling 45 acute myeloid leukaemia samples (100 bp PE) that are part of the Leucegene project, which were originally sequenced at high depth. We compared the sensitivity of six methods of recovering validated mutations on the same samples. The methods compared are a combination of three popular callers (MuTect, VarScan, and VarDict) and two filtering strategies. We observed an incremental loss in sensitivity when simulating libraries of 80M, 50M, 40M, 30M and 20M fragments, with the largest loss detected with less than 30M fragments (below 90%, average loss of 7%). The sensitivity in recovering insertions and deletions varied markedly between callers, with VarDict showing the highest sensitivity (60%). Single nucleotide variant sensitivity is relatively consistent across methods, apart from MuTect, whose default filters need adjustment when using RNA-Seq. We also analysed 136 RNA-Seq samples from the TCGA-LAML cohort (50 bp PE) and assessed the change in sensitivity between the initial libraries (average 59M fragments) and after downsampling to 40M fragments. When considering single nucleotide variants in recurrently mutated myeloid genes we found a comparable performance, with a 6% average loss in sensitivity using 40M fragments. Conclusions Between 30M and 40M 100 bp PE reads are needed to recover 90–95% of the initial variants on recurrently mutated myeloid genes. To extend this result to another cancer type, an exploration of the characteristics of its mutations and gene expression patterns is suggested.https://doi.org/10.1186/s12859-020-03860-4Cancer RNA-SeqVariant callingLibrary sizeSequencing depth
spellingShingle	Anna Quaglieri Christoffer Flensburg Terence P. Speed Ian J. Majewski Finding a suitable library size to call variants in RNA-Seq BMC Bioinformatics Cancer RNA-Seq Variant calling Library size Sequencing depth
title	Finding a suitable library size to call variants in RNA-Seq
title_full	Finding a suitable library size to call variants in RNA-Seq
title_fullStr	Finding a suitable library size to call variants in RNA-Seq
title_full_unstemmed	Finding a suitable library size to call variants in RNA-Seq
title_short	Finding a suitable library size to call variants in RNA-Seq
title_sort	finding a suitable library size to call variants in rna seq
topic	Cancer RNA-Seq Variant calling Library size Sequencing depth
url	https://doi.org/10.1186/s12859-020-03860-4
work_keys_str_mv	AT annaquaglieri findingasuitablelibrarysizetocallvariantsinrnaseq AT christofferflensburg findingasuitablelibrarysizetocallvariantsinrnaseq AT terencepspeed findingasuitablelibrarysizetocallvariantsinrnaseq AT ianjmajewski findingasuitablelibrarysizetocallvariantsinrnaseq

Finding a suitable library size to call variants in RNA-Seq

Similar Items