Detecting transcriptomic structural variants in heterogeneous contexts via the Multiple Compatible Arrangements Problem

Abstract Background Transcriptomic structural variants (TSVs)—large-scale transcriptome sequence change due to structural variation - are common in cancer. TSV detection from high-throughput sequencing data is a computationally challenging problem. Among all the confounding factors, sample heterogen...

Full description

Bibliographic Details
Main Authors: Yutong Qiu, Cong Ma, Han Xie, Carl Kingsford
Format: Article
Language:English
Published: BMC 2020-05-01
Series:Algorithms for Molecular Biology
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13015-020-00170-5
_version_ 1818512236002410496
author Yutong Qiu
Cong Ma
Han Xie
Carl Kingsford
author_facet Yutong Qiu
Cong Ma
Han Xie
Carl Kingsford
author_sort Yutong Qiu
collection DOAJ
description Abstract Background Transcriptomic structural variants (TSVs)—large-scale transcriptome sequence change due to structural variation - are common in cancer. TSV detection from high-throughput sequencing data is a computationally challenging problem. Among all the confounding factors, sample heterogeneity, where each sample contains multiple distinct alleles, poses a critical obstacle to accurate TSV prediction. Results To improve TSV detection in heterogeneous RNA-seq samples, we introduce the Multiple Compatible Arrangements Problem (MCAP), which seeks k genome arrangements that maximize the number of reads that are concordant with at least one arrangement. This models a heterogeneous or diploid sample. We prove that MCAP is NP-complete and provide a $$\frac{1}{4}$$ 1 4 -approximation algorithm for $$k=1$$ k = 1 and a $$\frac{3}{4}$$ 3 4 -approximation algorithm for the diploid case ( $$k=2$$ k = 2 ) assuming an oracle for $$k=1$$ k = 1 . Combining these, we obtain a $$\frac{3}{16}$$ 3 16 -approximation algorithm for MCAP when $$k=2$$ k = 2 (without an oracle). We also present an integer linear programming formulation for general k. We characterize the conflict structures in the graph that require $$k>1$$ k > 1 alleles to satisfy read concordancy and show that such structures are prevalent. Conclusions We show that the solution to MCAP accurately addresses sample heterogeneity during TSV detection. Our algorithms have improved performance on TCGA cancer samples and cancer cell line samples compared to a TSV calling tool, SQUID. The software is available at https://github.com/Kingsford-Group/diploidsquid .
first_indexed 2024-12-10T23:44:02Z
format Article
id doaj.art-582d03c624724f7fbc489091978dfd9d
institution Directory Open Access Journal
issn 1748-7188
language English
last_indexed 2024-12-10T23:44:02Z
publishDate 2020-05-01
publisher BMC
record_format Article
series Algorithms for Molecular Biology
spelling doaj.art-582d03c624724f7fbc489091978dfd9d2022-12-22T01:28:58ZengBMCAlgorithms for Molecular Biology1748-71882020-05-0115111510.1186/s13015-020-00170-5Detecting transcriptomic structural variants in heterogeneous contexts via the Multiple Compatible Arrangements ProblemYutong Qiu0Cong Ma1Han Xie2Carl Kingsford3Computational Biology Department, Carnegie Mellon UniversityComputational Biology Department, Carnegie Mellon UniversityComputational Biology Department, Carnegie Mellon UniversityComputational Biology Department, Carnegie Mellon UniversityAbstract Background Transcriptomic structural variants (TSVs)—large-scale transcriptome sequence change due to structural variation - are common in cancer. TSV detection from high-throughput sequencing data is a computationally challenging problem. Among all the confounding factors, sample heterogeneity, where each sample contains multiple distinct alleles, poses a critical obstacle to accurate TSV prediction. Results To improve TSV detection in heterogeneous RNA-seq samples, we introduce the Multiple Compatible Arrangements Problem (MCAP), which seeks k genome arrangements that maximize the number of reads that are concordant with at least one arrangement. This models a heterogeneous or diploid sample. We prove that MCAP is NP-complete and provide a $$\frac{1}{4}$$ 1 4 -approximation algorithm for $$k=1$$ k = 1 and a $$\frac{3}{4}$$ 3 4 -approximation algorithm for the diploid case ( $$k=2$$ k = 2 ) assuming an oracle for $$k=1$$ k = 1 . Combining these, we obtain a $$\frac{3}{16}$$ 3 16 -approximation algorithm for MCAP when $$k=2$$ k = 2 (without an oracle). We also present an integer linear programming formulation for general k. We characterize the conflict structures in the graph that require $$k>1$$ k > 1 alleles to satisfy read concordancy and show that such structures are prevalent. Conclusions We show that the solution to MCAP accurately addresses sample heterogeneity during TSV detection. Our algorithms have improved performance on TCGA cancer samples and cancer cell line samples compared to a TSV calling tool, SQUID. The software is available at https://github.com/Kingsford-Group/diploidsquid .http://link.springer.com/article/10.1186/s13015-020-00170-5Transcriptomic structural variationInteger linear programmingHeterogeneity
spellingShingle Yutong Qiu
Cong Ma
Han Xie
Carl Kingsford
Detecting transcriptomic structural variants in heterogeneous contexts via the Multiple Compatible Arrangements Problem
Algorithms for Molecular Biology
Transcriptomic structural variation
Integer linear programming
Heterogeneity
title Detecting transcriptomic structural variants in heterogeneous contexts via the Multiple Compatible Arrangements Problem
title_full Detecting transcriptomic structural variants in heterogeneous contexts via the Multiple Compatible Arrangements Problem
title_fullStr Detecting transcriptomic structural variants in heterogeneous contexts via the Multiple Compatible Arrangements Problem
title_full_unstemmed Detecting transcriptomic structural variants in heterogeneous contexts via the Multiple Compatible Arrangements Problem
title_short Detecting transcriptomic structural variants in heterogeneous contexts via the Multiple Compatible Arrangements Problem
title_sort detecting transcriptomic structural variants in heterogeneous contexts via the multiple compatible arrangements problem
topic Transcriptomic structural variation
Integer linear programming
Heterogeneity
url http://link.springer.com/article/10.1186/s13015-020-00170-5
work_keys_str_mv AT yutongqiu detectingtranscriptomicstructuralvariantsinheterogeneouscontextsviathemultiplecompatiblearrangementsproblem
AT congma detectingtranscriptomicstructuralvariantsinheterogeneouscontextsviathemultiplecompatiblearrangementsproblem
AT hanxie detectingtranscriptomicstructuralvariantsinheterogeneouscontextsviathemultiplecompatiblearrangementsproblem
AT carlkingsford detectingtranscriptomicstructuralvariantsinheterogeneouscontextsviathemultiplecompatiblearrangementsproblem