Detecting transcriptomic structural variants in heterogeneous contexts via the Multiple Compatible Arrangements Problem
Abstract Background Transcriptomic structural variants (TSVs)—large-scale transcriptome sequence change due to structural variation - are common in cancer. TSV detection from high-throughput sequencing data is a computationally challenging problem. Among all the confounding factors, sample heterogen...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2020-05-01
|
Series: | Algorithms for Molecular Biology |
Subjects: | |
Online Access: | http://link.springer.com/article/10.1186/s13015-020-00170-5 |
_version_ | 1818512236002410496 |
---|---|
author | Yutong Qiu Cong Ma Han Xie Carl Kingsford |
author_facet | Yutong Qiu Cong Ma Han Xie Carl Kingsford |
author_sort | Yutong Qiu |
collection | DOAJ |
description | Abstract Background Transcriptomic structural variants (TSVs)—large-scale transcriptome sequence change due to structural variation - are common in cancer. TSV detection from high-throughput sequencing data is a computationally challenging problem. Among all the confounding factors, sample heterogeneity, where each sample contains multiple distinct alleles, poses a critical obstacle to accurate TSV prediction. Results To improve TSV detection in heterogeneous RNA-seq samples, we introduce the Multiple Compatible Arrangements Problem (MCAP), which seeks k genome arrangements that maximize the number of reads that are concordant with at least one arrangement. This models a heterogeneous or diploid sample. We prove that MCAP is NP-complete and provide a $$\frac{1}{4}$$ 1 4 -approximation algorithm for $$k=1$$ k = 1 and a $$\frac{3}{4}$$ 3 4 -approximation algorithm for the diploid case ( $$k=2$$ k = 2 ) assuming an oracle for $$k=1$$ k = 1 . Combining these, we obtain a $$\frac{3}{16}$$ 3 16 -approximation algorithm for MCAP when $$k=2$$ k = 2 (without an oracle). We also present an integer linear programming formulation for general k. We characterize the conflict structures in the graph that require $$k>1$$ k > 1 alleles to satisfy read concordancy and show that such structures are prevalent. Conclusions We show that the solution to MCAP accurately addresses sample heterogeneity during TSV detection. Our algorithms have improved performance on TCGA cancer samples and cancer cell line samples compared to a TSV calling tool, SQUID. The software is available at https://github.com/Kingsford-Group/diploidsquid . |
first_indexed | 2024-12-10T23:44:02Z |
format | Article |
id | doaj.art-582d03c624724f7fbc489091978dfd9d |
institution | Directory Open Access Journal |
issn | 1748-7188 |
language | English |
last_indexed | 2024-12-10T23:44:02Z |
publishDate | 2020-05-01 |
publisher | BMC |
record_format | Article |
series | Algorithms for Molecular Biology |
spelling | doaj.art-582d03c624724f7fbc489091978dfd9d2022-12-22T01:28:58ZengBMCAlgorithms for Molecular Biology1748-71882020-05-0115111510.1186/s13015-020-00170-5Detecting transcriptomic structural variants in heterogeneous contexts via the Multiple Compatible Arrangements ProblemYutong Qiu0Cong Ma1Han Xie2Carl Kingsford3Computational Biology Department, Carnegie Mellon UniversityComputational Biology Department, Carnegie Mellon UniversityComputational Biology Department, Carnegie Mellon UniversityComputational Biology Department, Carnegie Mellon UniversityAbstract Background Transcriptomic structural variants (TSVs)—large-scale transcriptome sequence change due to structural variation - are common in cancer. TSV detection from high-throughput sequencing data is a computationally challenging problem. Among all the confounding factors, sample heterogeneity, where each sample contains multiple distinct alleles, poses a critical obstacle to accurate TSV prediction. Results To improve TSV detection in heterogeneous RNA-seq samples, we introduce the Multiple Compatible Arrangements Problem (MCAP), which seeks k genome arrangements that maximize the number of reads that are concordant with at least one arrangement. This models a heterogeneous or diploid sample. We prove that MCAP is NP-complete and provide a $$\frac{1}{4}$$ 1 4 -approximation algorithm for $$k=1$$ k = 1 and a $$\frac{3}{4}$$ 3 4 -approximation algorithm for the diploid case ( $$k=2$$ k = 2 ) assuming an oracle for $$k=1$$ k = 1 . Combining these, we obtain a $$\frac{3}{16}$$ 3 16 -approximation algorithm for MCAP when $$k=2$$ k = 2 (without an oracle). We also present an integer linear programming formulation for general k. We characterize the conflict structures in the graph that require $$k>1$$ k > 1 alleles to satisfy read concordancy and show that such structures are prevalent. Conclusions We show that the solution to MCAP accurately addresses sample heterogeneity during TSV detection. Our algorithms have improved performance on TCGA cancer samples and cancer cell line samples compared to a TSV calling tool, SQUID. The software is available at https://github.com/Kingsford-Group/diploidsquid .http://link.springer.com/article/10.1186/s13015-020-00170-5Transcriptomic structural variationInteger linear programmingHeterogeneity |
spellingShingle | Yutong Qiu Cong Ma Han Xie Carl Kingsford Detecting transcriptomic structural variants in heterogeneous contexts via the Multiple Compatible Arrangements Problem Algorithms for Molecular Biology Transcriptomic structural variation Integer linear programming Heterogeneity |
title | Detecting transcriptomic structural variants in heterogeneous contexts via the Multiple Compatible Arrangements Problem |
title_full | Detecting transcriptomic structural variants in heterogeneous contexts via the Multiple Compatible Arrangements Problem |
title_fullStr | Detecting transcriptomic structural variants in heterogeneous contexts via the Multiple Compatible Arrangements Problem |
title_full_unstemmed | Detecting transcriptomic structural variants in heterogeneous contexts via the Multiple Compatible Arrangements Problem |
title_short | Detecting transcriptomic structural variants in heterogeneous contexts via the Multiple Compatible Arrangements Problem |
title_sort | detecting transcriptomic structural variants in heterogeneous contexts via the multiple compatible arrangements problem |
topic | Transcriptomic structural variation Integer linear programming Heterogeneity |
url | http://link.springer.com/article/10.1186/s13015-020-00170-5 |
work_keys_str_mv | AT yutongqiu detectingtranscriptomicstructuralvariantsinheterogeneouscontextsviathemultiplecompatiblearrangementsproblem AT congma detectingtranscriptomicstructuralvariantsinheterogeneouscontextsviathemultiplecompatiblearrangementsproblem AT hanxie detectingtranscriptomicstructuralvariantsinheterogeneouscontextsviathemultiplecompatiblearrangementsproblem AT carlkingsford detectingtranscriptomicstructuralvariantsinheterogeneouscontextsviathemultiplecompatiblearrangementsproblem |