Improved detection of clinically relevant fusion transcripts in cancer by machine learning classification

Abstract Background Genomic rearrangements in cancer cells can create fusion genes that encode chimeric proteins or alter the expression of coding and non-coding RNAs. In some cancer types, fusions involving specific kinases are used as targets for therapy. Fusion genes can be detected by whole geno...

Full description

Bibliographic Details
Main Authors:	Völundur Hafstað, Jari Häkkinen, Malin Larsson, Johan Staaf, Johan Vallon-Christersson, Helena Persson
Format:	Article
Language:	English
Published:	BMC 2023-12-01
Series:	BMC Genomics
Subjects:	Fusion transcript Gene fusion Cancer genomics Tumor biology Precision medicine Machine learning
Online Access:	https://doi.org/10.1186/s12864-023-09889-y

_version_	1797377213238083584
author	Völundur Hafstað Jari Häkkinen Malin Larsson Johan Staaf Johan Vallon-Christersson Helena Persson
author_facet	Völundur Hafstað Jari Häkkinen Malin Larsson Johan Staaf Johan Vallon-Christersson Helena Persson
author_sort	Völundur Hafstað
collection	DOAJ
description	Abstract Background Genomic rearrangements in cancer cells can create fusion genes that encode chimeric proteins or alter the expression of coding and non-coding RNAs. In some cancer types, fusions involving specific kinases are used as targets for therapy. Fusion genes can be detected by whole genome sequencing (WGS) and targeted fusion panels, but RNA sequencing (RNA-Seq) has the advantageous capability of broadly detecting expressed fusion transcripts. Results We developed a pipeline for validation of fusion transcripts identified in RNA-Seq data using matched WGS data from The Cancer Genome Atlas (TCGA) and applied it to 910 tumors from 11 different cancer types. This resulted in 4237 validated gene fusions, 3049 of them with at least one identified genomic breakpoint. Utilizing validated fusions as true positive events, we trained a machine learning classifier to predict true and false positive fusion transcripts from RNA-Seq data. The final precision and recall metrics of the classifier were 0.74 and 0.71, respectively, in an independent dataset of 249 breast tumors. Application of this classifier to all samples with RNA-Seq data from these cancer types vastly extended the number of likely true positive fusion transcripts and identified many potentially targetable kinase fusions. Further analysis of the validated gene fusions suggested that many are created by intrachromosomal amplification events with microhomology-mediated non-homologous end-joining. Conclusions A classifier trained on validated fusion events increased the accuracy of fusion transcript identification in samples without WGS data. This allowed the analysis to be extended to all samples with RNA-Seq data, facilitating studies of tumor biology and increasing the number of detected kinase fusions. Machine learning could thus be used in identification of clinically relevant fusion events for targeted therapy. The large dataset of validated gene fusions generated here presents a useful resource for development and evaluation of fusion transcript detection algorithms.
first_indexed	2024-03-08T19:49:32Z
format	Article
id	doaj.art-9ff2da0262514696b39df8415e9a8fba
institution	Directory Open Access Journal
issn	1471-2164
language	English
last_indexed	2024-03-08T19:49:32Z
publishDate	2023-12-01
publisher	BMC
record_format	Article
series	BMC Genomics
spelling	doaj.art-9ff2da0262514696b39df8415e9a8fba2023-12-24T12:10:32ZengBMCBMC Genomics1471-21642023-12-0124111610.1186/s12864-023-09889-yImproved detection of clinically relevant fusion transcripts in cancer by machine learning classificationVölundur Hafstað0Jari Häkkinen1Malin Larsson2Johan Staaf3Johan Vallon-Christersson4Helena Persson5Faculty of Medicine, Department of Clinical Sciences Lund, Oncology, Lund University Cancer CentreFaculty of Medicine, Department of Clinical Sciences Lund, Oncology, Lund University Cancer CentreDepartment of Physics, Chemistry and Biology, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Linköping UniversityFaculty of Medicine, Department of Laboratory Medicine, Translational Cancer Research, Lund University Cancer CentreFaculty of Medicine, Department of Clinical Sciences Lund, Oncology, Lund University Cancer CentreFaculty of Medicine, Department of Clinical Sciences Lund, Oncology, Lund University Cancer CentreAbstract Background Genomic rearrangements in cancer cells can create fusion genes that encode chimeric proteins or alter the expression of coding and non-coding RNAs. In some cancer types, fusions involving specific kinases are used as targets for therapy. Fusion genes can be detected by whole genome sequencing (WGS) and targeted fusion panels, but RNA sequencing (RNA-Seq) has the advantageous capability of broadly detecting expressed fusion transcripts. Results We developed a pipeline for validation of fusion transcripts identified in RNA-Seq data using matched WGS data from The Cancer Genome Atlas (TCGA) and applied it to 910 tumors from 11 different cancer types. This resulted in 4237 validated gene fusions, 3049 of them with at least one identified genomic breakpoint. Utilizing validated fusions as true positive events, we trained a machine learning classifier to predict true and false positive fusion transcripts from RNA-Seq data. The final precision and recall metrics of the classifier were 0.74 and 0.71, respectively, in an independent dataset of 249 breast tumors. Application of this classifier to all samples with RNA-Seq data from these cancer types vastly extended the number of likely true positive fusion transcripts and identified many potentially targetable kinase fusions. Further analysis of the validated gene fusions suggested that many are created by intrachromosomal amplification events with microhomology-mediated non-homologous end-joining. Conclusions A classifier trained on validated fusion events increased the accuracy of fusion transcript identification in samples without WGS data. This allowed the analysis to be extended to all samples with RNA-Seq data, facilitating studies of tumor biology and increasing the number of detected kinase fusions. Machine learning could thus be used in identification of clinically relevant fusion events for targeted therapy. The large dataset of validated gene fusions generated here presents a useful resource for development and evaluation of fusion transcript detection algorithms.https://doi.org/10.1186/s12864-023-09889-yFusion transcriptGene fusionCancer genomicsTumor biologyPrecision medicineMachine learning
spellingShingle	Völundur Hafstað Jari Häkkinen Malin Larsson Johan Staaf Johan Vallon-Christersson Helena Persson Improved detection of clinically relevant fusion transcripts in cancer by machine learning classification BMC Genomics Fusion transcript Gene fusion Cancer genomics Tumor biology Precision medicine Machine learning
title	Improved detection of clinically relevant fusion transcripts in cancer by machine learning classification
title_full	Improved detection of clinically relevant fusion transcripts in cancer by machine learning classification
title_fullStr	Improved detection of clinically relevant fusion transcripts in cancer by machine learning classification
title_full_unstemmed	Improved detection of clinically relevant fusion transcripts in cancer by machine learning classification
title_short	Improved detection of clinically relevant fusion transcripts in cancer by machine learning classification
title_sort	improved detection of clinically relevant fusion transcripts in cancer by machine learning classification
topic	Fusion transcript Gene fusion Cancer genomics Tumor biology Precision medicine Machine learning
url	https://doi.org/10.1186/s12864-023-09889-y
work_keys_str_mv	AT volundurhafstað improveddetectionofclinicallyrelevantfusiontranscriptsincancerbymachinelearningclassification AT jarihakkinen improveddetectionofclinicallyrelevantfusiontranscriptsincancerbymachinelearningclassification AT malinlarsson improveddetectionofclinicallyrelevantfusiontranscriptsincancerbymachinelearningclassification AT johanstaaf improveddetectionofclinicallyrelevantfusiontranscriptsincancerbymachinelearningclassification AT johanvallonchristersson improveddetectionofclinicallyrelevantfusiontranscriptsincancerbymachinelearningclassification AT helenapersson improveddetectionofclinicallyrelevantfusiontranscriptsincancerbymachinelearningclassification

Improved detection of clinically relevant fusion transcripts in cancer by machine learning classification

Similar Items