CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites

<h4>Background</h4> It is estimated that up to 50% of all disease causing variants disrupt splicing. Due to its complexity, our ability to predict which variants disrupt splicing is limited, meaning missed diagnoses for patients. The emergence of machine learning for targeted medicine ho...

Full description

Bibliographic Details
Main Authors:	Yaron Strauch, Jenny Lord, Mahesan Niranjan, Diana Baralle
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2022-01-01
Series:	PLoS ONE
Online Access:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9165884/?tool=EBI

_version_	1828807805738418176
author	Yaron Strauch Jenny Lord Mahesan Niranjan Diana Baralle
author_facet	Yaron Strauch Jenny Lord Mahesan Niranjan Diana Baralle
author_sort	Yaron Strauch
collection	DOAJ
description	<h4>Background</h4> It is estimated that up to 50% of all disease causing variants disrupt splicing. Due to its complexity, our ability to predict which variants disrupt splicing is limited, meaning missed diagnoses for patients. The emergence of machine learning for targeted medicine holds great potential to improve prediction of splice disrupting variants. The recently published SpliceAI algorithm utilises deep neural networks and has been reported to have a greater accuracy than other commonly used methods. <h4>Methods and findings</h4> The original SpliceAI was trained on splice sites included in primary isoforms combined with novel junctions observed in GTEx data, which might introduce noise and de-correlate the machine learning input with its output. Limiting the data to only validated and manual annotated primary and alternatively spliced GENCODE sites in training may improve predictive abilities. All of these gene isoforms were collapsed (aggregated into one pseudo-isoform) and the SpliceAI architecture was retrained (CI-SpliceAI). Predictive performance on a newly curated dataset of 1,316 functionally validated variants from the literature was compared with the original SpliceAI, alongside MMSplice, MaxEntScan, and SQUIRLS. Both SpliceAI algorithms outperformed the other methods, with the original SpliceAI achieving an accuracy of ∼91%, and CI-SpliceAI showing an improvement at ∼92% overall. Predictive accuracy increased in the majority of curated variants. <h4>Conclusions</h4> We show that including only manually annotated alternatively spliced sites in training data improves prediction of clinically relevant variants, and highlight avenues for further performance improvements.
first_indexed	2024-12-12T08:31:20Z
format	Article
id	doaj.art-3fa027916905467994647ecd5fc0b678
institution	Directory Open Access Journal
issn	1932-6203
language	English
last_indexed	2024-12-12T08:31:20Z
publishDate	2022-01-01
publisher	Public Library of Science (PLoS)
record_format	Article
series	PLoS ONE
spelling	doaj.art-3fa027916905467994647ecd5fc0b6782022-12-22T00:31:05ZengPublic Library of Science (PLoS)PLoS ONE1932-62032022-01-01176CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sitesYaron StrauchJenny LordMahesan NiranjanDiana Baralle<h4>Background</h4> It is estimated that up to 50% of all disease causing variants disrupt splicing. Due to its complexity, our ability to predict which variants disrupt splicing is limited, meaning missed diagnoses for patients. The emergence of machine learning for targeted medicine holds great potential to improve prediction of splice disrupting variants. The recently published SpliceAI algorithm utilises deep neural networks and has been reported to have a greater accuracy than other commonly used methods. <h4>Methods and findings</h4> The original SpliceAI was trained on splice sites included in primary isoforms combined with novel junctions observed in GTEx data, which might introduce noise and de-correlate the machine learning input with its output. Limiting the data to only validated and manual annotated primary and alternatively spliced GENCODE sites in training may improve predictive abilities. All of these gene isoforms were collapsed (aggregated into one pseudo-isoform) and the SpliceAI architecture was retrained (CI-SpliceAI). Predictive performance on a newly curated dataset of 1,316 functionally validated variants from the literature was compared with the original SpliceAI, alongside MMSplice, MaxEntScan, and SQUIRLS. Both SpliceAI algorithms outperformed the other methods, with the original SpliceAI achieving an accuracy of ∼91%, and CI-SpliceAI showing an improvement at ∼92% overall. Predictive accuracy increased in the majority of curated variants. <h4>Conclusions</h4> We show that including only manually annotated alternatively spliced sites in training data improves prediction of clinically relevant variants, and highlight avenues for further performance improvements.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9165884/?tool=EBI
spellingShingle	Yaron Strauch Jenny Lord Mahesan Niranjan Diana Baralle CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites PLoS ONE
title	CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites
title_full	CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites
title_fullStr	CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites
title_full_unstemmed	CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites
title_short	CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites
title_sort	ci spliceai improving machine learning predictions of disease causing splicing variants using curated alternative splice sites
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9165884/?tool=EBI
work_keys_str_mv	AT yaronstrauch cispliceaiimprovingmachinelearningpredictionsofdiseasecausingsplicingvariantsusingcuratedalternativesplicesites AT jennylord cispliceaiimprovingmachinelearningpredictionsofdiseasecausingsplicingvariantsusingcuratedalternativesplicesites AT mahesanniranjan cispliceaiimprovingmachinelearningpredictionsofdiseasecausingsplicingvariantsusingcuratedalternativesplicesites AT dianabaralle cispliceaiimprovingmachinelearningpredictionsofdiseasecausingsplicingvariantsusingcuratedalternativesplicesites

CI-SpliceAI—Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites

Similar Items