Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance

Abstract Background There are over 25 tools dedicated for the detection of Copy Number Variants (CNVs) using Whole Exome Sequencing (WES) data based on read depth analysis. The tools reported consist of several steps, including: (i) calculation of read depth for each sequencing target, (ii) normaliz...

Full description

Bibliographic Details
Main Authors:	Wiktor Kuśmirek, Agnieszka Szmurło, Marek Wiewiórka, Robert Nowak, Tomasz Gambin
Format:	Article
Language:	English
Published:	BMC 2019-05-01
Series:	BMC Bioinformatics
Subjects:	Copy number variation Read depth Next-generation sequencing Clustering
Online Access:	http://link.springer.com/article/10.1186/s12859-019-2889-z

_version_	1811323265064370176
author	Wiktor Kuśmirek Agnieszka Szmurło Marek Wiewiórka Robert Nowak Tomasz Gambin
author_facet	Wiktor Kuśmirek Agnieszka Szmurło Marek Wiewiórka Robert Nowak Tomasz Gambin
author_sort	Wiktor Kuśmirek
collection	DOAJ
description	Abstract Background There are over 25 tools dedicated for the detection of Copy Number Variants (CNVs) using Whole Exome Sequencing (WES) data based on read depth analysis. The tools reported consist of several steps, including: (i) calculation of read depth for each sequencing target, (ii) normalization, (iii) segmentation and (iv) actual CNV calling. The essential aspect of the entire process is the normalization stage, in which systematic errors and biases are removed and the reference sample set is used to increase the signal-to-noise ratio. Although some CNV calling tools use dedicated algorithms to obtain the optimal reference sample set, most of the advanced CNV callers do not include this feature. To our knowledge, this work is the first attempt to assess the impact of reference sample set selection on CNV detection performance. Methods We used WES data from the 1000 Genomes project to evaluate the impact of various methods of reference sample set selection on CNV calling performance of three chosen state-of-the-art tools: CODEX, CNVkit and exomeCopy. Two naive solutions (all samples as reference set and random selection) as well as two clustering methods (k-means and k nearest neighbours (kNN) with a variable number of clusters or group sizes) have been evaluated to discover the best performing sample selection method. Results and Conclusions The performed experiments have shown that the appropriate selection of the reference sample set may greatly improve the CNV detection rate. In particular, we found that smart reduction of reference sample size may significantly increase the algorithms’ precision while having negligible negative effect on sensitivity. We observed that a complete CNV calling process with the k-means algorithm as the selection method has significantly better time complexity than kNN-based solution.
first_indexed	2024-04-13T13:51:03Z
format	Article
id	doaj.art-5e493e90a282429899eda3cb0632ec5e
institution	Directory Open Access Journal
issn	1471-2105
language	English
last_indexed	2024-04-13T13:51:03Z
publishDate	2019-05-01
publisher	BMC
record_format	Article
series	BMC Bioinformatics
spelling	doaj.art-5e493e90a282429899eda3cb0632ec5e2022-12-22T02:44:19ZengBMCBMC Bioinformatics1471-21052019-05-0120111010.1186/s12859-019-2889-zComparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performanceWiktor Kuśmirek0Agnieszka Szmurło1Marek Wiewiórka2Robert Nowak3Tomasz Gambin4Institute of Computer Science, Warsaw University of TechnologyInstitute of Computer Science, Warsaw University of TechnologyInstitute of Computer Science, Warsaw University of TechnologyInstitute of Computer Science, Warsaw University of TechnologyInstitute of Computer Science, Warsaw University of TechnologyAbstract Background There are over 25 tools dedicated for the detection of Copy Number Variants (CNVs) using Whole Exome Sequencing (WES) data based on read depth analysis. The tools reported consist of several steps, including: (i) calculation of read depth for each sequencing target, (ii) normalization, (iii) segmentation and (iv) actual CNV calling. The essential aspect of the entire process is the normalization stage, in which systematic errors and biases are removed and the reference sample set is used to increase the signal-to-noise ratio. Although some CNV calling tools use dedicated algorithms to obtain the optimal reference sample set, most of the advanced CNV callers do not include this feature. To our knowledge, this work is the first attempt to assess the impact of reference sample set selection on CNV detection performance. Methods We used WES data from the 1000 Genomes project to evaluate the impact of various methods of reference sample set selection on CNV calling performance of three chosen state-of-the-art tools: CODEX, CNVkit and exomeCopy. Two naive solutions (all samples as reference set and random selection) as well as two clustering methods (k-means and k nearest neighbours (kNN) with a variable number of clusters or group sizes) have been evaluated to discover the best performing sample selection method. Results and Conclusions The performed experiments have shown that the appropriate selection of the reference sample set may greatly improve the CNV detection rate. In particular, we found that smart reduction of reference sample size may significantly increase the algorithms’ precision while having negligible negative effect on sensitivity. We observed that a complete CNV calling process with the k-means algorithm as the selection method has significantly better time complexity than kNN-based solution.http://link.springer.com/article/10.1186/s12859-019-2889-zCopy number variationRead depthNext-generation sequencingClustering
spellingShingle	Wiktor Kuśmirek Agnieszka Szmurło Marek Wiewiórka Robert Nowak Tomasz Gambin Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance BMC Bioinformatics Copy number variation Read depth Next-generation sequencing Clustering
title	Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance
title_full	Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance
title_fullStr	Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance
title_full_unstemmed	Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance
title_short	Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance
title_sort	comparison of knn and k means optimization methods of reference set selection for improved cnv callers performance
topic	Copy number variation Read depth Next-generation sequencing Clustering
url	http://link.springer.com/article/10.1186/s12859-019-2889-z
work_keys_str_mv	AT wiktorkusmirek comparisonofknnandkmeansoptimizationmethodsofreferencesetselectionforimprovedcnvcallersperformance AT agnieszkaszmurło comparisonofknnandkmeansoptimizationmethodsofreferencesetselectionforimprovedcnvcallersperformance AT marekwiewiorka comparisonofknnandkmeansoptimizationmethodsofreferencesetselectionforimprovedcnvcallersperformance AT robertnowak comparisonofknnandkmeansoptimizationmethodsofreferencesetselectionforimprovedcnvcallersperformance AT tomaszgambin comparisonofknnandkmeansoptimizationmethodsofreferencesetselectionforimprovedcnvcallersperformance

Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance

Similar Items