Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data.

The analysis of single-cell genomics data presents several statistical challenges, and extensive efforts have been made to produce methods for the analysis of this data that impute missing values, address sampling issues and quantify and correct for noise. In spite of such efforts, no consensus on b...

Full description

Bibliographic Details
Main Authors: Andreas Tjärnberg, Omar Mahmood, Christopher A Jackson, Giuseppe-Antonio Saldi, Kyunghyun Cho, Lionel A Christiaen, Richard A Bonneau
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2021-01-01
Series:PLoS Computational Biology
Online Access:https://doi.org/10.1371/journal.pcbi.1008569
_version_ 1819260250917699584
author Andreas Tjärnberg
Omar Mahmood
Christopher A Jackson
Giuseppe-Antonio Saldi
Kyunghyun Cho
Lionel A Christiaen
Richard A Bonneau
author_facet Andreas Tjärnberg
Omar Mahmood
Christopher A Jackson
Giuseppe-Antonio Saldi
Kyunghyun Cho
Lionel A Christiaen
Richard A Bonneau
author_sort Andreas Tjärnberg
collection DOAJ
description The analysis of single-cell genomics data presents several statistical challenges, and extensive efforts have been made to produce methods for the analysis of this data that impute missing values, address sampling issues and quantify and correct for noise. In spite of such efforts, no consensus on best practices has been established and all current approaches vary substantially based on the available data and empirical tests. The k-Nearest Neighbor Graph (kNN-G) is often used to infer the identities of, and relationships between, cells and is the basis of many widely used dimensionality-reduction and projection methods. The kNN-G has also been the basis for imputation methods using, e.g., neighbor averaging and graph diffusion. However, due to the lack of an agreed-upon optimal objective function for choosing hyperparameters, these methods tend to oversmooth data, thereby resulting in a loss of information with regard to cell identity and the specific gene-to-gene patterns underlying regulatory mechanisms. In this paper, we investigate the tuning of kNN- and diffusion-based denoising methods with a novel non-stochastic method for optimally preserving biologically relevant informative variance in single-cell data. The framework, Denoising Expression data with a Weighted Affinity Kernel and Self-Supervision (DEWÄKSS), uses a self-supervised technique to tune its parameters. We demonstrate that denoising with optimal parameters selected by our objective function (i) is robust to preprocessing methods using data from established benchmarks, (ii) disentangles cellular identity and maintains robust clusters over dimension-reduction methods, (iii) maintains variance along several expression dimensions, unlike previous heuristic-based methods that tend to oversmooth data variance, and (iv) rarely involves diffusion but rather uses a fixed weighted kNN graph for denoising. Together, these findings provide a new understanding of kNN- and diffusion-based denoising methods. Code and example data for DEWÄKSS is available at https://gitlab.com/Xparx/dewakss/-/tree/Tjarnberg2020branch.
first_indexed 2024-12-23T19:22:56Z
format Article
id doaj.art-a8a4056d539749db956875c60e591791
institution Directory Open Access Journal
issn 1553-734X
1553-7358
language English
last_indexed 2024-12-23T19:22:56Z
publishDate 2021-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS Computational Biology
spelling doaj.art-a8a4056d539749db956875c60e5917912022-12-21T17:34:06ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582021-01-01171e100856910.1371/journal.pcbi.1008569Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data.Andreas TjärnbergOmar MahmoodChristopher A JacksonGiuseppe-Antonio SaldiKyunghyun ChoLionel A ChristiaenRichard A BonneauThe analysis of single-cell genomics data presents several statistical challenges, and extensive efforts have been made to produce methods for the analysis of this data that impute missing values, address sampling issues and quantify and correct for noise. In spite of such efforts, no consensus on best practices has been established and all current approaches vary substantially based on the available data and empirical tests. The k-Nearest Neighbor Graph (kNN-G) is often used to infer the identities of, and relationships between, cells and is the basis of many widely used dimensionality-reduction and projection methods. The kNN-G has also been the basis for imputation methods using, e.g., neighbor averaging and graph diffusion. However, due to the lack of an agreed-upon optimal objective function for choosing hyperparameters, these methods tend to oversmooth data, thereby resulting in a loss of information with regard to cell identity and the specific gene-to-gene patterns underlying regulatory mechanisms. In this paper, we investigate the tuning of kNN- and diffusion-based denoising methods with a novel non-stochastic method for optimally preserving biologically relevant informative variance in single-cell data. The framework, Denoising Expression data with a Weighted Affinity Kernel and Self-Supervision (DEWÄKSS), uses a self-supervised technique to tune its parameters. We demonstrate that denoising with optimal parameters selected by our objective function (i) is robust to preprocessing methods using data from established benchmarks, (ii) disentangles cellular identity and maintains robust clusters over dimension-reduction methods, (iii) maintains variance along several expression dimensions, unlike previous heuristic-based methods that tend to oversmooth data variance, and (iv) rarely involves diffusion but rather uses a fixed weighted kNN graph for denoising. Together, these findings provide a new understanding of kNN- and diffusion-based denoising methods. Code and example data for DEWÄKSS is available at https://gitlab.com/Xparx/dewakss/-/tree/Tjarnberg2020branch.https://doi.org/10.1371/journal.pcbi.1008569
spellingShingle Andreas Tjärnberg
Omar Mahmood
Christopher A Jackson
Giuseppe-Antonio Saldi
Kyunghyun Cho
Lionel A Christiaen
Richard A Bonneau
Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data.
PLoS Computational Biology
title Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data.
title_full Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data.
title_fullStr Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data.
title_full_unstemmed Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data.
title_short Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data.
title_sort optimal tuning of weighted knn and diffusion based methods for denoising single cell genomics data
url https://doi.org/10.1371/journal.pcbi.1008569
work_keys_str_mv AT andreastjarnberg optimaltuningofweightedknnanddiffusionbasedmethodsfordenoisingsinglecellgenomicsdata
AT omarmahmood optimaltuningofweightedknnanddiffusionbasedmethodsfordenoisingsinglecellgenomicsdata
AT christopherajackson optimaltuningofweightedknnanddiffusionbasedmethodsfordenoisingsinglecellgenomicsdata
AT giuseppeantoniosaldi optimaltuningofweightedknnanddiffusionbasedmethodsfordenoisingsinglecellgenomicsdata
AT kyunghyuncho optimaltuningofweightedknnanddiffusionbasedmethodsfordenoisingsinglecellgenomicsdata
AT lionelachristiaen optimaltuningofweightedknnanddiffusionbasedmethodsfordenoisingsinglecellgenomicsdata
AT richardabonneau optimaltuningofweightedknnanddiffusionbasedmethodsfordenoisingsinglecellgenomicsdata