On triangle inequalities of correlation-based distances for gene expression profiles

Abstract Background Distance functions are fundamental for evaluating the differences between gene expression profiles. Such a function would output a low value if the profiles are strongly correlated—either negatively or positively—and vice versa. One popular distance function is the absolute corre...

Full description

Bibliographic Details
Main Authors: Jiaxing Chen, Yen Kaow Ng, Lu Lin, Xianglilan Zhang, Shuaicheng Li
Format: Article
Language:English
Published: BMC 2023-02-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-023-05161-y
_version_ 1797811296530333696
author Jiaxing Chen
Yen Kaow Ng
Lu Lin
Xianglilan Zhang
Shuaicheng Li
author_facet Jiaxing Chen
Yen Kaow Ng
Lu Lin
Xianglilan Zhang
Shuaicheng Li
author_sort Jiaxing Chen
collection DOAJ
description Abstract Background Distance functions are fundamental for evaluating the differences between gene expression profiles. Such a function would output a low value if the profiles are strongly correlated—either negatively or positively—and vice versa. One popular distance function is the absolute correlation distance, $$d_a=1-|\rho |$$ d a = 1 - | ρ | , where $$\rho$$ ρ is similarity measure, such as Pearson or Spearman correlation. However, the absolute correlation distance fails to fulfill the triangle inequality, which would have guaranteed better performance at vector quantization, allowed fast data localization, as well as accelerated data clustering. Results In this work, we propose $$d_r=\sqrt{1-|\rho |}$$ d r = 1 - | ρ | as an alternative. We prove that $$d_r$$ d r satisfies the triangle inequality when $$\rho$$ ρ represents Pearson correlation, Spearman correlation, or Cosine similarity. We show $$d_r$$ d r to be better than $$d_s=\sqrt{1-\rho ^2}$$ d s = 1 - ρ 2 , another variant of $$d_a$$ d a that satisfies the triangle inequality, both analytically as well as experimentally. We empirically compared $$d_r$$ d r with $$d_a$$ d a in gene clustering and sample clustering experiment by real-world biological data. The two distances performed similarly in both gene clustering and sample clustering in hierarchical clustering and PAM (partitioning around medoids) clustering. However, $$d_r$$ d r demonstrated more robust clustering. According to the bootstrap experiment, $$d_r$$ d r generated more robust sample pair partition more frequently (P-value $$<0.05$$ < 0.05 ). The statistics on the time a class “dissolved” also support the advantage of $$d_r$$ d r in robustness. Conclusion $$d_r$$ d r , as a variant of absolute correlation distance, satisfies the triangle inequality and is capable for more robust clustering.
first_indexed 2024-03-13T07:21:34Z
format Article
id doaj.art-cb8d8a88663e4cdf8efab7ff149ecdff
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-03-13T07:21:34Z
publishDate 2023-02-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-cb8d8a88663e4cdf8efab7ff149ecdff2023-06-04T11:40:09ZengBMCBMC Bioinformatics1471-21052023-02-0124111610.1186/s12859-023-05161-yOn triangle inequalities of correlation-based distances for gene expression profilesJiaxing Chen0Yen Kaow Ng1Lu Lin2Xianglilan Zhang3Shuaicheng Li4Department of Computer Science, City University of Hong KongDepartment of Computer Science, City University of Hong KongDepartment of Computer Science, City University of Hong KongState Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and EpidemiologyDepartment of Computer Science, City University of Hong KongAbstract Background Distance functions are fundamental for evaluating the differences between gene expression profiles. Such a function would output a low value if the profiles are strongly correlated—either negatively or positively—and vice versa. One popular distance function is the absolute correlation distance, $$d_a=1-|\rho |$$ d a = 1 - | ρ | , where $$\rho$$ ρ is similarity measure, such as Pearson or Spearman correlation. However, the absolute correlation distance fails to fulfill the triangle inequality, which would have guaranteed better performance at vector quantization, allowed fast data localization, as well as accelerated data clustering. Results In this work, we propose $$d_r=\sqrt{1-|\rho |}$$ d r = 1 - | ρ | as an alternative. We prove that $$d_r$$ d r satisfies the triangle inequality when $$\rho$$ ρ represents Pearson correlation, Spearman correlation, or Cosine similarity. We show $$d_r$$ d r to be better than $$d_s=\sqrt{1-\rho ^2}$$ d s = 1 - ρ 2 , another variant of $$d_a$$ d a that satisfies the triangle inequality, both analytically as well as experimentally. We empirically compared $$d_r$$ d r with $$d_a$$ d a in gene clustering and sample clustering experiment by real-world biological data. The two distances performed similarly in both gene clustering and sample clustering in hierarchical clustering and PAM (partitioning around medoids) clustering. However, $$d_r$$ d r demonstrated more robust clustering. According to the bootstrap experiment, $$d_r$$ d r generated more robust sample pair partition more frequently (P-value $$<0.05$$ < 0.05 ). The statistics on the time a class “dissolved” also support the advantage of $$d_r$$ d r in robustness. Conclusion $$d_r$$ d r , as a variant of absolute correlation distance, satisfies the triangle inequality and is capable for more robust clustering.https://doi.org/10.1186/s12859-023-05161-yCorrelationDistanceTriangle inequalityClusteringGene expression analysisSingle cell
spellingShingle Jiaxing Chen
Yen Kaow Ng
Lu Lin
Xianglilan Zhang
Shuaicheng Li
On triangle inequalities of correlation-based distances for gene expression profiles
BMC Bioinformatics
Correlation
Distance
Triangle inequality
Clustering
Gene expression analysis
Single cell
title On triangle inequalities of correlation-based distances for gene expression profiles
title_full On triangle inequalities of correlation-based distances for gene expression profiles
title_fullStr On triangle inequalities of correlation-based distances for gene expression profiles
title_full_unstemmed On triangle inequalities of correlation-based distances for gene expression profiles
title_short On triangle inequalities of correlation-based distances for gene expression profiles
title_sort on triangle inequalities of correlation based distances for gene expression profiles
topic Correlation
Distance
Triangle inequality
Clustering
Gene expression analysis
Single cell
url https://doi.org/10.1186/s12859-023-05161-y
work_keys_str_mv AT jiaxingchen ontriangleinequalitiesofcorrelationbaseddistancesforgeneexpressionprofiles
AT yenkaowng ontriangleinequalitiesofcorrelationbaseddistancesforgeneexpressionprofiles
AT lulin ontriangleinequalitiesofcorrelationbaseddistancesforgeneexpressionprofiles
AT xianglilanzhang ontriangleinequalitiesofcorrelationbaseddistancesforgeneexpressionprofiles
AT shuaichengli ontriangleinequalitiesofcorrelationbaseddistancesforgeneexpressionprofiles