On triangle inequalities of correlation-based distances for gene expression profiles
Abstract Background Distance functions are fundamental for evaluating the differences between gene expression profiles. Such a function would output a low value if the profiles are strongly correlated—either negatively or positively—and vice versa. One popular distance function is the absolute corre...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2023-02-01
|
Series: | BMC Bioinformatics |
Subjects: | |
Online Access: | https://doi.org/10.1186/s12859-023-05161-y |
_version_ | 1797811296530333696 |
---|---|
author | Jiaxing Chen Yen Kaow Ng Lu Lin Xianglilan Zhang Shuaicheng Li |
author_facet | Jiaxing Chen Yen Kaow Ng Lu Lin Xianglilan Zhang Shuaicheng Li |
author_sort | Jiaxing Chen |
collection | DOAJ |
description | Abstract Background Distance functions are fundamental for evaluating the differences between gene expression profiles. Such a function would output a low value if the profiles are strongly correlated—either negatively or positively—and vice versa. One popular distance function is the absolute correlation distance, $$d_a=1-|\rho |$$ d a = 1 - | ρ | , where $$\rho$$ ρ is similarity measure, such as Pearson or Spearman correlation. However, the absolute correlation distance fails to fulfill the triangle inequality, which would have guaranteed better performance at vector quantization, allowed fast data localization, as well as accelerated data clustering. Results In this work, we propose $$d_r=\sqrt{1-|\rho |}$$ d r = 1 - | ρ | as an alternative. We prove that $$d_r$$ d r satisfies the triangle inequality when $$\rho$$ ρ represents Pearson correlation, Spearman correlation, or Cosine similarity. We show $$d_r$$ d r to be better than $$d_s=\sqrt{1-\rho ^2}$$ d s = 1 - ρ 2 , another variant of $$d_a$$ d a that satisfies the triangle inequality, both analytically as well as experimentally. We empirically compared $$d_r$$ d r with $$d_a$$ d a in gene clustering and sample clustering experiment by real-world biological data. The two distances performed similarly in both gene clustering and sample clustering in hierarchical clustering and PAM (partitioning around medoids) clustering. However, $$d_r$$ d r demonstrated more robust clustering. According to the bootstrap experiment, $$d_r$$ d r generated more robust sample pair partition more frequently (P-value $$<0.05$$ < 0.05 ). The statistics on the time a class “dissolved” also support the advantage of $$d_r$$ d r in robustness. Conclusion $$d_r$$ d r , as a variant of absolute correlation distance, satisfies the triangle inequality and is capable for more robust clustering. |
first_indexed | 2024-03-13T07:21:34Z |
format | Article |
id | doaj.art-cb8d8a88663e4cdf8efab7ff149ecdff |
institution | Directory Open Access Journal |
issn | 1471-2105 |
language | English |
last_indexed | 2024-03-13T07:21:34Z |
publishDate | 2023-02-01 |
publisher | BMC |
record_format | Article |
series | BMC Bioinformatics |
spelling | doaj.art-cb8d8a88663e4cdf8efab7ff149ecdff2023-06-04T11:40:09ZengBMCBMC Bioinformatics1471-21052023-02-0124111610.1186/s12859-023-05161-yOn triangle inequalities of correlation-based distances for gene expression profilesJiaxing Chen0Yen Kaow Ng1Lu Lin2Xianglilan Zhang3Shuaicheng Li4Department of Computer Science, City University of Hong KongDepartment of Computer Science, City University of Hong KongDepartment of Computer Science, City University of Hong KongState Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and EpidemiologyDepartment of Computer Science, City University of Hong KongAbstract Background Distance functions are fundamental for evaluating the differences between gene expression profiles. Such a function would output a low value if the profiles are strongly correlated—either negatively or positively—and vice versa. One popular distance function is the absolute correlation distance, $$d_a=1-|\rho |$$ d a = 1 - | ρ | , where $$\rho$$ ρ is similarity measure, such as Pearson or Spearman correlation. However, the absolute correlation distance fails to fulfill the triangle inequality, which would have guaranteed better performance at vector quantization, allowed fast data localization, as well as accelerated data clustering. Results In this work, we propose $$d_r=\sqrt{1-|\rho |}$$ d r = 1 - | ρ | as an alternative. We prove that $$d_r$$ d r satisfies the triangle inequality when $$\rho$$ ρ represents Pearson correlation, Spearman correlation, or Cosine similarity. We show $$d_r$$ d r to be better than $$d_s=\sqrt{1-\rho ^2}$$ d s = 1 - ρ 2 , another variant of $$d_a$$ d a that satisfies the triangle inequality, both analytically as well as experimentally. We empirically compared $$d_r$$ d r with $$d_a$$ d a in gene clustering and sample clustering experiment by real-world biological data. The two distances performed similarly in both gene clustering and sample clustering in hierarchical clustering and PAM (partitioning around medoids) clustering. However, $$d_r$$ d r demonstrated more robust clustering. According to the bootstrap experiment, $$d_r$$ d r generated more robust sample pair partition more frequently (P-value $$<0.05$$ < 0.05 ). The statistics on the time a class “dissolved” also support the advantage of $$d_r$$ d r in robustness. Conclusion $$d_r$$ d r , as a variant of absolute correlation distance, satisfies the triangle inequality and is capable for more robust clustering.https://doi.org/10.1186/s12859-023-05161-yCorrelationDistanceTriangle inequalityClusteringGene expression analysisSingle cell |
spellingShingle | Jiaxing Chen Yen Kaow Ng Lu Lin Xianglilan Zhang Shuaicheng Li On triangle inequalities of correlation-based distances for gene expression profiles BMC Bioinformatics Correlation Distance Triangle inequality Clustering Gene expression analysis Single cell |
title | On triangle inequalities of correlation-based distances for gene expression profiles |
title_full | On triangle inequalities of correlation-based distances for gene expression profiles |
title_fullStr | On triangle inequalities of correlation-based distances for gene expression profiles |
title_full_unstemmed | On triangle inequalities of correlation-based distances for gene expression profiles |
title_short | On triangle inequalities of correlation-based distances for gene expression profiles |
title_sort | on triangle inequalities of correlation based distances for gene expression profiles |
topic | Correlation Distance Triangle inequality Clustering Gene expression analysis Single cell |
url | https://doi.org/10.1186/s12859-023-05161-y |
work_keys_str_mv | AT jiaxingchen ontriangleinequalitiesofcorrelationbaseddistancesforgeneexpressionprofiles AT yenkaowng ontriangleinequalitiesofcorrelationbaseddistancesforgeneexpressionprofiles AT lulin ontriangleinequalitiesofcorrelationbaseddistancesforgeneexpressionprofiles AT xianglilanzhang ontriangleinequalitiesofcorrelationbaseddistancesforgeneexpressionprofiles AT shuaichengli ontriangleinequalitiesofcorrelationbaseddistancesforgeneexpressionprofiles |