Ranked Adjusted Rand: integrating distance and partition information in a measure of clustering agreement

<p>Abstract</p> <p>Background</p> <p>Biological information is commonly used to cluster or classify entities of interest such as genes, conditions, species or samples. However, different sources of data can be used to classify the same set of entities and methods allowi...

Full description

Bibliographic Details
Main Authors: Carriço João A, Pinto Francisco R, Ramirez Mário, Almeida Jonas S
Format: Article
Language:English
Published: BMC 2007-02-01
Series:BMC Bioinformatics
Online Access:http://www.biomedcentral.com/1471-2105/8/44
_version_ 1818790173238886400
author Carriço João A
Pinto Francisco R
Ramirez Mário
Almeida Jonas S
author_facet Carriço João A
Pinto Francisco R
Ramirez Mário
Almeida Jonas S
author_sort Carriço João A
collection DOAJ
description <p>Abstract</p> <p>Background</p> <p>Biological information is commonly used to cluster or classify entities of interest such as genes, conditions, species or samples. However, different sources of data can be used to classify the same set of entities and methods allowing the comparison of the performance of two data sources or the determination of how well a given classification agrees with another are frequently needed, especially in the absence of a universally accepted "gold standard" classification.</p> <p>Results</p> <p>Here, we describe a novel measure – the Ranked Adjusted Rand (<it>RAR</it>) index. <it>RAR </it>differs from existing methods by evaluating the extent of agreement between any two groupings, taking into account the intercluster distances. This characteristic is relevant to evaluate cases of pairs of entities grouped in the same cluster by one method and separated by another. The latter method may assign them to close neighbour clusters or, on the contrary, to clusters that are far apart from each other. <it>RAR </it>is applicable even when intercluster distance information is absent for both or one of the groupings. In the first case, <it>RAR </it>is equal to its predecessor, Adjusted Rand (<it>HA</it>) index. Artificially designed clusterings were used to demonstrate situations in which only <it>RAR </it>was able to detect differences in the grouping patterns. A study with larger simulated clusterings ensured that in realistic conditions, <it>RAR </it>is effectively integrating distance and partition information. The new method was applied to biological examples to compare 1) two microbial typing methods, 2) two gene regulatory network distances and 3) microarray gene expression data with pathway information. In the first application, one of the methods does not provide intercluster distances while the other originated a hierarchical clustering. <it>RAR </it>proved to be more sensitive than <it>HA </it>in the choice of a threshold for defining clusters in the hierarchical method that maximizes agreement between the results of both methods.</p> <p>Conclusion</p> <p><it>RAR </it>has its major advantage in combining cluster distance and partition information, while the previously available methods used only the latter. <it>RAR </it>should be used in the research problems were <it>HA </it>was previously used, because in the absence of inter cluster distance effects it is an equally effective measure, and in the presence of distance effects it is a more complete one.</p>
first_indexed 2024-12-18T14:51:15Z
format Article
id doaj.art-adf2b61b29144333baa33c832f3ffc4c
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-12-18T14:51:15Z
publishDate 2007-02-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-adf2b61b29144333baa33c832f3ffc4c2022-12-21T21:04:10ZengBMCBMC Bioinformatics1471-21052007-02-01814410.1186/1471-2105-8-44Ranked Adjusted Rand: integrating distance and partition information in a measure of clustering agreementCarriço João APinto Francisco RRamirez MárioAlmeida Jonas S<p>Abstract</p> <p>Background</p> <p>Biological information is commonly used to cluster or classify entities of interest such as genes, conditions, species or samples. However, different sources of data can be used to classify the same set of entities and methods allowing the comparison of the performance of two data sources or the determination of how well a given classification agrees with another are frequently needed, especially in the absence of a universally accepted "gold standard" classification.</p> <p>Results</p> <p>Here, we describe a novel measure – the Ranked Adjusted Rand (<it>RAR</it>) index. <it>RAR </it>differs from existing methods by evaluating the extent of agreement between any two groupings, taking into account the intercluster distances. This characteristic is relevant to evaluate cases of pairs of entities grouped in the same cluster by one method and separated by another. The latter method may assign them to close neighbour clusters or, on the contrary, to clusters that are far apart from each other. <it>RAR </it>is applicable even when intercluster distance information is absent for both or one of the groupings. In the first case, <it>RAR </it>is equal to its predecessor, Adjusted Rand (<it>HA</it>) index. Artificially designed clusterings were used to demonstrate situations in which only <it>RAR </it>was able to detect differences in the grouping patterns. A study with larger simulated clusterings ensured that in realistic conditions, <it>RAR </it>is effectively integrating distance and partition information. The new method was applied to biological examples to compare 1) two microbial typing methods, 2) two gene regulatory network distances and 3) microarray gene expression data with pathway information. In the first application, one of the methods does not provide intercluster distances while the other originated a hierarchical clustering. <it>RAR </it>proved to be more sensitive than <it>HA </it>in the choice of a threshold for defining clusters in the hierarchical method that maximizes agreement between the results of both methods.</p> <p>Conclusion</p> <p><it>RAR </it>has its major advantage in combining cluster distance and partition information, while the previously available methods used only the latter. <it>RAR </it>should be used in the research problems were <it>HA </it>was previously used, because in the absence of inter cluster distance effects it is an equally effective measure, and in the presence of distance effects it is a more complete one.</p>http://www.biomedcentral.com/1471-2105/8/44
spellingShingle Carriço João A
Pinto Francisco R
Ramirez Mário
Almeida Jonas S
Ranked Adjusted Rand: integrating distance and partition information in a measure of clustering agreement
BMC Bioinformatics
title Ranked Adjusted Rand: integrating distance and partition information in a measure of clustering agreement
title_full Ranked Adjusted Rand: integrating distance and partition information in a measure of clustering agreement
title_fullStr Ranked Adjusted Rand: integrating distance and partition information in a measure of clustering agreement
title_full_unstemmed Ranked Adjusted Rand: integrating distance and partition information in a measure of clustering agreement
title_short Ranked Adjusted Rand: integrating distance and partition information in a measure of clustering agreement
title_sort ranked adjusted rand integrating distance and partition information in a measure of clustering agreement
url http://www.biomedcentral.com/1471-2105/8/44
work_keys_str_mv AT carricojoaoa rankedadjustedrandintegratingdistanceandpartitioninformationinameasureofclusteringagreement
AT pintofranciscor rankedadjustedrandintegratingdistanceandpartitioninformationinameasureofclusteringagreement
AT ramirezmario rankedadjustedrandintegratingdistanceandpartitioninformationinameasureofclusteringagreement
AT almeidajonass rankedadjustedrandintegratingdistanceandpartitioninformationinameasureofclusteringagreement