Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata

Abstract Background The ability to efficiently search and filter datasets depends on access to high quality metadata. While most biomedical repositories require data submitters to provide a minimal set of metadata, some such as the Gene Expression Omnibus (GEO) allows users to specify additional met...

Full description

Bibliographic Details
Main Authors: Wei Hu, Amrapali Zaveri, Honglei Qiu, Michel Dumontier
Format: Article
Language:English
Published: BMC 2017-09-01
Series:BMC Bioinformatics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12859-017-1832-4
_version_ 1818155163652718592
author Wei Hu
Amrapali Zaveri
Honglei Qiu
Michel Dumontier
author_facet Wei Hu
Amrapali Zaveri
Honglei Qiu
Michel Dumontier
author_sort Wei Hu
collection DOAJ
description Abstract Background The ability to efficiently search and filter datasets depends on access to high quality metadata. While most biomedical repositories require data submitters to provide a minimal set of metadata, some such as the Gene Expression Omnibus (GEO) allows users to specify additional metadata in the form of textual key-value pairs (e.g. sex: female). However, since there is no structured vocabulary to guide the submitter regarding the metadata terms to use, consequently, the 44,000,000+ key-value pairs in GEO suffer from numerous quality issues including redundancy, heterogeneity, inconsistency, and incompleteness. Such issues hinder the ability of scientists to hone in on datasets that meet their requirements and point to a need for accurate, structured and complete description of the data. Methods In this study, we propose a clustering-based approach to address data quality issues in biomedical, specifically gene expression, metadata. First, we present three different kinds of similarity measures to compare metadata keys. Second, we design a scalable agglomerative clustering algorithm to cluster similar keys together. Results Our agglomerative cluster algorithm identified metadata keys that were similar, based on (i) name, (ii) core concept and (iii) value similarities, to each other and grouped them together. We evaluated our method using a manually created gold standard in which 359 keys were grouped into 27 clusters based on six types of characteristics: (i) age, (ii) cell line, (iii) disease, (iv) strain, (v) tissue and (vi) treatment. As a result, the algorithm generated 18 clusters containing 355 keys (four clusters with only one key were excluded). In the 18 clusters, there were keys that were identified correctly to be related to that cluster, but there were 13 keys which were not related to that cluster. We compared our approach with four other published methods. Our approach significantly outperformed them for most metadata keys and achieved the best average F-Score (0.63). Conclusion Our algorithm identified keys that were similar to each other and grouped them together. Our intuition that underpins cleaning by clustering is that, dividing keys into different clusters resolves the scalability issues for data observation and cleaning, and keys in the same cluster with duplicates and errors can easily be found. Our algorithm can also be applied to other biomedical data types.
first_indexed 2024-12-11T14:38:02Z
format Article
id doaj.art-95b281a1929041f8a5d6ce36331bf14f
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-12-11T14:38:02Z
publishDate 2017-09-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-95b281a1929041f8a5d6ce36331bf14f2022-12-22T01:02:05ZengBMCBMC Bioinformatics1471-21052017-09-0118111210.1186/s12859-017-1832-4Cleaning by clustering: methodology for addressing data quality issues in biomedical metadataWei Hu0Amrapali Zaveri1Honglei Qiu2Michel Dumontier3State Key Laboratory for Novel Software Technology, Nanjing UniversityInstitute of Data Science, Maastricht UniversityState Key Laboratory for Novel Software Technology, Nanjing UniversityInstitute of Data Science, Maastricht UniversityAbstract Background The ability to efficiently search and filter datasets depends on access to high quality metadata. While most biomedical repositories require data submitters to provide a minimal set of metadata, some such as the Gene Expression Omnibus (GEO) allows users to specify additional metadata in the form of textual key-value pairs (e.g. sex: female). However, since there is no structured vocabulary to guide the submitter regarding the metadata terms to use, consequently, the 44,000,000+ key-value pairs in GEO suffer from numerous quality issues including redundancy, heterogeneity, inconsistency, and incompleteness. Such issues hinder the ability of scientists to hone in on datasets that meet their requirements and point to a need for accurate, structured and complete description of the data. Methods In this study, we propose a clustering-based approach to address data quality issues in biomedical, specifically gene expression, metadata. First, we present three different kinds of similarity measures to compare metadata keys. Second, we design a scalable agglomerative clustering algorithm to cluster similar keys together. Results Our agglomerative cluster algorithm identified metadata keys that were similar, based on (i) name, (ii) core concept and (iii) value similarities, to each other and grouped them together. We evaluated our method using a manually created gold standard in which 359 keys were grouped into 27 clusters based on six types of characteristics: (i) age, (ii) cell line, (iii) disease, (iv) strain, (v) tissue and (vi) treatment. As a result, the algorithm generated 18 clusters containing 355 keys (four clusters with only one key were excluded). In the 18 clusters, there were keys that were identified correctly to be related to that cluster, but there were 13 keys which were not related to that cluster. We compared our approach with four other published methods. Our approach significantly outperformed them for most metadata keys and achieved the best average F-Score (0.63). Conclusion Our algorithm identified keys that were similar to each other and grouped them together. Our intuition that underpins cleaning by clustering is that, dividing keys into different clusters resolves the scalability issues for data observation and cleaning, and keys in the same cluster with duplicates and errors can easily be found. Our algorithm can also be applied to other biomedical data types.http://link.springer.com/article/10.1186/s12859-017-1832-4GEOMetadataData qualityClusteringBiomedicalExperimental data
spellingShingle Wei Hu
Amrapali Zaveri
Honglei Qiu
Michel Dumontier
Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata
BMC Bioinformatics
GEO
Metadata
Data quality
Clustering
Biomedical
Experimental data
title Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata
title_full Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata
title_fullStr Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata
title_full_unstemmed Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata
title_short Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata
title_sort cleaning by clustering methodology for addressing data quality issues in biomedical metadata
topic GEO
Metadata
Data quality
Clustering
Biomedical
Experimental data
url http://link.springer.com/article/10.1186/s12859-017-1832-4
work_keys_str_mv AT weihu cleaningbyclusteringmethodologyforaddressingdataqualityissuesinbiomedicalmetadata
AT amrapalizaveri cleaningbyclusteringmethodologyforaddressingdataqualityissuesinbiomedicalmetadata
AT hongleiqiu cleaningbyclusteringmethodologyforaddressingdataqualityissuesinbiomedicalmetadata
AT micheldumontier cleaningbyclusteringmethodologyforaddressingdataqualityissuesinbiomedicalmetadata