Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata
Abstract Background The ability to efficiently search and filter datasets depends on access to high quality metadata. While most biomedical repositories require data submitters to provide a minimal set of metadata, some such as the Gene Expression Omnibus (GEO) allows users to specify additional met...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2017-09-01
|
Series: | BMC Bioinformatics |
Subjects: | |
Online Access: | http://link.springer.com/article/10.1186/s12859-017-1832-4 |
_version_ | 1818155163652718592 |
---|---|
author | Wei Hu Amrapali Zaveri Honglei Qiu Michel Dumontier |
author_facet | Wei Hu Amrapali Zaveri Honglei Qiu Michel Dumontier |
author_sort | Wei Hu |
collection | DOAJ |
description | Abstract Background The ability to efficiently search and filter datasets depends on access to high quality metadata. While most biomedical repositories require data submitters to provide a minimal set of metadata, some such as the Gene Expression Omnibus (GEO) allows users to specify additional metadata in the form of textual key-value pairs (e.g. sex: female). However, since there is no structured vocabulary to guide the submitter regarding the metadata terms to use, consequently, the 44,000,000+ key-value pairs in GEO suffer from numerous quality issues including redundancy, heterogeneity, inconsistency, and incompleteness. Such issues hinder the ability of scientists to hone in on datasets that meet their requirements and point to a need for accurate, structured and complete description of the data. Methods In this study, we propose a clustering-based approach to address data quality issues in biomedical, specifically gene expression, metadata. First, we present three different kinds of similarity measures to compare metadata keys. Second, we design a scalable agglomerative clustering algorithm to cluster similar keys together. Results Our agglomerative cluster algorithm identified metadata keys that were similar, based on (i) name, (ii) core concept and (iii) value similarities, to each other and grouped them together. We evaluated our method using a manually created gold standard in which 359 keys were grouped into 27 clusters based on six types of characteristics: (i) age, (ii) cell line, (iii) disease, (iv) strain, (v) tissue and (vi) treatment. As a result, the algorithm generated 18 clusters containing 355 keys (four clusters with only one key were excluded). In the 18 clusters, there were keys that were identified correctly to be related to that cluster, but there were 13 keys which were not related to that cluster. We compared our approach with four other published methods. Our approach significantly outperformed them for most metadata keys and achieved the best average F-Score (0.63). Conclusion Our algorithm identified keys that were similar to each other and grouped them together. Our intuition that underpins cleaning by clustering is that, dividing keys into different clusters resolves the scalability issues for data observation and cleaning, and keys in the same cluster with duplicates and errors can easily be found. Our algorithm can also be applied to other biomedical data types. |
first_indexed | 2024-12-11T14:38:02Z |
format | Article |
id | doaj.art-95b281a1929041f8a5d6ce36331bf14f |
institution | Directory Open Access Journal |
issn | 1471-2105 |
language | English |
last_indexed | 2024-12-11T14:38:02Z |
publishDate | 2017-09-01 |
publisher | BMC |
record_format | Article |
series | BMC Bioinformatics |
spelling | doaj.art-95b281a1929041f8a5d6ce36331bf14f2022-12-22T01:02:05ZengBMCBMC Bioinformatics1471-21052017-09-0118111210.1186/s12859-017-1832-4Cleaning by clustering: methodology for addressing data quality issues in biomedical metadataWei Hu0Amrapali Zaveri1Honglei Qiu2Michel Dumontier3State Key Laboratory for Novel Software Technology, Nanjing UniversityInstitute of Data Science, Maastricht UniversityState Key Laboratory for Novel Software Technology, Nanjing UniversityInstitute of Data Science, Maastricht UniversityAbstract Background The ability to efficiently search and filter datasets depends on access to high quality metadata. While most biomedical repositories require data submitters to provide a minimal set of metadata, some such as the Gene Expression Omnibus (GEO) allows users to specify additional metadata in the form of textual key-value pairs (e.g. sex: female). However, since there is no structured vocabulary to guide the submitter regarding the metadata terms to use, consequently, the 44,000,000+ key-value pairs in GEO suffer from numerous quality issues including redundancy, heterogeneity, inconsistency, and incompleteness. Such issues hinder the ability of scientists to hone in on datasets that meet their requirements and point to a need for accurate, structured and complete description of the data. Methods In this study, we propose a clustering-based approach to address data quality issues in biomedical, specifically gene expression, metadata. First, we present three different kinds of similarity measures to compare metadata keys. Second, we design a scalable agglomerative clustering algorithm to cluster similar keys together. Results Our agglomerative cluster algorithm identified metadata keys that were similar, based on (i) name, (ii) core concept and (iii) value similarities, to each other and grouped them together. We evaluated our method using a manually created gold standard in which 359 keys were grouped into 27 clusters based on six types of characteristics: (i) age, (ii) cell line, (iii) disease, (iv) strain, (v) tissue and (vi) treatment. As a result, the algorithm generated 18 clusters containing 355 keys (four clusters with only one key were excluded). In the 18 clusters, there were keys that were identified correctly to be related to that cluster, but there were 13 keys which were not related to that cluster. We compared our approach with four other published methods. Our approach significantly outperformed them for most metadata keys and achieved the best average F-Score (0.63). Conclusion Our algorithm identified keys that were similar to each other and grouped them together. Our intuition that underpins cleaning by clustering is that, dividing keys into different clusters resolves the scalability issues for data observation and cleaning, and keys in the same cluster with duplicates and errors can easily be found. Our algorithm can also be applied to other biomedical data types.http://link.springer.com/article/10.1186/s12859-017-1832-4GEOMetadataData qualityClusteringBiomedicalExperimental data |
spellingShingle | Wei Hu Amrapali Zaveri Honglei Qiu Michel Dumontier Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata BMC Bioinformatics GEO Metadata Data quality Clustering Biomedical Experimental data |
title | Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata |
title_full | Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata |
title_fullStr | Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata |
title_full_unstemmed | Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata |
title_short | Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata |
title_sort | cleaning by clustering methodology for addressing data quality issues in biomedical metadata |
topic | GEO Metadata Data quality Clustering Biomedical Experimental data |
url | http://link.springer.com/article/10.1186/s12859-017-1832-4 |
work_keys_str_mv | AT weihu cleaningbyclusteringmethodologyforaddressingdataqualityissuesinbiomedicalmetadata AT amrapalizaveri cleaningbyclusteringmethodologyforaddressingdataqualityissuesinbiomedicalmetadata AT hongleiqiu cleaningbyclusteringmethodologyforaddressingdataqualityissuesinbiomedicalmetadata AT micheldumontier cleaningbyclusteringmethodologyforaddressingdataqualityissuesinbiomedicalmetadata |