An ontology-guided semantic data integration framework to support integrative data analysis of cancer survival

Abstract Background Cancer is the second leading cause of death in the United States, exceeded only by heart disease. Extant cancer survival analyses have primarily focused on individual-level factors due to limited data availability from a single data source. There is a need to integrate data from...

Full description

Bibliographic Details
Main Authors: Hansi Zhang, Yi Guo, Qian Li, Thomas J. George, Elizabeth Shenkman, François Modave, Jiang Bian
Format: Article
Language:English
Published: BMC 2018-07-01
Series:BMC Medical Informatics and Decision Making
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12911-018-0636-4
_version_ 1819014100388151296
author Hansi Zhang
Yi Guo
Qian Li
Thomas J. George
Elizabeth Shenkman
François Modave
Jiang Bian
author_facet Hansi Zhang
Yi Guo
Qian Li
Thomas J. George
Elizabeth Shenkman
François Modave
Jiang Bian
author_sort Hansi Zhang
collection DOAJ
description Abstract Background Cancer is the second leading cause of death in the United States, exceeded only by heart disease. Extant cancer survival analyses have primarily focused on individual-level factors due to limited data availability from a single data source. There is a need to integrate data from different sources to simultaneously study as much risk factors as possible. Thus, we proposed an ontology-based approach to integrate heterogeneous datasets addressing key data integration challenges. Methods Following best practices in ontology engineering, we created the Ontology for Cancer Research Variables (OCRV) adapting existing semantic resources such as the National Cancer Institute (NCI) Thesaurus. Using the global-as-view data integration approach, we created mapping axioms to link the data elements in different sources to OCRV. Implemented upon the Ontop platform, we built a data integration pipeline to query, extract, and transform data in relational databases using semantic queries into a pooled dataset according to the downstream multi-level Integrative Data Analysis (IDA) needs. Results Based on our use cases in the cancer survival IDA, we created tailored ontological structures in OCRV to facilitate the data integration tasks. Specifically, we created a flexible framework addressing key integration challenges: (1) using a shared, controlled vocabulary to make data understandable to both human and computers, (2) explicitly modeling the semantic relationships makes it possible to compute and reason with the data, (3) linking patients to contextual and environmental factors through geographic variables, (4) being able to document the data manipulation and integration processes clearly in the ontologies. Conclusions Using an ontology-based data integration approach not only standardizes the definitions of data variables through a common, controlled vocabulary, but also makes the semantic relationships among variables from different sources explicit and clear to all users of the same datasets. Such an approach resolves the ambiguity in variable selection, extraction and integration processes and thus improve reproducibility of the IDA.
first_indexed 2024-12-21T02:10:28Z
format Article
id doaj.art-19466a0e94b04b2b90a43aa06457525e
institution Directory Open Access Journal
issn 1472-6947
language English
last_indexed 2024-12-21T02:10:28Z
publishDate 2018-07-01
publisher BMC
record_format Article
series BMC Medical Informatics and Decision Making
spelling doaj.art-19466a0e94b04b2b90a43aa06457525e2022-12-21T19:19:23ZengBMCBMC Medical Informatics and Decision Making1472-69472018-07-0118S212914710.1186/s12911-018-0636-4An ontology-guided semantic data integration framework to support integrative data analysis of cancer survivalHansi Zhang0Yi Guo1Qian Li2Thomas J. George3Elizabeth Shenkman4François Modave5Jiang Bian6Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of FloridaDepartment of Health Outcomes and Biomedical Informatics, College of Medicine, University of FloridaDepartment of Health Outcomes and Biomedical Informatics, College of Medicine, University of FloridaDivision of Hematology and Oncology, Department of Medicine, College of Medicine, University of FloridaDepartment of Health Outcomes and Biomedical Informatics, College of Medicine, University of FloridaDepartment of Health Outcomes and Biomedical Informatics, College of Medicine, University of FloridaDepartment of Health Outcomes and Biomedical Informatics, College of Medicine, University of FloridaAbstract Background Cancer is the second leading cause of death in the United States, exceeded only by heart disease. Extant cancer survival analyses have primarily focused on individual-level factors due to limited data availability from a single data source. There is a need to integrate data from different sources to simultaneously study as much risk factors as possible. Thus, we proposed an ontology-based approach to integrate heterogeneous datasets addressing key data integration challenges. Methods Following best practices in ontology engineering, we created the Ontology for Cancer Research Variables (OCRV) adapting existing semantic resources such as the National Cancer Institute (NCI) Thesaurus. Using the global-as-view data integration approach, we created mapping axioms to link the data elements in different sources to OCRV. Implemented upon the Ontop platform, we built a data integration pipeline to query, extract, and transform data in relational databases using semantic queries into a pooled dataset according to the downstream multi-level Integrative Data Analysis (IDA) needs. Results Based on our use cases in the cancer survival IDA, we created tailored ontological structures in OCRV to facilitate the data integration tasks. Specifically, we created a flexible framework addressing key integration challenges: (1) using a shared, controlled vocabulary to make data understandable to both human and computers, (2) explicitly modeling the semantic relationships makes it possible to compute and reason with the data, (3) linking patients to contextual and environmental factors through geographic variables, (4) being able to document the data manipulation and integration processes clearly in the ontologies. Conclusions Using an ontology-based data integration approach not only standardizes the definitions of data variables through a common, controlled vocabulary, but also makes the semantic relationships among variables from different sources explicit and clear to all users of the same datasets. Such an approach resolves the ambiguity in variable selection, extraction and integration processes and thus improve reproducibility of the IDA.http://link.springer.com/article/10.1186/s12911-018-0636-4Semantic data integrationOntologySemantic webCancer survivalIntegrative data analysis
spellingShingle Hansi Zhang
Yi Guo
Qian Li
Thomas J. George
Elizabeth Shenkman
François Modave
Jiang Bian
An ontology-guided semantic data integration framework to support integrative data analysis of cancer survival
BMC Medical Informatics and Decision Making
Semantic data integration
Ontology
Semantic web
Cancer survival
Integrative data analysis
title An ontology-guided semantic data integration framework to support integrative data analysis of cancer survival
title_full An ontology-guided semantic data integration framework to support integrative data analysis of cancer survival
title_fullStr An ontology-guided semantic data integration framework to support integrative data analysis of cancer survival
title_full_unstemmed An ontology-guided semantic data integration framework to support integrative data analysis of cancer survival
title_short An ontology-guided semantic data integration framework to support integrative data analysis of cancer survival
title_sort ontology guided semantic data integration framework to support integrative data analysis of cancer survival
topic Semantic data integration
Ontology
Semantic web
Cancer survival
Integrative data analysis
url http://link.springer.com/article/10.1186/s12911-018-0636-4
work_keys_str_mv AT hansizhang anontologyguidedsemanticdataintegrationframeworktosupportintegrativedataanalysisofcancersurvival
AT yiguo anontologyguidedsemanticdataintegrationframeworktosupportintegrativedataanalysisofcancersurvival
AT qianli anontologyguidedsemanticdataintegrationframeworktosupportintegrativedataanalysisofcancersurvival
AT thomasjgeorge anontologyguidedsemanticdataintegrationframeworktosupportintegrativedataanalysisofcancersurvival
AT elizabethshenkman anontologyguidedsemanticdataintegrationframeworktosupportintegrativedataanalysisofcancersurvival
AT francoismodave anontologyguidedsemanticdataintegrationframeworktosupportintegrativedataanalysisofcancersurvival
AT jiangbian anontologyguidedsemanticdataintegrationframeworktosupportintegrativedataanalysisofcancersurvival
AT hansizhang ontologyguidedsemanticdataintegrationframeworktosupportintegrativedataanalysisofcancersurvival
AT yiguo ontologyguidedsemanticdataintegrationframeworktosupportintegrativedataanalysisofcancersurvival
AT qianli ontologyguidedsemanticdataintegrationframeworktosupportintegrativedataanalysisofcancersurvival
AT thomasjgeorge ontologyguidedsemanticdataintegrationframeworktosupportintegrativedataanalysisofcancersurvival
AT elizabethshenkman ontologyguidedsemanticdataintegrationframeworktosupportintegrativedataanalysisofcancersurvival
AT francoismodave ontologyguidedsemanticdataintegrationframeworktosupportintegrativedataanalysisofcancersurvival
AT jiangbian ontologyguidedsemanticdataintegrationframeworktosupportintegrativedataanalysisofcancersurvival