TCGA Expedition: A Data Acquisition and Management System for TCGA Data.

The Cancer Genome Atlas Project (TCGA) is a National Cancer Institute effort to profile at least 500 cases of 20 different tumor types using genomic platforms and to make these data, both raw and processed, available to all researchers. TCGA data are currently over 1.2 Petabyte in size and include w...

Full description

Bibliographic Details
Main Authors: Uma R Chandran, Olga P Medvedeva, M Michael Barmada, Philip D Blood, Anish Chakka, Soumya Luthra, Antonio Ferreira, Kim F Wong, Adrian V Lee, Zhihui Zhang, Robert Budden, J Ray Scott, Annerose Berndt, Jeremy M Berg, Rebecca S Jacobson
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2016-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC5082933?pdf=render
_version_ 1811307267614572544
author Uma R Chandran
Olga P Medvedeva
M Michael Barmada
Philip D Blood
Anish Chakka
Soumya Luthra
Antonio Ferreira
Kim F Wong
Adrian V Lee
Zhihui Zhang
Robert Budden
J Ray Scott
Annerose Berndt
Jeremy M Berg
Rebecca S Jacobson
author_facet Uma R Chandran
Olga P Medvedeva
M Michael Barmada
Philip D Blood
Anish Chakka
Soumya Luthra
Antonio Ferreira
Kim F Wong
Adrian V Lee
Zhihui Zhang
Robert Budden
J Ray Scott
Annerose Berndt
Jeremy M Berg
Rebecca S Jacobson
author_sort Uma R Chandran
collection DOAJ
description The Cancer Genome Atlas Project (TCGA) is a National Cancer Institute effort to profile at least 500 cases of 20 different tumor types using genomic platforms and to make these data, both raw and processed, available to all researchers. TCGA data are currently over 1.2 Petabyte in size and include whole genome sequence (WGS), whole exome sequence, methylation, RNA expression, proteomic, and clinical datasets. Publicly accessible TCGA data are released through public portals, but many challenges exist in navigating and using data obtained from these sites. We developed TCGA Expedition to support the research community focused on computational methods for cancer research. Data obtained, versioned, and archived using TCGA Expedition supports command line access at high-performance computing facilities as well as some functionality with third party tools. For a subset of TCGA data collected at University of Pittsburgh, we also re-associate TCGA data with de-identified data from the electronic health records. Here we describe the software as well as the architecture of our repository, methods for loading of TCGA data to multiple platforms, and security and regulatory controls that conform to federal best practices.TCGA Expedition software consists of a set of scripts written in Bash, Python and Java that download, extract, harmonize, version and store all TCGA data and metadata. The software generates a versioned, participant- and sample-centered, local TCGA data directory with metadata structures that directly reference the local data files as well as the original data files. The software supports flexible searches of the data via a web portal, user-centric data tracking tools, and data provenance tools. Using this software, we created a collaborative repository, the Pittsburgh Genome Resource Repository (PGRR) that enabled investigators at our institution to work with all TCGA data formats, and to interrogate these data with analysis pipelines, and associated tools. WGS data are especially challenging for individual investigators to use, due to issues with downloading, storage, and processing; having locally accessible WGS BAM files has proven invaluable.Our open-source, freely available TCGA Expedition software can be used to create a local collaborative infrastructure for acquiring, managing, and analyzing TCGA data and other large public datasets.
first_indexed 2024-04-13T09:01:24Z
format Article
id doaj.art-cbff19bf021f4dc4a1edfcf54f7fcaea
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-04-13T09:01:24Z
publishDate 2016-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-cbff19bf021f4dc4a1edfcf54f7fcaea2022-12-22T02:53:08ZengPublic Library of Science (PLoS)PLoS ONE1932-62032016-01-011110e016539510.1371/journal.pone.0165395TCGA Expedition: A Data Acquisition and Management System for TCGA Data.Uma R ChandranOlga P MedvedevaM Michael BarmadaPhilip D BloodAnish ChakkaSoumya LuthraAntonio FerreiraKim F WongAdrian V LeeZhihui ZhangRobert BuddenJ Ray ScottAnnerose BerndtJeremy M BergRebecca S JacobsonThe Cancer Genome Atlas Project (TCGA) is a National Cancer Institute effort to profile at least 500 cases of 20 different tumor types using genomic platforms and to make these data, both raw and processed, available to all researchers. TCGA data are currently over 1.2 Petabyte in size and include whole genome sequence (WGS), whole exome sequence, methylation, RNA expression, proteomic, and clinical datasets. Publicly accessible TCGA data are released through public portals, but many challenges exist in navigating and using data obtained from these sites. We developed TCGA Expedition to support the research community focused on computational methods for cancer research. Data obtained, versioned, and archived using TCGA Expedition supports command line access at high-performance computing facilities as well as some functionality with third party tools. For a subset of TCGA data collected at University of Pittsburgh, we also re-associate TCGA data with de-identified data from the electronic health records. Here we describe the software as well as the architecture of our repository, methods for loading of TCGA data to multiple platforms, and security and regulatory controls that conform to federal best practices.TCGA Expedition software consists of a set of scripts written in Bash, Python and Java that download, extract, harmonize, version and store all TCGA data and metadata. The software generates a versioned, participant- and sample-centered, local TCGA data directory with metadata structures that directly reference the local data files as well as the original data files. The software supports flexible searches of the data via a web portal, user-centric data tracking tools, and data provenance tools. Using this software, we created a collaborative repository, the Pittsburgh Genome Resource Repository (PGRR) that enabled investigators at our institution to work with all TCGA data formats, and to interrogate these data with analysis pipelines, and associated tools. WGS data are especially challenging for individual investigators to use, due to issues with downloading, storage, and processing; having locally accessible WGS BAM files has proven invaluable.Our open-source, freely available TCGA Expedition software can be used to create a local collaborative infrastructure for acquiring, managing, and analyzing TCGA data and other large public datasets.http://europepmc.org/articles/PMC5082933?pdf=render
spellingShingle Uma R Chandran
Olga P Medvedeva
M Michael Barmada
Philip D Blood
Anish Chakka
Soumya Luthra
Antonio Ferreira
Kim F Wong
Adrian V Lee
Zhihui Zhang
Robert Budden
J Ray Scott
Annerose Berndt
Jeremy M Berg
Rebecca S Jacobson
TCGA Expedition: A Data Acquisition and Management System for TCGA Data.
PLoS ONE
title TCGA Expedition: A Data Acquisition and Management System for TCGA Data.
title_full TCGA Expedition: A Data Acquisition and Management System for TCGA Data.
title_fullStr TCGA Expedition: A Data Acquisition and Management System for TCGA Data.
title_full_unstemmed TCGA Expedition: A Data Acquisition and Management System for TCGA Data.
title_short TCGA Expedition: A Data Acquisition and Management System for TCGA Data.
title_sort tcga expedition a data acquisition and management system for tcga data
url http://europepmc.org/articles/PMC5082933?pdf=render
work_keys_str_mv AT umarchandran tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT olgapmedvedeva tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT mmichaelbarmada tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT philipdblood tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT anishchakka tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT soumyaluthra tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT antonioferreira tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT kimfwong tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT adrianvlee tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT zhihuizhang tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT robertbudden tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT jrayscott tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT anneroseberndt tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT jeremymberg tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata
AT rebeccasjacobson tcgaexpeditionadataacquisitionandmanagementsystemfortcgadata