Simple integrative preprocessing preserves what is shared in data sources

<p>Abstract</p> <p>Background</p> <p>Bioinformatics data analysis toolbox needs general-purpose, fast and easily interpretable preprocessing tools that perform data integration during exploratory data analysis. Our focus is on vector-valued data sources, each consisting...

Full description

Bibliographic Details
Main Authors: Klami Arto, Tripathi Abhishek, Kaski Samuel
Format: Article
Language:English
Published: BMC 2008-02-01
Series:BMC Bioinformatics
Online Access:http://www.biomedcentral.com/1471-2105/9/111
_version_ 1818163406486634496
author Klami Arto
Tripathi Abhishek
Kaski Samuel
author_facet Klami Arto
Tripathi Abhishek
Kaski Samuel
author_sort Klami Arto
collection DOAJ
description <p>Abstract</p> <p>Background</p> <p>Bioinformatics data analysis toolbox needs general-purpose, fast and easily interpretable preprocessing tools that perform data integration during exploratory data analysis. Our focus is on vector-valued data sources, each consisting of measurements of the same entity but on different variables, and on tasks where source-specific variation is considered noisy or not interesting. Principal components analysis of all sources combined together is an obvious choice if it is not important to distinguish between data source-specific and shared variation. Canonical Correlation Analysis (CCA) focuses on mutual dependencies and discards source-specific "noise" but it produces a separate set of components for each source.</p> <p>Results</p> <p>It turns out that components given by CCA can be combined easily to produce a linear and hence fast and easily interpretable feature extraction method. The method fuses together several sources, such that the properties they share are preserved. Source-specific variation is discarded as uninteresting. We give the details and implement them in a software tool. The method is demonstrated on gene expression measurements in three case studies: classification of cell cycle regulated genes in yeast, identification of differentially expressed genes in leukemia, and defining stress response in yeast. The software package is available at <url>http://www.cis.hut.fi/projects/mi/software/drCCA/</url>.</p> <p>Conclusion</p> <p>We introduced a method for the task of data fusion for exploratory data analysis, when statistical dependencies between the sources and not within a source are interesting. The method uses canonical correlation analysis in a new way for dimensionality reduction, and inherits its good properties of being simple, fast, and easily interpretable as a linear projection.</p>
first_indexed 2024-12-11T16:49:03Z
format Article
id doaj.art-96332db2105341cdae85316e7831f8c5
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-12-11T16:49:03Z
publishDate 2008-02-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-96332db2105341cdae85316e7831f8c52022-12-22T00:58:08ZengBMCBMC Bioinformatics1471-21052008-02-019111110.1186/1471-2105-9-111Simple integrative preprocessing preserves what is shared in data sourcesKlami ArtoTripathi AbhishekKaski Samuel<p>Abstract</p> <p>Background</p> <p>Bioinformatics data analysis toolbox needs general-purpose, fast and easily interpretable preprocessing tools that perform data integration during exploratory data analysis. Our focus is on vector-valued data sources, each consisting of measurements of the same entity but on different variables, and on tasks where source-specific variation is considered noisy or not interesting. Principal components analysis of all sources combined together is an obvious choice if it is not important to distinguish between data source-specific and shared variation. Canonical Correlation Analysis (CCA) focuses on mutual dependencies and discards source-specific "noise" but it produces a separate set of components for each source.</p> <p>Results</p> <p>It turns out that components given by CCA can be combined easily to produce a linear and hence fast and easily interpretable feature extraction method. The method fuses together several sources, such that the properties they share are preserved. Source-specific variation is discarded as uninteresting. We give the details and implement them in a software tool. The method is demonstrated on gene expression measurements in three case studies: classification of cell cycle regulated genes in yeast, identification of differentially expressed genes in leukemia, and defining stress response in yeast. The software package is available at <url>http://www.cis.hut.fi/projects/mi/software/drCCA/</url>.</p> <p>Conclusion</p> <p>We introduced a method for the task of data fusion for exploratory data analysis, when statistical dependencies between the sources and not within a source are interesting. The method uses canonical correlation analysis in a new way for dimensionality reduction, and inherits its good properties of being simple, fast, and easily interpretable as a linear projection.</p>http://www.biomedcentral.com/1471-2105/9/111
spellingShingle Klami Arto
Tripathi Abhishek
Kaski Samuel
Simple integrative preprocessing preserves what is shared in data sources
BMC Bioinformatics
title Simple integrative preprocessing preserves what is shared in data sources
title_full Simple integrative preprocessing preserves what is shared in data sources
title_fullStr Simple integrative preprocessing preserves what is shared in data sources
title_full_unstemmed Simple integrative preprocessing preserves what is shared in data sources
title_short Simple integrative preprocessing preserves what is shared in data sources
title_sort simple integrative preprocessing preserves what is shared in data sources
url http://www.biomedcentral.com/1471-2105/9/111
work_keys_str_mv AT klamiarto simpleintegrativepreprocessingpreserveswhatissharedindatasources
AT tripathiabhishek simpleintegrativepreprocessingpreserveswhatissharedindatasources
AT kaskisamuel simpleintegrativepreprocessingpreserveswhatissharedindatasources