Simple integrative preprocessing preserves what is shared in data sources
<p>Abstract</p> <p>Background</p> <p>Bioinformatics data analysis toolbox needs general-purpose, fast and easily interpretable preprocessing tools that perform data integration during exploratory data analysis. Our focus is on vector-valued data sources, each consisting...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2008-02-01
|
Series: | BMC Bioinformatics |
Online Access: | http://www.biomedcentral.com/1471-2105/9/111 |
_version_ | 1818163406486634496 |
---|---|
author | Klami Arto Tripathi Abhishek Kaski Samuel |
author_facet | Klami Arto Tripathi Abhishek Kaski Samuel |
author_sort | Klami Arto |
collection | DOAJ |
description | <p>Abstract</p> <p>Background</p> <p>Bioinformatics data analysis toolbox needs general-purpose, fast and easily interpretable preprocessing tools that perform data integration during exploratory data analysis. Our focus is on vector-valued data sources, each consisting of measurements of the same entity but on different variables, and on tasks where source-specific variation is considered noisy or not interesting. Principal components analysis of all sources combined together is an obvious choice if it is not important to distinguish between data source-specific and shared variation. Canonical Correlation Analysis (CCA) focuses on mutual dependencies and discards source-specific "noise" but it produces a separate set of components for each source.</p> <p>Results</p> <p>It turns out that components given by CCA can be combined easily to produce a linear and hence fast and easily interpretable feature extraction method. The method fuses together several sources, such that the properties they share are preserved. Source-specific variation is discarded as uninteresting. We give the details and implement them in a software tool. The method is demonstrated on gene expression measurements in three case studies: classification of cell cycle regulated genes in yeast, identification of differentially expressed genes in leukemia, and defining stress response in yeast. The software package is available at <url>http://www.cis.hut.fi/projects/mi/software/drCCA/</url>.</p> <p>Conclusion</p> <p>We introduced a method for the task of data fusion for exploratory data analysis, when statistical dependencies between the sources and not within a source are interesting. The method uses canonical correlation analysis in a new way for dimensionality reduction, and inherits its good properties of being simple, fast, and easily interpretable as a linear projection.</p> |
first_indexed | 2024-12-11T16:49:03Z |
format | Article |
id | doaj.art-96332db2105341cdae85316e7831f8c5 |
institution | Directory Open Access Journal |
issn | 1471-2105 |
language | English |
last_indexed | 2024-12-11T16:49:03Z |
publishDate | 2008-02-01 |
publisher | BMC |
record_format | Article |
series | BMC Bioinformatics |
spelling | doaj.art-96332db2105341cdae85316e7831f8c52022-12-22T00:58:08ZengBMCBMC Bioinformatics1471-21052008-02-019111110.1186/1471-2105-9-111Simple integrative preprocessing preserves what is shared in data sourcesKlami ArtoTripathi AbhishekKaski Samuel<p>Abstract</p> <p>Background</p> <p>Bioinformatics data analysis toolbox needs general-purpose, fast and easily interpretable preprocessing tools that perform data integration during exploratory data analysis. Our focus is on vector-valued data sources, each consisting of measurements of the same entity but on different variables, and on tasks where source-specific variation is considered noisy or not interesting. Principal components analysis of all sources combined together is an obvious choice if it is not important to distinguish between data source-specific and shared variation. Canonical Correlation Analysis (CCA) focuses on mutual dependencies and discards source-specific "noise" but it produces a separate set of components for each source.</p> <p>Results</p> <p>It turns out that components given by CCA can be combined easily to produce a linear and hence fast and easily interpretable feature extraction method. The method fuses together several sources, such that the properties they share are preserved. Source-specific variation is discarded as uninteresting. We give the details and implement them in a software tool. The method is demonstrated on gene expression measurements in three case studies: classification of cell cycle regulated genes in yeast, identification of differentially expressed genes in leukemia, and defining stress response in yeast. The software package is available at <url>http://www.cis.hut.fi/projects/mi/software/drCCA/</url>.</p> <p>Conclusion</p> <p>We introduced a method for the task of data fusion for exploratory data analysis, when statistical dependencies between the sources and not within a source are interesting. The method uses canonical correlation analysis in a new way for dimensionality reduction, and inherits its good properties of being simple, fast, and easily interpretable as a linear projection.</p>http://www.biomedcentral.com/1471-2105/9/111 |
spellingShingle | Klami Arto Tripathi Abhishek Kaski Samuel Simple integrative preprocessing preserves what is shared in data sources BMC Bioinformatics |
title | Simple integrative preprocessing preserves what is shared in data sources |
title_full | Simple integrative preprocessing preserves what is shared in data sources |
title_fullStr | Simple integrative preprocessing preserves what is shared in data sources |
title_full_unstemmed | Simple integrative preprocessing preserves what is shared in data sources |
title_short | Simple integrative preprocessing preserves what is shared in data sources |
title_sort | simple integrative preprocessing preserves what is shared in data sources |
url | http://www.biomedcentral.com/1471-2105/9/111 |
work_keys_str_mv | AT klamiarto simpleintegrativepreprocessingpreserveswhatissharedindatasources AT tripathiabhishek simpleintegrativepreprocessingpreserveswhatissharedindatasources AT kaskisamuel simpleintegrativepreprocessingpreserveswhatissharedindatasources |