Genome Data Exploration Using Correspondence Analysis

Recent developments of sequencing technologies that allow the production of massive amounts of genomic and genotyping data have highlighted the need for synthetic data representation and pattern recognition methods that can mine and help discovering biologically meaningful knowledge included in such...

Full description

Bibliographic Details
Main Author: Fredj Tekaia
Format: Article
Language:English
Published: SAGE Publishing 2016-01-01
Series:Bioinformatics and Biology Insights
Online Access:https://doi.org/10.4137/BBI.S39614
_version_ 1818401460770045952
author Fredj Tekaia
author_facet Fredj Tekaia
author_sort Fredj Tekaia
collection DOAJ
description Recent developments of sequencing technologies that allow the production of massive amounts of genomic and genotyping data have highlighted the need for synthetic data representation and pattern recognition methods that can mine and help discovering biologically meaningful knowledge included in such large data sets. Correspondence analysis (CA) is an exploratory descriptive method designed to analyze two-way data tables, including some measure of association between rows and columns. It constructs linear combinations of variables, known as factors. CA has been used for decades to study high-dimensional data, and remarkable inferences from large data tables were obtained by reducing the dimensionality to a few orthogonal factors that correspond to the largest amount of variability in the data. Herein, I review CA and highlight its use by considering examples in handling high-dimensional data that can be constructed from genomic and genetic studies. Examples in amino acid compositions of large sets of species (viruses, phages, yeast, and fungi) as well as an example related to pairwise shared orthologs in a set of yeast and fungal species, as obtained from their proteome comparisons, are considered. For the first time, results show striking segregations between yeasts and fungi as well as between viruses and phages. Distributions obtained from shared orthologs show clusters of yeast and fungal species corresponding to their phylogenetic relationships. A direct comparison with the principal component analysis method is discussed using a recently published example of genotyping data related to newly discovered traces of an ancient hominid that was compared to modern human populations in the search for ancestral similarities. CA offers more detailed results highlighting links between modern humans and the ancient hominid and their characterizations. Compared to the popular principal component analysis method, CA allows easier and more effective interpretation of results, particularly by the ability of relating individual patterns with their corresponding characteristic variables.
first_indexed 2024-12-14T07:52:50Z
format Article
id doaj.art-4cb46c7fe6084b068777188fc863deb3
institution Directory Open Access Journal
issn 1177-9322
language English
last_indexed 2024-12-14T07:52:50Z
publishDate 2016-01-01
publisher SAGE Publishing
record_format Article
series Bioinformatics and Biology Insights
spelling doaj.art-4cb46c7fe6084b068777188fc863deb32022-12-21T23:10:38ZengSAGE PublishingBioinformatics and Biology Insights1177-93222016-01-011010.4137/BBI.S39614Genome Data Exploration Using Correspondence AnalysisFredj Tekaia0Institut Pasteur, Unit of Structural Microbiology, CNRS URA 3528 and University Paris Diderot, Sorbonne Paris Cité, Paris, France.Recent developments of sequencing technologies that allow the production of massive amounts of genomic and genotyping data have highlighted the need for synthetic data representation and pattern recognition methods that can mine and help discovering biologically meaningful knowledge included in such large data sets. Correspondence analysis (CA) is an exploratory descriptive method designed to analyze two-way data tables, including some measure of association between rows and columns. It constructs linear combinations of variables, known as factors. CA has been used for decades to study high-dimensional data, and remarkable inferences from large data tables were obtained by reducing the dimensionality to a few orthogonal factors that correspond to the largest amount of variability in the data. Herein, I review CA and highlight its use by considering examples in handling high-dimensional data that can be constructed from genomic and genetic studies. Examples in amino acid compositions of large sets of species (viruses, phages, yeast, and fungi) as well as an example related to pairwise shared orthologs in a set of yeast and fungal species, as obtained from their proteome comparisons, are considered. For the first time, results show striking segregations between yeasts and fungi as well as between viruses and phages. Distributions obtained from shared orthologs show clusters of yeast and fungal species corresponding to their phylogenetic relationships. A direct comparison with the principal component analysis method is discussed using a recently published example of genotyping data related to newly discovered traces of an ancient hominid that was compared to modern human populations in the search for ancestral similarities. CA offers more detailed results highlighting links between modern humans and the ancient hominid and their characterizations. Compared to the popular principal component analysis method, CA allows easier and more effective interpretation of results, particularly by the ability of relating individual patterns with their corresponding characteristic variables.https://doi.org/10.4137/BBI.S39614
spellingShingle Fredj Tekaia
Genome Data Exploration Using Correspondence Analysis
Bioinformatics and Biology Insights
title Genome Data Exploration Using Correspondence Analysis
title_full Genome Data Exploration Using Correspondence Analysis
title_fullStr Genome Data Exploration Using Correspondence Analysis
title_full_unstemmed Genome Data Exploration Using Correspondence Analysis
title_short Genome Data Exploration Using Correspondence Analysis
title_sort genome data exploration using correspondence analysis
url https://doi.org/10.4137/BBI.S39614
work_keys_str_mv AT fredjtekaia genomedataexplorationusingcorrespondenceanalysis