Detecting Novel Associations in Large Data Sets

Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and fo...

Full description

Bibliographic Details
Main Authors: Reshef, David N., Reshef, Yakir, Grossman, Sharon Rachel, Finucane, Hilary Kiyo, McVean, Gilean, Turnbaugh, Peter J., Mitzenmacher, Michael, Sabeti, Pardis C., Lander, Eric Steven
Other Authors: Whitaker College of Health Sciences and Technology
Format: Article
Language:en_US
Published: American Association for the Advancement of Science (AAAS) 2014
Online Access:http://hdl.handle.net/1721.1/84636
https://orcid.org/0000-0001-6463-4203
https://orcid.org/0000-0001-5410-7274
https://orcid.org/0000-0002-3355-6983
Description
Summary:Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R[superscript 2]) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.