Detecting Novel Associations in Large Data Sets

Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and fo...

Full description

Bibliographic Details
Main Authors: Reshef, David N., Reshef, Yakir, Grossman, Sharon Rachel, Finucane, Hilary Kiyo, McVean, Gilean, Turnbaugh, Peter J., Mitzenmacher, Michael, Sabeti, Pardis C., Lander, Eric Steven
Other Authors: Whitaker College of Health Sciences and Technology
Format: Article
Language:en_US
Published: American Association for the Advancement of Science (AAAS) 2014
Online Access:http://hdl.handle.net/1721.1/84636
https://orcid.org/0000-0001-6463-4203
https://orcid.org/0000-0001-5410-7274
https://orcid.org/0000-0002-3355-6983
_version_ 1826196591189426176
author Reshef, David N.
Reshef, Yakir
Grossman, Sharon Rachel
Finucane, Hilary Kiyo
McVean, Gilean
Turnbaugh, Peter J.
Mitzenmacher, Michael
Sabeti, Pardis C.
Lander, Eric Steven
author2 Whitaker College of Health Sciences and Technology
author_facet Whitaker College of Health Sciences and Technology
Reshef, David N.
Reshef, Yakir
Grossman, Sharon Rachel
Finucane, Hilary Kiyo
McVean, Gilean
Turnbaugh, Peter J.
Mitzenmacher, Michael
Sabeti, Pardis C.
Lander, Eric Steven
author_sort Reshef, David N.
collection MIT
description Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R[superscript 2]) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.
first_indexed 2024-09-23T10:29:44Z
format Article
id mit-1721.1/84636
institution Massachusetts Institute of Technology
language en_US
last_indexed 2024-09-23T10:29:44Z
publishDate 2014
publisher American Association for the Advancement of Science (AAAS)
record_format dspace
spelling mit-1721.1/846362022-09-30T21:28:33Z Detecting Novel Associations in Large Data Sets Reshef, David N. Reshef, Yakir Grossman, Sharon Rachel Finucane, Hilary Kiyo McVean, Gilean Turnbaugh, Peter J. Mitzenmacher, Michael Sabeti, Pardis C. Lander, Eric Steven Whitaker College of Health Sciences and Technology Massachusetts Institute of Technology. Department of Biology Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Reshef, David N. Reshef, Yakir Grossman, Sharon Rachel Lander, Eric S. Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R[superscript 2]) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships. National Institute of General Medical Sciences (U.S.) (Medical Scientist Training Program) 2014-02-03T13:18:52Z 2014-02-03T13:18:52Z 2011-12 2011-03 Article http://purl.org/eprint/type/JournalArticle 0036-8075 1095-9203 http://hdl.handle.net/1721.1/84636 Reshef, D. N., Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, and P. C. Sabeti. “Detecting Novel Associations in Large Data Sets.” Science 334, no. 6062 (December 15, 2011): 1518-1524. https://orcid.org/0000-0001-6463-4203 https://orcid.org/0000-0001-5410-7274 https://orcid.org/0000-0002-3355-6983 en_US http://dx.doi.org/10.1126/science.1205438 Science Creative Commons Attribution-Noncommercial-Share Alike 3.0 http://creativecommons.org/licenses/by-nc-sa/3.0/ application/pdf American Association for the Advancement of Science (AAAS) PMC
spellingShingle Reshef, David N.
Reshef, Yakir
Grossman, Sharon Rachel
Finucane, Hilary Kiyo
McVean, Gilean
Turnbaugh, Peter J.
Mitzenmacher, Michael
Sabeti, Pardis C.
Lander, Eric Steven
Detecting Novel Associations in Large Data Sets
title Detecting Novel Associations in Large Data Sets
title_full Detecting Novel Associations in Large Data Sets
title_fullStr Detecting Novel Associations in Large Data Sets
title_full_unstemmed Detecting Novel Associations in Large Data Sets
title_short Detecting Novel Associations in Large Data Sets
title_sort detecting novel associations in large data sets
url http://hdl.handle.net/1721.1/84636
https://orcid.org/0000-0001-6463-4203
https://orcid.org/0000-0001-5410-7274
https://orcid.org/0000-0002-3355-6983
work_keys_str_mv AT reshefdavidn detectingnovelassociationsinlargedatasets
AT reshefyakir detectingnovelassociationsinlargedatasets
AT grossmansharonrachel detectingnovelassociationsinlargedatasets
AT finucanehilarykiyo detectingnovelassociationsinlargedatasets
AT mcveangilean detectingnovelassociationsinlargedatasets
AT turnbaughpeterj detectingnovelassociationsinlargedatasets
AT mitzenmachermichael detectingnovelassociationsinlargedatasets
AT sabetipardisc detectingnovelassociationsinlargedatasets
AT landerericsteven detectingnovelassociationsinlargedatasets