Balancing utility and privacy of high-dimensional datasets : mobile phone metadata

Thesis: S.M. in Technology and Policy, Massachusetts Institute of Technology, Institute for Data, Systems, and Society, Technology and Policy Program, 2015.

Bibliographic Details
Main Author: Noriega Campero, Alejandro
Other Authors: Alex 'Sandy' Pentland.
Format: Thesis
Language:eng
Published: Massachusetts Institute of Technology 2016
Subjects:
Online Access:http://hdl.handle.net/1721.1/103573
_version_ 1826210090239131648
author Noriega Campero, Alejandro
author2 Alex 'Sandy' Pentland.
author_facet Alex 'Sandy' Pentland.
Noriega Campero, Alejandro
author_sort Noriega Campero, Alejandro
collection MIT
description Thesis: S.M. in Technology and Policy, Massachusetts Institute of Technology, Institute for Data, Systems, and Society, Technology and Policy Program, 2015.
first_indexed 2024-09-23T14:42:30Z
format Thesis
id mit-1721.1/103573
institution Massachusetts Institute of Technology
language eng
last_indexed 2024-09-23T14:42:30Z
publishDate 2016
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1035732022-01-31T17:15:09Z Balancing utility and privacy of high-dimensional datasets : mobile phone metadata Noriega Campero, Alejandro Alex 'Sandy' Pentland. Technology and Policy Program. Massachusetts Institute of Technology. Engineering Systems Division Massachusetts Institute of Technology. Institute for Data, Systems, and Society Technology and Policy Program Institute for Data, Systems, and Society. Engineering Systems Division. Technology and Policy Program. Thesis: S.M. in Technology and Policy, Massachusetts Institute of Technology, Institute for Data, Systems, and Society, Technology and Policy Program, 2015. Cataloged from PDF version of thesis. Includes bibliographical references (pages 75-76). Large-scale datasets of human behavior have the potential to fundamentally transform the way we develop cities, fight disease and crime, and respond to natural disasters. However, understanding the privacy of these data sets is key to their broad use and potential impact, for these consist of sensitive information such as citizens' geo-location. Moreover, recent research has shown adversarial methods that successfully associate sensitive information in the datasets to individuals, even under pseudonymization of all personal identifiers. This thesis conceptualizes, relates, and generalizes salient methodologies for disclosure analysis of pseudonymized data that have been developed in the last two decades, such as: k-anonymity, t-closeness, and unicity. Data at the core of the so-called "big data" revolution is fundamentally high-dimensional. We show implications of high-dimensionality as paradigmatic to modern disclosure analysis. Consequently, we propose and analyze a methodological framework that couples information-theoretic concepts from t-closeness and [delta]-disclosure with the partial adversarial knowledge model introduced by unicity [1] [2], as well as its possible extensions. The various methodologies were applied and compared on a large dataset of mobile phone records (CDRs), where results empirically showed ordinal equivalence among unicity measures and information distance measures EM-disclosure and KL-disclosure. Advantages of the proposed framework are highlighted, and future research avenues identified. We also investigate the tradeoff between data privacy and data usefulness related to mobile phone metadata (CDRs) and its real-world applications. On the disclosure side, four spatio-temporal points were enough to identify uniquely +95% of individuals, at a [ZIP code, 1 hour] spatiotemporal granularity - consistent with main results in the literature. As the dataset was coarsened in space and time, the ratio (unicity) decreased to values below 0.2% for data specified at [District, 1 week] granularity or lower. We confirmed the existence of a utility-privacy tradeoff for the 10 experts surveyed for this study, i.e., a positive relationship between reidentification risk and data utility. However, Pareto analysis revealed that several granularity levels (generalization profiles) are Pareto-suboptimal, thus the tradeoff is not strict. Non-strictness implies that not all privacy gains entail utility loss, and conversely, not all utility gains entail privacy loss. Results thus suggest that data policy decisions should rest on an understanding of the underlying privacy-utility tradeoff, as inefficient policies can otherwise be implemented, unnecessarily incurring in privacy or utility losses. Lastly we show that, due to ordinal equivalence tested on the CDR dataset, Pareto properties are preserved and thus these results on the utility-privacy tradeoff are invariant to assessing disclosure by information distance measures such as EM-disclosure and KL-disclosure. This work contributes to shed light on the privacy and utility tradeoff inherent to high-dimensional datasets of large societal systems. Its results and methodology are relevant for actors in both academic and policy domains, and germane as society engages in debate over technological and legal frameworks for potentially ubiquitous data generation and use. by Alejandro Noriega Campero. S.M. in Technology and Policy 2016-07-11T14:44:29Z 2016-07-11T14:44:29Z 2015 2015 Thesis http://hdl.handle.net/1721.1/103573 938937787 eng M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582 76 pages application/pdf Massachusetts Institute of Technology
spellingShingle Institute for Data, Systems, and Society.
Engineering Systems Division.
Technology and Policy Program.
Noriega Campero, Alejandro
Balancing utility and privacy of high-dimensional datasets : mobile phone metadata
title Balancing utility and privacy of high-dimensional datasets : mobile phone metadata
title_full Balancing utility and privacy of high-dimensional datasets : mobile phone metadata
title_fullStr Balancing utility and privacy of high-dimensional datasets : mobile phone metadata
title_full_unstemmed Balancing utility and privacy of high-dimensional datasets : mobile phone metadata
title_short Balancing utility and privacy of high-dimensional datasets : mobile phone metadata
title_sort balancing utility and privacy of high dimensional datasets mobile phone metadata
topic Institute for Data, Systems, and Society.
Engineering Systems Division.
Technology and Policy Program.
url http://hdl.handle.net/1721.1/103573
work_keys_str_mv AT noriegacamperoalejandro balancingutilityandprivacyofhighdimensionaldatasetsmobilephonemetadata