Improving official statistics in emerging markets using machine learning and mobile phone data

Mobile phones are one of the fastest growing technologies in the developing world with global penetration rates reaching 90%. Mobile phone data, also called CDR, are generated everytime phones are used and recorded by carriers at scale. CDR have generated groundbreaking insights in public health, of...

Full description

Bibliographic Details
Main Authors: Sundsøy, Pål, Bjelland, Johannes, Bengtsson, Linus, de Montjoye, Yves-Alexandre, Jahani, Eaman, Pentland, Alex Paul
Other Authors: Massachusetts Institute of Technology. Institute for Data, Systems, and Society
Format: Article
Language:English
Published: Springer 2017
Online Access:http://hdl.handle.net/1721.1/109143
https://orcid.org/0000-0003-3879-4275
https://orcid.org/0000-0002-8053-9983
_version_ 1826210494058332160
author Sundsøy, Pål
Bjelland, Johannes
Bengtsson, Linus
de Montjoye, Yves-Alexandre
Jahani, Eaman
Pentland, Alex Paul
author2 Massachusetts Institute of Technology. Institute for Data, Systems, and Society
author_facet Massachusetts Institute of Technology. Institute for Data, Systems, and Society
Sundsøy, Pål
Bjelland, Johannes
Bengtsson, Linus
de Montjoye, Yves-Alexandre
Jahani, Eaman
Pentland, Alex Paul
author_sort Sundsøy, Pål
collection MIT
description Mobile phones are one of the fastest growing technologies in the developing world with global penetration rates reaching 90%. Mobile phone data, also called CDR, are generated everytime phones are used and recorded by carriers at scale. CDR have generated groundbreaking insights in public health, official statistics, and logistics. However, the fact that most phones in developing countries are prepaid means that the data lacks key information about the user, including gender and other demographic variables. This precludes numerous uses of this data in social science and development economic research. It furthermore severely prevents the development of humanitarian applications such as the use of mobile phone data to target aid towards the most vulnerable groups during crisis. We developed a framework to extract more than 1400 features from standard mobile phone data and used them to predict useful individual characteristics and group estimates. We here present a systematic cross-country study of the applicability of machine learning for dataset augmentation at low cost. We validate our framework by showing how it can be used to reliably predict gender and other information for more than half a million people in two countries. We show how standard machine learning algorithms trained on only 10,000 users are sufficient to predict individual’s gender with an accuracy ranging from 74.3 to 88.4% in a developed country and from 74.5 to 79.7% in a developing country using only metadata. This is significantly higher than previous approaches and, once calibrated, gives highly accurate estimates of gender balance in groups. Performance suffers only marginally if we reduce the training size to 5,000, but significantly decreases in a smaller training set. We finally show that our indicators capture a large range of behavioral traits using factor analysis and that the framework can be used to predict other indicators of vulnerability such as age or socio-economic status. Mobile phone data has a great potential for good and our framework allows this data to be augmented with vulnerability and other information at a fraction of the cost.
first_indexed 2024-09-23T14:50:51Z
format Article
id mit-1721.1/109143
institution Massachusetts Institute of Technology
language English
last_indexed 2024-09-23T14:50:51Z
publishDate 2017
publisher Springer
record_format dspace
spelling mit-1721.1/1091432022-10-01T22:54:07Z Improving official statistics in emerging markets using machine learning and mobile phone data Sundsøy, Pål Bjelland, Johannes Bengtsson, Linus de Montjoye, Yves-Alexandre Jahani, Eaman Pentland, Alex Paul Massachusetts Institute of Technology. Institute for Data, Systems, and Society Program in Media Arts and Sciences (Massachusetts Institute of Technology) Jahani, Eaman Pentland, Alex Paul Mobile phones are one of the fastest growing technologies in the developing world with global penetration rates reaching 90%. Mobile phone data, also called CDR, are generated everytime phones are used and recorded by carriers at scale. CDR have generated groundbreaking insights in public health, official statistics, and logistics. However, the fact that most phones in developing countries are prepaid means that the data lacks key information about the user, including gender and other demographic variables. This precludes numerous uses of this data in social science and development economic research. It furthermore severely prevents the development of humanitarian applications such as the use of mobile phone data to target aid towards the most vulnerable groups during crisis. We developed a framework to extract more than 1400 features from standard mobile phone data and used them to predict useful individual characteristics and group estimates. We here present a systematic cross-country study of the applicability of machine learning for dataset augmentation at low cost. We validate our framework by showing how it can be used to reliably predict gender and other information for more than half a million people in two countries. We show how standard machine learning algorithms trained on only 10,000 users are sufficient to predict individual’s gender with an accuracy ranging from 74.3 to 88.4% in a developed country and from 74.5 to 79.7% in a developing country using only metadata. This is significantly higher than previous approaches and, once calibrated, gives highly accurate estimates of gender balance in groups. Performance suffers only marginally if we reduce the training size to 5,000, but significantly decreases in a smaller training set. We finally show that our indicators capture a large range of behavioral traits using factor analysis and that the framework can be used to predict other indicators of vulnerability such as age or socio-economic status. Mobile phone data has a great potential for good and our framework allows this data to be augmented with vulnerability and other information at a fraction of the cost. 2017-05-17T15:09:55Z 2017-05-17T15:09:55Z 2017-05 2016-11 2017-05-17T04:54:17Z Article http://purl.org/eprint/type/JournalArticle 2193-1127 http://hdl.handle.net/1721.1/109143 Jahani, Eaman; Sundsøy, Pål; Bjelland, Johannes; Bengtsson, Linus; Pentland, Alex ‘Sandy’ and de Montjoye, Yves-Alexandre. "Improving official statistics in emerging markets using machine learning and mobile phone data." EPJ Data Science 6, no. 3 (May 2017): 1-21. © 2017 The Author(s) https://orcid.org/0000-0003-3879-4275 https://orcid.org/0000-0002-8053-9983 en http://dx.doi.org/10.1140/epjds/s13688-017-0099-3 EPJ Data Science Creative Commons Attribution http://creativecommons.org/licenses/by/4.0/ The Author(s) application/pdf Springer Springer Berlin Heidelberg
spellingShingle Sundsøy, Pål
Bjelland, Johannes
Bengtsson, Linus
de Montjoye, Yves-Alexandre
Jahani, Eaman
Pentland, Alex Paul
Improving official statistics in emerging markets using machine learning and mobile phone data
title Improving official statistics in emerging markets using machine learning and mobile phone data
title_full Improving official statistics in emerging markets using machine learning and mobile phone data
title_fullStr Improving official statistics in emerging markets using machine learning and mobile phone data
title_full_unstemmed Improving official statistics in emerging markets using machine learning and mobile phone data
title_short Improving official statistics in emerging markets using machine learning and mobile phone data
title_sort improving official statistics in emerging markets using machine learning and mobile phone data
url http://hdl.handle.net/1721.1/109143
https://orcid.org/0000-0003-3879-4275
https://orcid.org/0000-0002-8053-9983
work_keys_str_mv AT sundsøypal improvingofficialstatisticsinemergingmarketsusingmachinelearningandmobilephonedata
AT bjellandjohannes improvingofficialstatisticsinemergingmarketsusingmachinelearningandmobilephonedata
AT bengtssonlinus improvingofficialstatisticsinemergingmarketsusingmachinelearningandmobilephonedata
AT demontjoyeyvesalexandre improvingofficialstatisticsinemergingmarketsusingmachinelearningandmobilephonedata
AT jahanieaman improvingofficialstatisticsinemergingmarketsusingmachinelearningandmobilephonedata
AT pentlandalexpaul improvingofficialstatisticsinemergingmarketsusingmachinelearningandmobilephonedata