Comparison of Social Media, Syndromic Surveillance, and Microbiologic Acute Respiratory Infection Data: Observational Study

BackgroundInternet data can be used to improve infectious disease models. However, the representativeness and individual-level validity of internet-derived measures are largely unexplored as this requires ground truth data for study. ObjectiveThis study sought to identify relationships b...

Full description

Bibliographic Details
Main Authors: Daughton, Ashlynn R, Chunara, Rumi, Paul, Michael J
Format: Article
Language:English
Published: JMIR Publications 2020-04-01
Series:JMIR Public Health and Surveillance
Online Access:http://publichealth.jmir.org/2020/2/e14986/
_version_ 1818877271752048640
author Daughton, Ashlynn R
Chunara, Rumi
Paul, Michael J
author_facet Daughton, Ashlynn R
Chunara, Rumi
Paul, Michael J
author_sort Daughton, Ashlynn R
collection DOAJ
description BackgroundInternet data can be used to improve infectious disease models. However, the representativeness and individual-level validity of internet-derived measures are largely unexplored as this requires ground truth data for study. ObjectiveThis study sought to identify relationships between Web-based behaviors and/or conversation topics and health status using a ground truth, survey-based dataset. MethodsThis study leveraged a unique dataset of self-reported surveys, microbiological laboratory tests, and social media data from the same individuals toward understanding the validity of individual-level constructs pertaining to influenza-like illness in social media data. Logistic regression models were used to identify illness in Twitter posts using user posting behaviors and topic model features extracted from users’ tweets. ResultsOf 396 original study participants, only 81 met the inclusion criteria for this study. Of these participants’ tweets, we identified only two instances that were related to health and occurred within 2 weeks (before or after) of a survey indicating symptoms. It was not possible to predict when participants reported symptoms using features derived from topic models (area under the curve [AUC]=0.51; P=.38), though it was possible using behavior features, albeit with a very small effect size (AUC=0.53; P≤.001). Individual symptoms were also generally not predictable either. The study sample and a random sample from Twitter are predictably different on held-out data (AUC=0.67; P≤.001), meaning that the content posted by people who participated in this study was predictably different from that posted by random Twitter users. Individuals in the random sample and the GoViral sample used Twitter with similar frequencies (similar @ mentions, number of tweets, and number of retweets; AUC=0.50; P=.19). ConclusionsTo our knowledge, this is the first instance of an attempt to use a ground truth dataset to validate infectious disease observations in social media data. The lack of signal, the lack of predictability among behaviors or topics, and the demonstrated volunteer bias in the study population are important findings for the large and growing body of disease surveillance using internet-sourced data.
first_indexed 2024-12-19T13:55:38Z
format Article
id doaj.art-6917ca39cb2547d8bfddd29b052d2568
institution Directory Open Access Journal
issn 2369-2960
language English
last_indexed 2024-12-19T13:55:38Z
publishDate 2020-04-01
publisher JMIR Publications
record_format Article
series JMIR Public Health and Surveillance
spelling doaj.art-6917ca39cb2547d8bfddd29b052d25682022-12-21T20:18:37ZengJMIR PublicationsJMIR Public Health and Surveillance2369-29602020-04-0162e1498610.2196/14986Comparison of Social Media, Syndromic Surveillance, and Microbiologic Acute Respiratory Infection Data: Observational StudyDaughton, Ashlynn RChunara, RumiPaul, Michael JBackgroundInternet data can be used to improve infectious disease models. However, the representativeness and individual-level validity of internet-derived measures are largely unexplored as this requires ground truth data for study. ObjectiveThis study sought to identify relationships between Web-based behaviors and/or conversation topics and health status using a ground truth, survey-based dataset. MethodsThis study leveraged a unique dataset of self-reported surveys, microbiological laboratory tests, and social media data from the same individuals toward understanding the validity of individual-level constructs pertaining to influenza-like illness in social media data. Logistic regression models were used to identify illness in Twitter posts using user posting behaviors and topic model features extracted from users’ tweets. ResultsOf 396 original study participants, only 81 met the inclusion criteria for this study. Of these participants’ tweets, we identified only two instances that were related to health and occurred within 2 weeks (before or after) of a survey indicating symptoms. It was not possible to predict when participants reported symptoms using features derived from topic models (area under the curve [AUC]=0.51; P=.38), though it was possible using behavior features, albeit with a very small effect size (AUC=0.53; P≤.001). Individual symptoms were also generally not predictable either. The study sample and a random sample from Twitter are predictably different on held-out data (AUC=0.67; P≤.001), meaning that the content posted by people who participated in this study was predictably different from that posted by random Twitter users. Individuals in the random sample and the GoViral sample used Twitter with similar frequencies (similar @ mentions, number of tweets, and number of retweets; AUC=0.50; P=.19). ConclusionsTo our knowledge, this is the first instance of an attempt to use a ground truth dataset to validate infectious disease observations in social media data. The lack of signal, the lack of predictability among behaviors or topics, and the demonstrated volunteer bias in the study population are important findings for the large and growing body of disease surveillance using internet-sourced data.http://publichealth.jmir.org/2020/2/e14986/
spellingShingle Daughton, Ashlynn R
Chunara, Rumi
Paul, Michael J
Comparison of Social Media, Syndromic Surveillance, and Microbiologic Acute Respiratory Infection Data: Observational Study
JMIR Public Health and Surveillance
title Comparison of Social Media, Syndromic Surveillance, and Microbiologic Acute Respiratory Infection Data: Observational Study
title_full Comparison of Social Media, Syndromic Surveillance, and Microbiologic Acute Respiratory Infection Data: Observational Study
title_fullStr Comparison of Social Media, Syndromic Surveillance, and Microbiologic Acute Respiratory Infection Data: Observational Study
title_full_unstemmed Comparison of Social Media, Syndromic Surveillance, and Microbiologic Acute Respiratory Infection Data: Observational Study
title_short Comparison of Social Media, Syndromic Surveillance, and Microbiologic Acute Respiratory Infection Data: Observational Study
title_sort comparison of social media syndromic surveillance and microbiologic acute respiratory infection data observational study
url http://publichealth.jmir.org/2020/2/e14986/
work_keys_str_mv AT daughtonashlynnr comparisonofsocialmediasyndromicsurveillanceandmicrobiologicacuterespiratoryinfectiondataobservationalstudy
AT chunararumi comparisonofsocialmediasyndromicsurveillanceandmicrobiologicacuterespiratoryinfectiondataobservationalstudy
AT paulmichaelj comparisonofsocialmediasyndromicsurveillanceandmicrobiologicacuterespiratoryinfectiondataobservationalstudy