Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing

Abstract We used social media data from “covid19positive” subreddit, from 03/2020 to 03/2022 to identify COVID-19 cases and extract their reported symptoms automatically using natural language processing (NLP). We trained a Bidirectional Encoder Representations from Transformers classification model...

Full description

Bibliographic Details
Main Authors: Muzhe Guo, Yong Ma, Efe Eworuke, Melissa Khashei, Jaejoon Song, Yueqin Zhao, Fang Jin
Format: Article
Language:English
Published: Nature Portfolio 2023-08-01
Series:Scientific Reports
Online Access:https://doi.org/10.1038/s41598-023-39986-7
_version_ 1797452920557404160
author Muzhe Guo
Yong Ma
Efe Eworuke
Melissa Khashei
Jaejoon Song
Yueqin Zhao
Fang Jin
author_facet Muzhe Guo
Yong Ma
Efe Eworuke
Melissa Khashei
Jaejoon Song
Yueqin Zhao
Fang Jin
author_sort Muzhe Guo
collection DOAJ
description Abstract We used social media data from “covid19positive” subreddit, from 03/2020 to 03/2022 to identify COVID-19 cases and extract their reported symptoms automatically using natural language processing (NLP). We trained a Bidirectional Encoder Representations from Transformers classification model with chunking to identify COVID-19 cases; also, we developed a novel QuadArm model, which incorporates Question-answering, dual-corpus expansion, Adaptive rotation clustering, and mapping, to extract symptoms. Our classification model achieved a 91.2% accuracy for the early period (03/2020-05/2020) and was applied to the Delta (07/2021–09/2021) and Omicron (12/2021–03/2022) periods for case identification. We identified 310, 8794, and 12,094 COVID-positive authors in the three periods, respectively. The top five common symptoms extracted in the early period were coughing (57%), fever (55%), loss of sense of smell (41%), headache (40%), and sore throat (40%). During the Delta period, these symptoms remained as the top five symptoms with percent authors reporting symptoms reduced to half or fewer than the early period. During the Omicron period, loss of sense of smell was reported less while sore throat was reported more. Our study demonstrated that NLP can be used to identify COVID-19 cases accurately and extracted symptoms efficiently.
first_indexed 2024-03-09T15:15:38Z
format Article
id doaj.art-786f56c08c1747619f44c43de2173c5d
institution Directory Open Access Journal
issn 2045-2322
language English
last_indexed 2024-03-09T15:15:38Z
publishDate 2023-08-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj.art-786f56c08c1747619f44c43de2173c5d2023-11-26T13:07:06ZengNature PortfolioScientific Reports2045-23222023-08-0113111310.1038/s41598-023-39986-7Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processingMuzhe Guo0Yong Ma1Efe Eworuke2Melissa Khashei3Jaejoon Song4Yueqin Zhao5Fang Jin6Department of Statistics, George Washington UniversityOffice of Biostatistics, Office of Translational Sciences, Center for Drug Evaluation and Research, Food and Drug Administration (FDA)Epidemiology and Drug Safety, IQVIA Real World SolutionsDivision of Epidemiology II, Office of Pharmacovigilance and Epidemiology, Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, Food and Drug Administration (FDA)Office of Biostatistics, Office of Translational Sciences, Center for Drug Evaluation and Research, Food and Drug Administration (FDA)Office of Biostatistics, Office of Translational Sciences, Center for Drug Evaluation and Research, Food and Drug Administration (FDA)Department of Statistics, George Washington UniversityAbstract We used social media data from “covid19positive” subreddit, from 03/2020 to 03/2022 to identify COVID-19 cases and extract their reported symptoms automatically using natural language processing (NLP). We trained a Bidirectional Encoder Representations from Transformers classification model with chunking to identify COVID-19 cases; also, we developed a novel QuadArm model, which incorporates Question-answering, dual-corpus expansion, Adaptive rotation clustering, and mapping, to extract symptoms. Our classification model achieved a 91.2% accuracy for the early period (03/2020-05/2020) and was applied to the Delta (07/2021–09/2021) and Omicron (12/2021–03/2022) periods for case identification. We identified 310, 8794, and 12,094 COVID-positive authors in the three periods, respectively. The top five common symptoms extracted in the early period were coughing (57%), fever (55%), loss of sense of smell (41%), headache (40%), and sore throat (40%). During the Delta period, these symptoms remained as the top five symptoms with percent authors reporting symptoms reduced to half or fewer than the early period. During the Omicron period, loss of sense of smell was reported less while sore throat was reported more. Our study demonstrated that NLP can be used to identify COVID-19 cases accurately and extracted symptoms efficiently.https://doi.org/10.1038/s41598-023-39986-7
spellingShingle Muzhe Guo
Yong Ma
Efe Eworuke
Melissa Khashei
Jaejoon Song
Yueqin Zhao
Fang Jin
Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing
Scientific Reports
title Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing
title_full Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing
title_fullStr Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing
title_full_unstemmed Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing
title_short Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing
title_sort identifying covid 19 cases and extracting patient reported symptoms from reddit using natural language processing
url https://doi.org/10.1038/s41598-023-39986-7
work_keys_str_mv AT muzheguo identifyingcovid19casesandextractingpatientreportedsymptomsfromredditusingnaturallanguageprocessing
AT yongma identifyingcovid19casesandextractingpatientreportedsymptomsfromredditusingnaturallanguageprocessing
AT efeeworuke identifyingcovid19casesandextractingpatientreportedsymptomsfromredditusingnaturallanguageprocessing
AT melissakhashei identifyingcovid19casesandextractingpatientreportedsymptomsfromredditusingnaturallanguageprocessing
AT jaejoonsong identifyingcovid19casesandextractingpatientreportedsymptomsfromredditusingnaturallanguageprocessing
AT yueqinzhao identifyingcovid19casesandextractingpatientreportedsymptomsfromredditusingnaturallanguageprocessing
AT fangjin identifyingcovid19casesandextractingpatientreportedsymptomsfromredditusingnaturallanguageprocessing