Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing
Abstract We used social media data from “covid19positive” subreddit, from 03/2020 to 03/2022 to identify COVID-19 cases and extract their reported symptoms automatically using natural language processing (NLP). We trained a Bidirectional Encoder Representations from Transformers classification model...
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Nature Portfolio
2023-08-01
|
Series: | Scientific Reports |
Online Access: | https://doi.org/10.1038/s41598-023-39986-7 |
_version_ | 1797452920557404160 |
---|---|
author | Muzhe Guo Yong Ma Efe Eworuke Melissa Khashei Jaejoon Song Yueqin Zhao Fang Jin |
author_facet | Muzhe Guo Yong Ma Efe Eworuke Melissa Khashei Jaejoon Song Yueqin Zhao Fang Jin |
author_sort | Muzhe Guo |
collection | DOAJ |
description | Abstract We used social media data from “covid19positive” subreddit, from 03/2020 to 03/2022 to identify COVID-19 cases and extract their reported symptoms automatically using natural language processing (NLP). We trained a Bidirectional Encoder Representations from Transformers classification model with chunking to identify COVID-19 cases; also, we developed a novel QuadArm model, which incorporates Question-answering, dual-corpus expansion, Adaptive rotation clustering, and mapping, to extract symptoms. Our classification model achieved a 91.2% accuracy for the early period (03/2020-05/2020) and was applied to the Delta (07/2021–09/2021) and Omicron (12/2021–03/2022) periods for case identification. We identified 310, 8794, and 12,094 COVID-positive authors in the three periods, respectively. The top five common symptoms extracted in the early period were coughing (57%), fever (55%), loss of sense of smell (41%), headache (40%), and sore throat (40%). During the Delta period, these symptoms remained as the top five symptoms with percent authors reporting symptoms reduced to half or fewer than the early period. During the Omicron period, loss of sense of smell was reported less while sore throat was reported more. Our study demonstrated that NLP can be used to identify COVID-19 cases accurately and extracted symptoms efficiently. |
first_indexed | 2024-03-09T15:15:38Z |
format | Article |
id | doaj.art-786f56c08c1747619f44c43de2173c5d |
institution | Directory Open Access Journal |
issn | 2045-2322 |
language | English |
last_indexed | 2024-03-09T15:15:38Z |
publishDate | 2023-08-01 |
publisher | Nature Portfolio |
record_format | Article |
series | Scientific Reports |
spelling | doaj.art-786f56c08c1747619f44c43de2173c5d2023-11-26T13:07:06ZengNature PortfolioScientific Reports2045-23222023-08-0113111310.1038/s41598-023-39986-7Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processingMuzhe Guo0Yong Ma1Efe Eworuke2Melissa Khashei3Jaejoon Song4Yueqin Zhao5Fang Jin6Department of Statistics, George Washington UniversityOffice of Biostatistics, Office of Translational Sciences, Center for Drug Evaluation and Research, Food and Drug Administration (FDA)Epidemiology and Drug Safety, IQVIA Real World SolutionsDivision of Epidemiology II, Office of Pharmacovigilance and Epidemiology, Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, Food and Drug Administration (FDA)Office of Biostatistics, Office of Translational Sciences, Center for Drug Evaluation and Research, Food and Drug Administration (FDA)Office of Biostatistics, Office of Translational Sciences, Center for Drug Evaluation and Research, Food and Drug Administration (FDA)Department of Statistics, George Washington UniversityAbstract We used social media data from “covid19positive” subreddit, from 03/2020 to 03/2022 to identify COVID-19 cases and extract their reported symptoms automatically using natural language processing (NLP). We trained a Bidirectional Encoder Representations from Transformers classification model with chunking to identify COVID-19 cases; also, we developed a novel QuadArm model, which incorporates Question-answering, dual-corpus expansion, Adaptive rotation clustering, and mapping, to extract symptoms. Our classification model achieved a 91.2% accuracy for the early period (03/2020-05/2020) and was applied to the Delta (07/2021–09/2021) and Omicron (12/2021–03/2022) periods for case identification. We identified 310, 8794, and 12,094 COVID-positive authors in the three periods, respectively. The top five common symptoms extracted in the early period were coughing (57%), fever (55%), loss of sense of smell (41%), headache (40%), and sore throat (40%). During the Delta period, these symptoms remained as the top five symptoms with percent authors reporting symptoms reduced to half or fewer than the early period. During the Omicron period, loss of sense of smell was reported less while sore throat was reported more. Our study demonstrated that NLP can be used to identify COVID-19 cases accurately and extracted symptoms efficiently.https://doi.org/10.1038/s41598-023-39986-7 |
spellingShingle | Muzhe Guo Yong Ma Efe Eworuke Melissa Khashei Jaejoon Song Yueqin Zhao Fang Jin Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing Scientific Reports |
title | Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing |
title_full | Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing |
title_fullStr | Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing |
title_full_unstemmed | Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing |
title_short | Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing |
title_sort | identifying covid 19 cases and extracting patient reported symptoms from reddit using natural language processing |
url | https://doi.org/10.1038/s41598-023-39986-7 |
work_keys_str_mv | AT muzheguo identifyingcovid19casesandextractingpatientreportedsymptomsfromredditusingnaturallanguageprocessing AT yongma identifyingcovid19casesandextractingpatientreportedsymptomsfromredditusingnaturallanguageprocessing AT efeeworuke identifyingcovid19casesandextractingpatientreportedsymptomsfromredditusingnaturallanguageprocessing AT melissakhashei identifyingcovid19casesandextractingpatientreportedsymptomsfromredditusingnaturallanguageprocessing AT jaejoonsong identifyingcovid19casesandextractingpatientreportedsymptomsfromredditusingnaturallanguageprocessing AT yueqinzhao identifyingcovid19casesandextractingpatientreportedsymptomsfromredditusingnaturallanguageprocessing AT fangjin identifyingcovid19casesandextractingpatientreportedsymptomsfromredditusingnaturallanguageprocessing |