Predicting Emerging Themes in Rapidly Expanding COVID-19 Literature With Unsupervised Word Embeddings and Machine Learning: Evidence-Based Study

BackgroundEvidence from peer-reviewed literature is the cornerstone for designing responses to global threats such as COVID-19. In massive and rapidly growing corpuses, such as COVID-19 publications, assimilating and synthesizing information is challenging. Leveraging a robus...

Full description

Bibliographic Details
Main Authors:	Ridam Pal, Harshita Chopra, Raghav Awasthi, Harsh Bandhey, Aditya Nagori, Tavpritesh Sethi
Format:	Article
Language:	English
Published:	JMIR Publications 2022-11-01
Series:	Journal of Medical Internet Research
Online Access:	https://www.jmir.org/2022/11/e34067

_version_	1797734696156659712
author	Ridam Pal Harshita Chopra Raghav Awasthi Harsh Bandhey Aditya Nagori Tavpritesh Sethi
author_facet	Ridam Pal Harshita Chopra Raghav Awasthi Harsh Bandhey Aditya Nagori Tavpritesh Sethi
author_sort	Ridam Pal
collection	DOAJ
description	BackgroundEvidence from peer-reviewed literature is the cornerstone for designing responses to global threats such as COVID-19. In massive and rapidly growing corpuses, such as COVID-19 publications, assimilating and synthesizing information is challenging. Leveraging a robust computational pipeline that evaluates multiple aspects, such as network topological features, communities, and their temporal trends, can make this process more efficient. ObjectiveWe aimed to show that new knowledge can be captured and tracked using the temporal change in the underlying unsupervised word embeddings of the literature. Further imminent themes can be predicted using machine learning on the evolving associations between words. MethodsFrequently occurring medical entities were extracted from the abstracts of more than 150,000 COVID-19 articles published on the World Health Organization database, collected on a monthly interval starting from February 2020. Word embeddings trained on each month’s literature were used to construct networks of entities with cosine similarities as edge weights. Topological features of the subsequent month’s network were forecasted based on prior patterns, and new links were predicted using supervised machine learning. Community detection and alluvial diagrams were used to track biomedical themes that evolved over the months. ResultsWe found that thromboembolic complications were detected as an emerging theme as early as August 2020. A shift toward the symptoms of long COVID complications was observed during March 2021, and neurological complications gained significance in June 2021. A prospective validation of the link prediction models achieved an area under the receiver operating characteristic curve of 0.87. Predictive modeling revealed predisposing conditions, symptoms, cross-infection, and neurological complications as dominant research themes in COVID-19 publications based on the patterns observed in previous months. ConclusionsMachine learning–based prediction of emerging links can contribute toward steering research by capturing themes represented by groups of medical entities, based on patterns of semantic relationships over time.
first_indexed	2024-03-12T12:47:07Z
format	Article
id	doaj.art-20d245763cd948f1b4cab9ba116fffcc
institution	Directory Open Access Journal
issn	1438-8871
language	English
last_indexed	2024-03-12T12:47:07Z
publishDate	2022-11-01
publisher	JMIR Publications
record_format	Article
series	Journal of Medical Internet Research
spelling	doaj.art-20d245763cd948f1b4cab9ba116fffcc2023-08-28T23:13:10ZengJMIR PublicationsJournal of Medical Internet Research1438-88712022-11-012411e3406710.2196/34067Predicting Emerging Themes in Rapidly Expanding COVID-19 Literature With Unsupervised Word Embeddings and Machine Learning: Evidence-Based StudyRidam Palhttps://orcid.org/0000-0003-1561-1173Harshita Choprahttps://orcid.org/0000-0003-3331-2003Raghav Awasthihttps://orcid.org/0000-0002-6643-4333Harsh Bandheyhttps://orcid.org/0000-0002-4113-0616Aditya Nagorihttps://orcid.org/0000-0002-6389-2179Tavpritesh Sethihttps://orcid.org/0000-0002-4776-7941 BackgroundEvidence from peer-reviewed literature is the cornerstone for designing responses to global threats such as COVID-19. In massive and rapidly growing corpuses, such as COVID-19 publications, assimilating and synthesizing information is challenging. Leveraging a robust computational pipeline that evaluates multiple aspects, such as network topological features, communities, and their temporal trends, can make this process more efficient. ObjectiveWe aimed to show that new knowledge can be captured and tracked using the temporal change in the underlying unsupervised word embeddings of the literature. Further imminent themes can be predicted using machine learning on the evolving associations between words. MethodsFrequently occurring medical entities were extracted from the abstracts of more than 150,000 COVID-19 articles published on the World Health Organization database, collected on a monthly interval starting from February 2020. Word embeddings trained on each month’s literature were used to construct networks of entities with cosine similarities as edge weights. Topological features of the subsequent month’s network were forecasted based on prior patterns, and new links were predicted using supervised machine learning. Community detection and alluvial diagrams were used to track biomedical themes that evolved over the months. ResultsWe found that thromboembolic complications were detected as an emerging theme as early as August 2020. A shift toward the symptoms of long COVID complications was observed during March 2021, and neurological complications gained significance in June 2021. A prospective validation of the link prediction models achieved an area under the receiver operating characteristic curve of 0.87. Predictive modeling revealed predisposing conditions, symptoms, cross-infection, and neurological complications as dominant research themes in COVID-19 publications based on the patterns observed in previous months. ConclusionsMachine learning–based prediction of emerging links can contribute toward steering research by capturing themes represented by groups of medical entities, based on patterns of semantic relationships over time.https://www.jmir.org/2022/11/e34067
spellingShingle	Ridam Pal Harshita Chopra Raghav Awasthi Harsh Bandhey Aditya Nagori Tavpritesh Sethi Predicting Emerging Themes in Rapidly Expanding COVID-19 Literature With Unsupervised Word Embeddings and Machine Learning: Evidence-Based Study Journal of Medical Internet Research
title	Predicting Emerging Themes in Rapidly Expanding COVID-19 Literature With Unsupervised Word Embeddings and Machine Learning: Evidence-Based Study
title_full	Predicting Emerging Themes in Rapidly Expanding COVID-19 Literature With Unsupervised Word Embeddings and Machine Learning: Evidence-Based Study
title_fullStr	Predicting Emerging Themes in Rapidly Expanding COVID-19 Literature With Unsupervised Word Embeddings and Machine Learning: Evidence-Based Study
title_full_unstemmed	Predicting Emerging Themes in Rapidly Expanding COVID-19 Literature With Unsupervised Word Embeddings and Machine Learning: Evidence-Based Study
title_short	Predicting Emerging Themes in Rapidly Expanding COVID-19 Literature With Unsupervised Word Embeddings and Machine Learning: Evidence-Based Study
title_sort	predicting emerging themes in rapidly expanding covid 19 literature with unsupervised word embeddings and machine learning evidence based study
url	https://www.jmir.org/2022/11/e34067
work_keys_str_mv	AT ridampal predictingemergingthemesinrapidlyexpandingcovid19literaturewithunsupervisedwordembeddingsandmachinelearningevidencebasedstudy AT harshitachopra predictingemergingthemesinrapidlyexpandingcovid19literaturewithunsupervisedwordembeddingsandmachinelearningevidencebasedstudy AT raghavawasthi predictingemergingthemesinrapidlyexpandingcovid19literaturewithunsupervisedwordembeddingsandmachinelearningevidencebasedstudy AT harshbandhey predictingemergingthemesinrapidlyexpandingcovid19literaturewithunsupervisedwordembeddingsandmachinelearningevidencebasedstudy AT adityanagori predictingemergingthemesinrapidlyexpandingcovid19literaturewithunsupervisedwordembeddingsandmachinelearningevidencebasedstudy AT tavpriteshsethi predictingemergingthemesinrapidlyexpandingcovid19literaturewithunsupervisedwordembeddingsandmachinelearningevidencebasedstudy

Predicting Emerging Themes in Rapidly Expanding COVID-19 Literature With Unsupervised Word Embeddings and Machine Learning: Evidence-Based Study

Similar Items