Applying Artificial Intelligence Methods for the Estimation of Disease Incidence: The Utility of Language Models

Background: AI-driven digital health tools often rely on estimates of disease incidence or prevalence, but obtaining these estimates is costly and time-consuming. We explored the use of machine learning models that leverage contextual information about diseases from unstructured text, to estimate di...

Full description

Bibliographic Details
Main Authors:	Yuanzhao Zhang, Robert Walecki, Joanne R. Winter, Felix J. S. Bragman, Sara Lourenco, Christopher Hart, Adam Baker, Yura Perov, Saurabh Johri
Format:	Article
Language:	English
Published:	Frontiers Media S.A. 2020-12-01
Series:	Frontiers in Digital Health
Subjects:	natural language processing disease incidence health statistic data deep learning machine learning
Online Access:	https://www.frontiersin.org/articles/10.3389/fdgth.2020.569261/full

_version_	1818853295443148800
author	Yuanzhao Zhang Robert Walecki Joanne R. Winter Felix J. S. Bragman Sara Lourenco Christopher Hart Adam Baker Yura Perov Saurabh Johri
author_facet	Yuanzhao Zhang Robert Walecki Joanne R. Winter Felix J. S. Bragman Sara Lourenco Christopher Hart Adam Baker Yura Perov Saurabh Johri
author_sort	Yuanzhao Zhang
collection	DOAJ
description	Background: AI-driven digital health tools often rely on estimates of disease incidence or prevalence, but obtaining these estimates is costly and time-consuming. We explored the use of machine learning models that leverage contextual information about diseases from unstructured text, to estimate disease incidence.Methods: We used a class of machine learning models, called language models, to extract contextual information relating to disease incidence. We evaluated three different language models: BioBERT, Global Vectors for Word Representation (GloVe), and the Universal Sentence Encoder (USE), as well as an approach which uses all jointly. The output of these models is a mathematical representation of the underlying data, known as “embeddings.” We used these to train neural network models to predict disease incidence. The neural networks were trained and validated using data from the Global Burden of Disease study, and tested using independent data sourced from the epidemiological literature.Findings: A variety of language models can be used to encode contextual information of diseases. We found that, on average, BioBERT embeddings were the best for disease names across multiple tasks. In particular, BioBERT was the best performing model when predicting specific disease-country pairs, whilst a fusion model combining BioBERT, GloVe, and USE performed best on average when predicting disease incidence in unseen countries. We also found that GloVe embeddings performed better than BioBERT embeddings when applied to country names. However, we also noticed that the models were limited in view of predicting previously unseen diseases. Further limitations were also observed with substantial variations across age groups and notably lower performance for diseases that are highly dependent on location and climate.Interpretation: We demonstrate that context-aware machine learning models can be used for estimating disease incidence. This method is quicker to implement than traditional epidemiological approaches. We therefore suggest it complements existing modeling efforts, where data is required more rapidly or at larger scale. This may particularly benefit AI-driven digital health products where the data will undergo further processing and a validated approximation of the disease incidence is adequate.
first_indexed	2024-12-19T07:34:33Z
format	Article
id	doaj.art-5a40111c5bf440f5b050c37761d75fce
institution	Directory Open Access Journal
issn	2673-253X
language	English
last_indexed	2024-12-19T07:34:33Z
publishDate	2020-12-01
publisher	Frontiers Media S.A.
record_format	Article
series	Frontiers in Digital Health
spelling	doaj.art-5a40111c5bf440f5b050c37761d75fce2022-12-21T20:30:36ZengFrontiers Media S.A.Frontiers in Digital Health2673-253X2020-12-01210.3389/fdgth.2020.569261569261Applying Artificial Intelligence Methods for the Estimation of Disease Incidence: The Utility of Language ModelsYuanzhao ZhangRobert WaleckiJoanne R. WinterFelix J. S. BragmanSara LourencoChristopher HartAdam BakerYura PerovSaurabh JohriBackground: AI-driven digital health tools often rely on estimates of disease incidence or prevalence, but obtaining these estimates is costly and time-consuming. We explored the use of machine learning models that leverage contextual information about diseases from unstructured text, to estimate disease incidence.Methods: We used a class of machine learning models, called language models, to extract contextual information relating to disease incidence. We evaluated three different language models: BioBERT, Global Vectors for Word Representation (GloVe), and the Universal Sentence Encoder (USE), as well as an approach which uses all jointly. The output of these models is a mathematical representation of the underlying data, known as “embeddings.” We used these to train neural network models to predict disease incidence. The neural networks were trained and validated using data from the Global Burden of Disease study, and tested using independent data sourced from the epidemiological literature.Findings: A variety of language models can be used to encode contextual information of diseases. We found that, on average, BioBERT embeddings were the best for disease names across multiple tasks. In particular, BioBERT was the best performing model when predicting specific disease-country pairs, whilst a fusion model combining BioBERT, GloVe, and USE performed best on average when predicting disease incidence in unseen countries. We also found that GloVe embeddings performed better than BioBERT embeddings when applied to country names. However, we also noticed that the models were limited in view of predicting previously unseen diseases. Further limitations were also observed with substantial variations across age groups and notably lower performance for diseases that are highly dependent on location and climate.Interpretation: We demonstrate that context-aware machine learning models can be used for estimating disease incidence. This method is quicker to implement than traditional epidemiological approaches. We therefore suggest it complements existing modeling efforts, where data is required more rapidly or at larger scale. This may particularly benefit AI-driven digital health products where the data will undergo further processing and a validated approximation of the disease incidence is adequate.https://www.frontiersin.org/articles/10.3389/fdgth.2020.569261/fullnatural language processingdisease incidencehealth statistic datadeep learningmachine learning
spellingShingle	Yuanzhao Zhang Robert Walecki Joanne R. Winter Felix J. S. Bragman Sara Lourenco Christopher Hart Adam Baker Yura Perov Saurabh Johri Applying Artificial Intelligence Methods for the Estimation of Disease Incidence: The Utility of Language Models Frontiers in Digital Health natural language processing disease incidence health statistic data deep learning machine learning
title	Applying Artificial Intelligence Methods for the Estimation of Disease Incidence: The Utility of Language Models
title_full	Applying Artificial Intelligence Methods for the Estimation of Disease Incidence: The Utility of Language Models
title_fullStr	Applying Artificial Intelligence Methods for the Estimation of Disease Incidence: The Utility of Language Models
title_full_unstemmed	Applying Artificial Intelligence Methods for the Estimation of Disease Incidence: The Utility of Language Models
title_short	Applying Artificial Intelligence Methods for the Estimation of Disease Incidence: The Utility of Language Models
title_sort	applying artificial intelligence methods for the estimation of disease incidence the utility of language models
topic	natural language processing disease incidence health statistic data deep learning machine learning
url	https://www.frontiersin.org/articles/10.3389/fdgth.2020.569261/full
work_keys_str_mv	AT yuanzhaozhang applyingartificialintelligencemethodsfortheestimationofdiseaseincidencetheutilityoflanguagemodels AT robertwalecki applyingartificialintelligencemethodsfortheestimationofdiseaseincidencetheutilityoflanguagemodels AT joannerwinter applyingartificialintelligencemethodsfortheestimationofdiseaseincidencetheutilityoflanguagemodels AT felixjsbragman applyingartificialintelligencemethodsfortheestimationofdiseaseincidencetheutilityoflanguagemodels AT saralourenco applyingartificialintelligencemethodsfortheestimationofdiseaseincidencetheutilityoflanguagemodels AT christopherhart applyingartificialintelligencemethodsfortheestimationofdiseaseincidencetheutilityoflanguagemodels AT adambaker applyingartificialintelligencemethodsfortheestimationofdiseaseincidencetheutilityoflanguagemodels AT yuraperov applyingartificialintelligencemethodsfortheestimationofdiseaseincidencetheutilityoflanguagemodels AT saurabhjohri applyingartificialintelligencemethodsfortheestimationofdiseaseincidencetheutilityoflanguagemodels

Applying Artificial Intelligence Methods for the Estimation of Disease Incidence: The Utility of Language Models

Similar Items