An evaluation of GPT models for phenotype concept recognition

Abstract Objective Clinical deep phenotyping and phenotype annotation play a critical role in both the diagnosis of patients with rare disorders as well as in building computationally-tractable knowledge in the rare disorders field. These processes rely on using ontology concepts, often from the Hum...

Full description

Bibliographic Details
Main Authors:	Tudor Groza, Harry Caufield, Dylan Gration, Gareth Baynam, Melissa A. Haendel, Peter N. Robinson, Christopher J. Mungall, Justin T. Reese
Format:	Article
Language:	English
Published:	BMC 2024-01-01
Series:	BMC Medical Informatics and Decision Making
Subjects:	Large language models Generative pretrained transformer Artificial intelligence Phenotype concept recognition Human Phenotype Ontology
Online Access:	https://doi.org/10.1186/s12911-024-02439-w

_version_	1797274414039957504
author	Tudor Groza Harry Caufield Dylan Gration Gareth Baynam Melissa A. Haendel Peter N. Robinson Christopher J. Mungall Justin T. Reese
author_facet	Tudor Groza Harry Caufield Dylan Gration Gareth Baynam Melissa A. Haendel Peter N. Robinson Christopher J. Mungall Justin T. Reese
author_sort	Tudor Groza
collection	DOAJ
description	Abstract Objective Clinical deep phenotyping and phenotype annotation play a critical role in both the diagnosis of patients with rare disorders as well as in building computationally-tractable knowledge in the rare disorders field. These processes rely on using ontology concepts, often from the Human Phenotype Ontology, in conjunction with a phenotype concept recognition task (supported usually by machine learning methods) to curate patient profiles or existing scientific literature. With the significant shift in the use of large language models (LLMs) for most NLP tasks, we examine the performance of the latest Generative Pre-trained Transformer (GPT) models underpinning ChatGPT as a foundation for the tasks of clinical phenotyping and phenotype annotation. Materials and methods The experimental setup of the study included seven prompts of various levels of specificity, two GPT models (gpt-3.5-turbo and gpt-4.0) and two established gold standard corpora for phenotype recognition, one consisting of publication abstracts and the other clinical observations. Results The best run, using in-context learning, achieved 0.58 document-level F1 score on publication abstracts and 0.75 document-level F1 score on clinical observations, as well as a mention-level F1 score of 0.7, which surpasses the current best in class tool. Without in-context learning, however, performance is significantly below the existing approaches. Conclusion Our experiments show that gpt-4.0 surpasses the state of the art performance if the task is constrained to a subset of the target ontology where there is prior knowledge of the terms that are expected to be matched. While the results are promising, the non-deterministic nature of the outcomes, the high cost and the lack of concordance between different runs using the same prompt and input make the use of these LLMs challenging for this particular task.
first_indexed	2024-03-07T14:57:58Z
format	Article
id	doaj.art-f4fb3d1859b74b69815dbe8519d05b0e
institution	Directory Open Access Journal
issn	1472-6947
language	English
last_indexed	2024-03-07T14:57:58Z
publishDate	2024-01-01
publisher	BMC
record_format	Article
series	BMC Medical Informatics and Decision Making
spelling	doaj.art-f4fb3d1859b74b69815dbe8519d05b0e2024-03-05T19:19:44ZengBMCBMC Medical Informatics and Decision Making1472-69472024-01-0124111210.1186/s12911-024-02439-wAn evaluation of GPT models for phenotype concept recognitionTudor Groza0Harry Caufield1Dylan Gration2Gareth Baynam3Melissa A. Haendel4Peter N. Robinson5Christopher J. Mungall6Justin T. Reese7Rare Care Centre, Perth Children’s HospitalDivision of Environmental Genomics and Systems Biology, Lawrence Berkeley National LaboratoryWestern Australian Register of Developmental Anomalies, King Edward Memorial HospitalRare Care Centre, Perth Children’s HospitalUniversity of Colorado Anschutz Medical CampusThe Jackson Laboratory for Genomic MedicineDivision of Environmental Genomics and Systems Biology, Lawrence Berkeley National LaboratoryDivision of Environmental Genomics and Systems Biology, Lawrence Berkeley National LaboratoryAbstract Objective Clinical deep phenotyping and phenotype annotation play a critical role in both the diagnosis of patients with rare disorders as well as in building computationally-tractable knowledge in the rare disorders field. These processes rely on using ontology concepts, often from the Human Phenotype Ontology, in conjunction with a phenotype concept recognition task (supported usually by machine learning methods) to curate patient profiles or existing scientific literature. With the significant shift in the use of large language models (LLMs) for most NLP tasks, we examine the performance of the latest Generative Pre-trained Transformer (GPT) models underpinning ChatGPT as a foundation for the tasks of clinical phenotyping and phenotype annotation. Materials and methods The experimental setup of the study included seven prompts of various levels of specificity, two GPT models (gpt-3.5-turbo and gpt-4.0) and two established gold standard corpora for phenotype recognition, one consisting of publication abstracts and the other clinical observations. Results The best run, using in-context learning, achieved 0.58 document-level F1 score on publication abstracts and 0.75 document-level F1 score on clinical observations, as well as a mention-level F1 score of 0.7, which surpasses the current best in class tool. Without in-context learning, however, performance is significantly below the existing approaches. Conclusion Our experiments show that gpt-4.0 surpasses the state of the art performance if the task is constrained to a subset of the target ontology where there is prior knowledge of the terms that are expected to be matched. While the results are promising, the non-deterministic nature of the outcomes, the high cost and the lack of concordance between different runs using the same prompt and input make the use of these LLMs challenging for this particular task.https://doi.org/10.1186/s12911-024-02439-wLarge language modelsGenerative pretrained transformerArtificial intelligencePhenotype concept recognitionHuman Phenotype Ontology
spellingShingle	Tudor Groza Harry Caufield Dylan Gration Gareth Baynam Melissa A. Haendel Peter N. Robinson Christopher J. Mungall Justin T. Reese An evaluation of GPT models for phenotype concept recognition BMC Medical Informatics and Decision Making Large language models Generative pretrained transformer Artificial intelligence Phenotype concept recognition Human Phenotype Ontology
title	An evaluation of GPT models for phenotype concept recognition
title_full	An evaluation of GPT models for phenotype concept recognition
title_fullStr	An evaluation of GPT models for phenotype concept recognition
title_full_unstemmed	An evaluation of GPT models for phenotype concept recognition
title_short	An evaluation of GPT models for phenotype concept recognition
title_sort	evaluation of gpt models for phenotype concept recognition
topic	Large language models Generative pretrained transformer Artificial intelligence Phenotype concept recognition Human Phenotype Ontology
url	https://doi.org/10.1186/s12911-024-02439-w
work_keys_str_mv	AT tudorgroza anevaluationofgptmodelsforphenotypeconceptrecognition AT harrycaufield anevaluationofgptmodelsforphenotypeconceptrecognition AT dylangration anevaluationofgptmodelsforphenotypeconceptrecognition AT garethbaynam anevaluationofgptmodelsforphenotypeconceptrecognition AT melissaahaendel anevaluationofgptmodelsforphenotypeconceptrecognition AT peternrobinson anevaluationofgptmodelsforphenotypeconceptrecognition AT christopherjmungall anevaluationofgptmodelsforphenotypeconceptrecognition AT justintreese anevaluationofgptmodelsforphenotypeconceptrecognition AT tudorgroza evaluationofgptmodelsforphenotypeconceptrecognition AT harrycaufield evaluationofgptmodelsforphenotypeconceptrecognition AT dylangration evaluationofgptmodelsforphenotypeconceptrecognition AT garethbaynam evaluationofgptmodelsforphenotypeconceptrecognition AT melissaahaendel evaluationofgptmodelsforphenotypeconceptrecognition AT peternrobinson evaluationofgptmodelsforphenotypeconceptrecognition AT christopherjmungall evaluationofgptmodelsforphenotypeconceptrecognition AT justintreese evaluationofgptmodelsforphenotypeconceptrecognition

An evaluation of GPT models for phenotype concept recognition

Similar Items