An evaluation of GPT models for phenotype concept recognition

Abstract Objective Clinical deep phenotyping and phenotype annotation play a critical role in both the diagnosis of patients with rare disorders as well as in building computationally-tractable knowledge in the rare disorders field. These processes rely on using ontology concepts, often from the Hum...

Full description

Bibliographic Details
Main Authors: Tudor Groza, Harry Caufield, Dylan Gration, Gareth Baynam, Melissa A. Haendel, Peter N. Robinson, Christopher J. Mungall, Justin T. Reese
Format: Article
Language:English
Published: BMC 2024-01-01
Series:BMC Medical Informatics and Decision Making
Subjects:
Online Access:https://doi.org/10.1186/s12911-024-02439-w
_version_ 1797274414039957504
author Tudor Groza
Harry Caufield
Dylan Gration
Gareth Baynam
Melissa A. Haendel
Peter N. Robinson
Christopher J. Mungall
Justin T. Reese
author_facet Tudor Groza
Harry Caufield
Dylan Gration
Gareth Baynam
Melissa A. Haendel
Peter N. Robinson
Christopher J. Mungall
Justin T. Reese
author_sort Tudor Groza
collection DOAJ
description Abstract Objective Clinical deep phenotyping and phenotype annotation play a critical role in both the diagnosis of patients with rare disorders as well as in building computationally-tractable knowledge in the rare disorders field. These processes rely on using ontology concepts, often from the Human Phenotype Ontology, in conjunction with a phenotype concept recognition task (supported usually by machine learning methods) to curate patient profiles or existing scientific literature. With the significant shift in the use of large language models (LLMs) for most NLP tasks, we examine the performance of the latest Generative Pre-trained Transformer (GPT) models underpinning ChatGPT as a foundation for the tasks of clinical phenotyping and phenotype annotation. Materials and methods The experimental setup of the study included seven prompts of various levels of specificity, two GPT models (gpt-3.5-turbo and gpt-4.0) and two established gold standard corpora for phenotype recognition, one consisting of publication abstracts and the other clinical observations. Results The best run, using in-context learning, achieved 0.58 document-level F1 score on publication abstracts and 0.75 document-level F1 score on clinical observations, as well as a mention-level F1 score of 0.7, which surpasses the current best in class tool. Without in-context learning, however, performance is significantly below the existing approaches. Conclusion Our experiments show that gpt-4.0 surpasses the state of the art performance if the task is constrained to a subset of the target ontology where there is prior knowledge of the terms that are expected to be matched. While the results are promising, the non-deterministic nature of the outcomes, the high cost and the lack of concordance between different runs using the same prompt and input make the use of these LLMs challenging for this particular task.
first_indexed 2024-03-07T14:57:58Z
format Article
id doaj.art-f4fb3d1859b74b69815dbe8519d05b0e
institution Directory Open Access Journal
issn 1472-6947
language English
last_indexed 2024-03-07T14:57:58Z
publishDate 2024-01-01
publisher BMC
record_format Article
series BMC Medical Informatics and Decision Making
spelling doaj.art-f4fb3d1859b74b69815dbe8519d05b0e2024-03-05T19:19:44ZengBMCBMC Medical Informatics and Decision Making1472-69472024-01-0124111210.1186/s12911-024-02439-wAn evaluation of GPT models for phenotype concept recognitionTudor Groza0Harry Caufield1Dylan Gration2Gareth Baynam3Melissa A. Haendel4Peter N. Robinson5Christopher J. Mungall6Justin T. Reese7Rare Care Centre, Perth Children’s HospitalDivision of Environmental Genomics and Systems Biology, Lawrence Berkeley National LaboratoryWestern Australian Register of Developmental Anomalies, King Edward Memorial HospitalRare Care Centre, Perth Children’s HospitalUniversity of Colorado Anschutz Medical CampusThe Jackson Laboratory for Genomic MedicineDivision of Environmental Genomics and Systems Biology, Lawrence Berkeley National LaboratoryDivision of Environmental Genomics and Systems Biology, Lawrence Berkeley National LaboratoryAbstract Objective Clinical deep phenotyping and phenotype annotation play a critical role in both the diagnosis of patients with rare disorders as well as in building computationally-tractable knowledge in the rare disorders field. These processes rely on using ontology concepts, often from the Human Phenotype Ontology, in conjunction with a phenotype concept recognition task (supported usually by machine learning methods) to curate patient profiles or existing scientific literature. With the significant shift in the use of large language models (LLMs) for most NLP tasks, we examine the performance of the latest Generative Pre-trained Transformer (GPT) models underpinning ChatGPT as a foundation for the tasks of clinical phenotyping and phenotype annotation. Materials and methods The experimental setup of the study included seven prompts of various levels of specificity, two GPT models (gpt-3.5-turbo and gpt-4.0) and two established gold standard corpora for phenotype recognition, one consisting of publication abstracts and the other clinical observations. Results The best run, using in-context learning, achieved 0.58 document-level F1 score on publication abstracts and 0.75 document-level F1 score on clinical observations, as well as a mention-level F1 score of 0.7, which surpasses the current best in class tool. Without in-context learning, however, performance is significantly below the existing approaches. Conclusion Our experiments show that gpt-4.0 surpasses the state of the art performance if the task is constrained to a subset of the target ontology where there is prior knowledge of the terms that are expected to be matched. While the results are promising, the non-deterministic nature of the outcomes, the high cost and the lack of concordance between different runs using the same prompt and input make the use of these LLMs challenging for this particular task.https://doi.org/10.1186/s12911-024-02439-wLarge language modelsGenerative pretrained transformerArtificial intelligencePhenotype concept recognitionHuman Phenotype Ontology
spellingShingle Tudor Groza
Harry Caufield
Dylan Gration
Gareth Baynam
Melissa A. Haendel
Peter N. Robinson
Christopher J. Mungall
Justin T. Reese
An evaluation of GPT models for phenotype concept recognition
BMC Medical Informatics and Decision Making
Large language models
Generative pretrained transformer
Artificial intelligence
Phenotype concept recognition
Human Phenotype Ontology
title An evaluation of GPT models for phenotype concept recognition
title_full An evaluation of GPT models for phenotype concept recognition
title_fullStr An evaluation of GPT models for phenotype concept recognition
title_full_unstemmed An evaluation of GPT models for phenotype concept recognition
title_short An evaluation of GPT models for phenotype concept recognition
title_sort evaluation of gpt models for phenotype concept recognition
topic Large language models
Generative pretrained transformer
Artificial intelligence
Phenotype concept recognition
Human Phenotype Ontology
url https://doi.org/10.1186/s12911-024-02439-w
work_keys_str_mv AT tudorgroza anevaluationofgptmodelsforphenotypeconceptrecognition
AT harrycaufield anevaluationofgptmodelsforphenotypeconceptrecognition
AT dylangration anevaluationofgptmodelsforphenotypeconceptrecognition
AT garethbaynam anevaluationofgptmodelsforphenotypeconceptrecognition
AT melissaahaendel anevaluationofgptmodelsforphenotypeconceptrecognition
AT peternrobinson anevaluationofgptmodelsforphenotypeconceptrecognition
AT christopherjmungall anevaluationofgptmodelsforphenotypeconceptrecognition
AT justintreese anevaluationofgptmodelsforphenotypeconceptrecognition
AT tudorgroza evaluationofgptmodelsforphenotypeconceptrecognition
AT harrycaufield evaluationofgptmodelsforphenotypeconceptrecognition
AT dylangration evaluationofgptmodelsforphenotypeconceptrecognition
AT garethbaynam evaluationofgptmodelsforphenotypeconceptrecognition
AT melissaahaendel evaluationofgptmodelsforphenotypeconceptrecognition
AT peternrobinson evaluationofgptmodelsforphenotypeconceptrecognition
AT christopherjmungall evaluationofgptmodelsforphenotypeconceptrecognition
AT justintreese evaluationofgptmodelsforphenotypeconceptrecognition