An evaluation of GPT models for phenotype concept recognition
Abstract Objective Clinical deep phenotyping and phenotype annotation play a critical role in both the diagnosis of patients with rare disorders as well as in building computationally-tractable knowledge in the rare disorders field. These processes rely on using ontology concepts, often from the Hum...
Main Authors: | , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2024-01-01
|
Series: | BMC Medical Informatics and Decision Making |
Subjects: | |
Online Access: | https://doi.org/10.1186/s12911-024-02439-w |
_version_ | 1797274414039957504 |
---|---|
author | Tudor Groza Harry Caufield Dylan Gration Gareth Baynam Melissa A. Haendel Peter N. Robinson Christopher J. Mungall Justin T. Reese |
author_facet | Tudor Groza Harry Caufield Dylan Gration Gareth Baynam Melissa A. Haendel Peter N. Robinson Christopher J. Mungall Justin T. Reese |
author_sort | Tudor Groza |
collection | DOAJ |
description | Abstract Objective Clinical deep phenotyping and phenotype annotation play a critical role in both the diagnosis of patients with rare disorders as well as in building computationally-tractable knowledge in the rare disorders field. These processes rely on using ontology concepts, often from the Human Phenotype Ontology, in conjunction with a phenotype concept recognition task (supported usually by machine learning methods) to curate patient profiles or existing scientific literature. With the significant shift in the use of large language models (LLMs) for most NLP tasks, we examine the performance of the latest Generative Pre-trained Transformer (GPT) models underpinning ChatGPT as a foundation for the tasks of clinical phenotyping and phenotype annotation. Materials and methods The experimental setup of the study included seven prompts of various levels of specificity, two GPT models (gpt-3.5-turbo and gpt-4.0) and two established gold standard corpora for phenotype recognition, one consisting of publication abstracts and the other clinical observations. Results The best run, using in-context learning, achieved 0.58 document-level F1 score on publication abstracts and 0.75 document-level F1 score on clinical observations, as well as a mention-level F1 score of 0.7, which surpasses the current best in class tool. Without in-context learning, however, performance is significantly below the existing approaches. Conclusion Our experiments show that gpt-4.0 surpasses the state of the art performance if the task is constrained to a subset of the target ontology where there is prior knowledge of the terms that are expected to be matched. While the results are promising, the non-deterministic nature of the outcomes, the high cost and the lack of concordance between different runs using the same prompt and input make the use of these LLMs challenging for this particular task. |
first_indexed | 2024-03-07T14:57:58Z |
format | Article |
id | doaj.art-f4fb3d1859b74b69815dbe8519d05b0e |
institution | Directory Open Access Journal |
issn | 1472-6947 |
language | English |
last_indexed | 2024-03-07T14:57:58Z |
publishDate | 2024-01-01 |
publisher | BMC |
record_format | Article |
series | BMC Medical Informatics and Decision Making |
spelling | doaj.art-f4fb3d1859b74b69815dbe8519d05b0e2024-03-05T19:19:44ZengBMCBMC Medical Informatics and Decision Making1472-69472024-01-0124111210.1186/s12911-024-02439-wAn evaluation of GPT models for phenotype concept recognitionTudor Groza0Harry Caufield1Dylan Gration2Gareth Baynam3Melissa A. Haendel4Peter N. Robinson5Christopher J. Mungall6Justin T. Reese7Rare Care Centre, Perth Children’s HospitalDivision of Environmental Genomics and Systems Biology, Lawrence Berkeley National LaboratoryWestern Australian Register of Developmental Anomalies, King Edward Memorial HospitalRare Care Centre, Perth Children’s HospitalUniversity of Colorado Anschutz Medical CampusThe Jackson Laboratory for Genomic MedicineDivision of Environmental Genomics and Systems Biology, Lawrence Berkeley National LaboratoryDivision of Environmental Genomics and Systems Biology, Lawrence Berkeley National LaboratoryAbstract Objective Clinical deep phenotyping and phenotype annotation play a critical role in both the diagnosis of patients with rare disorders as well as in building computationally-tractable knowledge in the rare disorders field. These processes rely on using ontology concepts, often from the Human Phenotype Ontology, in conjunction with a phenotype concept recognition task (supported usually by machine learning methods) to curate patient profiles or existing scientific literature. With the significant shift in the use of large language models (LLMs) for most NLP tasks, we examine the performance of the latest Generative Pre-trained Transformer (GPT) models underpinning ChatGPT as a foundation for the tasks of clinical phenotyping and phenotype annotation. Materials and methods The experimental setup of the study included seven prompts of various levels of specificity, two GPT models (gpt-3.5-turbo and gpt-4.0) and two established gold standard corpora for phenotype recognition, one consisting of publication abstracts and the other clinical observations. Results The best run, using in-context learning, achieved 0.58 document-level F1 score on publication abstracts and 0.75 document-level F1 score on clinical observations, as well as a mention-level F1 score of 0.7, which surpasses the current best in class tool. Without in-context learning, however, performance is significantly below the existing approaches. Conclusion Our experiments show that gpt-4.0 surpasses the state of the art performance if the task is constrained to a subset of the target ontology where there is prior knowledge of the terms that are expected to be matched. While the results are promising, the non-deterministic nature of the outcomes, the high cost and the lack of concordance between different runs using the same prompt and input make the use of these LLMs challenging for this particular task.https://doi.org/10.1186/s12911-024-02439-wLarge language modelsGenerative pretrained transformerArtificial intelligencePhenotype concept recognitionHuman Phenotype Ontology |
spellingShingle | Tudor Groza Harry Caufield Dylan Gration Gareth Baynam Melissa A. Haendel Peter N. Robinson Christopher J. Mungall Justin T. Reese An evaluation of GPT models for phenotype concept recognition BMC Medical Informatics and Decision Making Large language models Generative pretrained transformer Artificial intelligence Phenotype concept recognition Human Phenotype Ontology |
title | An evaluation of GPT models for phenotype concept recognition |
title_full | An evaluation of GPT models for phenotype concept recognition |
title_fullStr | An evaluation of GPT models for phenotype concept recognition |
title_full_unstemmed | An evaluation of GPT models for phenotype concept recognition |
title_short | An evaluation of GPT models for phenotype concept recognition |
title_sort | evaluation of gpt models for phenotype concept recognition |
topic | Large language models Generative pretrained transformer Artificial intelligence Phenotype concept recognition Human Phenotype Ontology |
url | https://doi.org/10.1186/s12911-024-02439-w |
work_keys_str_mv | AT tudorgroza anevaluationofgptmodelsforphenotypeconceptrecognition AT harrycaufield anevaluationofgptmodelsforphenotypeconceptrecognition AT dylangration anevaluationofgptmodelsforphenotypeconceptrecognition AT garethbaynam anevaluationofgptmodelsforphenotypeconceptrecognition AT melissaahaendel anevaluationofgptmodelsforphenotypeconceptrecognition AT peternrobinson anevaluationofgptmodelsforphenotypeconceptrecognition AT christopherjmungall anevaluationofgptmodelsforphenotypeconceptrecognition AT justintreese anevaluationofgptmodelsforphenotypeconceptrecognition AT tudorgroza evaluationofgptmodelsforphenotypeconceptrecognition AT harrycaufield evaluationofgptmodelsforphenotypeconceptrecognition AT dylangration evaluationofgptmodelsforphenotypeconceptrecognition AT garethbaynam evaluationofgptmodelsforphenotypeconceptrecognition AT melissaahaendel evaluationofgptmodelsforphenotypeconceptrecognition AT peternrobinson evaluationofgptmodelsforphenotypeconceptrecognition AT christopherjmungall evaluationofgptmodelsforphenotypeconceptrecognition AT justintreese evaluationofgptmodelsforphenotypeconceptrecognition |