PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature

Abstract Background Adverse events induced by drug-drug interactions are a major concern in the United States. Current research is moving toward using electronic health record (EHR) data, including for adverse drug events discovery. One of the first steps in EHR-based studies is to define a phenotyp...

Full description

Bibliographic Details
Main Authors: Samar Binkheder, Heng-Yi Wu, Sara K. Quinney, Shijun Zhang, Md. Muntasir Zitu, Chien‐Wei Chiang, Lei Wang, Josette Jones, Lang Li
Format: Article
Language:English
Published: BMC 2022-06-01
Series:Journal of Biomedical Semantics
Subjects:
Online Access:https://doi.org/10.1186/s13326-022-00272-6
_version_ 1811248925869342720
author Samar Binkheder
Heng-Yi Wu
Sara K. Quinney
Shijun Zhang
Md. Muntasir Zitu
Chien‐Wei Chiang
Lei Wang
Josette Jones
Lang Li
author_facet Samar Binkheder
Heng-Yi Wu
Sara K. Quinney
Shijun Zhang
Md. Muntasir Zitu
Chien‐Wei Chiang
Lei Wang
Josette Jones
Lang Li
author_sort Samar Binkheder
collection DOAJ
description Abstract Background Adverse events induced by drug-drug interactions are a major concern in the United States. Current research is moving toward using electronic health record (EHR) data, including for adverse drug events discovery. One of the first steps in EHR-based studies is to define a phenotype for establishing a cohort of patients. However, phenotype definitions are not readily available for all phenotypes. One of the first steps of developing automated text mining tools is building a corpus. Therefore, this study aimed to develop annotation guidelines and a gold standard corpus to facilitate building future automated approaches for mining phenotype definitions contained in the literature. Furthermore, our aim is to improve the understanding of how these published phenotype definitions are presented in the literature and how we annotate them for future text mining tasks. Results Two annotators manually annotated the corpus on a sentence-level for the presence of evidence for phenotype definitions. Three major categories (inclusion, intermediate, and exclusion) with a total of ten dimensions were proposed characterizing major contextual patterns and cues for presenting phenotype definitions in published literature. The developed annotation guidelines were used to annotate the corpus that contained 3971 sentences: 1923 out of 3971 (48.4%) for the inclusion category, 1851 out of 3971 (46.6%) for the intermediate category, and 2273 out of 3971 (57.2%) for exclusion category. The highest number of annotated sentences was 1449 out of 3971 (36.5%) for the “Biomedical & Procedure” dimension. The lowest number of annotated sentences was 49 out of 3971 (1.2%) for “The use of NLP”. The overall percent inter-annotator agreement was 97.8%. Percent and Kappa statistics also showed high inter-annotator agreement across all dimensions. Conclusions The corpus and annotation guidelines can serve as a foundational informatics approach for annotating and mining phenotype definitions in literature, and can be used later for text mining applications.
first_indexed 2024-04-12T15:36:26Z
format Article
id doaj.art-27997d7d695640b7988d6d73e3318b51
institution Directory Open Access Journal
issn 2041-1480
language English
last_indexed 2024-04-12T15:36:26Z
publishDate 2022-06-01
publisher BMC
record_format Article
series Journal of Biomedical Semantics
spelling doaj.art-27997d7d695640b7988d6d73e3318b512022-12-22T03:26:56ZengBMCJournal of Biomedical Semantics2041-14802022-06-0113111710.1186/s13326-022-00272-6PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literatureSamar Binkheder0Heng-Yi Wu1Sara K. Quinney2Shijun Zhang3Md. Muntasir Zitu4Chien‐Wei Chiang5Lei Wang6Josette Jones7Lang Li8Department of Biohealth Informatics, Indiana University School of Informatics and ComputingDevelopment Science Informatics, GenentechDepartment of Obstetrics and Gynecology, Indiana University School of MedicineDepartment of Biomedical Informatics, College of Medicine, The Ohio State UniversityDepartment of Biomedical Informatics, College of Medicine, The Ohio State UniversityDepartment of Biomedical Informatics, College of Medicine, The Ohio State UniversityDepartment of Biomedical Informatics, College of Medicine, The Ohio State UniversityDepartment of Biohealth Informatics, Indiana University School of Informatics and ComputingDepartment of Biomedical Informatics, College of Medicine, The Ohio State UniversityAbstract Background Adverse events induced by drug-drug interactions are a major concern in the United States. Current research is moving toward using electronic health record (EHR) data, including for adverse drug events discovery. One of the first steps in EHR-based studies is to define a phenotype for establishing a cohort of patients. However, phenotype definitions are not readily available for all phenotypes. One of the first steps of developing automated text mining tools is building a corpus. Therefore, this study aimed to develop annotation guidelines and a gold standard corpus to facilitate building future automated approaches for mining phenotype definitions contained in the literature. Furthermore, our aim is to improve the understanding of how these published phenotype definitions are presented in the literature and how we annotate them for future text mining tasks. Results Two annotators manually annotated the corpus on a sentence-level for the presence of evidence for phenotype definitions. Three major categories (inclusion, intermediate, and exclusion) with a total of ten dimensions were proposed characterizing major contextual patterns and cues for presenting phenotype definitions in published literature. The developed annotation guidelines were used to annotate the corpus that contained 3971 sentences: 1923 out of 3971 (48.4%) for the inclusion category, 1851 out of 3971 (46.6%) for the intermediate category, and 2273 out of 3971 (57.2%) for exclusion category. The highest number of annotated sentences was 1449 out of 3971 (36.5%) for the “Biomedical & Procedure” dimension. The lowest number of annotated sentences was 49 out of 3971 (1.2%) for “The use of NLP”. The overall percent inter-annotator agreement was 97.8%. Percent and Kappa statistics also showed high inter-annotator agreement across all dimensions. Conclusions The corpus and annotation guidelines can serve as a foundational informatics approach for annotating and mining phenotype definitions in literature, and can be used later for text mining applications.https://doi.org/10.1186/s13326-022-00272-6Adverse drug eventsBiomedical corpusElectronic health recordsPhenotype definitionsText mining
spellingShingle Samar Binkheder
Heng-Yi Wu
Sara K. Quinney
Shijun Zhang
Md. Muntasir Zitu
Chien‐Wei Chiang
Lei Wang
Josette Jones
Lang Li
PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature
Journal of Biomedical Semantics
Adverse drug events
Biomedical corpus
Electronic health records
Phenotype definitions
Text mining
title PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature
title_full PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature
title_fullStr PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature
title_full_unstemmed PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature
title_short PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature
title_sort phenodef a corpus for annotating sentences with information of phenotype definitions in biomedical literature
topic Adverse drug events
Biomedical corpus
Electronic health records
Phenotype definitions
Text mining
url https://doi.org/10.1186/s13326-022-00272-6
work_keys_str_mv AT samarbinkheder phenodefacorpusforannotatingsentenceswithinformationofphenotypedefinitionsinbiomedicalliterature
AT hengyiwu phenodefacorpusforannotatingsentenceswithinformationofphenotypedefinitionsinbiomedicalliterature
AT sarakquinney phenodefacorpusforannotatingsentenceswithinformationofphenotypedefinitionsinbiomedicalliterature
AT shijunzhang phenodefacorpusforannotatingsentenceswithinformationofphenotypedefinitionsinbiomedicalliterature
AT mdmuntasirzitu phenodefacorpusforannotatingsentenceswithinformationofphenotypedefinitionsinbiomedicalliterature
AT chienweichiang phenodefacorpusforannotatingsentenceswithinformationofphenotypedefinitionsinbiomedicalliterature
AT leiwang phenodefacorpusforannotatingsentenceswithinformationofphenotypedefinitionsinbiomedicalliterature
AT josettejones phenodefacorpusforannotatingsentenceswithinformationofphenotypedefinitionsinbiomedicalliterature
AT langli phenodefacorpusforannotatingsentenceswithinformationofphenotypedefinitionsinbiomedicalliterature