SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks

Abstract Background The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algo...

Full description

Bibliographic Details
Main Authors:	Lucas Emanuel Silva e Oliveira, Ana Carolina Peters, Adalniza Moura Pucca da Silva, Caroline Pilatti Gebeluca, Yohan Bonescki Gumiel, Lilian Mie Mukai Cintho, Deborah Ribeiro Carvalho, Sadid Al Hasan, Claudia Maria Cabral Moro
Format:	Article
Language:	English
Published:	BMC 2022-05-01
Series:	Journal of Biomedical Semantics
Subjects:	Natural language processing Semantic annotation Clinical narratives Corpora Gold standard
Online Access:	https://doi.org/10.1186/s13326-022-00269-1

_version_	1818250463086116864
author	Lucas Emanuel Silva e Oliveira Ana Carolina Peters Adalniza Moura Pucca da Silva Caroline Pilatti Gebeluca Yohan Bonescki Gumiel Lilian Mie Mukai Cintho Deborah Ribeiro Carvalho Sadid Al Hasan Claudia Maria Cabral Moro
author_facet	Lucas Emanuel Silva e Oliveira Ana Carolina Peters Adalniza Moura Pucca da Silva Caroline Pilatti Gebeluca Yohan Bonescki Gumiel Lilian Mie Mukai Cintho Deborah Ribeiro Carvalho Sadid Al Hasan Claudia Maria Cabral Moro
author_sort	Lucas Emanuel Silva e Oliveira
collection	DOAJ
description	Abstract Background The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field. Methods In this study, a semantically annotated corpus was developed using clinical text from multiple medical specialties, document types, and institutions. In addition, we present, (1) a survey listing common aspects, differences, and lessons learned from previous research, (2) a fine-grained annotation schema that can be replicated to guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations. Results This study resulted in SemClinBr, a corpus that has 1000 clinical notes, labeled with 65,117 entities and 11,263 relations. In addition, both negation cues and medical abbreviation dictionaries were generated from the annotations. The average annotator agreement score varied from 0.71 (applying strict match) to 0.92 (considering a relaxed match) while accepting partial overlaps and hierarchically related semantic types. The extrinsic evaluation, when applying the corpus to two downstream NLP tasks, demonstrated the reliability and usefulness of annotations, with the systems achieving results that were consistent with the agreement scores. Conclusion The SemClinBr corpus and other resources produced in this work can support clinical NLP studies, providing a common development and evaluation resource for the research community, boosting the utilization of EHRs in both clinical practice and biomedical research. To the best of our knowledge, SemClinBr is the first available Portuguese clinical corpus.
first_indexed	2024-12-12T15:52:47Z
format	Article
id	doaj.art-dca2a5a5934743c598bfb5786c86df86
institution	Directory Open Access Journal
issn	2041-1480
language	English
last_indexed	2024-12-12T15:52:47Z
publishDate	2022-05-01
publisher	BMC
record_format	Article
series	Journal of Biomedical Semantics
spelling	doaj.art-dca2a5a5934743c598bfb5786c86df862022-12-22T00:19:33ZengBMCJournal of Biomedical Semantics2041-14802022-05-0113111910.1186/s13326-022-00269-1SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasksLucas Emanuel Silva e Oliveira0Ana Carolina Peters1Adalniza Moura Pucca da Silva2Caroline Pilatti Gebeluca3Yohan Bonescki Gumiel4Lilian Mie Mukai Cintho5Deborah Ribeiro Carvalho6Sadid Al Hasan7Claudia Maria Cabral Moro8Health Technology Program, Pontifical Catholic University of ParanáHealth Technology Program, Pontifical Catholic University of ParanáHealth Technology Program, Pontifical Catholic University of ParanáHealth Technology Program, Pontifical Catholic University of ParanáHealth Technology Program, Pontifical Catholic University of ParanáHealth Technology Program, Pontifical Catholic University of ParanáHealth Technology Program, Pontifical Catholic University of ParanáAI Lab, Philips Research North AmericaHealth Technology Program, Pontifical Catholic University of ParanáAbstract Background The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field. Methods In this study, a semantically annotated corpus was developed using clinical text from multiple medical specialties, document types, and institutions. In addition, we present, (1) a survey listing common aspects, differences, and lessons learned from previous research, (2) a fine-grained annotation schema that can be replicated to guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations. Results This study resulted in SemClinBr, a corpus that has 1000 clinical notes, labeled with 65,117 entities and 11,263 relations. In addition, both negation cues and medical abbreviation dictionaries were generated from the annotations. The average annotator agreement score varied from 0.71 (applying strict match) to 0.92 (considering a relaxed match) while accepting partial overlaps and hierarchically related semantic types. The extrinsic evaluation, when applying the corpus to two downstream NLP tasks, demonstrated the reliability and usefulness of annotations, with the systems achieving results that were consistent with the agreement scores. Conclusion The SemClinBr corpus and other resources produced in this work can support clinical NLP studies, providing a common development and evaluation resource for the research community, boosting the utilization of EHRs in both clinical practice and biomedical research. To the best of our knowledge, SemClinBr is the first available Portuguese clinical corpus.https://doi.org/10.1186/s13326-022-00269-1Natural language processingSemantic annotationClinical narrativesCorporaGold standard
spellingShingle	Lucas Emanuel Silva e Oliveira Ana Carolina Peters Adalniza Moura Pucca da Silva Caroline Pilatti Gebeluca Yohan Bonescki Gumiel Lilian Mie Mukai Cintho Deborah Ribeiro Carvalho Sadid Al Hasan Claudia Maria Cabral Moro SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks Journal of Biomedical Semantics Natural language processing Semantic annotation Clinical narratives Corpora Gold standard
title	SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks
title_full	SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks
title_fullStr	SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks
title_full_unstemmed	SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks
title_short	SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks
title_sort	semclinbr a multi institutional and multi specialty semantically annotated corpus for portuguese clinical nlp tasks
topic	Natural language processing Semantic annotation Clinical narratives Corpora Gold standard
url	https://doi.org/10.1186/s13326-022-00269-1
work_keys_str_mv	AT lucasemanuelsilvaeoliveira semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks AT anacarolinapeters semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks AT adalnizamourapuccadasilva semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks AT carolinepilattigebeluca semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks AT yohanbonesckigumiel semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks AT lilianmiemukaicintho semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks AT deborahribeirocarvalho semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks AT sadidalhasan semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks AT claudiamariacabralmoro semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks

SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks

Similar Items