The OpenDeID corpus for patient de-identification

Abstract For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured ele...

Full description

Bibliographic Details
Main Authors: Jitendra Jonnagaddala, Aipeng Chen, Sean Batongbacal, Chandini Nekkantti
Format: Article
Language:English
Published: Nature Portfolio 2021-10-01
Series:Scientific Reports
Online Access:https://doi.org/10.1038/s41598-021-99554-9
_version_ 1818343171609853952
author Jitendra Jonnagaddala
Aipeng Chen
Sean Batongbacal
Chandini Nekkantti
author_facet Jitendra Jonnagaddala
Aipeng Chen
Sean Batongbacal
Chandini Nekkantti
author_sort Jitendra Jonnagaddala
collection DOAJ
description Abstract For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured electronic health records. We retrieved 4548 unstructured surgical pathology reports from four urban Australian hospitals. The corpus was developed by two annotators under three different experimental settings. The quality of the annotations was evaluated for each setting. Specifically, we employed serial annotations, parallel annotations, and pre-annotations. Our results suggest that the pre-annotations approach is not reliable in terms of quality when compared to the serial annotations but can drastically reduce annotation time. The OpenDeID corpus comprises 2,100 pathology reports from 1,833 cancer patients with an average of 737.49 tokens and 7.35 protected health information entities annotated per report. The overall inter annotator agreement and deviation scores are 0.9464 and 0.9726, respectively. Realistic surrogates are also generated to make the corpus suitable for distribution to other researchers.
first_indexed 2024-12-13T16:26:21Z
format Article
id doaj.art-bf43a13da7444d1aa4e7e7642ca37eee
institution Directory Open Access Journal
issn 2045-2322
language English
last_indexed 2024-12-13T16:26:21Z
publishDate 2021-10-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj.art-bf43a13da7444d1aa4e7e7642ca37eee2022-12-21T23:38:36ZengNature PortfolioScientific Reports2045-23222021-10-011111810.1038/s41598-021-99554-9The OpenDeID corpus for patient de-identificationJitendra Jonnagaddala0Aipeng Chen1Sean Batongbacal2Chandini Nekkantti3School of Population Health, UNSW SydneySchool of Computer Science and Engineering, UNSW SydneySchool of Computer Science and Engineering, UNSW SydneyCGD HealthAbstract For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured electronic health records. We retrieved 4548 unstructured surgical pathology reports from four urban Australian hospitals. The corpus was developed by two annotators under three different experimental settings. The quality of the annotations was evaluated for each setting. Specifically, we employed serial annotations, parallel annotations, and pre-annotations. Our results suggest that the pre-annotations approach is not reliable in terms of quality when compared to the serial annotations but can drastically reduce annotation time. The OpenDeID corpus comprises 2,100 pathology reports from 1,833 cancer patients with an average of 737.49 tokens and 7.35 protected health information entities annotated per report. The overall inter annotator agreement and deviation scores are 0.9464 and 0.9726, respectively. Realistic surrogates are also generated to make the corpus suitable for distribution to other researchers.https://doi.org/10.1038/s41598-021-99554-9
spellingShingle Jitendra Jonnagaddala
Aipeng Chen
Sean Batongbacal
Chandini Nekkantti
The OpenDeID corpus for patient de-identification
Scientific Reports
title The OpenDeID corpus for patient de-identification
title_full The OpenDeID corpus for patient de-identification
title_fullStr The OpenDeID corpus for patient de-identification
title_full_unstemmed The OpenDeID corpus for patient de-identification
title_short The OpenDeID corpus for patient de-identification
title_sort opendeid corpus for patient de identification
url https://doi.org/10.1038/s41598-021-99554-9
work_keys_str_mv AT jitendrajonnagaddala theopendeidcorpusforpatientdeidentification
AT aipengchen theopendeidcorpusforpatientdeidentification
AT seanbatongbacal theopendeidcorpusforpatientdeidentification
AT chandininekkantti theopendeidcorpusforpatientdeidentification
AT jitendrajonnagaddala opendeidcorpusforpatientdeidentification
AT aipengchen opendeidcorpusforpatientdeidentification
AT seanbatongbacal opendeidcorpusforpatientdeidentification
AT chandininekkantti opendeidcorpusforpatientdeidentification