The OpenDeID corpus for patient de-identification
Abstract For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured ele...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Nature Portfolio
2021-10-01
|
Series: | Scientific Reports |
Online Access: | https://doi.org/10.1038/s41598-021-99554-9 |
_version_ | 1818343171609853952 |
---|---|
author | Jitendra Jonnagaddala Aipeng Chen Sean Batongbacal Chandini Nekkantti |
author_facet | Jitendra Jonnagaddala Aipeng Chen Sean Batongbacal Chandini Nekkantti |
author_sort | Jitendra Jonnagaddala |
collection | DOAJ |
description | Abstract For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured electronic health records. We retrieved 4548 unstructured surgical pathology reports from four urban Australian hospitals. The corpus was developed by two annotators under three different experimental settings. The quality of the annotations was evaluated for each setting. Specifically, we employed serial annotations, parallel annotations, and pre-annotations. Our results suggest that the pre-annotations approach is not reliable in terms of quality when compared to the serial annotations but can drastically reduce annotation time. The OpenDeID corpus comprises 2,100 pathology reports from 1,833 cancer patients with an average of 737.49 tokens and 7.35 protected health information entities annotated per report. The overall inter annotator agreement and deviation scores are 0.9464 and 0.9726, respectively. Realistic surrogates are also generated to make the corpus suitable for distribution to other researchers. |
first_indexed | 2024-12-13T16:26:21Z |
format | Article |
id | doaj.art-bf43a13da7444d1aa4e7e7642ca37eee |
institution | Directory Open Access Journal |
issn | 2045-2322 |
language | English |
last_indexed | 2024-12-13T16:26:21Z |
publishDate | 2021-10-01 |
publisher | Nature Portfolio |
record_format | Article |
series | Scientific Reports |
spelling | doaj.art-bf43a13da7444d1aa4e7e7642ca37eee2022-12-21T23:38:36ZengNature PortfolioScientific Reports2045-23222021-10-011111810.1038/s41598-021-99554-9The OpenDeID corpus for patient de-identificationJitendra Jonnagaddala0Aipeng Chen1Sean Batongbacal2Chandini Nekkantti3School of Population Health, UNSW SydneySchool of Computer Science and Engineering, UNSW SydneySchool of Computer Science and Engineering, UNSW SydneyCGD HealthAbstract For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured electronic health records. We retrieved 4548 unstructured surgical pathology reports from four urban Australian hospitals. The corpus was developed by two annotators under three different experimental settings. The quality of the annotations was evaluated for each setting. Specifically, we employed serial annotations, parallel annotations, and pre-annotations. Our results suggest that the pre-annotations approach is not reliable in terms of quality when compared to the serial annotations but can drastically reduce annotation time. The OpenDeID corpus comprises 2,100 pathology reports from 1,833 cancer patients with an average of 737.49 tokens and 7.35 protected health information entities annotated per report. The overall inter annotator agreement and deviation scores are 0.9464 and 0.9726, respectively. Realistic surrogates are also generated to make the corpus suitable for distribution to other researchers.https://doi.org/10.1038/s41598-021-99554-9 |
spellingShingle | Jitendra Jonnagaddala Aipeng Chen Sean Batongbacal Chandini Nekkantti The OpenDeID corpus for patient de-identification Scientific Reports |
title | The OpenDeID corpus for patient de-identification |
title_full | The OpenDeID corpus for patient de-identification |
title_fullStr | The OpenDeID corpus for patient de-identification |
title_full_unstemmed | The OpenDeID corpus for patient de-identification |
title_short | The OpenDeID corpus for patient de-identification |
title_sort | opendeid corpus for patient de identification |
url | https://doi.org/10.1038/s41598-021-99554-9 |
work_keys_str_mv | AT jitendrajonnagaddala theopendeidcorpusforpatientdeidentification AT aipengchen theopendeidcorpusforpatientdeidentification AT seanbatongbacal theopendeidcorpusforpatientdeidentification AT chandininekkantti theopendeidcorpusforpatientdeidentification AT jitendrajonnagaddala opendeidcorpusforpatientdeidentification AT aipengchen opendeidcorpusforpatientdeidentification AT seanbatongbacal opendeidcorpusforpatientdeidentification AT chandininekkantti opendeidcorpusforpatientdeidentification |