An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation

BackgroundWith the popularization of electronic health records in China, the utilization of digitalized data has great potential for the development of real-world medical research. However, the data usually contains a great deal of protected health information and the direct...

Full description

Bibliographic Details
Main Authors:	Peng Wang, Yong Li, Liang Yang, Simin Li, Linfeng Li, Zehan Zhao, Shaopei Long, Fei Wang, Hongqian Wang, Ying Li, Chengliang Wang
Format:	Article
Language:	English
Published:	JMIR Publications 2022-08-01
Series:	JMIR Medical Informatics
Online Access:	https://medinform.jmir.org/2022/8/e38154

_version_	1797734815707955200
author	Peng Wang Yong Li Liang Yang Simin Li Linfeng Li Zehan Zhao Shaopei Long Fei Wang Hongqian Wang Ying Li Chengliang Wang
author_facet	Peng Wang Yong Li Liang Yang Simin Li Linfeng Li Zehan Zhao Shaopei Long Fei Wang Hongqian Wang Ying Li Chengliang Wang
author_sort	Peng Wang
collection	DOAJ
description	BackgroundWith the popularization of electronic health records in China, the utilization of digitalized data has great potential for the development of real-world medical research. However, the data usually contains a great deal of protected health information and the direct usage of this data may cause privacy issues. The task of deidentifying protected health information in electronic health records can be regarded as a named entity recognition problem. Existing rule-based, machine learning–based, or deep learning–based methods have been proposed to solve this problem. However, these methods still face the difficulties of insufficient Chinese electronic health record data and the complex features of the Chinese language. ObjectiveThis paper proposes a method to overcome the difficulties of overfitting and a lack of training data for deep neural networks to enable Chinese protected health information deidentification. MethodsWe propose a new model that merges TinyBERT (bidirectional encoder representations from transformers) as a text feature extraction module and the conditional random field method as a prediction module for deidentifying protected health information in Chinese medical electronic health records. In addition, a hybrid data augmentation method that integrates a sentence generation strategy and a mention-replacement strategy is proposed for overcoming insufficient Chinese electronic health records. ResultsWe compare our method with 5 baseline methods that utilize different BERT models as their feature extraction modules. Experimental results on the Chinese electronic health records that we collected demonstrate that our method had better performance (microprecision: 98.7%, microrecall: 99.13%, and micro-F1 score: 98.91%) and higher efficiency (40% faster) than all the BERT-based baseline methods. ConclusionsCompared to baseline methods, the efficiency advantage of TinyBERT on our proposed augmented data set was kept while the performance improved for the task of Chinese protected health information deidentification.
first_indexed	2024-03-12T12:49:56Z
format	Article
id	doaj.art-c67adb9b501b4c718dd84430123992f7
institution	Directory Open Access Journal
issn	2291-9694
language	English
last_indexed	2024-03-12T12:49:56Z
publishDate	2022-08-01
publisher	JMIR Publications
record_format	Article
series	JMIR Medical Informatics
spelling	doaj.art-c67adb9b501b4c718dd84430123992f72023-08-28T22:58:36ZengJMIR PublicationsJMIR Medical Informatics2291-96942022-08-01108e3815410.2196/38154An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and ValidationPeng Wanghttps://orcid.org/0000-0002-5571-3425Yong Lihttps://orcid.org/0000-0001-7937-810XLiang Yanghttps://orcid.org/0000-0002-7981-0764Simin Lihttps://orcid.org/0000-0001-8184-9505Linfeng Lihttps://orcid.org/0000-0003-4949-8906Zehan Zhaohttps://orcid.org/0000-0001-9508-1701Shaopei Longhttps://orcid.org/0000-0001-6955-6945Fei Wanghttps://orcid.org/0000-0002-2890-8964Hongqian Wanghttps://orcid.org/0000-0002-1432-5012Ying Lihttps://orcid.org/0000-0001-5153-5441Chengliang Wanghttps://orcid.org/0000-0003-0877-1064 BackgroundWith the popularization of electronic health records in China, the utilization of digitalized data has great potential for the development of real-world medical research. However, the data usually contains a great deal of protected health information and the direct usage of this data may cause privacy issues. The task of deidentifying protected health information in electronic health records can be regarded as a named entity recognition problem. Existing rule-based, machine learning–based, or deep learning–based methods have been proposed to solve this problem. However, these methods still face the difficulties of insufficient Chinese electronic health record data and the complex features of the Chinese language. ObjectiveThis paper proposes a method to overcome the difficulties of overfitting and a lack of training data for deep neural networks to enable Chinese protected health information deidentification. MethodsWe propose a new model that merges TinyBERT (bidirectional encoder representations from transformers) as a text feature extraction module and the conditional random field method as a prediction module for deidentifying protected health information in Chinese medical electronic health records. In addition, a hybrid data augmentation method that integrates a sentence generation strategy and a mention-replacement strategy is proposed for overcoming insufficient Chinese electronic health records. ResultsWe compare our method with 5 baseline methods that utilize different BERT models as their feature extraction modules. Experimental results on the Chinese electronic health records that we collected demonstrate that our method had better performance (microprecision: 98.7%, microrecall: 99.13%, and micro-F1 score: 98.91%) and higher efficiency (40% faster) than all the BERT-based baseline methods. ConclusionsCompared to baseline methods, the efficiency advantage of TinyBERT on our proposed augmented data set was kept while the performance improved for the task of Chinese protected health information deidentification.https://medinform.jmir.org/2022/8/e38154
spellingShingle	Peng Wang Yong Li Liang Yang Simin Li Linfeng Li Zehan Zhao Shaopei Long Fei Wang Hongqian Wang Ying Li Chengliang Wang An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation JMIR Medical Informatics
title	An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation
title_full	An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation
title_fullStr	An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation
title_full_unstemmed	An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation
title_short	An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation
title_sort	efficient method for deidentifying protected health information in chinese electronic health records algorithm development and validation
url	https://medinform.jmir.org/2022/8/e38154
work_keys_str_mv	AT pengwang anefficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT yongli anefficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT liangyang anefficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT siminli anefficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT linfengli anefficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT zehanzhao anefficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT shaopeilong anefficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT feiwang anefficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT hongqianwang anefficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT yingli anefficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT chengliangwang anefficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT pengwang efficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT yongli efficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT liangyang efficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT siminli efficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT linfengli efficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT zehanzhao efficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT shaopeilong efficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT feiwang efficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT hongqianwang efficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT yingli efficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation AT chengliangwang efficientmethodfordeidentifyingprotectedhealthinformationinchineseelectronichealthrecordsalgorithmdevelopmentandvalidation

An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation

Similar Items