Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients
The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies la...
Main Authors: | , , , , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2022-10-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/12/19/9976 |
_version_ | 1797480575462801408 |
---|---|
author | Tanmoy Paul Humayera Islam Nitesh Singh Yaswitha Jampani Teja Venkat Pavan Kotapati Preethi Aishwarya Tautam Md Kamruz Zaman Rana Vasanthi Mandhadi Vishakha Sharma Michael Barnes Richard D. Hammer Abu Saleh Mohammad Mosa |
author_facet | Tanmoy Paul Humayera Islam Nitesh Singh Yaswitha Jampani Teja Venkat Pavan Kotapati Preethi Aishwarya Tautam Md Kamruz Zaman Rana Vasanthi Mandhadi Vishakha Sharma Michael Barnes Richard D. Hammer Abu Saleh Mohammad Mosa |
author_sort | Tanmoy Paul |
collection | DOAJ |
description | The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies largely on the proper selection of features. The objective of this study was to investigate the utility of various features in a conditional-random-field (CRF)-based NER model. Natural language processing (NLP) toolkits were used to annotate the protected health information (PHI) from a total of 10,239 radiology reports that were divided into seven types. Multiple features were extracted by the toolkit and the NER models were built using these features and their combinations. A total of 10 features were extracted and the performance of the models was evaluated based on their precision, recall, and F<sub>1</sub>-score. The best-performing features were n-gram, prefix-suffix, word embedding, and word shape. These features outperformed others across all types of reports. The dataset we used was large in volume and divided into multiple types of reports. Such a diverse dataset made sure that the results were not subject to a small number of structured texts from where a machine learning model can easily learn the features. The manual de-identification of large-scale clinical reports is impractical. This study helps to identify the best-performing features for building an NER model for automatic de-identification from a wide array of features mentioned in the literature. |
first_indexed | 2024-03-09T22:02:02Z |
format | Article |
id | doaj.art-e0a2cb981dc5479694e7a9dccdcb2ccd |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-09T22:02:02Z |
publishDate | 2022-10-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-e0a2cb981dc5479694e7a9dccdcb2ccd2023-11-23T19:48:48ZengMDPI AGApplied Sciences2076-34172022-10-011219997610.3390/app12199976Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC PatientsTanmoy Paul0Humayera Islam1Nitesh Singh2Yaswitha Jampani3Teja Venkat Pavan Kotapati4Preethi Aishwarya Tautam5Md Kamruz Zaman Rana6Vasanthi Mandhadi7Vishakha Sharma8Michael Barnes9Richard D. Hammer10Abu Saleh Mohammad Mosa11Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USANextGen Biomedical Informatics Center, University of Missouri, Columbia, MO 65211, USANextGen Biomedical Informatics Center, University of Missouri, Columbia, MO 65211, USANextGen Biomedical Informatics Center, University of Missouri, Columbia, MO 65211, USADepartment of Health Management and Informatics, University of Missouri, Columbia, MO 65211, USADepartment of Health Management and Informatics, University of Missouri, Columbia, MO 65211, USANextGen Biomedical Informatics Center, University of Missouri, Columbia, MO 65211, USADepartment of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USARoche Diagnostics, F. Hoffmann-La Roche, Santa Clara, CA 95050, USARoche Diagnostics, F. Hoffmann-La Roche, Santa Clara, CA 95050, USADepartment of Pathology and Anatomical Sciences, University of Missouri, Columbia, MO 65211, USADepartment of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USAThe de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies largely on the proper selection of features. The objective of this study was to investigate the utility of various features in a conditional-random-field (CRF)-based NER model. Natural language processing (NLP) toolkits were used to annotate the protected health information (PHI) from a total of 10,239 radiology reports that were divided into seven types. Multiple features were extracted by the toolkit and the NER models were built using these features and their combinations. A total of 10 features were extracted and the performance of the models was evaluated based on their precision, recall, and F<sub>1</sub>-score. The best-performing features were n-gram, prefix-suffix, word embedding, and word shape. These features outperformed others across all types of reports. The dataset we used was large in volume and divided into multiple types of reports. Such a diverse dataset made sure that the results were not subject to a small number of structured texts from where a machine learning model can easily learn the features. The manual de-identification of large-scale clinical reports is impractical. This study helps to identify the best-performing features for building an NER model for automatic de-identification from a wide array of features mentioned in the literature.https://www.mdpi.com/2076-3417/12/19/9976protected health informationnatural language processing (NLP)named entity recognition (NER)de-identificationconditional random field (CRF) |
spellingShingle | Tanmoy Paul Humayera Islam Nitesh Singh Yaswitha Jampani Teja Venkat Pavan Kotapati Preethi Aishwarya Tautam Md Kamruz Zaman Rana Vasanthi Mandhadi Vishakha Sharma Michael Barnes Richard D. Hammer Abu Saleh Mohammad Mosa Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients Applied Sciences protected health information natural language processing (NLP) named entity recognition (NER) de-identification conditional random field (CRF) |
title | Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients |
title_full | Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients |
title_fullStr | Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients |
title_full_unstemmed | Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients |
title_short | Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients |
title_sort | utility of features in a natural language processing based clinical de identification model using radiology reports for advanced nsclc patients |
topic | protected health information natural language processing (NLP) named entity recognition (NER) de-identification conditional random field (CRF) |
url | https://www.mdpi.com/2076-3417/12/19/9976 |
work_keys_str_mv | AT tanmoypaul utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients AT humayeraislam utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients AT niteshsingh utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients AT yaswithajampani utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients AT tejavenkatpavankotapati utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients AT preethiaishwaryatautam utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients AT mdkamruzzamanrana utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients AT vasanthimandhadi utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients AT vishakhasharma utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients AT michaelbarnes utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients AT richarddhammer utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients AT abusalehmohammadmosa utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients |