Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients

The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies la...

Full description

Bibliographic Details
Main Authors: Tanmoy Paul, Humayera Islam, Nitesh Singh, Yaswitha Jampani, Teja Venkat Pavan Kotapati, Preethi Aishwarya Tautam, Md Kamruz Zaman Rana, Vasanthi Mandhadi, Vishakha Sharma, Michael Barnes, Richard D. Hammer, Abu Saleh Mohammad Mosa
Format: Article
Language:English
Published: MDPI AG 2022-10-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/12/19/9976
_version_ 1797480575462801408
author Tanmoy Paul
Humayera Islam
Nitesh Singh
Yaswitha Jampani
Teja Venkat Pavan Kotapati
Preethi Aishwarya Tautam
Md Kamruz Zaman Rana
Vasanthi Mandhadi
Vishakha Sharma
Michael Barnes
Richard D. Hammer
Abu Saleh Mohammad Mosa
author_facet Tanmoy Paul
Humayera Islam
Nitesh Singh
Yaswitha Jampani
Teja Venkat Pavan Kotapati
Preethi Aishwarya Tautam
Md Kamruz Zaman Rana
Vasanthi Mandhadi
Vishakha Sharma
Michael Barnes
Richard D. Hammer
Abu Saleh Mohammad Mosa
author_sort Tanmoy Paul
collection DOAJ
description The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies largely on the proper selection of features. The objective of this study was to investigate the utility of various features in a conditional-random-field (CRF)-based NER model. Natural language processing (NLP) toolkits were used to annotate the protected health information (PHI) from a total of 10,239 radiology reports that were divided into seven types. Multiple features were extracted by the toolkit and the NER models were built using these features and their combinations. A total of 10 features were extracted and the performance of the models was evaluated based on their precision, recall, and F<sub>1</sub>-score. The best-performing features were n-gram, prefix-suffix, word embedding, and word shape. These features outperformed others across all types of reports. The dataset we used was large in volume and divided into multiple types of reports. Such a diverse dataset made sure that the results were not subject to a small number of structured texts from where a machine learning model can easily learn the features. The manual de-identification of large-scale clinical reports is impractical. This study helps to identify the best-performing features for building an NER model for automatic de-identification from a wide array of features mentioned in the literature.
first_indexed 2024-03-09T22:02:02Z
format Article
id doaj.art-e0a2cb981dc5479694e7a9dccdcb2ccd
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-09T22:02:02Z
publishDate 2022-10-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-e0a2cb981dc5479694e7a9dccdcb2ccd2023-11-23T19:48:48ZengMDPI AGApplied Sciences2076-34172022-10-011219997610.3390/app12199976Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC PatientsTanmoy Paul0Humayera Islam1Nitesh Singh2Yaswitha Jampani3Teja Venkat Pavan Kotapati4Preethi Aishwarya Tautam5Md Kamruz Zaman Rana6Vasanthi Mandhadi7Vishakha Sharma8Michael Barnes9Richard D. Hammer10Abu Saleh Mohammad Mosa11Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USANextGen Biomedical Informatics Center, University of Missouri, Columbia, MO 65211, USANextGen Biomedical Informatics Center, University of Missouri, Columbia, MO 65211, USANextGen Biomedical Informatics Center, University of Missouri, Columbia, MO 65211, USADepartment of Health Management and Informatics, University of Missouri, Columbia, MO 65211, USADepartment of Health Management and Informatics, University of Missouri, Columbia, MO 65211, USANextGen Biomedical Informatics Center, University of Missouri, Columbia, MO 65211, USADepartment of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USARoche Diagnostics, F. Hoffmann-La Roche, Santa Clara, CA 95050, USARoche Diagnostics, F. Hoffmann-La Roche, Santa Clara, CA 95050, USADepartment of Pathology and Anatomical Sciences, University of Missouri, Columbia, MO 65211, USADepartment of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USAThe de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies largely on the proper selection of features. The objective of this study was to investigate the utility of various features in a conditional-random-field (CRF)-based NER model. Natural language processing (NLP) toolkits were used to annotate the protected health information (PHI) from a total of 10,239 radiology reports that were divided into seven types. Multiple features were extracted by the toolkit and the NER models were built using these features and their combinations. A total of 10 features were extracted and the performance of the models was evaluated based on their precision, recall, and F<sub>1</sub>-score. The best-performing features were n-gram, prefix-suffix, word embedding, and word shape. These features outperformed others across all types of reports. The dataset we used was large in volume and divided into multiple types of reports. Such a diverse dataset made sure that the results were not subject to a small number of structured texts from where a machine learning model can easily learn the features. The manual de-identification of large-scale clinical reports is impractical. This study helps to identify the best-performing features for building an NER model for automatic de-identification from a wide array of features mentioned in the literature.https://www.mdpi.com/2076-3417/12/19/9976protected health informationnatural language processing (NLP)named entity recognition (NER)de-identificationconditional random field (CRF)
spellingShingle Tanmoy Paul
Humayera Islam
Nitesh Singh
Yaswitha Jampani
Teja Venkat Pavan Kotapati
Preethi Aishwarya Tautam
Md Kamruz Zaman Rana
Vasanthi Mandhadi
Vishakha Sharma
Michael Barnes
Richard D. Hammer
Abu Saleh Mohammad Mosa
Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients
Applied Sciences
protected health information
natural language processing (NLP)
named entity recognition (NER)
de-identification
conditional random field (CRF)
title Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients
title_full Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients
title_fullStr Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients
title_full_unstemmed Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients
title_short Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients
title_sort utility of features in a natural language processing based clinical de identification model using radiology reports for advanced nsclc patients
topic protected health information
natural language processing (NLP)
named entity recognition (NER)
de-identification
conditional random field (CRF)
url https://www.mdpi.com/2076-3417/12/19/9976
work_keys_str_mv AT tanmoypaul utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients
AT humayeraislam utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients
AT niteshsingh utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients
AT yaswithajampani utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients
AT tejavenkatpavankotapati utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients
AT preethiaishwaryatautam utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients
AT mdkamruzzamanrana utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients
AT vasanthimandhadi utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients
AT vishakhasharma utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients
AT michaelbarnes utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients
AT richarddhammer utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients
AT abusalehmohammadmosa utilityoffeaturesinanaturallanguageprocessingbasedclinicaldeidentificationmodelusingradiologyreportsforadvancednsclcpatients