Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and Validation

BackgroundPhenotype information in electronic health records (EHRs) is mainly recorded in unstructured free text, which cannot be directly used for clinical research. EHR-based deep-phenotyping methods can structure phenotype information in EHRs with high fidelity, making it...

Full description

Bibliographic Details
Main Authors:	Shicheng Li, Lizong Deng, Xu Zhang, Luming Chen, Tao Yang, Yifan Qi, Taijiao Jiang
Format:	Article
Language:	English
Published:	JMIR Publications 2022-06-01
Series:	Journal of Medical Internet Research
Online Access:	https://www.jmir.org/2022/6/e37213

_version_	1797735006883282944
author	Shicheng Li Lizong Deng Xu Zhang Luming Chen Tao Yang Yifan Qi Taijiao Jiang
author_facet	Shicheng Li Lizong Deng Xu Zhang Luming Chen Tao Yang Yifan Qi Taijiao Jiang
author_sort	Shicheng Li
collection	DOAJ
description	BackgroundPhenotype information in electronic health records (EHRs) is mainly recorded in unstructured free text, which cannot be directly used for clinical research. EHR-based deep-phenotyping methods can structure phenotype information in EHRs with high fidelity, making it the focus of medical informatics. However, developing a deep-phenotyping method for non-English EHRs (ie, Chinese EHRs) is challenging. Although numerous EHR resources exist in China, fine-grained annotation data that are suitable for developing deep-phenotyping methods are limited. It is challenging to develop a deep-phenotyping method for Chinese EHRs in such a low-resource scenario. ObjectiveIn this study, we aimed to develop a deep-phenotyping method with good generalization ability for Chinese EHRs based on limited fine-grained annotation data. MethodsThe core of the methodology was to identify linguistic patterns of phenotype descriptions in Chinese EHRs with a sequence motif discovery tool and perform deep phenotyping of Chinese EHRs by recognizing linguistic patterns in free text. Specifically, 1000 Chinese EHRs were manually annotated based on a fine-grained information model, PhenoSSU (Semantic Structured Unit of Phenotypes). The annotation data set was randomly divided into a training set (n=700, 70%) and a testing set (n=300, 30%). The process for mining linguistic patterns was divided into three steps. First, free text in the training set was encoded as single-letter sequences (P: phenotype, A: attribute). Second, a biological sequence analysis tool—MEME (Multiple Expectation Maximums for Motif Elicitation)—was used to identify motifs in the single-letter sequences. Finally, the identified motifs were reduced to a series of regular expressions representing linguistic patterns of PhenoSSU instances in Chinese EHRs. Based on the discovered linguistic patterns, we developed a deep-phenotyping method for Chinese EHRs, including a deep learning–based method for named entity recognition and a pattern recognition–based method for attribute prediction. ResultsIn total, 51 sequence motifs with statistical significance were mined from 700 Chinese EHRs in the training set and were combined into six regular expressions. It was found that these six regular expressions could be learned from a mean of 134 (SD 9.7) annotated EHRs in the training set. The deep-phenotyping algorithm for Chinese EHRs could recognize PhenoSSU instances with an overall accuracy of 0.844 on the test set. For the subtask of entity recognition, the algorithm achieved an F1 score of 0.898 with the Bidirectional Encoder Representations from Transformers–bidirectional long short-term memory and conditional random field model; for the subtask of attribute prediction, the algorithm achieved a weighted accuracy of 0.940 with the linguistic pattern–based method. ConclusionsWe developed a simple but effective strategy to perform deep phenotyping of Chinese EHRs with limited fine-grained annotation data. Our work will promote the second use of Chinese EHRs and give inspiration to other non–English-speaking countries.
first_indexed	2024-03-12T12:52:43Z
format	Article
id	doaj.art-9b7f82f6494645cfb6f2e6795344b883
institution	Directory Open Access Journal
issn	1438-8871
language	English
last_indexed	2024-03-12T12:52:43Z
publishDate	2022-06-01
publisher	JMIR Publications
record_format	Article
series	Journal of Medical Internet Research
spelling	doaj.art-9b7f82f6494645cfb6f2e6795344b8832023-08-28T22:13:21ZengJMIR PublicationsJournal of Medical Internet Research1438-88712022-06-01246e3721310.2196/37213Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and ValidationShicheng Lihttps://orcid.org/0000-0002-5893-8822Lizong Denghttps://orcid.org/0000-0001-9314-262XXu Zhanghttps://orcid.org/0000-0002-2270-9286Luming Chenhttps://orcid.org/0000-0003-2468-8631Tao Yanghttps://orcid.org/0000-0002-7521-4295Yifan Qihttps://orcid.org/0000-0002-0665-2611Taijiao Jianghttps://orcid.org/0000-0002-6280-6347 BackgroundPhenotype information in electronic health records (EHRs) is mainly recorded in unstructured free text, which cannot be directly used for clinical research. EHR-based deep-phenotyping methods can structure phenotype information in EHRs with high fidelity, making it the focus of medical informatics. However, developing a deep-phenotyping method for non-English EHRs (ie, Chinese EHRs) is challenging. Although numerous EHR resources exist in China, fine-grained annotation data that are suitable for developing deep-phenotyping methods are limited. It is challenging to develop a deep-phenotyping method for Chinese EHRs in such a low-resource scenario. ObjectiveIn this study, we aimed to develop a deep-phenotyping method with good generalization ability for Chinese EHRs based on limited fine-grained annotation data. MethodsThe core of the methodology was to identify linguistic patterns of phenotype descriptions in Chinese EHRs with a sequence motif discovery tool and perform deep phenotyping of Chinese EHRs by recognizing linguistic patterns in free text. Specifically, 1000 Chinese EHRs were manually annotated based on a fine-grained information model, PhenoSSU (Semantic Structured Unit of Phenotypes). The annotation data set was randomly divided into a training set (n=700, 70%) and a testing set (n=300, 30%). The process for mining linguistic patterns was divided into three steps. First, free text in the training set was encoded as single-letter sequences (P: phenotype, A: attribute). Second, a biological sequence analysis tool—MEME (Multiple Expectation Maximums for Motif Elicitation)—was used to identify motifs in the single-letter sequences. Finally, the identified motifs were reduced to a series of regular expressions representing linguistic patterns of PhenoSSU instances in Chinese EHRs. Based on the discovered linguistic patterns, we developed a deep-phenotyping method for Chinese EHRs, including a deep learning–based method for named entity recognition and a pattern recognition–based method for attribute prediction. ResultsIn total, 51 sequence motifs with statistical significance were mined from 700 Chinese EHRs in the training set and were combined into six regular expressions. It was found that these six regular expressions could be learned from a mean of 134 (SD 9.7) annotated EHRs in the training set. The deep-phenotyping algorithm for Chinese EHRs could recognize PhenoSSU instances with an overall accuracy of 0.844 on the test set. For the subtask of entity recognition, the algorithm achieved an F1 score of 0.898 with the Bidirectional Encoder Representations from Transformers–bidirectional long short-term memory and conditional random field model; for the subtask of attribute prediction, the algorithm achieved a weighted accuracy of 0.940 with the linguistic pattern–based method. ConclusionsWe developed a simple but effective strategy to perform deep phenotyping of Chinese EHRs with limited fine-grained annotation data. Our work will promote the second use of Chinese EHRs and give inspiration to other non–English-speaking countries.https://www.jmir.org/2022/6/e37213
spellingShingle	Shicheng Li Lizong Deng Xu Zhang Luming Chen Tao Yang Yifan Qi Taijiao Jiang Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and Validation Journal of Medical Internet Research
title	Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and Validation
title_full	Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and Validation
title_fullStr	Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and Validation
title_full_unstemmed	Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and Validation
title_short	Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and Validation
title_sort	deep phenotyping of chinese electronic health records by recognizing linguistic patterns of phenotypic narratives with a sequence motif discovery tool algorithm development and validation
url	https://www.jmir.org/2022/6/e37213
work_keys_str_mv	AT shichengli deepphenotypingofchineseelectronichealthrecordsbyrecognizinglinguisticpatternsofphenotypicnarrativeswithasequencemotifdiscoverytoolalgorithmdevelopmentandvalidation AT lizongdeng deepphenotypingofchineseelectronichealthrecordsbyrecognizinglinguisticpatternsofphenotypicnarrativeswithasequencemotifdiscoverytoolalgorithmdevelopmentandvalidation AT xuzhang deepphenotypingofchineseelectronichealthrecordsbyrecognizinglinguisticpatternsofphenotypicnarrativeswithasequencemotifdiscoverytoolalgorithmdevelopmentandvalidation AT lumingchen deepphenotypingofchineseelectronichealthrecordsbyrecognizinglinguisticpatternsofphenotypicnarrativeswithasequencemotifdiscoverytoolalgorithmdevelopmentandvalidation AT taoyang deepphenotypingofchineseelectronichealthrecordsbyrecognizinglinguisticpatternsofphenotypicnarrativeswithasequencemotifdiscoverytoolalgorithmdevelopmentandvalidation AT yifanqi deepphenotypingofchineseelectronichealthrecordsbyrecognizinglinguisticpatternsofphenotypicnarrativeswithasequencemotifdiscoverytoolalgorithmdevelopmentandvalidation AT taijiaojiang deepphenotypingofchineseelectronichealthrecordsbyrecognizinglinguisticpatternsofphenotypicnarrativeswithasequencemotifdiscoverytoolalgorithmdevelopmentandvalidation

Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and Validation

Similar Items