MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors

With growing interest in machine learning, text standardization is becoming an increasingly important aspect of data pre-processing within biomedical communities. As performances of machine learning algorithms are affected by both the amount and the quality of their training data, effective data sta...

Full description

Bibliographic Details
Main Authors:	Han Kyul Kim, Sae Won Choi, Ye Seul Bae, Jiin Choi, Hyein Kwon, Christine P. Lee, Hae-Young Lee, Taehoon Ko
Format:	Article
Language:	English
Published:	MDPI AG 2020-11-01
Series:	Applied Sciences
Subjects:	text standardization unsupervised term mapping unsupervised concept normalization biomedical text pre-processing
Online Access:	https://www.mdpi.com/2076-3417/10/21/7831

_version_	1827702862645297152
author	Han Kyul Kim Sae Won Choi Ye Seul Bae Jiin Choi Hyein Kwon Christine P. Lee Hae-Young Lee Taehoon Ko
author_facet	Han Kyul Kim Sae Won Choi Ye Seul Bae Jiin Choi Hyein Kwon Christine P. Lee Hae-Young Lee Taehoon Ko
author_sort	Han Kyul Kim
collection	DOAJ
description	With growing interest in machine learning, text standardization is becoming an increasingly important aspect of data pre-processing within biomedical communities. As performances of machine learning algorithms are affected by both the amount and the quality of their training data, effective data standardization is needed to guarantee consistent data integrity. Furthermore, biomedical organizations, depending on their geographical locations or affiliations, rely on different sets of text standardization in practice. To facilitate easier machine learning-related collaborations between these organizations, an effective yet practical text data standardization method is needed. In this paper, we introduce MARIE (a context-aware term mapping method with string matching and embedding vectors), an unsupervised learning-based tool, to find standardized clinical terminologies for queries, such as a hospital’s own codes. By incorporating both string matching methods and term embedding vectors generated by BioBERT (bidirectional encoder representations from transformers for biomedical text mining), it utilizes both structural and contextual information to calculate similarity measures between source and target terms. Compared to previous term mapping methods, MARIE shows improved mapping accuracy. Furthermore, it can be easily expanded to incorporate any string matching or term embedding methods. Without requiring any additional model training, it is not only effective, but also a practical term mapping method for text data standardization and pre-processing.
first_indexed	2024-03-10T15:05:32Z
format	Article
id	doaj.art-d8af4db735f040e8bb336d19e486921b
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-10T15:05:32Z
publishDate	2020-11-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-d8af4db735f040e8bb336d19e486921b2023-11-20T19:49:23ZengMDPI AGApplied Sciences2076-34172020-11-011021783110.3390/app10217831MARIE: A Context-Aware Term Mapping with String Matching and Embedding VectorsHan Kyul Kim0Sae Won Choi1Ye Seul Bae2Jiin Choi3Hyein Kwon4Christine P. Lee5Hae-Young Lee6Taehoon Ko7Office of Hospital Information, Seoul National University Hospital, Seoul 03080, KoreaOffice of Hospital Information, Seoul National University Hospital, Seoul 03080, KoreaOffice of Hospital Information, Seoul National University Hospital, Seoul 03080, KoreaOffice of Hospital Information, Seoul National University Hospital, Seoul 03080, KoreaOffice of Hospital Information, Seoul National University Hospital, Seoul 03080, KoreaOffice of Hospital Information, Seoul National University Hospital, Seoul 03080, KoreaDepartment of Internal Medicine, Seoul National University Hospital, Seoul 03080, KoreaDepartment of Medical Informatics, The Catholic University of Korea, Seoul 03080, KoreaWith growing interest in machine learning, text standardization is becoming an increasingly important aspect of data pre-processing within biomedical communities. As performances of machine learning algorithms are affected by both the amount and the quality of their training data, effective data standardization is needed to guarantee consistent data integrity. Furthermore, biomedical organizations, depending on their geographical locations or affiliations, rely on different sets of text standardization in practice. To facilitate easier machine learning-related collaborations between these organizations, an effective yet practical text data standardization method is needed. In this paper, we introduce MARIE (a context-aware term mapping method with string matching and embedding vectors), an unsupervised learning-based tool, to find standardized clinical terminologies for queries, such as a hospital’s own codes. By incorporating both string matching methods and term embedding vectors generated by BioBERT (bidirectional encoder representations from transformers for biomedical text mining), it utilizes both structural and contextual information to calculate similarity measures between source and target terms. Compared to previous term mapping methods, MARIE shows improved mapping accuracy. Furthermore, it can be easily expanded to incorporate any string matching or term embedding methods. Without requiring any additional model training, it is not only effective, but also a practical term mapping method for text data standardization and pre-processing.https://www.mdpi.com/2076-3417/10/21/7831text standardizationunsupervised term mappingunsupervised concept normalizationbiomedical text pre-processing
spellingShingle	Han Kyul Kim Sae Won Choi Ye Seul Bae Jiin Choi Hyein Kwon Christine P. Lee Hae-Young Lee Taehoon Ko MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors Applied Sciences text standardization unsupervised term mapping unsupervised concept normalization biomedical text pre-processing
title	MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors
title_full	MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors
title_fullStr	MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors
title_full_unstemmed	MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors
title_short	MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors
title_sort	marie a context aware term mapping with string matching and embedding vectors
topic	text standardization unsupervised term mapping unsupervised concept normalization biomedical text pre-processing
url	https://www.mdpi.com/2076-3417/10/21/7831
work_keys_str_mv	AT hankyulkim marieacontextawaretermmappingwithstringmatchingandembeddingvectors AT saewonchoi marieacontextawaretermmappingwithstringmatchingandembeddingvectors AT yeseulbae marieacontextawaretermmappingwithstringmatchingandembeddingvectors AT jiinchoi marieacontextawaretermmappingwithstringmatchingandembeddingvectors AT hyeinkwon marieacontextawaretermmappingwithstringmatchingandembeddingvectors AT christineplee marieacontextawaretermmappingwithstringmatchingandembeddingvectors AT haeyounglee marieacontextawaretermmappingwithstringmatchingandembeddingvectors AT taehoonko marieacontextawaretermmappingwithstringmatchingandembeddingvectors

MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors

Similar Items