Leveraging Wikipedia Knowledge for Distant Supervision in Medical Concept Normalization

The majority of recent research has approached the Medical Concept Normalization (MCN) task as supervised text classification. However, combining all of the currently available training datasets for this task (CADEC, PsyTAR, COMETA) only covers a small fraction of the concepts contained in the Syste...

Full description

Bibliographic Details
Main Authors: Ningtyas, Annisa Maulida, El-Ebshihy, Alaa, Herwanto, Guntur Budi, Piroi, Florina, Hanbury, Allan
Format: Article
Published: 2022
Subjects:
Description
Summary:The majority of recent research has approached the Medical Concept Normalization (MCN) task as supervised text classification. However, combining all of the currently available training datasets for this task (CADEC, PsyTAR, COMETA) only covers a small fraction of the concepts contained in the Systematized Nomenclature of Medical-Clinical Terms (SNOMED-CT). In this work, we propose a distant supervision approach to broaden the training data coverage of the SNOMED-CT concepts by tapping into Wikipedia as a source of informal medical phrases. Based on our observations, components of Wikipedia articles (article summaries, Wikipedia’s redirect pages, wikilinks data) contain informal medical terms that can be generalized to those used in social media posts. We extract the article summaries, Wikipedia’s redirect pages, and wikilinks data from the Wikipedia articles relating to medical information. We pair this data with corresponding SNOMED-CT concepts. Our distant supervision approach was able to double the concept coverage from the public MCN data sets. Our experiments show that the proposed distant supervision data approach improved the model performance on the three publicly available MCN datasets. © 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.