Improving the CONTES method for normalizing biomedical text entities with concepts from an ontology with (almost) no training data

Entity normalization, or entity linking in the general domain, is an information extraction task that aims to annotate/bind multiple words/expressions in raw text with semantic references, such as concepts of an ontology. An ontology consists minimally of a formally organized vocabulary or hierarchy...

Full description

Bibliographic Details
Main Authors: Arnaud Ferré, Mouhamadou Ba, Robert Bossy
Format: Article
Language:English
Published: Korea Genome Organization 2019-06-01
Series:Genomics & Informatics
Subjects:
Online Access:http://genominfo.org/upload/pdf/gi-2019-17-2-e20.pdf
_version_ 1819110584684118016
author Arnaud Ferré
Mouhamadou Ba
Robert Bossy
author_facet Arnaud Ferré
Mouhamadou Ba
Robert Bossy
author_sort Arnaud Ferré
collection DOAJ
description Entity normalization, or entity linking in the general domain, is an information extraction task that aims to annotate/bind multiple words/expressions in raw text with semantic references, such as concepts of an ontology. An ontology consists minimally of a formally organized vocabulary or hierarchy of terms, which captures knowledge of a domain. Presently, machine-learning methods, often coupled with distributional representations, achieve good performance. However, these require large training datasets, which are not always available, especially for tasks in specialized domains. CONTES (CONcept-TErm System) is a supervised method that addresses entity normalization with ontology concepts using small training datasets. CONTES has some limitations, such as it does not scale well with very large ontologies, it tends to overgeneralize predictions, and it lacks valid representations for the out-of-vocabulary words. Here, we propose to assess different methods to reduce the dimensionality in the representation of the ontology. We also propose to calibrate parameters in order to make the predictions more accurate, and to address the problem of out-of-vocabulary words, with a specific method.
first_indexed 2024-12-22T03:44:03Z
format Article
id doaj.art-2dcffa9df61f4c95b39887f3a76ee59d
institution Directory Open Access Journal
issn 2234-0742
language English
last_indexed 2024-12-22T03:44:03Z
publishDate 2019-06-01
publisher Korea Genome Organization
record_format Article
series Genomics & Informatics
spelling doaj.art-2dcffa9df61f4c95b39887f3a76ee59d2022-12-21T18:40:11ZengKorea Genome OrganizationGenomics & Informatics2234-07422019-06-0117210.5808/GI.2019.17.2.e20562Improving the CONTES method for normalizing biomedical text entities with concepts from an ontology with (almost) no training dataArnaud Ferré0Mouhamadou Ba1Robert Bossy2 MaIAGE, INRA, Paris-Saclay University, 78350 Jouy-en-Josas, France MaIAGE, INRA, Paris-Saclay University, 78350 Jouy-en-Josas, France MaIAGE, INRA, Paris-Saclay University, 78350 Jouy-en-Josas, FranceEntity normalization, or entity linking in the general domain, is an information extraction task that aims to annotate/bind multiple words/expressions in raw text with semantic references, such as concepts of an ontology. An ontology consists minimally of a formally organized vocabulary or hierarchy of terms, which captures knowledge of a domain. Presently, machine-learning methods, often coupled with distributional representations, achieve good performance. However, these require large training datasets, which are not always available, especially for tasks in specialized domains. CONTES (CONcept-TErm System) is a supervised method that addresses entity normalization with ontology concepts using small training datasets. CONTES has some limitations, such as it does not scale well with very large ontologies, it tends to overgeneralize predictions, and it lacks valid representations for the out-of-vocabulary words. Here, we propose to assess different methods to reduce the dimensionality in the representation of the ontology. We also propose to calibrate parameters in order to make the predictions more accurate, and to address the problem of out-of-vocabulary words, with a specific method.http://genominfo.org/upload/pdf/gi-2019-17-2-e20.pdfbiomedical text miningentity normalizationontologyword embedding
spellingShingle Arnaud Ferré
Mouhamadou Ba
Robert Bossy
Improving the CONTES method for normalizing biomedical text entities with concepts from an ontology with (almost) no training data
Genomics & Informatics
biomedical text mining
entity normalization
ontology
word embedding
title Improving the CONTES method for normalizing biomedical text entities with concepts from an ontology with (almost) no training data
title_full Improving the CONTES method for normalizing biomedical text entities with concepts from an ontology with (almost) no training data
title_fullStr Improving the CONTES method for normalizing biomedical text entities with concepts from an ontology with (almost) no training data
title_full_unstemmed Improving the CONTES method for normalizing biomedical text entities with concepts from an ontology with (almost) no training data
title_short Improving the CONTES method for normalizing biomedical text entities with concepts from an ontology with (almost) no training data
title_sort improving the contes method for normalizing biomedical text entities with concepts from an ontology with almost no training data
topic biomedical text mining
entity normalization
ontology
word embedding
url http://genominfo.org/upload/pdf/gi-2019-17-2-e20.pdf
work_keys_str_mv AT arnaudferre improvingthecontesmethodfornormalizingbiomedicaltextentitieswithconceptsfromanontologywithalmostnotrainingdata
AT mouhamadouba improvingthecontesmethodfornormalizingbiomedicaltextentitieswithconceptsfromanontologywithalmostnotrainingdata
AT robertbossy improvingthecontesmethodfornormalizingbiomedicaltextentitieswithconceptsfromanontologywithalmostnotrainingdata