Thesaurus-based disambiguation of gene symbols

<p>Abstract</p> <p>Background</p> <p>Massive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. However, disambiguation of gene symbols is a major bottleneck.</p> <p>Results</p>...

Full description

Bibliographic Details
Main Authors: Wain Hester M, van Mulligen Erik M, Schuemie Martijn J, Weeber Marc, Mons Barend, Schijvenaars Bob JA, Kors Jan A
Format: Article
Language:English
Published: BMC 2005-06-01
Series:BMC Bioinformatics
Online Access:http://www.biomedcentral.com/1471-2105/6/149
_version_ 1811259829385166848
author Wain Hester M
van Mulligen Erik M
Schuemie Martijn J
Weeber Marc
Mons Barend
Schijvenaars Bob JA
Kors Jan A
author_facet Wain Hester M
van Mulligen Erik M
Schuemie Martijn J
Weeber Marc
Mons Barend
Schijvenaars Bob JA
Kors Jan A
author_sort Wain Hester M
collection DOAJ
description <p>Abstract</p> <p>Background</p> <p>Massive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. However, disambiguation of gene symbols is a major bottleneck.</p> <p>Results</p> <p>We developed a simple thesaurus-based disambiguation algorithm that can operate with very little training data. The thesaurus comprises the information from five human genetic databases and MeSH. The extent of the homonym problem for human gene symbols is shown to be substantial (33% of the genes in our combined thesaurus had one or more ambiguous symbols), not only because one symbol can refer to multiple genes, but also because a gene symbol can have many non-gene meanings. A test set of 52,529 Medline abstracts, containing 690 ambiguous human gene symbols taken from OMIM, was automatically generated. Overall accuracy of the disambiguation algorithm was up to 92.7% on the test set.</p> <p>Conclusion</p> <p>The ambiguity of human gene symbols is substantial, not only because one symbol may denote multiple genes but particularly because many symbols have other, non-gene meanings. The proposed disambiguation approach resolves most ambiguities in our test set with high accuracy, including the important gene/not a gene decisions. The algorithm is fast and scalable, enabling gene-symbol disambiguation in massive text mining applications.</p>
first_indexed 2024-04-12T18:38:28Z
format Article
id doaj.art-1950ff10c6064d5eb83480c3d51c066c
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-04-12T18:38:28Z
publishDate 2005-06-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-1950ff10c6064d5eb83480c3d51c066c2022-12-22T03:20:52ZengBMCBMC Bioinformatics1471-21052005-06-016114910.1186/1471-2105-6-149Thesaurus-based disambiguation of gene symbolsWain Hester Mvan Mulligen Erik MSchuemie Martijn JWeeber MarcMons BarendSchijvenaars Bob JAKors Jan A<p>Abstract</p> <p>Background</p> <p>Massive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. However, disambiguation of gene symbols is a major bottleneck.</p> <p>Results</p> <p>We developed a simple thesaurus-based disambiguation algorithm that can operate with very little training data. The thesaurus comprises the information from five human genetic databases and MeSH. The extent of the homonym problem for human gene symbols is shown to be substantial (33% of the genes in our combined thesaurus had one or more ambiguous symbols), not only because one symbol can refer to multiple genes, but also because a gene symbol can have many non-gene meanings. A test set of 52,529 Medline abstracts, containing 690 ambiguous human gene symbols taken from OMIM, was automatically generated. Overall accuracy of the disambiguation algorithm was up to 92.7% on the test set.</p> <p>Conclusion</p> <p>The ambiguity of human gene symbols is substantial, not only because one symbol may denote multiple genes but particularly because many symbols have other, non-gene meanings. The proposed disambiguation approach resolves most ambiguities in our test set with high accuracy, including the important gene/not a gene decisions. The algorithm is fast and scalable, enabling gene-symbol disambiguation in massive text mining applications.</p>http://www.biomedcentral.com/1471-2105/6/149
spellingShingle Wain Hester M
van Mulligen Erik M
Schuemie Martijn J
Weeber Marc
Mons Barend
Schijvenaars Bob JA
Kors Jan A
Thesaurus-based disambiguation of gene symbols
BMC Bioinformatics
title Thesaurus-based disambiguation of gene symbols
title_full Thesaurus-based disambiguation of gene symbols
title_fullStr Thesaurus-based disambiguation of gene symbols
title_full_unstemmed Thesaurus-based disambiguation of gene symbols
title_short Thesaurus-based disambiguation of gene symbols
title_sort thesaurus based disambiguation of gene symbols
url http://www.biomedcentral.com/1471-2105/6/149
work_keys_str_mv AT wainhesterm thesaurusbaseddisambiguationofgenesymbols
AT vanmulligenerikm thesaurusbaseddisambiguationofgenesymbols
AT schuemiemartijnj thesaurusbaseddisambiguationofgenesymbols
AT weebermarc thesaurusbaseddisambiguationofgenesymbols
AT monsbarend thesaurusbaseddisambiguationofgenesymbols
AT schijvenaarsbobja thesaurusbaseddisambiguationofgenesymbols
AT korsjana thesaurusbaseddisambiguationofgenesymbols