Thesaurus-based disambiguation of gene symbols
<p>Abstract</p> <p>Background</p> <p>Massive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. However, disambiguation of gene symbols is a major bottleneck.</p> <p>Results</p>...
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2005-06-01
|
Series: | BMC Bioinformatics |
Online Access: | http://www.biomedcentral.com/1471-2105/6/149 |
_version_ | 1811259829385166848 |
---|---|
author | Wain Hester M van Mulligen Erik M Schuemie Martijn J Weeber Marc Mons Barend Schijvenaars Bob JA Kors Jan A |
author_facet | Wain Hester M van Mulligen Erik M Schuemie Martijn J Weeber Marc Mons Barend Schijvenaars Bob JA Kors Jan A |
author_sort | Wain Hester M |
collection | DOAJ |
description | <p>Abstract</p> <p>Background</p> <p>Massive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. However, disambiguation of gene symbols is a major bottleneck.</p> <p>Results</p> <p>We developed a simple thesaurus-based disambiguation algorithm that can operate with very little training data. The thesaurus comprises the information from five human genetic databases and MeSH. The extent of the homonym problem for human gene symbols is shown to be substantial (33% of the genes in our combined thesaurus had one or more ambiguous symbols), not only because one symbol can refer to multiple genes, but also because a gene symbol can have many non-gene meanings. A test set of 52,529 Medline abstracts, containing 690 ambiguous human gene symbols taken from OMIM, was automatically generated. Overall accuracy of the disambiguation algorithm was up to 92.7% on the test set.</p> <p>Conclusion</p> <p>The ambiguity of human gene symbols is substantial, not only because one symbol may denote multiple genes but particularly because many symbols have other, non-gene meanings. The proposed disambiguation approach resolves most ambiguities in our test set with high accuracy, including the important gene/not a gene decisions. The algorithm is fast and scalable, enabling gene-symbol disambiguation in massive text mining applications.</p> |
first_indexed | 2024-04-12T18:38:28Z |
format | Article |
id | doaj.art-1950ff10c6064d5eb83480c3d51c066c |
institution | Directory Open Access Journal |
issn | 1471-2105 |
language | English |
last_indexed | 2024-04-12T18:38:28Z |
publishDate | 2005-06-01 |
publisher | BMC |
record_format | Article |
series | BMC Bioinformatics |
spelling | doaj.art-1950ff10c6064d5eb83480c3d51c066c2022-12-22T03:20:52ZengBMCBMC Bioinformatics1471-21052005-06-016114910.1186/1471-2105-6-149Thesaurus-based disambiguation of gene symbolsWain Hester Mvan Mulligen Erik MSchuemie Martijn JWeeber MarcMons BarendSchijvenaars Bob JAKors Jan A<p>Abstract</p> <p>Background</p> <p>Massive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. However, disambiguation of gene symbols is a major bottleneck.</p> <p>Results</p> <p>We developed a simple thesaurus-based disambiguation algorithm that can operate with very little training data. The thesaurus comprises the information from five human genetic databases and MeSH. The extent of the homonym problem for human gene symbols is shown to be substantial (33% of the genes in our combined thesaurus had one or more ambiguous symbols), not only because one symbol can refer to multiple genes, but also because a gene symbol can have many non-gene meanings. A test set of 52,529 Medline abstracts, containing 690 ambiguous human gene symbols taken from OMIM, was automatically generated. Overall accuracy of the disambiguation algorithm was up to 92.7% on the test set.</p> <p>Conclusion</p> <p>The ambiguity of human gene symbols is substantial, not only because one symbol may denote multiple genes but particularly because many symbols have other, non-gene meanings. The proposed disambiguation approach resolves most ambiguities in our test set with high accuracy, including the important gene/not a gene decisions. The algorithm is fast and scalable, enabling gene-symbol disambiguation in massive text mining applications.</p>http://www.biomedcentral.com/1471-2105/6/149 |
spellingShingle | Wain Hester M van Mulligen Erik M Schuemie Martijn J Weeber Marc Mons Barend Schijvenaars Bob JA Kors Jan A Thesaurus-based disambiguation of gene symbols BMC Bioinformatics |
title | Thesaurus-based disambiguation of gene symbols |
title_full | Thesaurus-based disambiguation of gene symbols |
title_fullStr | Thesaurus-based disambiguation of gene symbols |
title_full_unstemmed | Thesaurus-based disambiguation of gene symbols |
title_short | Thesaurus-based disambiguation of gene symbols |
title_sort | thesaurus based disambiguation of gene symbols |
url | http://www.biomedcentral.com/1471-2105/6/149 |
work_keys_str_mv | AT wainhesterm thesaurusbaseddisambiguationofgenesymbols AT vanmulligenerikm thesaurusbaseddisambiguationofgenesymbols AT schuemiemartijnj thesaurusbaseddisambiguationofgenesymbols AT weebermarc thesaurusbaseddisambiguationofgenesymbols AT monsbarend thesaurusbaseddisambiguationofgenesymbols AT schijvenaarsbobja thesaurusbaseddisambiguationofgenesymbols AT korsjana thesaurusbaseddisambiguationofgenesymbols |