Improving named entity recognition accuracy of gene and protein in biomedical text

The plethora of biomedical material on the WWW is one of the factors that have sustained interest in automatic methods for extracting information from biomedical document, which can help biologists in their research. To extract useful knowledge from the biomedical literature, we must be able to rec...

Full description

Bibliographic Details
Main Author:	Tohidi, Hossein
Format:	Thesis
Language:	English English
Published:	2011
Subjects:	Biomedical materials - Data processing Text processing (Computer science) Data mining
Online Access:	http://psasir.upm.edu.my/id/eprint/27703/1/FSKTM%202011%2026R.pdf

_version_	1796971184809574400
author	Tohidi, Hossein
author_facet	Tohidi, Hossein
author_sort	Tohidi, Hossein
collection	UPM
description	The plethora of biomedical material on the WWW is one of the factors that have sustained interest in automatic methods for extracting information from biomedical document, which can help biologists in their research. To extract useful knowledge from the biomedical literature, we must be able to recognize names of biomedical entities, such as genes, proteins, cells, and diseases which are called Named Entity. The task of recognizing entity-denoting expressions, or named entities (NE), in natural language documents is called Named Entity Recognition (NER). Among the biomedical types such as gene, protein, virus, cells, and etc, the most important biomedical types for recognition are gene and protein, which is the scope of this research. The most important reason why most researchers focus on the gene and protein named entities is due to the complexity nature of such types. This complexity includes the issues of character-level variation, word-level variation, and word order variation in biomedical text literature. Typically there are four approaches for Named Entity Recognition, namely: Dictionary-Based, Rule-Based, Statistical and Machine Learning, and Hybrid approaches. In this study, to handle the above issues in recognizing gene and protein names, a statistical similarity measurement as a pattern matching function is proposed. Our approach is based on an assumption that a named entity occurs among a noun group which is extracted using Brill Part of Speech tagger. The strength of our proposed approach for recognizing biomedical named entity is based on a Statistical Character-Based Syntax Similarity (SCSS) algorithm which measured similarity between all extracted candidates and the well-known biomedical named entities from a corpus. For this study, we have used the GENIA V3.0 corpus, which is the largest annotated corpus in the molecular and biology domain. The proposed approach is evaluated based on two measures: recall and precision which are used to calculate a balanced F-test. We have compared our pattern matching function with the other methods and result is satisfied as precision is 98.5% and recall is 96.4%, while the F-test is 97.5 for both gene and protein names recognizing and precision is 99.3% and recall is 99.1%, while the F-test is 99.1 for protein names recognizing.
first_indexed	2024-03-06T08:09:12Z
format	Thesis
id	upm.eprints-27703
institution	Universiti Putra Malaysia
language	English English
last_indexed	2024-03-06T08:09:12Z
publishDate	2011
record_format	dspace
spelling	upm.eprints-277032014-04-10T05:19:35Z http://psasir.upm.edu.my/id/eprint/27703/ Improving named entity recognition accuracy of gene and protein in biomedical text Tohidi, Hossein The plethora of biomedical material on the WWW is one of the factors that have sustained interest in automatic methods for extracting information from biomedical document, which can help biologists in their research. To extract useful knowledge from the biomedical literature, we must be able to recognize names of biomedical entities, such as genes, proteins, cells, and diseases which are called Named Entity. The task of recognizing entity-denoting expressions, or named entities (NE), in natural language documents is called Named Entity Recognition (NER). Among the biomedical types such as gene, protein, virus, cells, and etc, the most important biomedical types for recognition are gene and protein, which is the scope of this research. The most important reason why most researchers focus on the gene and protein named entities is due to the complexity nature of such types. This complexity includes the issues of character-level variation, word-level variation, and word order variation in biomedical text literature. Typically there are four approaches for Named Entity Recognition, namely: Dictionary-Based, Rule-Based, Statistical and Machine Learning, and Hybrid approaches. In this study, to handle the above issues in recognizing gene and protein names, a statistical similarity measurement as a pattern matching function is proposed. Our approach is based on an assumption that a named entity occurs among a noun group which is extracted using Brill Part of Speech tagger. The strength of our proposed approach for recognizing biomedical named entity is based on a Statistical Character-Based Syntax Similarity (SCSS) algorithm which measured similarity between all extracted candidates and the well-known biomedical named entities from a corpus. For this study, we have used the GENIA V3.0 corpus, which is the largest annotated corpus in the molecular and biology domain. The proposed approach is evaluated based on two measures: recall and precision which are used to calculate a balanced F-test. We have compared our pattern matching function with the other methods and result is satisfied as precision is 98.5% and recall is 96.4%, while the F-test is 97.5 for both gene and protein names recognizing and precision is 99.3% and recall is 99.1%, while the F-test is 99.1 for protein names recognizing. 2011-08 Thesis NonPeerReviewed application/pdf en http://psasir.upm.edu.my/id/eprint/27703/1/FSKTM%202011%2026R.pdf Tohidi, Hossein (2011) Improving named entity recognition accuracy of gene and protein in biomedical text. Masters thesis, Universiti Putra Malaysia. Biomedical materials - Data processing Text processing (Computer science) Data mining English
spellingShingle	Biomedical materials - Data processing Text processing (Computer science) Data mining Tohidi, Hossein Improving named entity recognition accuracy of gene and protein in biomedical text
title	Improving named entity recognition accuracy of gene and protein in biomedical text
title_full	Improving named entity recognition accuracy of gene and protein in biomedical text
title_fullStr	Improving named entity recognition accuracy of gene and protein in biomedical text
title_full_unstemmed	Improving named entity recognition accuracy of gene and protein in biomedical text
title_short	Improving named entity recognition accuracy of gene and protein in biomedical text
title_sort	improving named entity recognition accuracy of gene and protein in biomedical text
topic	Biomedical materials - Data processing Text processing (Computer science) Data mining
url	http://psasir.upm.edu.my/id/eprint/27703/1/FSKTM%202011%2026R.pdf
work_keys_str_mv	AT tohidihossein improvingnamedentityrecognitionaccuracyofgeneandproteininbiomedicaltext

Improving named entity recognition accuracy of gene and protein in biomedical text

Similar Items