pGenN, a gene normalization tool for plant genes and proteins in scientific literature.

Automatically detecting gene/protein names in the literature and connecting them to databases records, also known as gene normalization, provides a means to structure the information buried in free-text literature. Gene normalization is critical for improving the coverage of annotation in the databa...

Full description

Bibliographic Details
Main Authors: Ruoyao Ding, Cecilia N Arighi, Jung-Youn Lee, Cathy H Wu, K Vijay-Shanker
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2015-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC4530884?pdf=render
_version_ 1811221757591289856
author Ruoyao Ding
Cecilia N Arighi
Jung-Youn Lee
Cathy H Wu
K Vijay-Shanker
author_facet Ruoyao Ding
Cecilia N Arighi
Jung-Youn Lee
Cathy H Wu
K Vijay-Shanker
author_sort Ruoyao Ding
collection DOAJ
description Automatically detecting gene/protein names in the literature and connecting them to databases records, also known as gene normalization, provides a means to structure the information buried in free-text literature. Gene normalization is critical for improving the coverage of annotation in the databases, and is an essential component of many text mining systems and database curation pipelines.In this manuscript, we describe a gene normalization system specifically tailored for plant species, called pGenN (pivot-based Gene Normalization). The system consists of three steps: dictionary-based gene mention detection, species assignment, and intra species normalization. We have developed new heuristics to improve each of these phases.We evaluated the performance of pGenN on an in-house expertly annotated corpus consisting of 104 plant relevant abstracts. Our system achieved an F-value of 88.9% (Precision 90.9% and Recall 87.2%) on this corpus, outperforming state-of-art systems presented in BioCreative III. We have processed over 440,000 plant-related Medline abstracts using pGenN. The gene normalization results are stored in a local database for direct query from the pGenN web interface (proteininformationresource.org/pgenn/). The annotated literature corpus is also publicly available through the PIR text mining portal (proteininformationresource.org/iprolink/).
first_indexed 2024-04-12T08:05:55Z
format Article
id doaj.art-a1581755afec4497b96b9f7c059f81a3
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-04-12T08:05:55Z
publishDate 2015-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-a1581755afec4497b96b9f7c059f81a32022-12-22T03:41:10ZengPublic Library of Science (PLoS)PLoS ONE1932-62032015-01-01108e013530510.1371/journal.pone.0135305pGenN, a gene normalization tool for plant genes and proteins in scientific literature.Ruoyao DingCecilia N ArighiJung-Youn LeeCathy H WuK Vijay-ShankerAutomatically detecting gene/protein names in the literature and connecting them to databases records, also known as gene normalization, provides a means to structure the information buried in free-text literature. Gene normalization is critical for improving the coverage of annotation in the databases, and is an essential component of many text mining systems and database curation pipelines.In this manuscript, we describe a gene normalization system specifically tailored for plant species, called pGenN (pivot-based Gene Normalization). The system consists of three steps: dictionary-based gene mention detection, species assignment, and intra species normalization. We have developed new heuristics to improve each of these phases.We evaluated the performance of pGenN on an in-house expertly annotated corpus consisting of 104 plant relevant abstracts. Our system achieved an F-value of 88.9% (Precision 90.9% and Recall 87.2%) on this corpus, outperforming state-of-art systems presented in BioCreative III. We have processed over 440,000 plant-related Medline abstracts using pGenN. The gene normalization results are stored in a local database for direct query from the pGenN web interface (proteininformationresource.org/pgenn/). The annotated literature corpus is also publicly available through the PIR text mining portal (proteininformationresource.org/iprolink/).http://europepmc.org/articles/PMC4530884?pdf=render
spellingShingle Ruoyao Ding
Cecilia N Arighi
Jung-Youn Lee
Cathy H Wu
K Vijay-Shanker
pGenN, a gene normalization tool for plant genes and proteins in scientific literature.
PLoS ONE
title pGenN, a gene normalization tool for plant genes and proteins in scientific literature.
title_full pGenN, a gene normalization tool for plant genes and proteins in scientific literature.
title_fullStr pGenN, a gene normalization tool for plant genes and proteins in scientific literature.
title_full_unstemmed pGenN, a gene normalization tool for plant genes and proteins in scientific literature.
title_short pGenN, a gene normalization tool for plant genes and proteins in scientific literature.
title_sort pgenn a gene normalization tool for plant genes and proteins in scientific literature
url http://europepmc.org/articles/PMC4530884?pdf=render
work_keys_str_mv AT ruoyaoding pgennagenenormalizationtoolforplantgenesandproteinsinscientificliterature
AT cecilianarighi pgennagenenormalizationtoolforplantgenesandproteinsinscientificliterature
AT jungyounlee pgennagenenormalizationtoolforplantgenesandproteinsinscientificliterature
AT cathyhwu pgennagenenormalizationtoolforplantgenesandproteinsinscientificliterature
AT kvijayshanker pgennagenenormalizationtoolforplantgenesandproteinsinscientificliterature