Similarity corpus on microbial transcriptional regulation

Abstract Background The ability to express the same meaning in different ways is a well-known property of natural language. This amazing property is the source of major difficulties in natural language processing. Given the constant increase in published literature, its curation and information extr...

Full description

Bibliographic Details
Main Authors: Oscar Lithgow-Serrano, Socorro Gama-Castro, Cecilia Ishida-Gutiérrez, Citlalli Mejía-Almonte, Víctor H. Tierrafría, Sara Martínez-Luna, Alberto Santos-Zavaleta, David Velázquez-Ramírez, Julio Collado-Vides
Format: Article
Language:English
Published: BMC 2019-05-01
Series:Journal of Biomedical Semantics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13326-019-0200-x
_version_ 1818521723103870976
author Oscar Lithgow-Serrano
Socorro Gama-Castro
Cecilia Ishida-Gutiérrez
Citlalli Mejía-Almonte
Víctor H. Tierrafría
Sara Martínez-Luna
Alberto Santos-Zavaleta
David Velázquez-Ramírez
Julio Collado-Vides
author_facet Oscar Lithgow-Serrano
Socorro Gama-Castro
Cecilia Ishida-Gutiérrez
Citlalli Mejía-Almonte
Víctor H. Tierrafría
Sara Martínez-Luna
Alberto Santos-Zavaleta
David Velázquez-Ramírez
Julio Collado-Vides
author_sort Oscar Lithgow-Serrano
collection DOAJ
description Abstract Background The ability to express the same meaning in different ways is a well-known property of natural language. This amazing property is the source of major difficulties in natural language processing. Given the constant increase in published literature, its curation and information extraction would strongly benefit from efficient automatic processes, for which corpora of sentences evaluated by experts are a valuable resource. Results Given our interest in applying such approaches to the benefit of curation of the biomedical literature, specifically that about gene regulation in microbial organisms, we decided to build a corpus with graded textual similarity evaluated by curators and that was designed specifically oriented to our purposes. Based on the predefined statistical power of future analyses, we defined features of the design, including sampling, selection criteria, balance, and size, among others. A non-fully crossed study design was applied. Each pair of sentences was evaluated by 3 annotators from a total of 7; the scale used in the semantic similarity assessment task within the Semantic Evaluation workshop (SEMEVAL) was adapted to our goals in four successive iterative sessions with clear improvements in the agreed guidelines and interrater reliability results. Alternatives for such a corpus evaluation have been widely discussed. Conclusions To the best of our knowledge, this is the first similarity corpus—a dataset of pairs of sentences for which human experts rate the semantic similarity of each pair—in this domain of knowledge. We have initiated its incorporation in our research towards high-throughput curation strategies based on natural language processing.
first_indexed 2024-12-11T01:55:12Z
format Article
id doaj.art-f133b23bf096497ca7efcbaa05005aec
institution Directory Open Access Journal
issn 2041-1480
language English
last_indexed 2024-12-11T01:55:12Z
publishDate 2019-05-01
publisher BMC
record_format Article
series Journal of Biomedical Semantics
spelling doaj.art-f133b23bf096497ca7efcbaa05005aec2022-12-22T01:24:38ZengBMCJournal of Biomedical Semantics2041-14802019-05-0110111410.1186/s13326-019-0200-xSimilarity corpus on microbial transcriptional regulationOscar Lithgow-Serrano0Socorro Gama-Castro1Cecilia Ishida-Gutiérrez2Citlalli Mejía-Almonte3Víctor H. Tierrafría4Sara Martínez-Luna5Alberto Santos-Zavaleta6David Velázquez-Ramírez7Julio Collado-Vides8Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P.Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P.Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P.Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P.Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P.Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P.Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P.Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P.Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P.Abstract Background The ability to express the same meaning in different ways is a well-known property of natural language. This amazing property is the source of major difficulties in natural language processing. Given the constant increase in published literature, its curation and information extraction would strongly benefit from efficient automatic processes, for which corpora of sentences evaluated by experts are a valuable resource. Results Given our interest in applying such approaches to the benefit of curation of the biomedical literature, specifically that about gene regulation in microbial organisms, we decided to build a corpus with graded textual similarity evaluated by curators and that was designed specifically oriented to our purposes. Based on the predefined statistical power of future analyses, we defined features of the design, including sampling, selection criteria, balance, and size, among others. A non-fully crossed study design was applied. Each pair of sentences was evaluated by 3 annotators from a total of 7; the scale used in the semantic similarity assessment task within the Semantic Evaluation workshop (SEMEVAL) was adapted to our goals in four successive iterative sessions with clear improvements in the agreed guidelines and interrater reliability results. Alternatives for such a corpus evaluation have been widely discussed. Conclusions To the best of our knowledge, this is the first similarity corpus—a dataset of pairs of sentences for which human experts rate the semantic similarity of each pair—in this domain of knowledge. We have initiated its incorporation in our research towards high-throughput curation strategies based on natural language processing.http://link.springer.com/article/10.1186/s13326-019-0200-xCorpusSimilarityTranscriptional-regulationGenomics
spellingShingle Oscar Lithgow-Serrano
Socorro Gama-Castro
Cecilia Ishida-Gutiérrez
Citlalli Mejía-Almonte
Víctor H. Tierrafría
Sara Martínez-Luna
Alberto Santos-Zavaleta
David Velázquez-Ramírez
Julio Collado-Vides
Similarity corpus on microbial transcriptional regulation
Journal of Biomedical Semantics
Corpus
Similarity
Transcriptional-regulation
Genomics
title Similarity corpus on microbial transcriptional regulation
title_full Similarity corpus on microbial transcriptional regulation
title_fullStr Similarity corpus on microbial transcriptional regulation
title_full_unstemmed Similarity corpus on microbial transcriptional regulation
title_short Similarity corpus on microbial transcriptional regulation
title_sort similarity corpus on microbial transcriptional regulation
topic Corpus
Similarity
Transcriptional-regulation
Genomics
url http://link.springer.com/article/10.1186/s13326-019-0200-x
work_keys_str_mv AT oscarlithgowserrano similaritycorpusonmicrobialtranscriptionalregulation
AT socorrogamacastro similaritycorpusonmicrobialtranscriptionalregulation
AT ceciliaishidagutierrez similaritycorpusonmicrobialtranscriptionalregulation
AT citlallimejiaalmonte similaritycorpusonmicrobialtranscriptionalregulation
AT victorhtierrafria similaritycorpusonmicrobialtranscriptionalregulation
AT saramartinezluna similaritycorpusonmicrobialtranscriptionalregulation
AT albertosantoszavaleta similaritycorpusonmicrobialtranscriptionalregulation
AT davidvelazquezramirez similaritycorpusonmicrobialtranscriptionalregulation
AT juliocolladovides similaritycorpusonmicrobialtranscriptionalregulation