Assessment of NER solutions against the first and second CALBC Silver Standard Corpus

Abstract Background Competitions in text mining have been used to measure the performance of automatic text processing solutions against a manually annotated gold standard corpus (GSC). The preparation of the GSC is time-consuming and costly and the fin...

Full description

Bibliographic Details
Main Authors:	Rebholz-Schuhmann Dietrich, Yepes Antonio, Li Chen, Kafkas Senay, Lewin Ian, Kang Ning, Corbett Peter, Milward David, Buyko Ekaterina, Beisswanger Elena, Hornbostel Kerstin, Kouznetsov Alexandre, Witte René, Laurila Jonas B, Baker Christopher JO, Kuo Cheng-Ju, Clematide Simone, Rinaldi Fabio, Farkas Richárd, Móra György, Hara Kazuo, Furlong Laura I, Rautschka Michael, Neves Mariana, Pascual-Montano Alberto, Wei Qi, Collier Nigel, Chowdhury Md, Lavelli Alberto, Berlanga Rafael, Morante Roser, Van Asch Vincent, Daelemans Walter, Marina José, van Mulligen Erik, Kors Jan, Hahn Udo
Format:	Article
Language:	English
Published:	BMC 2011-10-01
Series:	Journal of Biomedical Semantics
Online Access:	http://www.jbiomedsem.com/content/2/S5/S11

_version_	1828421409712373760
author	Rebholz-Schuhmann Dietrich Yepes Antonio Li Chen Kafkas Senay Lewin Ian Kang Ning Corbett Peter Milward David Buyko Ekaterina Beisswanger Elena Hornbostel Kerstin Kouznetsov Alexandre Witte René Laurila Jonas B Baker Christopher JO Kuo Cheng-Ju Clematide Simone Rinaldi Fabio Farkas Richárd Móra György Hara Kazuo Furlong Laura I Rautschka Michael Neves Mariana Pascual-Montano Alberto Wei Qi Collier Nigel Chowdhury Md Lavelli Alberto Berlanga Rafael Morante Roser Van Asch Vincent Daelemans Walter Marina José van Mulligen Erik Kors Jan Hahn Udo
author_facet	Rebholz-Schuhmann Dietrich Yepes Antonio Li Chen Kafkas Senay Lewin Ian Kang Ning Corbett Peter Milward David Buyko Ekaterina Beisswanger Elena Hornbostel Kerstin Kouznetsov Alexandre Witte René Laurila Jonas B Baker Christopher JO Kuo Cheng-Ju Clematide Simone Rinaldi Fabio Farkas Richárd Móra György Hara Kazuo Furlong Laura I Rautschka Michael Neves Mariana Pascual-Montano Alberto Wei Qi Collier Nigel Chowdhury Md Lavelli Alberto Berlanga Rafael Morante Roser Van Asch Vincent Daelemans Walter Marina José van Mulligen Erik Kors Jan Hahn Udo
author_sort	Rebholz-Schuhmann Dietrich
collection	DOAJ
description	<p>Abstract</p> <p>Background</p> <p>Competitions in text mining have been used to measure the performance of automatic text processing solutions against a manually annotated gold standard corpus (GSC). The preparation of the GSC is time-consuming and costly and the final corpus consists at the most of a few thousand documents annotated with a limited set of semantic groups. To overcome these shortcomings, the CALBC project partners (PPs) have produced a large-scale annotated biomedical corpus with four different semantic groups through the harmonisation of annotations from automatic text mining solutions, the first version of the Silver Standard Corpus (SSC-I). The four semantic groups are chemical entities and drugs (CHED), genes and proteins (PRGE), diseases and disorders (DISO) and species (SPE). This corpus has been used for the First CALBC Challenge asking the participants to annotate the corpus with their text processing solutions.</p> <p>Results</p> <p>All four PPs from the CALBC project and in addition, 12 challenge participants (CPs) contributed annotated data sets for an evaluation against the SSC-I. CPs could ignore the training data and deliver the annotations from their genuine annotation system, or could train a machine-learning approach on the provided pre-annotated data. In general, the performances of the annotation solutions were lower for entities from the categories CHED and PRGE in comparison to the identification of entities categorized as DISO and SPE. The best performance over all semantic groups were achieved from two annotation solutions that have been trained on the SSC-I.</p> <p>The data sets from participants were used to generate the harmonised Silver Standard Corpus II (SSC-II), if the participant did not make use of the annotated data set from the SSC-I for training purposes. The performances of the participants’ solutions were again measured against the SSC-II. The performances of the annotation solutions showed again better results for DISO and SPE in comparison to CHED and PRGE.</p> <p>Conclusions</p> <p>The SSC-I delivers a large set of annotations (1,121,705) for a large number of documents (100,000 Medline abstracts). The annotations cover four different semantic groups and are sufficiently homogeneous to be reproduced with a trained classifier leading to an average F-measure of 85%. Benchmarking the annotation solutions against the SSC-II leads to better performance for the CPs’ annotation solutions in comparison to the SSC-I.</p>
first_indexed	2024-12-10T15:30:05Z
format	Article
id	doaj.art-99b973fcd07c4f17beca293d4bfb3610
institution	Directory Open Access Journal
issn	2041-1480
language	English
last_indexed	2024-12-10T15:30:05Z
publishDate	2011-10-01
publisher	BMC
record_format	Article
series	Journal of Biomedical Semantics
spelling	doaj.art-99b973fcd07c4f17beca293d4bfb36102022-12-22T01:43:25ZengBMCJournal of Biomedical Semantics2041-14802011-10-012Suppl 5S1110.1186/2041-1480-2-S5-S11Assessment of NER solutions against the first and second CALBC Silver Standard CorpusRebholz-Schuhmann DietrichYepes AntonioLi ChenKafkas SenayLewin IanKang NingCorbett PeterMilward DavidBuyko EkaterinaBeisswanger ElenaHornbostel KerstinKouznetsov AlexandreWitte RenéLaurila Jonas BBaker Christopher JOKuo Cheng-JuClematide SimoneRinaldi FabioFarkas RichárdMóra GyörgyHara KazuoFurlong Laura IRautschka MichaelNeves MarianaPascual-Montano AlbertoWei QiCollier NigelChowdhury MdLavelli AlbertoBerlanga RafaelMorante RoserVan Asch VincentDaelemans WalterMarina Josévan Mulligen ErikKors JanHahn Udo<p>Abstract</p> <p>Background</p> <p>Competitions in text mining have been used to measure the performance of automatic text processing solutions against a manually annotated gold standard corpus (GSC). The preparation of the GSC is time-consuming and costly and the final corpus consists at the most of a few thousand documents annotated with a limited set of semantic groups. To overcome these shortcomings, the CALBC project partners (PPs) have produced a large-scale annotated biomedical corpus with four different semantic groups through the harmonisation of annotations from automatic text mining solutions, the first version of the Silver Standard Corpus (SSC-I). The four semantic groups are chemical entities and drugs (CHED), genes and proteins (PRGE), diseases and disorders (DISO) and species (SPE). This corpus has been used for the First CALBC Challenge asking the participants to annotate the corpus with their text processing solutions.</p> <p>Results</p> <p>All four PPs from the CALBC project and in addition, 12 challenge participants (CPs) contributed annotated data sets for an evaluation against the SSC-I. CPs could ignore the training data and deliver the annotations from their genuine annotation system, or could train a machine-learning approach on the provided pre-annotated data. In general, the performances of the annotation solutions were lower for entities from the categories CHED and PRGE in comparison to the identification of entities categorized as DISO and SPE. The best performance over all semantic groups were achieved from two annotation solutions that have been trained on the SSC-I.</p> <p>The data sets from participants were used to generate the harmonised Silver Standard Corpus II (SSC-II), if the participant did not make use of the annotated data set from the SSC-I for training purposes. The performances of the participants’ solutions were again measured against the SSC-II. The performances of the annotation solutions showed again better results for DISO and SPE in comparison to CHED and PRGE.</p> <p>Conclusions</p> <p>The SSC-I delivers a large set of annotations (1,121,705) for a large number of documents (100,000 Medline abstracts). The annotations cover four different semantic groups and are sufficiently homogeneous to be reproduced with a trained classifier leading to an average F-measure of 85%. Benchmarking the annotation solutions against the SSC-II leads to better performance for the CPs’ annotation solutions in comparison to the SSC-I.</p>http://www.jbiomedsem.com/content/2/S5/S11
spellingShingle	Rebholz-Schuhmann Dietrich Yepes Antonio Li Chen Kafkas Senay Lewin Ian Kang Ning Corbett Peter Milward David Buyko Ekaterina Beisswanger Elena Hornbostel Kerstin Kouznetsov Alexandre Witte René Laurila Jonas B Baker Christopher JO Kuo Cheng-Ju Clematide Simone Rinaldi Fabio Farkas Richárd Móra György Hara Kazuo Furlong Laura I Rautschka Michael Neves Mariana Pascual-Montano Alberto Wei Qi Collier Nigel Chowdhury Md Lavelli Alberto Berlanga Rafael Morante Roser Van Asch Vincent Daelemans Walter Marina José van Mulligen Erik Kors Jan Hahn Udo Assessment of NER solutions against the first and second CALBC Silver Standard Corpus Journal of Biomedical Semantics
title	Assessment of NER solutions against the first and second CALBC Silver Standard Corpus
title_full	Assessment of NER solutions against the first and second CALBC Silver Standard Corpus
title_fullStr	Assessment of NER solutions against the first and second CALBC Silver Standard Corpus
title_full_unstemmed	Assessment of NER solutions against the first and second CALBC Silver Standard Corpus
title_short	Assessment of NER solutions against the first and second CALBC Silver Standard Corpus
title_sort	assessment of ner solutions against the first and second calbc silver standard corpus
url	http://www.jbiomedsem.com/content/2/S5/S11
work_keys_str_mv	AT rebholzschuhmanndietrich assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT yepesantonio assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT lichen assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT kafkassenay assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT lewinian assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT kangning assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT corbettpeter assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT milwarddavid assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT buykoekaterina assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT beisswangerelena assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT hornbostelkerstin assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT kouznetsovalexandre assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT witterene assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT laurilajonasb assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT bakerchristopherjo assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT kuochengju assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT clematidesimone assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT rinaldifabio assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT farkasrichard assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT moragyorgy assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT harakazuo assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT furlonglaurai assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT rautschkamichael assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT nevesmariana assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT pascualmontanoalberto assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT weiqi assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT colliernigel assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT chowdhurymd assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT lavellialberto assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT berlangarafael assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT moranteroser assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT vanaschvincent assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT daelemanswalter assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT marinajose assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT vanmulligenerik assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT korsjan assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus AT hahnudo assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus

Assessment of NER solutions against the first and second CALBC Silver Standard Corpus

Similar Items