PADI-web corpus: Labeled textual data in animal health domain

Monitoring animal health worldwide, especially the early detection of outbreaks of emerging pathogens, is one of the means of preventing the introduction of infectious diseases in countries (Collier et al., 2008) [3]. In this context, we developed PADI-web, a Platform for Automated extraction of ani...

Full description

Bibliographic Details
Main Authors: Julien Rabatel, Elena Arsevska, Mathieu Roche
Format: Article
Language:English
Published: Elsevier 2019-02-01
Series:Data in Brief
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340918316032
_version_ 1811310111239438336
author Julien Rabatel
Elena Arsevska
Mathieu Roche
author_facet Julien Rabatel
Elena Arsevska
Mathieu Roche
author_sort Julien Rabatel
collection DOAJ
description Monitoring animal health worldwide, especially the early detection of outbreaks of emerging pathogens, is one of the means of preventing the introduction of infectious diseases in countries (Collier et al., 2008) [3]. In this context, we developed PADI-web, a Platform for Automated extraction of animal Disease Information from the Web (Arsevska et al., 2016, 2018). PADI-web is a text-mining tool that automatically detects, categorizes and extracts disease outbreak information from Web news articles. PADI-web currently monitors the Web for five emerging animal infectious diseases, i.e., African swine fever, avian influenza including highly pathogenic and low pathogenic avian influenza, foot-and-mouth disease, bluetongue, and Schmallenberg virus infection. PADI-web collects Web news articles in near-real time through RSS feeds. Currently, PADI-web collects disease information from Google News because of its international and multiple language coverage. We implemented machine learning techniques to identify the relevant disease information in texts (i.e., location and date of an outbreak, affected hosts, their numbers and clinical signs). In order to train the model for Information Extraction (IE) from news articles, a corpus in English has been manually labeled by domain experts. This labeled corpus (Rabatel et al., 2017) is presented in this data paper.
first_indexed 2024-04-13T09:53:46Z
format Article
id doaj.art-60c3e36215074d44a9cd4ac870cd56a6
institution Directory Open Access Journal
issn 2352-3409
language English
last_indexed 2024-04-13T09:53:46Z
publishDate 2019-02-01
publisher Elsevier
record_format Article
series Data in Brief
spelling doaj.art-60c3e36215074d44a9cd4ac870cd56a62022-12-22T02:51:31ZengElsevierData in Brief2352-34092019-02-0122643646PADI-web corpus: Labeled textual data in animal health domainJulien Rabatel0Elena Arsevska1Mathieu Roche2Cirad, Montpellier, FranceASTRE, Cirad, INRA, Montpellier, France; Cirad, Montpellier, FranceTETIS, Univ. of Montpellier, AgroParisTech, Cirad, CNRS, Irstea, Montpellier, France; Cirad, Montpellier, France; Corresponding author.Monitoring animal health worldwide, especially the early detection of outbreaks of emerging pathogens, is one of the means of preventing the introduction of infectious diseases in countries (Collier et al., 2008) [3]. In this context, we developed PADI-web, a Platform for Automated extraction of animal Disease Information from the Web (Arsevska et al., 2016, 2018). PADI-web is a text-mining tool that automatically detects, categorizes and extracts disease outbreak information from Web news articles. PADI-web currently monitors the Web for five emerging animal infectious diseases, i.e., African swine fever, avian influenza including highly pathogenic and low pathogenic avian influenza, foot-and-mouth disease, bluetongue, and Schmallenberg virus infection. PADI-web collects Web news articles in near-real time through RSS feeds. Currently, PADI-web collects disease information from Google News because of its international and multiple language coverage. We implemented machine learning techniques to identify the relevant disease information in texts (i.e., location and date of an outbreak, affected hosts, their numbers and clinical signs). In order to train the model for Information Extraction (IE) from news articles, a corpus in English has been manually labeled by domain experts. This labeled corpus (Rabatel et al., 2017) is presented in this data paper.http://www.sciencedirect.com/science/article/pii/S2352340918316032
spellingShingle Julien Rabatel
Elena Arsevska
Mathieu Roche
PADI-web corpus: Labeled textual data in animal health domain
Data in Brief
title PADI-web corpus: Labeled textual data in animal health domain
title_full PADI-web corpus: Labeled textual data in animal health domain
title_fullStr PADI-web corpus: Labeled textual data in animal health domain
title_full_unstemmed PADI-web corpus: Labeled textual data in animal health domain
title_short PADI-web corpus: Labeled textual data in animal health domain
title_sort padi web corpus labeled textual data in animal health domain
url http://www.sciencedirect.com/science/article/pii/S2352340918316032
work_keys_str_mv AT julienrabatel padiwebcorpuslabeledtextualdatainanimalhealthdomain
AT elenaarsevska padiwebcorpuslabeledtextualdatainanimalhealthdomain
AT mathieuroche padiwebcorpuslabeledtextualdatainanimalhealthdomain