«La Repubblica» Corpus

This paper reviews a huge resource of contemporary Italian newspaper language, the «La Repubblica» corpus. The corpus contains articles, which appeared in the Italian daily newspaper La Repubblica during the years 1985 to 2000 and counts more than 380 million tokens. Apart from being tokenized, it i...

Full description

Bibliographic Details
Main Author: Rebecca Sierig
Format: Article
Language:deu
Published: Institut für Dokumentologie und Editorik e. V. 2017-09-01
Series:RIDE
Subjects:
Online Access:https://ride.i-d-e.de/issues/issue-6/la-repubblica-corpus/
_version_ 1797365965374095360
author Rebecca Sierig
author_facet Rebecca Sierig
author_sort Rebecca Sierig
collection DOAJ
description This paper reviews a huge resource of contemporary Italian newspaper language, the «La Repubblica» corpus. The corpus contains articles, which appeared in the Italian daily newspaper La Repubblica during the years 1985 to 2000 and counts more than 380 million tokens. Apart from being tokenized, it is also PoS-tagged, enriched with TEI-conformant structural mark-up as well as categorized with respect to topics and genres. The data and their preparation are addressed in the first part of this paper while its second part deals with access to the corpus. When the review was written, there were two possible ways of accessing the corpus: either by the ‘old’ interface directly hosted by the Institute of Translational Studies at the University of Bologna (SSLMIT) or by the ‘new’ one hosted by a NoSketch Engine. Both ways are compared in order to point out the changes.
first_indexed 2024-03-08T16:57:29Z
format Article
id doaj.art-0aea1776475e4832b11a6592ff546c87
institution Directory Open Access Journal
issn 2363-4952
language deu
last_indexed 2024-03-08T16:57:29Z
publishDate 2017-09-01
publisher Institut für Dokumentologie und Editorik e. V.
record_format Article
series RIDE
spelling doaj.art-0aea1776475e4832b11a6592ff546c872024-01-04T18:19:37ZdeuInstitut für Dokumentologie und Editorik e. V.RIDE2363-49522017-09-01610.18716/ride.a.6.9«La Repubblica» CorpusRebecca Sierig0https://orcid.org/0000-0002-5323-4543University of LeipzigThis paper reviews a huge resource of contemporary Italian newspaper language, the «La Repubblica» corpus. The corpus contains articles, which appeared in the Italian daily newspaper La Repubblica during the years 1985 to 2000 and counts more than 380 million tokens. Apart from being tokenized, it is also PoS-tagged, enriched with TEI-conformant structural mark-up as well as categorized with respect to topics and genres. The data and their preparation are addressed in the first part of this paper while its second part deals with access to the corpus. When the review was written, there were two possible ways of accessing the corpus: either by the ‘old’ interface directly hosted by the Institute of Translational Studies at the University of Bologna (SSLMIT) or by the ‘new’ one hosted by a NoSketch Engine. Both ways are compared in order to point out the changes.https://ride.i-d-e.de/issues/issue-6/la-repubblica-corpus/20th centuryinterfaceitalianlinguistic searchnewspaperpos-taggingteitext collection
spellingShingle Rebecca Sierig
«La Repubblica» Corpus
RIDE
20th century
interface
italian
linguistic search
newspaper
pos-tagging
tei
text collection
title «La Repubblica» Corpus
title_full «La Repubblica» Corpus
title_fullStr «La Repubblica» Corpus
title_full_unstemmed «La Repubblica» Corpus
title_short «La Repubblica» Corpus
title_sort la repubblica corpus
topic 20th century
interface
italian
linguistic search
newspaper
pos-tagging
tei
text collection
url https://ride.i-d-e.de/issues/issue-6/la-repubblica-corpus/
work_keys_str_mv AT rebeccasierig larepubblicacorpus