Seimo posėdžių stenogramų tekstynas autorystės nustatymo bei autoriaus profilio sudarymo tyrimams | Corpus of transcribed parliamentary speeches for authorship attribution and author profiling tasks

In our paper we present a corpus of transcribed Lithuanian parliamentary speeches. The corpus is prepared in a specific format, appropriate for different authorship identification tasks. The corpus consists of approximately 111 thousand texts (24 million words). Each text matches one parliamentary s...

Full description

Bibliographic Details
Main Authors:	Jurgita Kapočiūtė-Dzikienė, Andrius Utka, Ligita Šarkutė
Format:	Article
Language:	deu
Published:	Vilnius University 2014-12-01
Series:	Kalbotyra
Subjects:	Seimo posėdžių stenogramos autorystės nustatymo tekstynas stilo - metrija individualių autorių autorystės nustatymas autorių profilio nustatymas
Online Access:	http://www.kalbotyra.flf.vu.lt/wp-content/uploads/2015/01/Kalbotyra_66_27_45.pdf

_version_	1818387968027525120
author	Jurgita Kapočiūtė-Dzikienė Andrius Utka Ligita Šarkutė
author_facet	Jurgita Kapočiūtė-Dzikienė Andrius Utka Ligita Šarkutė
author_sort	Jurgita Kapočiūtė-Dzikienė
collection	DOAJ
description	In our paper we present a corpus of transcribed Lithuanian parliamentary speeches. The corpus is prepared in a specific format, appropriate for different authorship identification tasks. The corpus consists of approximately 111 thousand texts (24 million words). Each text matches one parliamentary speech produced during an ordinary session from the period of 7 parliamentary terms starting on March 10, 1990 and ending on December 23, 2013. The texts are grouped into 147 categories corresponding to individual authors, therefore they can be used for authorship attribution tasks; besides, these texts are also grouped according to age, gender and political views, therefore they are also suitable for author profiling tasks. Whereas short texts complicate recognition of author speaking style and are ambiguous in relation to the style of other authors, we incorporated only texts containing not less than 100 words into the corpus. In order to make each category as comprehensive and representative as possible, we included only those authors, who produced speeches at least 200 times. All the texts are lemmatized, morphologically and syntactically annotated, tokenized into the character n-grams. The statistical information of the corpus is also available. We have also demonstrated that the created corpus can be effectively used in authorship attribution and author profiling tasks with supervised machine learning methods. The corpus structure also allows using it with unsupervised machine learning methods and can be used for creation of rule-based methods, as well as in different linguistic analyses.
first_indexed	2024-12-14T04:18:22Z
format	Article
id	doaj.art-d4f79e01ead64882b1c90f792dc26b15
institution	Directory Open Access Journal
issn	1392-1517 2029-8315
language	deu
last_indexed	2024-12-14T04:18:22Z
publishDate	2014-12-01
publisher	Vilnius University
record_format	Article
series	Kalbotyra
spelling	doaj.art-d4f79e01ead64882b1c90f792dc26b152022-12-21T23:17:27ZdeuVilnius UniversityKalbotyra1392-15172029-83152014-12-01662745Seimo posėdžių stenogramų tekstynas autorystės nustatymo bei autoriaus profilio sudarymo tyrimams \| Corpus of transcribed parliamentary speeches for authorship attribution and author profiling tasksJurgita Kapočiūtė-DzikienėAndrius UtkaLigita ŠarkutėIn our paper we present a corpus of transcribed Lithuanian parliamentary speeches. The corpus is prepared in a specific format, appropriate for different authorship identification tasks. The corpus consists of approximately 111 thousand texts (24 million words). Each text matches one parliamentary speech produced during an ordinary session from the period of 7 parliamentary terms starting on March 10, 1990 and ending on December 23, 2013. The texts are grouped into 147 categories corresponding to individual authors, therefore they can be used for authorship attribution tasks; besides, these texts are also grouped according to age, gender and political views, therefore they are also suitable for author profiling tasks. Whereas short texts complicate recognition of author speaking style and are ambiguous in relation to the style of other authors, we incorporated only texts containing not less than 100 words into the corpus. In order to make each category as comprehensive and representative as possible, we included only those authors, who produced speeches at least 200 times. All the texts are lemmatized, morphologically and syntactically annotated, tokenized into the character n-grams. The statistical information of the corpus is also available. We have also demonstrated that the created corpus can be effectively used in authorship attribution and author profiling tasks with supervised machine learning methods. The corpus structure also allows using it with unsupervised machine learning methods and can be used for creation of rule-based methods, as well as in different linguistic analyses.http://www.kalbotyra.flf.vu.lt/wp-content/uploads/2015/01/Kalbotyra_66_27_45.pdfSeimo posėdžių stenogramosautorystės nustatymo tekstynasstilo - metrijaindividualių autorių autorystės nustatymasautorių profilio nustatymas
spellingShingle	Jurgita Kapočiūtė-Dzikienė Andrius Utka Ligita Šarkutė Seimo posėdžių stenogramų tekstynas autorystės nustatymo bei autoriaus profilio sudarymo tyrimams \| Corpus of transcribed parliamentary speeches for authorship attribution and author profiling tasks Kalbotyra Seimo posėdžių stenogramos autorystės nustatymo tekstynas stilo - metrija individualių autorių autorystės nustatymas autorių profilio nustatymas
title	Seimo posėdžių stenogramų tekstynas autorystės nustatymo bei autoriaus profilio sudarymo tyrimams \| Corpus of transcribed parliamentary speeches for authorship attribution and author profiling tasks
title_full	Seimo posėdžių stenogramų tekstynas autorystės nustatymo bei autoriaus profilio sudarymo tyrimams \| Corpus of transcribed parliamentary speeches for authorship attribution and author profiling tasks
title_fullStr	Seimo posėdžių stenogramų tekstynas autorystės nustatymo bei autoriaus profilio sudarymo tyrimams \| Corpus of transcribed parliamentary speeches for authorship attribution and author profiling tasks
title_full_unstemmed	Seimo posėdžių stenogramų tekstynas autorystės nustatymo bei autoriaus profilio sudarymo tyrimams \| Corpus of transcribed parliamentary speeches for authorship attribution and author profiling tasks
title_short	Seimo posėdžių stenogramų tekstynas autorystės nustatymo bei autoriaus profilio sudarymo tyrimams \| Corpus of transcribed parliamentary speeches for authorship attribution and author profiling tasks
title_sort	seimo posedziu stenogramu tekstynas autorystes nustatymo bei autoriaus profilio sudarymo tyrimams corpus of transcribed parliamentary speeches for authorship attribution and author profiling tasks
topic	Seimo posėdžių stenogramos autorystės nustatymo tekstynas stilo - metrija individualių autorių autorystės nustatymas autorių profilio nustatymas
url	http://www.kalbotyra.flf.vu.lt/wp-content/uploads/2015/01/Kalbotyra_66_27_45.pdf
work_keys_str_mv	AT jurgitakapociutedzikiene seimoposedziustenogramutekstynasautorystesnustatymobeiautoriausprofiliosudarymotyrimamscorpusoftranscribedparliamentaryspeechesforauthorshipattributionandauthorprofilingtasks AT andriusutka seimoposedziustenogramutekstynasautorystesnustatymobeiautoriausprofiliosudarymotyrimamscorpusoftranscribedparliamentaryspeechesforauthorshipattributionandauthorprofilingtasks AT ligitasarkute seimoposedziustenogramutekstynasautorystesnustatymobeiautoriausprofiliosudarymotyrimamscorpusoftranscribedparliamentaryspeechesforauthorshipattributionandauthorprofilingtasks

Seimo posėdžių stenogramų tekstynas autorystės nustatymo bei autoriaus profilio sudarymo tyrimams | Corpus of transcribed parliamentary speeches for authorship attribution and author profiling tasks

Similar Items