Eesti keele ühendkorpuste sari 2013–2021: mahukaim eestikeelsete digitekstide kogu

Eesti Keele Instituudi ja tarkvarafirma Lexical Computing Ltd. koostöös on valminud ühendkorpuste sari, milles on nüüdseks neli versiooni: eesti keele ühendkorpus 2013, 2017, 2019 ja 2021. Ühendkorpused on mahult suurimad eesti keele korpused ning nende rakendusvõimalused on laialdased, alates leksi...

Full description

Bibliographic Details
Main Authors:	Kristina Koppel, Jelena Kallas
Format:	Article
Language:	English
Published:	Eesti Rakenduslingvistika Ühing (Estonian Association for Applied Linguistics) 2022-04-01
Series:	Eesti Rakenduslingvistika Ühingu Aastaraamat
Subjects:	eesti keele ühendkorpus tekstikorpused korpusleksikograafia korpuspäringusüsteem eesti keel estonian national corpus corpora corpus lexicography corpus query system estonian
Online Access:	http://arhiiv.rakenduslingvistika.ee/ajakirjad/index.php/aastaraamat/article/view/ERYa18.12

_version_	1797814239624167424
author	Kristina Koppel Jelena Kallas
author_facet	Kristina Koppel Jelena Kallas
author_sort	Kristina Koppel
collection	DOAJ
description	Eesti Keele Instituudi ja tarkvarafirma Lexical Computing Ltd. koostöös on valminud ühendkorpuste sari, milles on nüüdseks neli versiooni: eesti keele ühendkorpus 2013, 2017, 2019 ja 2021. Ühendkorpused on mahult suurimad eesti keele korpused ning nende rakendusvõimalused on laialdased, alates leksikograafia-alasest uurimistööst ning lõpetades masinõppe-otstarbeliste keelemudelite loomisega. Artiklis keskendume seni uusimale eesti keele ühendkorpusele 2021, mis koosneb suures osas veebist kogutud tekstidest. Kirjeldame veebitekstide kogumise, järeltöötluse ja puhastamise põhimõtteid ning ühendkorpuse allkorpusi, samuti anname ülevaate lähtetekstide klassifitseerimisest. Lisaks tutvustame korpuspäringusüsteemi Sketch Engine näitel korpusandemete uusi analüüsivõimalusi ning visandame korpusalase arendustöö edasisi perspektiive ja vajadusi. *** Estonian National Corpus 2013–2021: The largest collection of Estonian language data The paper describes the Estonian National Corpus 2021 (Estonian NC 2021), the latest and the largest edition in the Estonian National Corpora series. The entire series of Estonian NC consists of four corpora: Estonian NC 2013, 2017, 2019 and 2021. The series was compiled by the Institute of the Estonian Language in cooperation with the software company Lexical Computing Ltd. All corpora are accessible through the Sketch Engine interface, a corpus query system developed and maintained by Lexical Computing Ltd. The data are also stored in the repository Entu at Center of Estonian Language Resources. The Estonian National Corpus 2021 contains eleven sub-corpora (i.e. Web 2013, Web 2017, Web 2019, Web 2021, Feeds 2014-2021, Wikipedia 2021, Wikipedia Talk 2017, the Open Access Journals (DOAJ), Literature, the Balanced Corpus, and the Reference Corpus) totalling 2.4 billion words. In addition, the corpus is divided into genres and topics. The most extensive part of the Estonian NC 2021 is the Estonian Web Corpora, i.e. texts crawled from the web. In the paper, we outline the process of crawling the web, the process of cleaning and post-processing the crawled data, and the methodology for classifying web texts into genres and topics. We also introduce new tools for the analysis of corpus data in Sketch Engine, and suggest further perspectives and needs for corpus development.
first_indexed	2024-03-13T08:04:40Z
format	Article
id	doaj.art-df724b1fc0c94beaadbdf59b1e310a33
institution	Directory Open Access Journal
issn	1736-2563 2228-0677
language	English
last_indexed	2024-03-13T08:04:40Z
publishDate	2022-04-01
publisher	Eesti Rakenduslingvistika Ühing (Estonian Association for Applied Linguistics)
record_format	Article
series	Eesti Rakenduslingvistika Ühingu Aastaraamat
spelling	doaj.art-df724b1fc0c94beaadbdf59b1e310a332023-06-01T09:33:08ZengEesti Rakenduslingvistika Ühing (Estonian Association for Applied Linguistics)Eesti Rakenduslingvistika Ühingu Aastaraamat1736-25632228-06772022-04-011820722810.5128/ERYa18.12Eesti keele ühendkorpuste sari 2013–2021: mahukaim eestikeelsete digitekstide koguKristina KoppelJelena KallasEesti Keele Instituudi ja tarkvarafirma Lexical Computing Ltd. koostöös on valminud ühendkorpuste sari, milles on nüüdseks neli versiooni: eesti keele ühendkorpus 2013, 2017, 2019 ja 2021. Ühendkorpused on mahult suurimad eesti keele korpused ning nende rakendusvõimalused on laialdased, alates leksikograafia-alasest uurimistööst ning lõpetades masinõppe-otstarbeliste keelemudelite loomisega. Artiklis keskendume seni uusimale eesti keele ühendkorpusele 2021, mis koosneb suures osas veebist kogutud tekstidest. Kirjeldame veebitekstide kogumise, järeltöötluse ja puhastamise põhimõtteid ning ühendkorpuse allkorpusi, samuti anname ülevaate lähtetekstide klassifitseerimisest. Lisaks tutvustame korpuspäringusüsteemi Sketch Engine näitel korpusandemete uusi analüüsivõimalusi ning visandame korpusalase arendustöö edasisi perspektiive ja vajadusi. *** Estonian National Corpus 2013–2021: The largest collection of Estonian language data The paper describes the Estonian National Corpus 2021 (Estonian NC 2021), the latest and the largest edition in the Estonian National Corpora series. The entire series of Estonian NC consists of four corpora: Estonian NC 2013, 2017, 2019 and 2021. The series was compiled by the Institute of the Estonian Language in cooperation with the software company Lexical Computing Ltd. All corpora are accessible through the Sketch Engine interface, a corpus query system developed and maintained by Lexical Computing Ltd. The data are also stored in the repository Entu at Center of Estonian Language Resources. The Estonian National Corpus 2021 contains eleven sub-corpora (i.e. Web 2013, Web 2017, Web 2019, Web 2021, Feeds 2014-2021, Wikipedia 2021, Wikipedia Talk 2017, the Open Access Journals (DOAJ), Literature, the Balanced Corpus, and the Reference Corpus) totalling 2.4 billion words. In addition, the corpus is divided into genres and topics. The most extensive part of the Estonian NC 2021 is the Estonian Web Corpora, i.e. texts crawled from the web. In the paper, we outline the process of crawling the web, the process of cleaning and post-processing the crawled data, and the methodology for classifying web texts into genres and topics. We also introduce new tools for the analysis of corpus data in Sketch Engine, and suggest further perspectives and needs for corpus development.http://arhiiv.rakenduslingvistika.ee/ajakirjad/index.php/aastaraamat/article/view/ERYa18.12eesti keele ühendkorpustekstikorpusedkorpusleksikograafiakorpuspäringusüsteemeesti keelestonian national corpuscorporacorpus lexicographycorpus query systemestonian
spellingShingle	Kristina Koppel Jelena Kallas Eesti keele ühendkorpuste sari 2013–2021: mahukaim eestikeelsete digitekstide kogu Eesti Rakenduslingvistika Ühingu Aastaraamat eesti keele ühendkorpus tekstikorpused korpusleksikograafia korpuspäringusüsteem eesti keel estonian national corpus corpora corpus lexicography corpus query system estonian
title	Eesti keele ühendkorpuste sari 2013–2021: mahukaim eestikeelsete digitekstide kogu
title_full	Eesti keele ühendkorpuste sari 2013–2021: mahukaim eestikeelsete digitekstide kogu
title_fullStr	Eesti keele ühendkorpuste sari 2013–2021: mahukaim eestikeelsete digitekstide kogu
title_full_unstemmed	Eesti keele ühendkorpuste sari 2013–2021: mahukaim eestikeelsete digitekstide kogu
title_short	Eesti keele ühendkorpuste sari 2013–2021: mahukaim eestikeelsete digitekstide kogu
title_sort	eesti keele uhendkorpuste sari 2013 2021 mahukaim eestikeelsete digitekstide kogu
topic	eesti keele ühendkorpus tekstikorpused korpusleksikograafia korpuspäringusüsteem eesti keel estonian national corpus corpora corpus lexicography corpus query system estonian
url	http://arhiiv.rakenduslingvistika.ee/ajakirjad/index.php/aastaraamat/article/view/ERYa18.12
work_keys_str_mv	AT kristinakoppel eestikeeleuhendkorpustesari20132021mahukaimeestikeelsetedigitekstidekogu AT jelenakallas eestikeeleuhendkorpustesari20132021mahukaimeestikeelsetedigitekstidekogu

Eesti keele ühendkorpuste sari 2013–2021: mahukaim eestikeelsete digitekstide kogu

Similar Items