arTenTen: Arabic Corpus and Word Sketches

We present arTenTen, a web-crawled corpus of Arabic, gathered in 2012. arTenTen consists of 5.8-billion words. A chunk of it has been lemmatized and part-of-speech (POS) tagged with the MADA tool and subsequently loaded into Sketch Engine, a leading corpus query tool, where it is open for all to use...

Full description

Bibliographic Details
Main Authors: Tressy Arts, Yonatan Belinkov, Nizar Habash, Adam Kilgarriff, Vit Suchomel
Format: Article
Language:English
Published: Elsevier 2014-12-01
Series:Journal of King Saud University: Computer and Information Sciences
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S1319157814000330
_version_ 1818549786362511360
author Tressy Arts
Yonatan Belinkov
Nizar Habash
Adam Kilgarriff
Vit Suchomel
author_facet Tressy Arts
Yonatan Belinkov
Nizar Habash
Adam Kilgarriff
Vit Suchomel
author_sort Tressy Arts
collection DOAJ
description We present arTenTen, a web-crawled corpus of Arabic, gathered in 2012. arTenTen consists of 5.8-billion words. A chunk of it has been lemmatized and part-of-speech (POS) tagged with the MADA tool and subsequently loaded into Sketch Engine, a leading corpus query tool, where it is open for all to use. We have also created ‘word sketches’: one-page, automatic, corpus-derived summaries of a word’s grammatical and collocational behavior. We use examples to demonstrate what the corpus can show us regarding Arabic words and phrases and how this can support lexicography and inform linguistic research. The article also presents the ‘sketch grammar’ (the basis for the word sketches) in detail, describes the process of building and processing the corpus, and considers the role of the corpus in additional research on Arabic.
first_indexed 2024-12-12T08:37:48Z
format Article
id doaj.art-cb1681c59bb14f3295ceed6c5997ee09
institution Directory Open Access Journal
issn 1319-1578
language English
last_indexed 2024-12-12T08:37:48Z
publishDate 2014-12-01
publisher Elsevier
record_format Article
series Journal of King Saud University: Computer and Information Sciences
spelling doaj.art-cb1681c59bb14f3295ceed6c5997ee092022-12-22T00:30:52ZengElsevierJournal of King Saud University: Computer and Information Sciences1319-15782014-12-0126435737110.1016/j.jksuci.2014.06.009arTenTen: Arabic Corpus and Word SketchesTressy Arts0Yonatan Belinkov1Nizar Habash2Adam Kilgarriff3Vit Suchomel4Chief Editor Oxford Arabic Dictionary, UKMIT, USANew York University Abu Dhabi, United Arab EmiratesLexical Computing Ltd, UKMasaryk Univ., Czech RepublicWe present arTenTen, a web-crawled corpus of Arabic, gathered in 2012. arTenTen consists of 5.8-billion words. A chunk of it has been lemmatized and part-of-speech (POS) tagged with the MADA tool and subsequently loaded into Sketch Engine, a leading corpus query tool, where it is open for all to use. We have also created ‘word sketches’: one-page, automatic, corpus-derived summaries of a word’s grammatical and collocational behavior. We use examples to demonstrate what the corpus can show us regarding Arabic words and phrases and how this can support lexicography and inform linguistic research. The article also presents the ‘sketch grammar’ (the basis for the word sketches) in detail, describes the process of building and processing the corpus, and considers the role of the corpus in additional research on Arabic.http://www.sciencedirect.com/science/article/pii/S1319157814000330CorporaLexicographyMorphologyConcordanceArabic
spellingShingle Tressy Arts
Yonatan Belinkov
Nizar Habash
Adam Kilgarriff
Vit Suchomel
arTenTen: Arabic Corpus and Word Sketches
Journal of King Saud University: Computer and Information Sciences
Corpora
Lexicography
Morphology
Concordance
Arabic
title arTenTen: Arabic Corpus and Word Sketches
title_full arTenTen: Arabic Corpus and Word Sketches
title_fullStr arTenTen: Arabic Corpus and Word Sketches
title_full_unstemmed arTenTen: Arabic Corpus and Word Sketches
title_short arTenTen: Arabic Corpus and Word Sketches
title_sort artenten arabic corpus and word sketches
topic Corpora
Lexicography
Morphology
Concordance
Arabic
url http://www.sciencedirect.com/science/article/pii/S1319157814000330
work_keys_str_mv AT tressyarts artentenarabiccorpusandwordsketches
AT yonatanbelinkov artentenarabiccorpusandwordsketches
AT nizarhabash artentenarabiccorpusandwordsketches
AT adamkilgarriff artentenarabiccorpusandwordsketches
AT vitsuchomel artentenarabiccorpusandwordsketches