arTenTen: Arabic Corpus and Word Sketches

We present arTenTen, a web-crawled corpus of Arabic, gathered in 2012. arTenTen consists of 5.8-billion words. A chunk of it has been lemmatized and part-of-speech (POS) tagged with the MADA tool and subsequently loaded into Sketch Engine, a leading corpus query tool, where it is open for all to use...

Full description

Bibliographic Details
Main Authors: Tressy Arts, Yonatan Belinkov, Nizar Habash, Adam Kilgarriff, Vit Suchomel
Format: Article
Language:English
Published: Elsevier 2014-12-01
Series:Journal of King Saud University: Computer and Information Sciences
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S1319157814000330