Summary: | Abstract
A linguistic corpus is a collection of linguistic data derived from language texts, which represent the real patterns of language use to the researchers. The priority of the corpus over other linguistic resources stems from the amount of data it represents and the possibility of computer use in linguistic studies. In the present study, an annotated monolingual linguistic corpus of Light Verb Constructions (LVCs) of Persian language (LCP) developed by the authors was introduced. The corpus contained more than 6000 LVCs, which were used in more than 2000000 linguistic contexts. Just a comparison of the number of LVCs with the number of simple verbs in Persian is enough to indicate the importance of these types of language resources. This annotated corpus presented LVCs formed by 21 Persian Light Verbs (LVs) that are used in real contexts. This unprecedented work has the capacity to easily provide a large computational bulk of various data for the researchers to assess the existing hypotheses and put forward the new ones.
Keywords: Persian Language, Language Resources, Linguistic Corpus, Light Verb Constructions, Natural Language Processing
Introduction
Light verbs are a group of verbs that have lost part of their semantic contents during language evolution. These so-called light verbs in combination with a preverbal element like a noun, adjective, or prepositional phrase form Light Verb Constructions (LVCs) in Persian. The study of LVCs is important not only theoretically, but also practically. The verbal system of Persian largely consists of LVCs and it doubles the importance of their study in this language. Nevertheless, many studies have pointed out the challenges that Persian LVCs pose for computational systems. They have emphasized the lack of appropriate computer resources and the necessity of studies that provide the researchers with their standard language patterns in this language (Maerefat, 2004; Hasas Sediqi, 2010; Taslimipoor, 2012; Askariyan, 2012, and Barfi, 2016 among others). Although there are already valuable Persian corpora developed by specialists like Bijan Khan (2004, 2018), Asi (2005), and Al-e-Ahmad et al. (2010) in this field, there is no corpus to comprehensively represent LVCs of all productive Persian Light Verbs (LVs). The only available corpus dealing with Persian LVCs is PresPred (Samvellian & Faqiri, 2013), which represents those consisting of one of the twenty-one productive Persian LVs (Zadan). To address this need, we developed the first corpus for Persian LVCs.[1] This annotated corpus presented the LVCs formed by 21 Persian LVs that are used in real contexts. The present unprecedented work has the capacity to readily provide a large computational bulk of various data for researchers.
Materials and Methods
Development of the present corpus experienced the following steps: designing the structure of the corpus, selecting a corpus as a basis, normalizing the texts, defining the search nodes, writing macro codes in Visual Basic Analysis (VBA) language for preparing the search software, extracting all the sentences containing the verbs under investigation (regardless of being light or lexical verbs), extracting the sentences with LVCs, and finally selecting an annotation model and applying it to the results. It was designed to be a synchronic monolingual corpus of Persian LVCs. We chose a corpus developed by Bijan Khan (2018) as a basis. It was developed in the Research Institute of Information and Communication Technology and contained 950000 text files. First, we normalized the texts and then used VBA macro codes to extract the LVCs consisting of 21 Persian LVs ( da:shtan: have, kardan: do, shodan: become, gashtan: turn, goza:shtan: put, keshidan: pull, didan: see, da:dan: give, bakhshidan: give, grant, gereftan: get, yaftan: obtain, ?a:madan: come, ?a:vardan: bring, residan: arrive, raftan: go, ?ofta:dan: fall, ?anda:khtan: throw, bordan: take, khordan: collide, zadan: hit, and bastan: tie). then, constituency test (topicalization, coordination, deletion, and substitution) was applied to distinguish LVCs from lexical verbs. Annotation of LVCs has been done at the word level within a Distributed Morphology setting (Halle & Marantz, 1993 and Marantz, 2013). Preverbal elements and LVs were considered as categoryless elements (annotated as Pre-Verbs (PVs)) and categorizers (annotated as LVs), respectively. In addition, the present and past lemmas of each LVC were given and their separability/inseparability was annotated as SEP/INSEP. It should be noted that in line with Karimi-Doostan (2011), the cases, in which preverbal elements and LVs were broken by a negative particle (neg), the imperfective morpheme (mi), modals and auxiliaries, such as ba:yad (should, must), xa:stan (will) as a future auxiliary verb, and da:sˇtan (to have) as a progressive auxiliary verb, as well as clitic pronouns like –esˇ (it), were annotated as INSEP. Table 1 represents these tags and the colors used for each of them.
Discussion of Results and Conclusion
Light Verb Constructions (LVCs) as a subset of complex or multi-word predicates are among the most challenging topics of language. The present study developed a monolingual corpus of Persian LVCs with the aim of providing the researchers with a large computational bulk of data related to these challenging constructions and improving the authenticity of the studies conducted in this field. The present corpus included about 6000 LVCs in more than 2000000 contexts. In contrast, the number of Lexical verbs in Persian is about 200. The comparison highlighted how significant this kind of linguistic resource could be for a language and its researchers. They can be used in machine translation, artificial intelligence and language processing programs, data recovery programs, language learning, grammar books, and dictionaries.
[1]. The corpus of Light Verb Constructions of Persian is available at https://literature.ut.ac.ir/compound-verb.
|