Sunday, November 21, 2010

Nepali POS tagset: The Nelralec Tagset (Nt-01)

The Nelralec (Nepali Language Resources and Localization for Education and Communication) tagset for Nepali was developed by a team comprising of linguists Yogendra Yadava, Ram Lohani, and Bhim Regmi and Andrew Hardie on the basis of the EAGLES guidelines for morphosyntactic annotation of corpora. The Nelralec tagset is fully hierarchical where in a tag such as VVYN1F, the initial letter V indicates the grammatical category i.e. verb. The following V indicates that the verb is finite, and the letter Y indicates third person. The fully specific tag VVYN1F indicates a very tightly defined, narrow category - feminine singular non-honorific third person finite verbs, such as "chE".
The tagset is compiled with respect to the standard Nepali; hence, the dialectal differences are not taken into consideration while compiling the tagset. Interestingly, the tagset has two main structural features that distinguishes it from a standard grammatical analysis of Nepali even though it is primarily based on previous analyses of Nepali grammar for instance Acharya (1991). As a matter of fact, the tagset is conceived and developed as a model of Nepali grammar for the purpose of POS annotation. In other words, it an abstraction designed to form a basis for manual and automatic POS annotation of tokens.
First, a single graphical token which contain multiple elements are tokenised as separate tokens i.e. break the graphical unit into several tokens, and each of them is annotated accordingly. The form which is disjointed from the start or end of another token and made into a separate token of its own is sometimes called a 'clitic' (in this tagging scheme). The token splitted and the 'clitic' are marked by symbol #. To illustrate an example, the Nepali postpositions, which are preferentially written as affixes on the noun or other word that they govern, are treated as separate tokens in this scheme of analysis. This gives the tagset the flexibility needed to handle very large array of potentially possible configurations of case markers. Second, tense, aspect and mood are not marked up on finite verbs, which are classified solely according to their agreement marking -- a necessary simplification for dealing with the complex verbal inflections of Nepali, which, together with the use of compound verbs, could not be indicated by the tagset without the use of thousands of additional categories.
On the other hand, the treatment of compound noun is very different from that of 'clitic'. In Nepali, compound as well as reduplicated words can be written in one of the three ways as shown below:

(22) chOrA chOrI (as two separate tokens) (lit. Son daughter - ‘children’)

(23) chOrA-chOrI (with hyphen) (lit. Son-daughter - ‘children’)

(24) chOrAchOrI (as a single token) (lit. Sondaughter - ‘children’)

(22) will be tagged as two separate tokens. (23&24) are tagged according to the nature of the last element of the compound i.e. the tag would be consistent as in "chOrI".

Nouns are classified into two types: proper and common. From a noun token, Case and number endings are tokenised separately. The former is treated as postposition. A model of number-gender in Nepali is developed for the purpose of POS tagging. The gender marker in Nepali like -O, -I and -A as in "chOrO, chOrI (Son -, daughter - dir/, Son -, respectively), and chOrA", respectively are ignored on nouns on two grounds. Firstly, as these features on noun are lexical-derivational feature, hence ignored. Secondly, there is a lack of exact symmetric counterpart with the markers regarding gender. For example, there is no masculine counterpart of the token ending with -I like AImAI (woman) -- marda (man). On the other hand, the same feature on pronouns, adjectives, non-finite verbs, etc. where the distinction is motivated by agreement are tagged accordingly. Even the noun token with honorific markers like sara, sAheba, jyU, etc. are tagged as NN (common) or NP (proper). In the Nelralec tagset, postpositions are those clitics (as defined in this tagging scheme) that are deattached from the noun token like case markers, plural suffix, etc. Similarly, Nepali classifiers are annotated separately.
Nepali adjectives, depending upon the nature of their morphological behaviour, are divided into five types. These types are primarily based on gender-number agreement i.e. masculine singular, feminine singular, other for masculine and feminine plurals, unmarked for undeclinable adjectives, and a common tag for both comparative and superlative adjectives.
As this POS tagset is developed as a model of Nepali grammar for POS annotation, pronominals are organised unlike in the traditional/descriptive grammar. Pronouns are organised as personal and reflexive. The former is organised on the basis of person as First, Second and other for unspecified person and honorificity is marked on five levels (see Hardie et al. 2005). Interestingly, in Nepali genitive case alter the phonetic form of the pronoun and cannot be separated as in the noun. Hence, it is treated as a single unit having tag like PMXKM i.e. Pronoun-1P-umarked for honorific-possessive-masculine for mErO/hAmrO. Similarly for ergative/instrumental case markers are also inseparable from the pronoun.
The pronoun-determiner is organised as a separate tag, and is subdivided into demonstrative, interrogative, relative and general (mnemonics are labelled according to their form in Nepali for the two interrogative and relative). As the pronoun-determiner functions as demonstrative and as a pronoun in Nepali, it is imperative to tag the tokens on the basis of the local phrasal context.
Nepali has a large number of TAM combinations, and if every possible combination is to be tagged separately, the tags would be unmanageable enormous. Therefore, in a case of verb, which has two verb roots but a single token, the Nelralec tagset follows a convention that the last identifiable verb is taken into consideration for annotation. For example, in "garnEcha" (do-subjective mood.BE.prs), "cha" is taken into account for annotating the verb token. However, two separate verb tokens will receive individual tags. Consequently, there is no distinction between main and auxiliary verbs in the tagset. Since, the idea behind the Nelralec tagset is to accomplish POS annotation, certain aspects of Nepali verb morphology is ignored viz. passive, causative and negative. These aspects of morphology are annotated as their counterpart i.e. active, non-causative and positive, respectively.
Within the verbal domain annotation, finiteness is distinguished on the basis of person marking. A verb with person marking on it is considered as finite opposed to without person marker for non-finite. Under the non-finite verb form, the participles like "gardO, gardI, gardA, gardai, garE, garnE, garEra" and the subjunctive e-form like garE (note that it is phonetically the same as a participle) and i-form for instance "garI" of the Nepali verbs are grouped accordingly. Similarly, command verb forms are tagged separately according to the honorific status.
In Nepali finite verbs, the distinction operates on Person (First, Second and Third), Number (Singular and Plural), Gender (Masculine and Feminine) and Honorific (Non-honorific and Medial). From the above, theoretically speaking 24 tags can be derived; however, only 10 tags are required since not all the combinations of these morphosyntactic features have separate forms in Nepali. Interestingly, separate tags are designed in this tagset for optative verbs as they behave differently in many ways from the other finite verbs.
In the Nelralec tagset, the mnemonics of the tag elements are schematised according to the Nepali form like M for first person (after "ma" (I)), T for second person (after "timI" (you)). Interestingly, there is no uniform scheme in organising tags on the basis of their types and attributes. For example, Nouns are NN and NP showing category and type - common and proper, respectively. Conversely, Adjectives are distinguished as JM, JF, JO, JX, and JT on the basis of the morphosyntactic attributes - gender and degree (see Hardie et al. ibid.: 5-11 for details of other categories). In other words, the Nelralec tagset assumes underspecification of both types and attributes among its 112 tags.  

[From: Mallikarjun, B, Yoonus, M. Sinha, Samar & A, Vadivel. 2010. Indian Languages and Part-of-Speech Annotation. Mysore: Linguistic Data Consortium for Indian Language: pp. 22-25. ISBN-81-7342-197-8]

