Lingua Parasitica: 2010

Tuesday, November 30, 2010

Hampi 112010

Sunday, November 21, 2010

Nepali POS tagset: The Nelralec Tagset (Nt-01)

The Nelralec (Nepali Language Resources and Localization for Education and Communication) tagset for Nepali was developed by a team comprising of linguists Yogendra Yadava, Ram Lohani, and Bhim Regmi and Andrew Hardie on the basis of the EAGLES guidelines for morphosyntactic annotation of corpora. The Nelralec tagset is fully hierarchical where in a tag such as VVYN1F, the initial letter V indicates the grammatical category i.e. verb. The following V indicates that the verb is finite, and the letter Y indicates third person. The fully specific tag VVYN1F indicates a very tightly defined, narrow category - feminine singular non-honorific third person finite verbs, such as "chE".

The tagset is compiled with respect to the standard Nepali; hence, the dialectal differences are not taken into consideration while compiling the tagset. Interestingly, the tagset has two main structural features that distinguishes it from a standard grammatical analysis of Nepali even though it is primarily based on previous analyses of Nepali grammar for instance Acharya (1991). As a matter of fact, the tagset is conceived and developed as a model of Nepali grammar for the purpose of POS annotation. In other words, it an abstraction designed to form a basis for manual and automatic POS annotation of tokens.

First, a single graphical token which contain multiple elements are tokenised as separate tokens i.e. break the graphical unit into several tokens, and each of them is annotated accordingly. The form which is disjointed from the start or end of another token and made into a separate token of its own is sometimes called a 'clitic' (in this tagging scheme). The token splitted and the 'clitic' are marked by symbol #. To illustrate an example, the Nepali postpositions, which are preferentially written as affixes on the noun or other word that they govern, are treated as separate tokens in this scheme of analysis. This gives the tagset the flexibility needed to handle very large array of potentially possible configurations of case markers. Second, tense, aspect and mood are not marked up on finite verbs, which are classified solely according to their agreement marking -- a necessary simplification for dealing with the complex verbal inflections of Nepali, which, together with the use of compound verbs, could not be indicated by the tagset without the use of thousands of additional categories.

On the other hand, the treatment of compound noun is very different from that of 'clitic'. In Nepali, compound as well as reduplicated words can be written in one of the three ways as shown below:

(22) chOrA chOrI (as two separate tokens) (lit. Son daughter - ‘children’)

(23) chOrA-chOrI (with hyphen) (lit. Son-daughter - ‘children’)

(24) chOrAchOrI (as a single token) (lit. Sondaughter - ‘children’)

(22) will be tagged as two separate tokens. (23&24) are tagged according to the nature of the last element of the compound i.e. the tag would be consistent as in "chOrI".

Nouns are classified into two types: proper and common. From a noun token, Case and number endings are tokenised separately. The former is treated as postposition. A model of number-gender in Nepali is developed for the purpose of POS tagging. The gender marker in Nepali like -O, -I and -A as in "chOrO, chOrI (Son - dir.sg, daughter - dir/obl.sg, Son - obl.lt/sg, respectively), and chOrA", respectively are ignored on nouns on two grounds. Firstly, as these features on noun are lexical-derivational feature, hence ignored. Secondly, there is a lack of exact symmetric counterpart with the markers regarding gender. For example, there is no masculine counterpart of the token ending with -I like AImAI (woman) -- marda (man). On the other hand, the same feature on pronouns, adjectives, non-finite verbs, etc. where the distinction is motivated by agreement are tagged accordingly. Even the noun token with honorific markers like sara, sAheba, jyU, etc. are tagged as NN (common) or NP (proper). In the Nelralec tagset, postpositions are those clitics (as defined in this tagging scheme) that are deattached from the noun token like case markers, plural suffix, etc. Similarly, Nepali classifiers are annotated separately.

Nepali adjectives, depending upon the nature of their morphological behaviour, are divided into five types. These types are primarily based on gender-number agreement i.e. masculine singular, feminine singular, other for masculine and feminine plurals, unmarked for undeclinable adjectives, and a common tag for both comparative and superlative adjectives.

As this POS tagset is developed as a model of Nepali grammar for POS annotation, pronominals are organised unlike in the traditional/descriptive grammar. Pronouns are organised as personal and reflexive. The former is organised on the basis of person as First, Second and other for unspecified person and honorificity is marked on five levels (see Hardie et al. 2005). Interestingly, in Nepali genitive case alter the phonetic form of the pronoun and cannot be separated as in the noun. Hence, it is treated as a single unit having tag like PMXKM i.e. Pronoun-1P-umarked for honorific-possessive-masculine for mErO/hAmrO. Similarly for ergative/instrumental case markers are also inseparable from the pronoun.

The pronoun-determiner is organised as a separate tag, and is subdivided into demonstrative, interrogative, relative and general (mnemonics are labelled according to their form in Nepali for the two interrogative and relative). As the pronoun-determiner functions as demonstrative and as a pronoun in Nepali, it is imperative to tag the tokens on the basis of the local phrasal context.

Nepali has a large number of TAM combinations, and if every possible combination is to be tagged separately, the tags would be unmanageable enormous. Therefore, in a case of verb, which has two verb roots but a single token, the Nelralec tagset follows a convention that the last identifiable verb is taken into consideration for annotation. For example, in "garnEcha" (do-subjective mood.BE.prs), "cha" is taken into account for annotating the verb token. However, two separate verb tokens will receive individual tags. Consequently, there is no distinction between main and auxiliary verbs in the tagset. Since, the idea behind the Nelralec tagset is to accomplish POS annotation, certain aspects of Nepali verb morphology is ignored viz. passive, causative and negative. These aspects of morphology are annotated as their counterpart i.e. active, non-causative and positive, respectively.

Within the verbal domain annotation, finiteness is distinguished on the basis of person marking. A verb with person marking on it is considered as finite opposed to without person marker for non-finite. Under the non-finite verb form, the participles like "gardO, gardI, gardA, gardai, garE, garnE, garEra" and the subjunctive e-form like garE (note that it is phonetically the same as a participle) and i-form for instance "garI" of the Nepali verbs are grouped accordingly. Similarly, command verb forms are tagged separately according to the honorific status.

In Nepali finite verbs, the distinction operates on Person (First, Second and Third), Number (Singular and Plural), Gender (Masculine and Feminine) and Honorific (Non-honorific and Medial). From the above, theoretically speaking 24 tags can be derived; however, only 10 tags are required since not all the combinations of these morphosyntactic features have separate forms in Nepali. Interestingly, separate tags are designed in this tagset for optative verbs as they behave differently in many ways from the other finite verbs.

In the Nelralec tagset, the mnemonics of the tag elements are schematised according to the Nepali form like M for first person (after "ma" (I)), T for second person (after "timI" (you)). Interestingly, there is no uniform scheme in organising tags on the basis of their types and attributes. For example, Nouns are NN and NP showing category and type - common and proper, respectively. Conversely, Adjectives are distinguished as JM, JF, JO, JX, and JT on the basis of the morphosyntactic attributes - gender and degree (see Hardie et al. ibid.: 5-11 for details of other categories). In other words, the Nelralec tagset assumes underspecification of both types and attributes among its 112 tags.

[From: Mallikarjun, B, Yoonus, M. Sinha, Samar & A, Vadivel. 2010. Indian Languages and Part-of-Speech Annotation. Mysore: Linguistic Data Consortium for Indian Language: pp. 22-25. ISBN-81-7342-197-8]

Issues in POS Tagset Design

In the initial phase of POS tagset development for NLP purposes, the tagsets were designed and developed from the machine learning point of view in lieu of linguistic point of view. Under such considerations, language is arbitrarily considered as a sequence of tokens to be associated with a given set of tags. In formal terms, a set of strings over Σ (i.e. any subset of Σ*) is called a formal language over Σ. Moreover, focus on linguistic knowledge in designing tagset was neglected.

However, with the growing realisation that linguistic knowledge is essential in any work on language, the issues involved in designing POS tagset are discussed from the linguistic perspectives too as these issues have wide implications on the annotation of the linguistic data, and the resultant output and application based on it.

In this section, the following conceptual design issues are discussed with relevant illustrations from Indian languages.

1. Theoretical Background

In the development of a new tagset, the developers will analyse linguistic data in light of a particular linguistic theory that they advocate. The development of tagset, therefore, is not theory independent or theory-neutral as one often wishes it to be due to the conflicting assumptions. Consequently, the theoretical assumptions play an important role in deciding many other aspects of tagset design.

However, it is also possible that the developers are application-oriented rather than linguistic-theory oriented. For example, the Machine Learning researchers using a POS tagged corpus for their experiments are primarily concerned with Machine-Learnable tagging than with a specific linguistic theory. Therefore, such researchers will develop POS tagset accordingly. Paradoxically, this view has dominated the development of POS tagset to a large extent.

English being the first language of corpus linguistics, the grammatical framework chosen to describe its POS are Generalised Phrase Structure Grammar and Lexical Functional Grammar, which had promoted the notion that a category is composed of a bundle of features. In the Indian language POS tagset scenario, IIIT-Hindi tagset and Telugu tagset developed by CALTS, Hyderabad (Sree R. J. et al. 2008) are based on the Paninian perspective (for details see section 6). However, it is a desirable feature that tagset is not theory-laden but supports linguistic analysis also.

2. Form and Function

One of the major decisions that the tagging schema needs to resolve is a tagging decision between form and function of a token in the text. As a given word/token may function differently in different syntactic contexts, they may be assigned different tags depending upon the function rather than on the form. Such cases, however, pose a computational complexity for automatic tagging, since more than one tag is given for the same form but with different contextual syntactic functions. On the other hand, two syntactic functions of a token/word may be assigned a single tag on the basis of its form. This also leads to information loss.

To maintain a firm decision between form and function, different approaches are decided for POS tagging; and each approach has underlying assumption to validate the decision. To illustrate such an assumption, a token is POS tagged on the basis of the form rather than the function in AnnCorra (Bharti et al. 2006). This decision is based on the priority that it eradicates choices involved in manual tagging, and establishes a token-tag relation which leads to efficient machine learning. In contrast to AnnCorra, Stuttgart-Tübingen Tag-Set (STTS) for German (Atwell ms.) has made linguistically motivated distinction between attributive and predicative adjectives. However, there are other approaches where there is a division of labour with respect to the hierarchy regarding form and function. The MSRI developed ILPOSTS based Hindi tagset is such one tagset which takes morphosyntactic form into account for assigning attribute-value (the lowest in the hierarchy), and function for annotating the Type (the mid-level hierarchy).

Knowles & Don (2003) has devised another approach for Malay, a language in which words change their function according to context. For example, "masuk" is a verb in a context but it is a noun "entrance" in a context of building, car-park, etc. Acknowledging this linguistic fact in Malay, Knowles & Don's tagset for Malay separates lexical class or form from syntactic function, and give each word in the lexicon only one class-tag. They have used the term ‘tag’ to label a lexical class, and ‘slot’ to refer to a position in syntactic structure in Malay (see Atwell ms.: 19).

Yet another view on this dichotomy is expressed as the following. To illustrate the form and function dichotomy, “maaDi” in Kannada is ambiguous between plural imperative and past verbal participle. A tagger needs to resolve such ambiguity through context. However, there is no need to consider those distinctions which are entirely within the scope of syntax. For example, syntax allows, as a general, universal rule, that nouns can act as adjectival modifiers. This rule is very much a part of any syntactic system. Hence, a tagger need not tag a noun as an adjective because of its function. This is unnecessary and it adds to the complexity of machine learning (Kavi Narayana Murthy, p.c.).

This view asserts that ambiguity arising out of form needs to be disambiguated at POS level provided there are no syntactic rules to account for its function. In other words, POS tagging is primarily based on form, and function is a secondary concern of tagging to be carried out as a last resort for disambiguation.

3. Granularity: Coarse Vs. Fine

The one of the important concerns in developing a tagset for a language is granularity - coarseness and fineness. They refer to the broad annotation and the finer annotation, respectively of any grammatical category. The aim of the corpus annotation is to maximise information content so that the tagged corpus can be used for a variety of applications. But as a matter of fact, the applications are not known in advance, hence, the level of linguistic annotation required is also unknown. The general corpus developers, as a principle, prefer to maximise linguistic enrichment by designing tagset in such a way that the annotation can be customised according to the needs of the application.

However, in POS tagset design, there are two schemes for granularity. The coarse annotation has far less number of tags than the fine grained annotation, and aids in higher accuracy in the course of manual tagging and in efficient machine learning. Despite such advantages, the coarse grained POS tagset is of less use as it does not capture much relevant information on POS. On the other hand, a finer annotation provide a very large number of information but also leads to create a problem for automatic tagging as it maximises tag options for a given token leading to computational complexity.

In view of the above mentioned advantages and disadvantages of the schemes, an ideal POS tagset design makes a subtle balance for POS annotation. However, it is important to remember that all linguistic information cannot be annotated at the POS level as well as all other linguistic information cannot be recovered from other levels of annotation. As a rule of thumb, it is imperative to capture optimal information at this level of annotation. In other words, POS design has to be such that coarse as well as fine information is retrieved as per the needs of the application.

In this context of granularity, the hierarchical architecture provides an edge over the flat architecture as it allows to modularise information accordingly. This is usually conceived along the levels of hierarchy - deeper the level, finer the features are encoded. On the other hand, flat may be too coarse or too finer or may lose relevant information in the POS tagged corpus.

The Text Analytics and Natural Language Processing ( Tanl) tagset (Attardi & Simi (Ms)) used for the EVALITA09 POS tagging is one such tagset designed for both coarse and fine grained annotation. It consists of 328 tags, and provides three levels of POS tags: coarse-grain, fine-grain and morphed tags. The coarse-grain tags consist of the 14 categories, the fine-grain tags have 36 tags like indefinite pronoun, personal pronoun, possessive pronoun, interrogative pronoun and relative pronoun among pronoun, and the morphed tags consist of 328 categories, which include morphological information like person, number, gender, tense, mode, and clitic.

4. Orthographic Conventions

One of the major issues that one faces in designing a tagset is to account orthographic practices that are beyond the known linguistic principles of categorisation. It is a known linguistic fact that a single token need not necessarily express meaning but rather a group of tokens. Such linguistic unit has come to known as multi token word (MTW) (also commonly known as multiword expression in computational literature). For example, a complex postposition in Hindi, के लिए collectively expresses a single meaning of "benefaction/purposive" (as a case marker). In isolation, के is a masculine genitive case marker and लिए has no semantic content. Ideally, therefore, के लिए can be tagged in one of the three ways:

(4) के\ and लिए\ as two separate POS labels (though tag for लिए is an issue).

(5) [के\ लिए\]\ as a single complex postposition with two different POS labels.

(6) [के लिए]\ as a complex but a single POS label.

It is one of the major decisions that a tagset designing has to take firmly regarding POS labelling to different tokens of a single lexical word. It is often the case that such issues are tagged ad hoc/arbitrarily at the POS level annotation, and are resolved at the higher level like local token grouping/chunking where a group of tokens is assigned a single tag.

Apart from MTW, contractions pose as a major issue with respect to token and linguistic annotation. Contrary to MTW, contractions are those orthographic forms that are shortened than the usual form reflecting the spoken form yet it partially retains the usual orthographic form. For example in Nepali, भा'थ्यो or भा'-थ्यो is contracted form of भएको थियो. The contracted form भा is a contraction of a participial भएको which is different from a dubitative particle भा. Similarly, थ्यो is a contracted form of थियो.

There are two known approaches to tackle this orthographic convention. The first approach considers the form as an orthographic convention reflecting a spoken form of two known distinct tokens. Therefore, the contracted forms are pre-processed, and tokenised as two separate tokens after separating punctuation markers from these tokens and tagged accordingly. To illustrate the case mentioned above, भा will be tagged as a participial and थ्यो as a verb assuming them to be an alternative orthographic form of their respective category. Alternately, the contracted form is considered as a single token reflecting to a linguistic reality in the mind of the speaker/author. In accordance with the language use the token is tagged.

5. Computational Complexity

One of the important functions of POS tagging is to resolve category level ambiguity. Paradoxically, in practice, there remain many issues where ambiguity remains unresolved or partially resolved even after POS tagging, and becomes a source of ambiguity for further processing. In this context, it is important to remember that the ambiguities are related with token-tag rather than semantic or structural ambiguity.

One of the most common examples to cite is about case syncretism, where the same form of marker is used for different case markers. For example, Hindi dative and accusative case markers have a similar phonetic/orthographic form as को. In a form based approach, let’s assume that को is assigned dative consistently irrespective of linguistic context in which it is accusative. In the process, this results in a loss of linguistic information that को is also an accusative case marker in Hindi. This approach, however, facilitates an ease for machine learning algorithm to POS tag but the resultant output has a loss of relevant linguistic information. Though, such an approach solves an issue ad hoc at the POS level annotation, its result needs to be recategorised and reassign the appropriate tag in association with other levels of annotation like semantic tagging in order to regain the lost linguistic information which is significant for higher level processing.

In a function based approach, though it demands annotator to distinguish each case and tag accordingly which in turn adds cognitive load to the annotator (see section 5.6), each linguistic information is tagged appropriately despite similar forms. However, for machine, it is a more difficult task to distinguish POS tags, technically to disambiguate, as there is no linguistic supplement to distinguish the two (Bhat & Richa (ms.) for detailed discussion on the issue). Thus, a system requires other tools and techniques to disambiguate it adding to computational complexity.

As a matter of fact, these approaches is a tug-of-war between detailed linguistic tagging and an ease for cognitive load to the annotator or/and subsequent automatic tagging. In an ideal tagging scheme, these two aspects are balanced finely so that it remains optimal with respect to the design scheme and the various processes both at the manual as well as at the machine level. Therefore, it is imperative to validate POS tagset at, and across various NLP processes in order to achieve computational as well as manual optimality.

6. Cognitive Load on Annotator

One of the major objectives of corpus linguistics is to design tagger which minimises human labour for annotating the text. Such automatic tagger, however, requires linguistic knowledge. Ideally an automatic tagger is “trained” by giving it the results of a manually annotated corpus, also called "training corpus." It is on the basis of "training corpus," the automatic tagger gains linguistic knowledge in association with machine learning techniques.

With respect to POS tagging, automatic tagger is trained to acquire knowledge to establish a tag-token relationship. The tagger acquires this knowledge from "training corpus", which is manually POS annotated. This, in turn, establishes a work flow that manual annotation forms the backbone of all kinds of annotation for NLP tasks.

With the given importance of manual annotation, and of POS annotation specifically for NLP tasks, it is important to ensure that manual POS annotation has zero-error. Since manual tagging is a tedious process, it is always desirable to reduce tagging load on the annotator to ensure such a standard. It is desirable that the annotation process is simple, intuitive, easier, and makes feel-good so that the cognitive load on the annotator is reduced to maximum limit. The first most requisite is to make the user comfortable with the GUI based tool. The look and feel of the tool can be customised according to the user so that it can set to an environment in which the user would like to work comfortably.

To reduce cognitive load on the annotator, the tool can be designed in such a way that it reduces number of human annotation interference which in consequence aims to minimise human error in tagging. For example, in Nepali, Direct Case has "0" value for Case Marker, and Oblique takes morphological Case Marker as given in values. The tool needs to be programmed in accordance with the linguistic facts such that value assignment for Direct Case takes automatically whereas for Oblique, the value assignment will be carried out manually. As a consequence of such filtering program, chances of error with respect to Direct Case are reduced. The tool, therefore, needs to be flexible enough to be customised with filters to accommodate language specific tagging facts while tagging data from many languages.

It is also desirable to facilitate annotation of finite list of items automatically. For example, punctuation markers are finite, and the tool can be designed to tag them automatically reducing iteration that otherwise a manual annotator has to carry out.

The developments and incorporation of such heuristic as well as linguistic facts into the tool primarily based on POS tagset can provide an impetus to ease off cognitive load on the annotator to ensure zero-error standard.

POS Annotation Vis-A-Vis Corpus Annotation

Annotation is a process of ascribing grammatical categories to token/word of a corpus. Prior to corpus annotation, Text Encoding Initiative (TEI) makes annotation of corpus reader friendly and suggests universal grammatical categories for annotations enabling corpora to be stored and transferred. Moreover, TEI uses Standard Generalised Mark-up Language (SGML), an ISO-standard 8879 technology for defining generalised mark-up languages for documents, for text encoding and annotation purpose, and more recently XML has been adopted. This enables to encode any textual resource, in a manner that is hardware, software, and application independent.

Leech (1993) describes seven maxims for annotation of text corpora:

Reversibility: Annotation should be removable and the annotated corpus can be reverted back to raw corpus.
Extractibilty: Annotations should be able to be separated from the corpus text.
Reader Friendliness: Annotation has to be such that it is reader friendly.
Maker Explicitness: Manual as well as automatic tagging should make difference to the corpus user.
Potentiality: Annotation is a potential representation rather than absolute representation.
Mentality: Annotation should be theory independent.
Non-Standardness: None of the annotation scheme is regarded as the a priori standard. Standards emerge through practical consensus, and the set of corpus tags will, very likely, be revised many times during the course, in order to find an optimal set for each language.

As a design measure of POS tagset, it is widely accepted that the POS tagset will not include any derivational, etymological, syntactic, semantic or discourse information (Hardie 2003). However, composition of tags does have its significance in annotation. Leech (1997) suggests the following criteria for labelling tag:

Conciseness: It is more convenient to use concise label than verbose, lengthy ones. For example, "mas" rather than masculine.
Perspicuity: Interpretable labels are more user friendly than which cannot. Cloeren (1999) writes, "For reasons of readability there is a preference for mnemonic tags. Full-length names may be clearer individually, but make the annotated text virtually unreadable." For example, "NMZ" is more easily interpreted as nominaliser than "NML".
Analysability: Decomposed labels are friendly to human annotator as well as machine.
Compositionality: A tag needs to be logically composed as a string of symbols representing levels of taxonomic categories. For example, a tag NC.mas.sg.dir.0.n.n in Hindi is for Category Noun, Type Common, and Attributes Gender, Number, Case, Case Marker, Distributive and Honorificity along with its valuation.

Leech & Smith (1990: 27) point out that syntactic parsing is arguably the central task of NLP, and POS tagging, being a prerequisite to parsing, is "the most central area of corpus processing.” Though POS tag has a limited scope of syntactic disambiguation, it shares a network of relationship with various other intermediate tasks in constructing an optimal system. In the architecture of corpus annotation, different levels of annotation feed POS tagging, and vice-versa. POS tagging being a mid level NLP process, ideally it can and should make use of lexical level processes and should yield results that are desired for the syntactic parsing. Therefore, both these processes are to be considered in designing POS tagsets.

Bharati et al. (2006) among others, show that features from Morph Analyser can be used for enhancing the performance of a POS tagger. Infact, they argue that Morph Analyser itself can identify the parts of speech in most of the cases, and a POS tagger can be used to disambiguate the multiple answers provided by Morph Analyser. In retaliation, POS tagged data is used for other higher level processes like chunking, parsing, etc.

Similar view on POS tagging is expressed by Kavi Narayana Murthy (p.c.), who considers POS annotation as a mid level process depending on lexical level annotation and processes. In his view, Lexicon, in a sense, can be considered a tagger that it tags root forms of words/lemmata with 'all possible' tags. Morpho Analyser deals with inflected and/or derived forms of words and assign ‘all possible tags' to all valid forms of all valid words in a given language. Given this, a tagger that tags words/tokens in a running text can be viewed as a disambiguater rather than as an assigner of tags. All possible tags have already been assigned by the Lexicon/Morph Analyser and the task of POS level annotation is only to eliminate or at least reduce ambiguities if any. Further, he opines that this approach to POS tagging has several advantages:

Impossible tags are never assigned.
Words which have only one possible tag need not even be considered, only ambiguous cases need to be considered by the tagger.
The degree and nature of ambiguities can be studied, both at the root word level (from the Lexicon) and at the running text level (from a tagged corpus).
All types of ambiguities are not of the same nature. Therefore, different strategies can be formulated to handle them. For example, some kinds of ambiguities are easily solved using local context while others may inherently need long distance dependencies to be considered.
In the context of Indian languages, he emphasises that Indian languages are morphologically very rich and a Morph Analyser is the most essential component for processing which can substantially reduce the ambiguities at the Lexicon though it can also introduce some ambiguities of its own. But the ambiguities introduced by the Morph Analyser are always uniform and fully rule governed. This helps us to design a judicious combination of linguistic (rule based/knowledge based) and statistical/machine-learning approaches.

[From: Mallikarjun, B, Yoonus, M. Sinha, Samar & A, Vadivel. 2010. Indian Languages and Part-of-Speech Annotation. Mysore: Linguistic Data Consortium for Indian Language: pp. 2-4. ISBN-81-7342-197-8]

POS Annotation

The ultimate goal of research on Natural Language Processing (henceforth, NLP) is to understand human language, and to facilitate human-machine interaction through human, natural language (weak AI) and to model theory of mind (strong AI). To achieve such a Promethean mission, research on NLP has focussed on various intermediate tasks that make partial sense of language structure without requiring complete understanding; consequently, contributing in developing a successful system. Part-of-Speech (henceforth, POS) tagging is one such task.

In corpus linguistics, POS tagging, also called grammatical tagging or word-category disambiguation, is a classification system, a process of marking up the words in a text corpus as corresponding to particular parts of speech, based on both its definition, as well as its context i.e., relationship with adjacent and related words in a phrase, clause, or sentence. It is the most common form of corpus annotation and is widely accepted as the first stage of a more comprehensive syntactic annotation. It serves wide number of applications like speech synthesis and recognition, information extraction, partial parsing, machine translation, lexicography, etc., to name a few.

However, it is important to remember that POS tag is different from parts of speech label (as understood in general parlance). The latter captures the basic grammatical category of word/token in a given language without any specific information about its morphosyntactic content and about punctuation markers. On the other hand, POS tag is annotation of word/token in its entirety following the writing convention that a language/script/orthography follows which includes punctuations and other conventions followed in the writing. Hence, it needs to accommodate many non-linguistic but writing convention based issues too.

As a process, POS tagging assigns a tag to a specific unit of natural language text. Hence, the text to be tagged is first passed through a tokeniser which applies various formatting rules to divide the text into tokens, a unit of written material divided by white space.

At a formal level of description, POS tagging can be stated as in (1), where a sequence of tokens W= w1...wn corresponds to a sequence of tags T=t1...tn, drawn from a set of tags {T}.

(1) S = argmax P (t1...tn | w1...wn) (Dandapat 2008: 4)

t1...tn

This description (1) implies that the input to the tagger is a whole sentence and the output is a whole sequence of tags. Such formalisation also assumes that tagging is an independent process, independent of the dictionary and morphology. The task is to assign tags. It is possible that a word/token is tagged with a category which is not at all possible. For example, "in" can be tagged as a verb by a POS tagger. However, it explicates that a token is given a tag in context of the adjacent tokens, and has no relationship between a token and a tag based on the morphosyntactic cues of the former. On the other hand, tagsets are designed to capture finer morphosyntactic details; consequently, a large number of tags are devised as a relationship between a token and a tag without any dependence on the former's contextual position. Under such criteria, POS tagging can be stated as,

(2) S = argmax P (t | w) (Kavi Narayana Murthy, p.c.)

On the other hand, a large of POS tagset is designed to annotate based both form and function of a token in a given clause/sentence. Consequently, such tagging is neither solely based on form nor on function, and can be formally expressed as,

(3) S = argmax P (ti | w1...wi...wn) (Kavi Narayana Murthy, p.c.)

t1...tn

The basic requirement for POS tagging is a POS tagset, a tagging scheme, practical definitions of each tag and tag elements with words and contexts where each tag and tag elements applies. For manual annotation, it requires graphical user interface (GUI) based annotation tool designed to assign tags from a specific tagset. At the level of automatic tagging, a tagger, a program for assigning a tag to each token in the corpus, implementing the tagset and tagging scheme in a tag assignment algorithm is required. Ideally an automatic tagger is “trained” by giving it the results of a manually annotated corpus. The tagger then tags unknown text corpora from a set of rules or of a statistical analysis of the results of the manually tagged corpus. There are large numbers of methods, techniques and free/open source tools available for automatic tagging (visit http://www-nlp.stanford.edu/links/statnlp.html#Taggers).

Glossary

Ambiguity: In computational linguistics, ambiguity refers to a state where there is a choice of tag to a given token.

Annotation Tool: A tool used for tagging.

Decomposable: A tag is known as decomposable if the string representing the tag contains one or more shorter sub-strings that are meaningful out of the context of the original tag. It is a desirable feature of the hierarchical tagset.

Hierarchical: The term “hierarchical”, when used of a tagset, means that the categories in that tagset are structured relative to one another. Rather than a large number of independent categories, a hierarchical tagset will contain a small number of categories, each of which contains a number of sub-categories, each of which may contain sub-sub-categories, and so on, in a tree-like structure (Hardie 2003: 48).

Lexicon: A list of possible tags for the root forms of all the valid words in a given language.

Local Token Grouping: A group of tokens that form part of a single linguistic word.

Morph Analyser: A tool that splits a given word into its constituent morphemes and identifies their corresponding grammatical categories.

Multi Token Word: It refers to a collection of separate tokens which is a single lexical expression in a language though written separately and independently these tokens may have an independent meaning or nonce but other than as a single lexical expression.

Part-of-speech: Categories [that] group lexical items which perform similar grammatical functions (Greene & Rubin 1971: 3).

POS Tag: A POS label given to a token (optionally along with its morphosyntactic attributes).

Pre-processing: A process of normalisation of text before tokenisation.

Tag element: It is a part of a tag which provides information about individual elements that makes up a tag. Prototypically, it includes Type and other morpho-syntactic Attributes.

Tagset: A set of defined tags. A set of word categories to be applied to the word tokens of a text (Hardie 2003).

Tagging: The process of assigning a tag to a token. Also known as annotation.

Token: A printed item separated by white space.

Training corpus: A manually annotated corpus on which automatic or semi automatic tagger is trained to acquire linguistic knowledge.

Underspecification: A lack of feature in a given tagset in comparison with another tagset.

Monday, October 25, 2010

Darjeeling Photos 102010


सूर्यास्त पछिको एक क्षण

A moment after sun set as seen from Rangbull during dashain. Pic from the verandah at Rangbull home.


धान बारी

Paddy field before harvest at Bong Busty, Kalimpong.


धानको बाला

A ripening stalk of paddy at Bong Busty, Kalimpong.


रुख टमाटर

An egg-shaped native Latin American edible fruit Tamarillo (Solanum betaceum (syn. Cyphomandra betacea)) at Rangbull.

पारी भित्तामा घाम-छायाँ

The play of the rays of the rising sun and shadow on the Pokhrebong and the Mirik ranges. Pic from Rangbull home.

काराङ-किरिङ बाटो र खोला

A winding tea garden road at Raney and the Balasun and her tributaries in the foothills. Pic from Rani Kup, Kurseong.


नेभाराको पात पछिको दृश्य

A view of Namchi where the statue of Guru Rinpoche, the patron saint of Sikkim is erected. Pic from Lopchu.

हरियो गोदावरी

A green chrysanthemum plant grown by my maila mama at Rangbull.

हाम्रो घर, रुख-पात माझमा

A view of my home at Rangbull which is 100 steps down from NH 55.

टाइगर हिल, दार्जीलिङ्ग

A view of the Tiger Hill at dusk from Bhotay Busty, Darjeeling.

नीलो पहाड़ र नीलै बादल, एक साझँमा

A view from Bhotay Busty, Darjeeling on an autumn evening.

Saturday, September 25, 2010

Mysore Photo 092010

सीता हरण
A lump of monsoon clouds brings an image to my mind as titled. Pic. at Mysore Palace

वर्ण मन्थन-गन्थन

A causerie on churning of speech sounds

Speech (वाक) and sign (सङ्केत) (of the Deaf) are the two basic manifestations of language (भाषा). People know how to speak/sign is as instinctive as spiders know how to spin webs even though both the manifestations are of transitory nature. On the other hand, writing (लेखाई) is an artifact – a significant cultural accomplishment designed to capture and record the transitory nature of speech.

Acknowledging this transitory nature of speech, shiksha (शिक्षा meaning phonetics), one of the six vedanga (lit. 'limbs of the Vedas') under the purview of linguistics in the Indian grammatical tradition, was intended for proper pronunciation of the sacred text (Cardona 1994). In the similar spirit, the Brahmi script and its derivatives like Devanagari, Sarada, Gurumukhi, Bangla, Assamese, Grantha, Kannada, Malayalam, etc. collectively called the Indic scripts, were primarily based on articulatory phonetics, and the units of orthography were designed to exhibit one to one correspondence with the speech sounds (see Murthy 2006: 273). This shows that there is a systematic relationship to language and has a systematic internal organisation. In other words, it follows certain principles and rules which qualifies to be called grammar, or grammar at the level of scripts. Panini's Shiva Sutra is one such formalisation of the Sanskrit language which Kiparsky (1991) refers to as an akshar-samamnaya, an ennumeration, exhaustive listing of sounds of Sanskrit.

1. Writing system and Orthography

In the study of writing, writing system refers to a set of visible or tactile signs used to represent units of language in a systematic way (Coulmas 1999: 560). These signs are individually termed 'character' (the Unicode Standards 4 includes letter, diacritic, numeral, punctutation, technical notation) and collectively called a script (लिपी). A single language may utilise several scripts. For example, Nepali can be written in the Devanagari script as well as in the Roman script as we do in sms. However, we follow at least one set of rules and conventions for using script in a particular language which is understood and shared by a community. A set of rules which defines the set of signs used, and the rules about how to write these signs including punctuation and spelling is called orthography (lit. 'correct writing') (वर्णविन्यास in Nepali).

The Nagari or Devanagari is used to write many languages of the Indian sub-continent viz. Hindi, Marathi, Bodo, Konkani, Sanskrit and Nepali. Though these languages share a single script, their orthography is different from each other. To cite an illustration, at the level of speech sounds Marathi has ळ and Nepali does not have. In other words, each language uses a subset of the Devanagari, and form their respective orthography. Similarly, Hindi case markers are written separately with nouns like राम ने but conjoined with pronouns like उसको whereas Nepali case markers are attached in the both as चामेले, उसलाई. Interestingly, a document entitled 'Sikkim Debt Law of 1910' shows, in the past, case markers with Nepali nouns were written separately.

2. Devanagari: Principles & Organisation

The Devanagari script is based on phonetic principles which consider both place and manner of articulation of the speech sounds (Bright 1996: 384). In the context of writing, the written form of these speech sounds are called (lit. 'letters representing modulation of voice'), and the systematic arrangement of the वर्ण is usually referred to as (lit. 'garland of letters'). However, in the Indian grammatical tradition, varna-samamnaya (as in TaittirÁya-PrÀtiÌÀkhya) and akshar-samamnaya are often used as near synonyms though they belong to the different knowledge domains (Kapoor 2007: 6 fn. 34). Vishnumitra-Vritti on the Rig Veda Pratisakhya uses the term mala (ibid.).

वर्णमाला is organised broadly on the basis of vowels (स्वर lit. 'voice') and consonants (व्यञ्जन lit. 'embellishment'), and the series of vowels and consonants is called स्वर वर्ण and व्यञ्जन वर्ण, respectively. The canonical order of स्वर वर्ण proceeds from short vowel (ह्रस्व स्वर) to the corresponding long vowel (दीर्घ स्वर) followed by diphthongs (संयुक्त स्वर/द्विस्वर). The names of vowels consist of their sounds sometimes followed by कर (lit. 'maker'); thus अ is called अ-कर (ibid.). It is interesting to note that the Devanagari vowels other than अ following a consonant are written with मात्रा corresponding to each vowel like पा, पि, etc. In other words, compositionally पा is made up of प् + आ, and पि is made up of प् + इ.

व्यञ्जन वर्ण is primarily based on the place of articulation (उच्चारण-स्थान). Further, speech sounds are organised on the basis of manner (प्रयत्न) – stops (स्पर्श), nasal (नासिक्य), approximant (अन्त:स्थ), fricative (उष्म/संघर्षी); voicing (घोषत्व - घोष/अघोष) and aspiration (प्राणत्व - अल्पप्राण/महाप्राण).

The series of speech sounds belonging to स्पर्श and नासिक्य (lit. 'pertaining to the nose') (collectively called occlusives) are organised into वर्ग (lit. 'class') on the basis of their symmetric articulatory phonetic properties, and each वर्ग is usually known by the initial letter of the particular वर्ग of the वर्णमाला. Such types of व्यञ्जन is called वर्गीय व्यञ्जन.

The fundamental principle of the Devanagari script is that each consonant carries an inherent schwa vowel (अमूर्त/निरपेक्ष स्वर) अ. This principle is graphically represented by a verti-bar called कन in all the consonants like क, ग, ट, etc. with an exception of र. To illustrate an example, प्‍ + अ = प.

Apart from the वर्ण, there is a group of characters which is collectively called diacritic (उपचिन्ह). In the articulation of speech sounds, it is observed that the initial nasal sound is assimilated to the following sound sharing the same place of articulation. This kind of regressive assimilation is called nasal homorganicity. Since, the Devanagari is based on the articulation of the speech sounds it is a convention following which the respective nasal of the each वर्ग is used. However, when followed by अवर्गीय व्यञ्जन – य र ल व श ष स ह, it is अनुस्वर (lit. ‘after-sound’) <ं> (शिर बिन्दु in Nepali). The following examples illustrates nasal homorganicity.

ङ in क-वर्ग : अङ्क

ञ in च-वर्ग : कञ्चन

ण in ट-वर्ग : कण्ठ

न in त-वर्ग : अन्त

म in प-वर्ग : चुम्बक

<ं> in अवर्गीय व्यञ्जन : वंश

The Devanagari स्वर वर्ण is inherently oral in nature; however, it can undergo nasalisation and consequently result into nasalised vowels. अनुनासिक ('after-nasalisation') (popularly चन्द्रबिन्दु in Nepali) <ँ> is used to represent nasalised vowel (अनुनासिक स्वर), like अ as अँ. विसर्ग (lit. 'discharge') < : > is used to represent non-syllabic ह. Since, अनुस्वर, अनुनासिक and विसर्ग are not pronounced independently and their pronunciation is dependent upon another sound, they are collectively known as अयोगवाह (lit. 'formed in union with') in Sanskrit.

An additional diacritic, a subscribed dot, which is popularly nowadays known as तल थोप्ली/थोप्ली (in Nepali; Hindi equivalent is नुक्ता (from Arabic 'point')) is used for similar sounds to an existing character. Such convention is also found in Classical Sanskrit like य and य़ (see Cardona 2003). अवग्रह < ऽ > is used to indicate elision or coalescence of a vowel as a result of sandhi like सदाऽऽत्मा (lit. 'the self, always') from सदा + आत्मा.

हलन्त (Sanskrit विराम lit. 'termination, end') is employed in order to cancel or silence the inherent vowel of a consonant, and represents a consonant without a vowel. It is a slanting stroke drawn at bottom right of a consonant to be precise at कन <्>, resulting प् as प्‍ .

In the case where the inherent vowel is obliterated, consonants are conjoined together; and such conjuncts are called ligature (संयुक्त अक्षर). Some of these ligatures take a distinct graphical representation called glyph like क् + ष = क्ष; others follow linear expansion as in त् + व = त्व; र् + य = र्‍य (an eye-lash र /परेली-र in Nepali), and some other follow vertical stacking like ट् + र = ट् ; क् + र = क्र; र +् + क = र्क (superscribed hook is called रेफ in Nepali).

The Devanagari is written from left to right, and is recognizable by a distinctive horizontal line running along the tops of the letters, called headstroke (डिको in Nepali), that links them together as a word/token.

3. Akshar

In the Indian philosophy, akshar is a conceptually ladened term. Akshar (masculine neuter in Sanskrit) originally refers to syllable. In its application to writing – Akshar (अक्षर = अ‌‌ + क्षरmeaning indelible) is a group of one or more glyphs (of characters) that form one unit in writing or printing (Bhaskararao 2003: 388). It is directly related to glyph, and understood to have obligatory vowel ending (Salomon 2003: 70). Moreover, it is interesting to note that diacritics like अनुस्वर, अनुनासिक, विसर्ग, अवग्रह and हलन्त are not a separate akshar but a part of an akshar to which it is a diacritic in the Devanagari. It is often referred as graphic syllable although it does not necessarily share one-to-one correspondence with phonetic syllable as shown below. It is, hence, inappropriate to equate akshar with syllable (Sproat 2000: 45).

Glyph string: अस्त्र

Character string: अ + स ‌+ ् + त ‌+ ्+ र

Akshar: अ – स्त्र

Syllable: as-tra (slightly adapted from Bhaskararao 2003: 388)

In the Indian grammatical tradition, the akshars that are listed in the varnamala are called mulakshar. The Classical Sanskrit varnamala has twelve mulakshar belonging to the स्वर वर्ण; and a series of व्यञ्जन mulakshars with the matras came to be known as बाह्रखड़ी, which is derived from barhaakshari (lit. 'twelve akshars'). It is important to note that बाह्रखड़ी is specific to the Classical Sanskrit varnamala though nowadays it is generically used as a term for consonant-matra combination for all other varnamalas too.

Devanagari as a writing system follows akshar system (Salomon 2003: 71) opposed to other systems like alphabetic, abjad, syllabary, alpha-syllabary (Bright 1996), logographic, even to abugida (Daniels 1996; see Bhaskararao 2003 for opposing view). It is worth to note that one of the motivating factors behind the akshar system of writing was to aid in memorisation, recitation, and reproduction of orally preserved texts. Hence, akshar is not only a psychological and perceptual unit of the Indic writing system, but also a basis of grammar at the level of script.

4. Devanagari and Orthographies

Salomon (2003: 75) writes that “Nagari script as used for Sanskrit serves as the prototype for its application, with minor variations or additions, to other languages.” This explains the fact that despite having the same script, the respective orthography of Sanskrit, Hindi, Marathi, Bodo, Konkani, Dogri and Nepali are different from each other. The changes in the respective orthography are due to the qualitative and quantitative characteristics of speech sound specific to a language.

A sketch of main characteristics of the Sanskrit, Hindi Marathi, Bodo and Dogri orthographies are highlighted in a nutshell. Sanskrit shows the full range of characters several of which did not surface in other languages later. It has 13 vowels and 33 consonants; and ॡ , though not a phoneme of Sanskrit, is included to maintain short-long vowel symmetry (Salomon 2003: 75). ळ and ळ्ह which are allophones of ड and ढ, respectively in intervocalic position are also part of the varnamala of Vedic Sanskrit (ibid.).

Salomon (2003: 75) observes that in Hindi, ॠ, ऌ and ॡ are omited; न, ञ, ण and ष are retained in the varnamala but only for the tatsams, and ड़ and ढ़ are added using नुक्ता for similar sounds from other languages to an existing character. In the course of time, to accommodate Arabic and Persian borrowings in Hindi, नुक्ता has gained prominence; and क़, ख़, ग़, ज़, फ़ are a part of the Hindi orthography. On the use of diacritics in Hindi, Shapiro (2003: 257-258) mentions inconsistencies/interchangeability in the use of अनुस्वर andअनुनासिक. Marathi, as followed in Hindi, excludes ॠ, ऌ and ॡ, but retains ळ of Vedic Sanskrit. In common parlance, it is now identified as “Marathi ल”. It has an eye-lash र as a glyph.

As recently as 1976, Bodo, a Tibeto-Burman tonal language with 6 vowels and 16 consonants, has adapted the Devanagari as its main script (Baro 1996, 1990/2007). The Bodo orthography has < ' > as अ's matra, and ओ is used for unrounded back vowel. However, tone (तान) is not marked, therefore, at the level of the present day orthography, जा is ambiguous between 'to eat' and 'to be', which are, otherwise, distinct high tone and low tone, respectively (Bridul Basumatary, p. c.). Dogri, a tonal Indo-Aryan language, has घ, झ, ढ, ध, भ, ढ़ in its orthography but they are not pronounced. As matter of fact, they are substituted by tonal difference. सुर चिन्ह < ' > is used to mark high falling tone in Dogri orthography (Sunil Kumar, p.c.).

5. Nepali Orthography

The earliest evidence that shows Nepali written in the Devanagari is anonymous बाज परीक्षा which dates back to 943 A.D. (Pokhrel: 2043 B.S.). For all these centuries, it has remained as the main script to write Nepali. However, with the advent of publication of Nepali grammar and text books, there are consistent inconsistencies in the Nepali orthography, particularly varnamala. In the most significant linguistic documentation of Indian languages, the Linguistic Survey of India (1891-1927), Grierson (1927) makes reference to the Nagari script but does not document Nepali varnamala. Interestingly, he notes (ibid.: 21), “[T]he only peculiarity which occurs is the occasional use of the dots, thus < ^..> instead of <ँ> , as the sign of Anunasika or nasalisation” (< > is mine). Similar observation is also made by Pandit (2051 B.S.: 3) (see Appendix I). A corpus study by Acharya (1991: 70) mentions inconsistencies regarding use of अनुस्वर andअनुनासिक in Nepali. A well known dictionary, नेपाली वृहत् शब्दकोष (1983: 19-21) acknowledges the lack of standardisation of the Nepali orthography, and points towards different issues and debates (see Clark 1969). A brief cursory survey of the Nepali varnas and their वर्णक्रम in the वर्णमाला exemplifies this very fact.

To summarise, Table 4 shows that नेपाली स्वर वर्ण ranges between 16 to 6, and there are 7 types of स्वर वर्णक्रम. Table 5 similarly, shows a range of 39 to 26 व्यञ्जन वर्ण, and 9 types of व्यञ्जन वर्णक्रम (see Appendix II).

6. Aftermath

Though one can safely attribute the paradox to tradition, convention and approach (see Acharya 1991: 63-64), the existing observed variations and the lack of the Nepali varnamala have actually opened the Pandora's box. Among other consequences; in the realm of pedagogy as in actual practice – Nepali learners following different text books will never end up learning the same Nepali varnamala. Similarly, the need of a the Nepali varnamala has a relevance in developing the script grammar of the Nepali language to meet the demands of the modern day technological advancement and its use in the emerging domains of language use.

At another level, in the Indian context, it is not just a distinct script which is a part of a language's identity – Perso-Arabic and Devanagari for Urdu and Hindi, respectively, (see Masica 1991: 144); but also orthography as witnessed in Hindi and Marathi despite having the same script. It is in this context, it is imperative to mention that orthography contributes in a high degree to the formation of a sense of solidarity and to formation of ethno-linguistic consciousness. Hence as an effort towards carving a distinct orthographic identity, Nepali is a language worthy to possess its own varnamala as well as the Nepali varnamala apart from its spelling system (see Turner 1931: xvii). Finally, a spoiler – the ideology, tradition, convention, trend and approach behind the (above mentioned) Nepali varnamalas and the contemporary cleavages, claims and contentions regarding the Nepali orthography, which certainly needs further investigation, is a part of a sequel.

Appendix I

The following available texts are used (as numerically listed in the tables 4 and 5):

Ayton, Jas Alex. 1820. A Grammar of the Népalese Language. Calcutta.
Turnbull, A. 1923 (1982 3^rd edn.). Nepali Grammar and Vocabulary. New Delhi: Asian Academic Services.
Pradhan, Parasmani & Pradhan, Rudramani. 1970. नयाँ साउँ अक्षर. Calcutta; Macmillan & Company.
Sharma, Radhakrishna. 1981. नेपाली सरल पाठ. Gangtok: Education Directorate, Govt. of Sikkim.
Sinha, Gokul. 1983 (2^nd edn). सरल नेपाली. Sonada, Darjeeling: Ramesh Bandhu Prakashan.
1985 (11^th edn.) माध्मिक नेपाली व्याकरण र रचना. Gangtok: Rashtriya Pustak Prakashan.
Pradhan, Bhai Chand & Pradhan, Manbahadur. 1987 (2^ndedn.). सुगम नेपाली व्याकरण र निबन्ध रचना. Kalimpong: D. P. Upasak & Sons.
Sigdel, Somnath. 2050 B.S./1993 A.D. (23^rd edn.). मध्यचन्द्रिका. Kathmandu: Sajha Prakashan.
Pandit, Gururaj Hemraj. 2051 B.S. /1994 A.D. (2^nd edn.). चन्द्रिका. Kathmandu: Sajha Prakashan.
Acharya, Jayaraj. 1991. A Descriptive Grammar of Nepali and an Analyzed Corpus. Washington, DC : Georgetown University Press.
Yonjon, Nainasingh. 2002. शिशुपाठ प्रथम भाग. Darjeeling: Shyam Prakashan.
Hutt, Michael & Subedi, Abhi. 2003. Teach Yourself Nepali. A complete course in understanding, speaking and writing Nepali. London: Hodder Headline.
Upadhyay, Tarapati & Upadhyay, Dron Kumar. 2005. आदर्श नेपाली व्याकरण. Udalguri, Assam.
Kumari, Shyamala B. & Sinha, Gokul. 2005. An Intensive Course in Nepali. Mysore: CIIL.
Nepal, Ghanashyam & Lama, Kavita. 2006. उच्च माध्मिक नेपाली व्याकरण र रचना. Siliguri: Ekta Book House.
Nepal, Ghanashyam & Parajuli, Pushkar. 2007. माध्मिक नेपाली व्याकरण र रचना. Siliguri: Ekta Book House.
Sarma, Khagen et al. (eds.). 2008. हाम्रो भाषा. Assam: Nepali Academic Council.
(Year of publication not available). हाम्रो वर्णमाला. Darjeeling: Shyam Prakashan.
Nepal, Ghanashyam & Rai, Jeena. (in press). नेपाली सरल व्याकरण. Siliguri: Ekta Book House.
Sharma, Shivraj.2009. भाषावैज्ञानिक परिचय: नेपाली वर्णमालाका वर्णहरू र मेरा अन्य लेख. Darjeeling: Sriraj Prakashan

Appendix II

Native Nepali terminology of स्वरवर्ण-मात्रा (Sharma 2009: 7-11).

अ साँउ अक्षर

आ कान्नानी/कान्दानी

इ बाइमात्रा

ई दाइँमात्रा

उ तलकुरे/तर्कुल्ले उ

ऊ बर्धने ऊ

ऋ रिकार

ए एकलख/एकखुट्टे ए

ऐ दोलख ऐ

ओ लखकानो ओ

औ दोलखकन्ना औ

अं शिरबिन्दे

आँ चन्द्रबिन्दे

अ: दबासबिन्दे

A popular Nepali Varnamala rendition (from Sarma et al. 2008) among others.

कपुरी क

खरायो ख

गाई गाड़े ग

घर जस्तो घ

मास गेड़ी ङ

चरीचुच्चे च

छाते छ

डाड़ु ज

खुट्टो झर्‍यो झ

गोरु सिङे ञ

ओठ काट्यो ट

ओठ मिलायो म

डाङडुङे ड

कुकुरपुछ्रे ढ

तीन धर्के ण

कोदाली त

घोरमुखा थ

दयेंली द

काँध लौरी ध

निहुरमुन्टे न

पाटी प

पिठ्यूँ बोकी फ

पेटकाट्यो ब

भकारी भ

राम्रो म

बूढ़ो य

खाँबे र

हात भाँचियो ल

बाटुलो व

मोटो श

पेट चिरो ष

पातलो स

हलिगोड़े ह

तल थोप्ली ड़

तल थोप्ली ढ़

छेपारी क्ष

दुई धर्के त्र

गाँठो पारी ज्ञ

P.S. The Nepali Varnamala Troupe (tiny tots and their instructors Subash Shanker, Bishal Sewa and Karma) enthralled the audience with their soulful Varnamala rendition in their own style at Rachna Books, Gangtok, Sikkim on February 26, 2010. “We want to make the children learn their school lessons in the form of music,” said Debasish Mothey. Similar show was organized at Rambi Primary School, Sikkim.

(Source: http://www.in.com/news/entertainment/fullstory-nepali-varnamala-troupe-makes-learning-fun-through-music-13062879-586d945285fea17780c6cefb6f6e5718a0fb0a89-1.html).

Acknowledgements

I am grateful to Prof. Kavi Narayana Murthy, Dr. Mallikarjun B, Dr. Gokul Sinha, Prof. Ghanshyam Nepal, Dr. Khagen Sarma, Jeena Rai, Umesh Chamling, Rupesh Rai, Bridul Basumatary and Sunil Kumar for their valuable comments on the earlier draft as well as for discussion on the issue. Needless to say errors are mine. I wish to dedicate this causerie to my first teacher who taught me नेपाली वर्णमाला, Radhika Gurung (Pran Nath Nursery School, Kalimpong), whom fondly we call बड़ी आन्टी.

Selected Bibliography

Agarwala, V. S. 1966. The Devanagari Script. In Indian Systems of Writing. 12-16. Delhi: Publications Division.

Bhaskararao, Peri. 2003. Elements of Indian Indic Scripts. In Peri Bhaskararao (ed.), Working Papers of International Symposium on Indic Scripts: Past and Future. 382-391. Tokyo: ILCAA.

Baro, Madhu Ram. 1996. गोजौ रावखान्थि. Guwahati: Assam Higher Secondary Education Council.

Baro, Madhu Ram. 1991/2007. The Historical Development of Boro Language. Guwahati: N.L. Publications.

Bright, William. 1996. The Devanagari Script. In Peter T. Daniels & William Bright (eds.), The World's Writing Systems. 384-390. New York: Oxford University Press.

Cardona, George. 1994. Indian linguistics. In Giulio Lepschy (ed.), History of Linguistics. Vol 1. The Eastern Traditions of Linguistics. 25-60. London/New York: Longman.

Clark, T. W. 1969. Nepali and Pahari. In Thomas A Sebeok (ed.), Current Trends in Linguistics. Vol. 5. 249-276. The Hague/Paris : Mouton.

Coulmas, Florian. 1999 (ed.). The Blackwell Encyclopaedia of Writing Systems. Oxford: Blackwell.

Daniels, Peter T. 1996. The Study of Writing systems. In Peter T. Daniels & William Bright (eds.), The World's Writing Systems. 3-17. New York: Oxford University Press.

Gautam, Deviprasad. 2049 वि० स०. नेपाली भाषा-परिचय. काठमाडौं : साझा प्रकाशन

Grierson, Graham Abraham. 1916/68. The Linguistic Survey of India. Vol IX Part IV. Delhi: Motilal Banarasidass.

Kiparsky, Paul. 1991. Economy and the construction of the Sivasutras. In M. M. Deshpande and S. Bhate (eds.), Paninian Studies.Michigan : Ann Arbor. Available at http://www.stanford.edu/~kiparsky/

Masica, Colin P. 1991. The Indo-Aryan Languages. Cambridge: Cambridge University Press.

Murthy, Kavi Narayana. 2006. Natural Language Processing : An Information Access Perspective . New Delhi: Ess Ess Publications.

Naik, Bapurao S. 1971. Typology of Devanagari. Vol. 1-3. Bombay: Directorate of Languages.

Kapoor, Kapil. 2007. Auṁ: Akshar in Indian Thought. In P. G. Patel, Pramod Pandey & Dilip Rajgor (eds.), The Indic Scripts: Paleographic and Linguistic Perspectives. 1-8. New Delhi: DK Printworld (P) Ltd.

Pokhrel, Balkrishna. 2043 वि० स०. पाँच सय वर्ष. काठमाडौं : साझा प्रकाशन

Salomon, Richard. 2003. Writing systems of the Indo-Aryan Languages. In George Cardona & Dhanesh Jain (eds.), The Indo-Aryan Language. 67-103. London: Routledge Language Family Series.

Shapiro, Michael C. 2003. Hindi. In George Cardona & Dhanesh Jain (eds.), The Indo-Aryan Language. 250–285. London: Routledge Language Family Series.

Sinha, R. M. K. 2009. A Journey from Indian Scripts Processing to Indian Languages Processing. IEEE Annals of History of Computing. January-March. 8-31.

Sinha, Samar. ms. Nepali Varnamala: Emergence, Divergence & Convergence. LDCIL, CIIL, Mysore.

Sproat, Richard. 2000. A Computational Theory of Writing Systems. Cambridge: Cambridge University Press.
Sharma, Shivraj. 2009. भाषावैज्ञानिक परिचय: नेपाली वर्णमालाका वर्णहरू र मेरा अन्य लेख. Darjeeling: Sriraj Prakashan.

Sthapit, Shishir Kumar. 2003. Nepali Orthography: A Descriptive Analysis. In Peri Bhaskararao (ed.), Working Papers of International Symposium on Indic Scripts: Past and Future. 62-91. Tokyo: ILCAA.

Turner, Ralph Lilley. 1931. A Comparative and Etymological Dictionary of the Nepali Language. London: K. Paul, Trench, Trubner.

नेपाली वृहत् शब्दकोष .1983. काठमाडौं : नेपाली राजकीय प्रज्ञा प्रतिष्ठान

Sikkim Debt Law of 1910 http://www.digitalhimalaya.com/collections/rarebooks/

http://www.darjeelingtimes.com/dtnews/opinions/social/1262-2010-08-19-18-20-43.html