Thursday, June 30, 2016

Ceaseless rain
Tea and book

There are
no msgs
no chats
no calls
Not even
wind or cloud
to pass
across the ridge
There is
conversing within us.

Sunday, March 15, 2015

Gandhi on Valentine's Day 2015

Thursday, March 12, 2015

Safdar Hashmi Memorial Trust. Addressing Gandhi. 1995. New Delhi: SAHMAT. Pg. 190. Hb. 21 x 27.5 cms. ISBN 81-86219-22-6. Price: Rs. 450 (pb), Rs. 900 (hb)

Reviewed by Samar Sinha

Addressing Gandhi: 125 Years Of Mahatma Gandhi is a bilingual book in English and Hindi comprising of seven articles on Gandhiji with over 100 sketches and photographs of Gandhiji. The form of the books deceives as a mere coffee table book, but the contents elevate its stature both in matter and spirit of a man who is remembered even after 125 years and continues to have relevance in the world to come. A book is, infact, to mark Gandhiji's 125 birth anniversary by SAHMAT. Since Gandhi's life and thought has had an enormous impact on the way we think today, Addressing Gandhi brings to the light how this man is revered and remembered in various pursuits. The book deserves review as the reader deserves to understand his significance in the newer contexts. Moreover, the volume under review brings different perspectives on Gandhiji.
The first article titled 'Gandhiji' is by Irfan Habib, a well known historian. It is a biographical sketch of Gandhiji drawing on events that shaped Gandhiji throughout his life. Based on Gandhiji's autobiography, Habib portrays an evolution of a baniya boy from Porbandar to a barrister from England; his realisation as a 'coolie-barrister' at Petermaritzburg, and his subsequent transformations. In the following narrative, Habib mainly draws Gandhiji’s political endeavours with brief reference to his political and social thoughts. Of particular interest that Habib sketches about Gandhiji is his post-independence days. He highlights Gandhiji's effort to douse the communal violence, appeal and persuasion to pay Pakistan the promised sum, and to develop cordial relationship with the colonial cousin. Habib associates Gandhiji's assassin Nathuram Godse with RSS and Hindu Mahasabha, and leaves his reader to think when he concludes '...he was not alone in the plot.' (pg. 24).
In the second essay by Ravinder Kumar on 'Gandhi and India's Transition to Bourgeois Modernity,' the author discusses issues that are essential in understanding past and present India. Drawing parallel with the Sakyamuni, the author tries to understand modernity in India in the context of '...Gandhi represents our entry into bourgeois modernity as a comprehensive social, political and cultural process' (pg.29). Kumar highlights that Gandhiji's satyagraha drew diverse streams of thoughts and communities into a unified national movement. Although literature do not favour Gandhiji in ushering modernity within the country, Kumar advocates that Gandhiji's concerns were rooted in the Indian conditions rather than of the European experience. Equally, his concerns were for 'appropriate technology,' 'sustainable development,' conservation of folk and popular culture and aesthetic traditions, and of pluralistic society. Moreover, Gandhiji stresses on moderation and self-constraint for effective functioning of the institutions.
Suresh Sharma's essay on 'Swaraj and the Quest for Freedom: Rabindranath's Tagore's Critique of Gandhi's Non-Cooperation' focusses on the debate between the poet and the Mahatma on Swaraj and Swadeshi. Around the historical events, both the parties build their arguments and counter arguments in order to understand what one meant by Swaraj. Tagore and Gandhiji were fundamentally different on the aspect that the poet has reposed implicit faith in the sheer power of the word, whereas the mahatma viewed that the power of the word is modulated by a deep sense of imperfection inherent in human nature, and truth needs to be affirmed. Sharma notes that the dialogue between them were not limited to critique of each other but also a self-critique.
In the following interview in Hindi, Madhukar Upadhayay interviews Ramchandra Gandhi about the contemporary relevance of Gandhi and his thoughts. Equally, the interview presses about the role of Gandhi (had he been alive) in the contemporary scenario related to Ayodhyay.
One of the most important contributions to this volume is Nandalal Bose's essay titled 'Bapuji' which he dedicates as an offering to Gandhiji. Infact, the article is republished from the Visva_Bharati Quarterly (1984). It is a rare article on Gandhiji by the master who created an iconic lino-cut of Bapuji during his Dandi March, and a close associate of Bapuji. Bose draws an autobiographical account of Bapuji and how his thoughts influenced the artist and gave meaning to his life. He recalls his meeting with Bapuji at the Congress session at Lucknow in 1936. Gandhiji's instruction of building Faizpur (Gram Congress) using only rural material, employing country craftsmen and indigenous conception; and Gandhiji's request to make miniature bamboo chariot. One of the unknown facets of Gandhiji is well explored by Bose in this essay – it is his love of art and artistic heritage. Moreover, Bose also narrates that Gandhi loved music, and would have dedicated his life for music had he not had to fight the colonisers. In his sharp contrast to the machine made things, Gandhiji championed the artistic urge as it lacks to satisfy the aesthetic need. He dubs Bapuji as '...a patron of artists' (pg. 131).

KG Subramanyan's 'Remembering Mahatma Gandhi,' is an autobiographical exploration of Gandhi in his life course. In other words, he builds a mosaic of Gandhiji through what he is known as a freedom fighter, one who restored the sense of dignity to a large number of people, one who won the enemy without wiping them out. The authors also dwells on Gandhiji's philosophy of inter-dependence of human individual, society and environment. Subramanyan also regarded Gandhiji as a national leader with a truly global perspectives.
The final essay on 'Locating Gandhi in Indian Art History: Nandalal and Ramkinkar' by Tapati Guha-Thakurta focusses on art of these two masters. Ramkinkar Baij, a sculptor was disappointed with the Debiprasad sculpture of Gandhiji, and wishes to sculpt Gandhiji with full of life and movement in open space with volume, dimension and materials to experiment. On the other hand, Nandalal Bose's association with Gandhiji, as a part of Gandhiji's political programme, and his iconic monochrome lino-cut provides an interesting juncture where art, nationalism, artists and the subjects intermingle to create a narrative. Guha-Thakurta explores this narrative – Gandhiji in the Indian art history as a charismatic motif, and '...signifying certain ideological thrusts and motivations in nationalisms...redefining the very notions of 'art' and 'Indian-ness' (pg 141). Nandalal Bose's Bapuji (1930) is an iconic lino-cut is a new national public art. Through his association with Gandhiji, Bose came to the forefront as an artist that fits Gandhiji's political programme through public art. Nandalal Bose's 1938 images was a novel nationalist construction of the Indian panorama with stripped off classicism and enhanced folkish, playful motifs. In her comparative study of the two masters, she concludes that Baij is non-canoncial whereas Bose has become canonical.
Addressing Gandhi is enriched with index. The book, undoubtedly, is not only a novel way to mark Gandhiji's 125 anniversary but also a commemorative contribution to the Gandhi art (or how he is represented) that has continued to survive Gandhi as a subject for the artists till date. The volume also includes sketches, paintings, photographs and collages by contemporary artists like Bulbul Sharma, Jogen Chowdhury, Adimoolan, Walter D'Souza, Jatin Das, Shuvaprasanna to name a few. To make the matter shorter and direct, this volume is a desired publication, and provides as a platform to examine Gandhiji's conception of art, art as a political programme and its relation with his thoughts on various aspects and facets of his personality and pursuits. Finally, the book emphasises Gandhi as artist's subject – a least explored subject in the Gandhian studies.

Tuesday, November 30, 2010

Hampi 112010

Sunday, November 21, 2010

Nepali POS tagset: The Nelralec Tagset (Nt-01)

The Nelralec (Nepali Language Resources and Localization for Education and Communication) tagset for Nepali was developed by a team comprising of linguists Yogendra Yadava, Ram Lohani, and Bhim Regmi and Andrew Hardie on the basis of the EAGLES guidelines for morphosyntactic annotation of corpora. The Nelralec tagset is fully hierarchical where in a tag such as VVYN1F, the initial letter V indicates the grammatical category i.e. verb. The following V indicates that the verb is finite, and the letter Y indicates third person. The fully specific tag VVYN1F indicates a very tightly defined, narrow category - feminine singular non-honorific third person finite verbs, such as "chE".
The tagset is compiled with respect to the standard Nepali; hence, the dialectal differences are not taken into consideration while compiling the tagset. Interestingly, the tagset has two main structural features that distinguishes it from a standard grammatical analysis of Nepali even though it is primarily based on previous analyses of Nepali grammar for instance Acharya (1991). As a matter of fact, the tagset is conceived and developed as a model of Nepali grammar for the purpose of POS annotation. In other words, it an abstraction designed to form a basis for manual and automatic POS annotation of tokens.
First, a single graphical token which contain multiple elements are tokenised as separate tokens i.e. break the graphical unit into several tokens, and each of them is annotated accordingly. The form which is disjointed from the start or end of another token and made into a separate token of its own is sometimes called a 'clitic' (in this tagging scheme). The token splitted and the 'clitic' are marked by symbol #. To illustrate an example, the Nepali postpositions, which are preferentially written as affixes on the noun or other word that they govern, are treated as separate tokens in this scheme of analysis. This gives the tagset the flexibility needed to handle very large array of potentially possible configurations of case markers. Second, tense, aspect and mood are not marked up on finite verbs, which are classified solely according to their agreement marking -- a necessary simplification for dealing with the complex verbal inflections of Nepali, which, together with the use of compound verbs, could not be indicated by the tagset without the use of thousands of additional categories.
On the other hand, the treatment of compound noun is very different from that of 'clitic'. In Nepali, compound as well as reduplicated words can be written in one of the three ways as shown below:

(22) chOrA chOrI (as two separate tokens) (lit. Son daughter - ‘children’)

(23) chOrA-chOrI (with hyphen) (lit. Son-daughter - ‘children’)

(24) chOrAchOrI (as a single token) (lit. Sondaughter - ‘children’)

(22) will be tagged as two separate tokens. (23&24) are tagged according to the nature of the last element of the compound i.e. the tag would be consistent as in "chOrI".

Nouns are classified into two types: proper and common. From a noun token, Case and number endings are tokenised separately. The former is treated as postposition. A model of number-gender in Nepali is developed for the purpose of POS tagging. The gender marker in Nepali like -O, -I and -A as in "chOrO, chOrI (Son -, daughter - dir/, Son -, respectively), and chOrA", respectively are ignored on nouns on two grounds. Firstly, as these features on noun are lexical-derivational feature, hence ignored. Secondly, there is a lack of exact symmetric counterpart with the markers regarding gender. For example, there is no masculine counterpart of the token ending with -I like AImAI (woman) -- marda (man). On the other hand, the same feature on pronouns, adjectives, non-finite verbs, etc. where the distinction is motivated by agreement are tagged accordingly. Even the noun token with honorific markers like sara, sAheba, jyU, etc. are tagged as NN (common) or NP (proper). In the Nelralec tagset, postpositions are those clitics (as defined in this tagging scheme) that are deattached from the noun token like case markers, plural suffix, etc. Similarly, Nepali classifiers are annotated separately.
Nepali adjectives, depending upon the nature of their morphological behaviour, are divided into five types. These types are primarily based on gender-number agreement i.e. masculine singular, feminine singular, other for masculine and feminine plurals, unmarked for undeclinable adjectives, and a common tag for both comparative and superlative adjectives.
As this POS tagset is developed as a model of Nepali grammar for POS annotation, pronominals are organised unlike in the traditional/descriptive grammar. Pronouns are organised as personal and reflexive. The former is organised on the basis of person as First, Second and other for unspecified person and honorificity is marked on five levels (see Hardie et al. 2005). Interestingly, in Nepali genitive case alter the phonetic form of the pronoun and cannot be separated as in the noun. Hence, it is treated as a single unit having tag like PMXKM i.e. Pronoun-1P-umarked for honorific-possessive-masculine for mErO/hAmrO. Similarly for ergative/instrumental case markers are also inseparable from the pronoun.
The pronoun-determiner is organised as a separate tag, and is subdivided into demonstrative, interrogative, relative and general (mnemonics are labelled according to their form in Nepali for the two interrogative and relative). As the pronoun-determiner functions as demonstrative and as a pronoun in Nepali, it is imperative to tag the tokens on the basis of the local phrasal context.
Nepali has a large number of TAM combinations, and if every possible combination is to be tagged separately, the tags would be unmanageable enormous. Therefore, in a case of verb, which has two verb roots but a single token, the Nelralec tagset follows a convention that the last identifiable verb is taken into consideration for annotation. For example, in "garnEcha" (do-subjective mood.BE.prs), "cha" is taken into account for annotating the verb token. However, two separate verb tokens will receive individual tags. Consequently, there is no distinction between main and auxiliary verbs in the tagset. Since, the idea behind the Nelralec tagset is to accomplish POS annotation, certain aspects of Nepali verb morphology is ignored viz. passive, causative and negative. These aspects of morphology are annotated as their counterpart i.e. active, non-causative and positive, respectively.
Within the verbal domain annotation, finiteness is distinguished on the basis of person marking. A verb with person marking on it is considered as finite opposed to without person marker for non-finite. Under the non-finite verb form, the participles like "gardO, gardI, gardA, gardai, garE, garnE, garEra" and the subjunctive e-form like garE (note that it is phonetically the same as a participle) and i-form for instance "garI" of the Nepali verbs are grouped accordingly. Similarly, command verb forms are tagged separately according to the honorific status.
In Nepali finite verbs, the distinction operates on Person (First, Second and Third), Number (Singular and Plural), Gender (Masculine and Feminine) and Honorific (Non-honorific and Medial). From the above, theoretically speaking 24 tags can be derived; however, only 10 tags are required since not all the combinations of these morphosyntactic features have separate forms in Nepali. Interestingly, separate tags are designed in this tagset for optative verbs as they behave differently in many ways from the other finite verbs.
In the Nelralec tagset, the mnemonics of the tag elements are schematised according to the Nepali form like M for first person (after "ma" (I)), T for second person (after "timI" (you)). Interestingly, there is no uniform scheme in organising tags on the basis of their types and attributes. For example, Nouns are NN and NP showing category and type - common and proper, respectively. Conversely, Adjectives are distinguished as JM, JF, JO, JX, and JT on the basis of the morphosyntactic attributes - gender and degree (see Hardie et al. ibid.: 5-11 for details of other categories). In other words, the Nelralec tagset assumes underspecification of both types and attributes among its 112 tags.  

[From: Mallikarjun, B, Yoonus, M. Sinha, Samar & A, Vadivel. 2010. Indian Languages and Part-of-Speech Annotation. Mysore: Linguistic Data Consortium for Indian Language: pp. 22-25. ISBN-81-7342-197-8]

Issues in POS Tagset Design

In the initial phase of POS tagset development for NLP purposes, the tagsets were designed and developed from the machine learning point of view in lieu of linguistic point of view. Under such considerations, language is arbitrarily considered as a sequence of tokens to be associated with a given set of tags. In formal terms, a set of strings over Σ (i.e. any subset of Σ*) is called a formal language over Σ. Moreover, focus on linguistic knowledge in designing tagset was neglected.
However, with the growing realisation that linguistic knowledge is essential in any work on language, the issues involved in designing POS tagset are discussed from the linguistic perspectives too as these issues have wide implications on the annotation of the linguistic data, and the resultant output and application based on it.
In this section, the following conceptual design issues are discussed with relevant illustrations from Indian languages.

1. Theoretical Background
In the development of a new tagset, the developers will analyse linguistic data in light of a particular linguistic theory that they advocate. The development of tagset, therefore, is not theory independent or theory-neutral as one often wishes it to be due to the conflicting assumptions. Consequently, the theoretical assumptions play an important role in deciding many other aspects of tagset design.
However, it is also possible that the developers are application-oriented rather than linguistic-theory oriented. For example, the Machine Learning researchers using a POS tagged corpus for their experiments are primarily concerned with Machine-Learnable tagging than with a specific linguistic theory. Therefore, such researchers will develop POS tagset accordingly. Paradoxically, this view has dominated the development of POS tagset to a large extent.
English being the first language of corpus linguistics, the grammatical framework chosen to describe its POS are Generalised Phrase Structure Grammar and Lexical Functional Grammar, which had promoted the notion that a category is composed of a bundle of features. In the Indian language POS tagset scenario, IIIT-Hindi tagset and Telugu tagset developed by CALTS, Hyderabad (Sree R. J. et al. 2008) are based on the Paninian perspective (for details see section 6). However, it is a desirable feature that tagset is not theory-laden but supports linguistic analysis also. 
2. Form and Function
One of the major decisions that the tagging schema needs to resolve is a tagging decision between form and function of a token in the text. As a given word/token may function differently in different syntactic contexts, they may be assigned different tags depending upon the function rather than on the form. Such cases, however, pose a computational complexity for automatic tagging, since more than one tag is given for the same form but with different contextual syntactic functions. On the other hand, two syntactic functions of a token/word may be assigned a single tag on the basis of its form. This also leads to information loss.
To maintain a firm decision between form and function, different approaches are decided for POS tagging; and each approach has underlying assumption to validate the decision. To illustrate such an assumption, a token is POS tagged on the basis of the form rather than the function in AnnCorra (Bharti et al. 2006). This decision is based on the priority that it eradicates choices involved in manual tagging, and establishes a token-tag relation which leads to efficient machine learning. In contrast to AnnCorra, Stuttgart-Tübingen Tag-Set (STTS) for German (Atwell ms.) has made linguistically motivated distinction between attributive and predicative adjectives. However, there are other approaches where there is a division of labour with respect to the hierarchy regarding form and function. The MSRI developed ILPOSTS based Hindi tagset is such one tagset which takes morphosyntactic form into account for assigning attribute-value (the lowest in the hierarchy), and function for annotating the Type (the mid-level hierarchy).
Knowles & Don (2003) has devised another approach for Malay, a language in which words change their function according to context. For example, "masuk" is a verb in a context but it is a noun "entrance" in a context of building, car-park, etc. Acknowledging this linguistic fact in Malay, Knowles & Don's tagset for Malay separates lexical class or form from syntactic function, and give each word in the lexicon only one class-tag. They have used the term ‘tag’ to label a lexical class, and ‘slot’ to refer to a position in syntactic structure in Malay (see Atwell ms.: 19).
Yet another view on this dichotomy is expressed as the following. To illustrate the form and function dichotomy, “maaDi” in Kannada is ambiguous between plural imperative and past verbal participle. A tagger needs to resolve such ambiguity through context. However, there is no need to consider those distinctions which are entirely within the scope of syntax. For example, syntax allows, as a general, universal rule, that nouns can act as adjectival modifiers. This rule is very much a part of any syntactic system. Hence, a tagger need not tag a noun as an adjective because of its function. This is unnecessary and it adds to the complexity of machine learning (Kavi Narayana Murthy, p.c.).
This view asserts that ambiguity arising out of form needs to be disambiguated at POS level provided there are no syntactic rules to account for its function. In other words, POS tagging is primarily based on form, and function is a secondary concern of tagging to be carried out as a last resort for disambiguation.
3. Granularity: Coarse Vs. Fine
The one of the important concerns in developing a tagset for a language is granularity - coarseness and fineness. They refer to the broad annotation and the finer annotation, respectively of any grammatical category. The aim of the corpus annotation is to maximise information content so that the tagged corpus can be used for a variety of applications. But as a matter of fact, the applications are not known in advance, hence, the level of linguistic annotation required is also unknown. The general corpus developers, as a principle, prefer to maximise linguistic enrichment by designing tagset in such a way that the annotation can be customised according to the needs of the application.
However, in POS tagset design, there are two schemes for granularity. The coarse annotation has far less number of tags than the fine grained annotation, and aids in higher accuracy in the course of manual tagging and in efficient machine learning. Despite such advantages, the coarse grained POS tagset is of less use as it does not capture much relevant information on POS. On the other hand, a finer annotation provide a very large number of information but also leads to create a problem for automatic tagging as it maximises tag options for a given token leading to computational complexity.
In view of the above mentioned advantages and disadvantages of the schemes, an ideal POS tagset design makes a subtle balance for POS annotation. However, it is important to remember that all linguistic information cannot be annotated at the POS level as well as all other linguistic information cannot be recovered from other levels of annotation. As a rule of thumb, it is imperative to capture optimal information at this level of annotation. In other words, POS design has to be such that coarse as well as fine information is retrieved as per the needs of the application.
In this context of granularity, the hierarchical architecture provides an edge over the flat architecture as it allows to modularise information accordingly. This is usually conceived along the levels of hierarchy - deeper the level, finer the features are encoded. On the other hand, flat may be too coarse or too finer or may lose relevant information in the POS tagged corpus.
The Text Analytics and Natural Language Processing ( Tanl) tagset (Attardi & Simi (Ms)) used for the EVALITA09 POS tagging is one such tagset designed for both coarse and fine grained annotation. It consists of 328 tags, and provides three levels of POS tags: coarse-grain, fine-grain and morphed tags. The coarse-grain tags consist of the 14 categories, the fine-grain tags have 36 tags like indefinite pronoun, personal pronoun, possessive pronoun, interrogative pronoun and relative pronoun among pronoun, and the morphed tags consist of 328 categories, which include morphological information like person, number, gender, tense, mode, and clitic.

4. Orthographic Conventions
One of the major issues that one faces in designing a tagset is to account orthographic practices that are beyond the known linguistic principles of categorisation. It is a known linguistic fact that a single token need not necessarily express meaning but rather a group of tokens. Such linguistic unit has come to known as multi token word (MTW) (also commonly known as multiword expression in computational literature). For example, a complex postposition in Hindi, के लिए collectively expresses a single meaning of "benefaction/purposive" (as a case marker). In isolation, के is a masculine genitive case marker and लिए has no semantic content. Ideally, therefore, के लिए can be tagged in one of the three ways:

(4) के\ and लिए\ as two separate POS labels (though tag for लिए is an issue).
(5) [के\ लिए\]\ as a single complex postposition with two different POS labels.
(6) [के लिए]\ as a complex but a single POS label.

It is one of the major decisions that a tagset designing has to take firmly regarding POS labelling to different tokens of a single lexical word. It is often the case that such issues are tagged ad hoc/arbitrarily at the POS level annotation, and are resolved at the higher level like local token grouping/chunking where a group of tokens is assigned a single tag.
Apart from MTW, contractions pose as a major issue with respect to token and linguistic annotation. Contrary to MTW, contractions are those orthographic forms that are shortened than the usual form reflecting the spoken form yet it partially retains the usual orthographic form. For example in Nepali, भा'थ्यो or भा'-थ्यो is contracted form of भएको थियो. The contracted form भा is a contraction of a participial भएको which is different from a dubitative particle भा. Similarly, थ्यो is a contracted form of थियो.
There are two known approaches to tackle this orthographic convention. The first approach considers the form as an orthographic convention reflecting a spoken form of two known distinct tokens. Therefore, the contracted forms are pre-processed, and tokenised as two separate tokens after separating punctuation markers from these tokens and tagged accordingly. To illustrate the case mentioned above, भा will be tagged as a participial and थ्यो as a verb assuming them to be an alternative orthographic form of their respective category. Alternately, the contracted form is considered as a single token reflecting to a linguistic reality in the mind of the speaker/author. In accordance with the language use the token is tagged.

5. Computational Complexity
One of the important functions of POS tagging is to resolve category level ambiguity. Paradoxically, in practice, there remain many issues where ambiguity remains unresolved or partially resolved even after POS tagging, and becomes a source of ambiguity for further processing. In this context, it is important to remember that the ambiguities are related with token-tag rather than semantic or structural ambiguity.
One of the most common examples to cite is about case syncretism, where the same form of marker is used for different case markers. For example, Hindi dative and accusative case markers have a similar phonetic/orthographic form as को. In a form based approach, let’s assume that को is assigned dative consistently irrespective of linguistic context in which it is accusative. In the process, this results in a loss of linguistic information that को is also an accusative case marker in Hindi. This approach, however, facilitates an ease for machine learning algorithm to POS tag but the resultant output has a loss of relevant linguistic information. Though, such an approach solves an issue ad hoc at the POS level annotation, its result needs to be recategorised and reassign the appropriate tag in association with other levels of annotation like semantic tagging in order to regain the lost linguistic information which is significant for higher level processing.
In a function based approach, though it demands annotator to distinguish each case and tag accordingly which in turn adds cognitive load to the annotator (see section 5.6), each linguistic information is tagged appropriately despite similar forms. However, for machine, it is a more difficult task to distinguish POS tags, technically to disambiguate, as there is no linguistic supplement to distinguish the two (Bhat & Richa (ms.) for detailed discussion on the issue). Thus, a system requires other tools and techniques to disambiguate it adding to computational complexity.
As a matter of fact, these approaches is a tug-of-war between detailed linguistic tagging and an ease for cognitive load to the annotator or/and subsequent automatic tagging. In an ideal tagging scheme, these two aspects are balanced finely so that it remains optimal with respect to the design scheme and the various processes both at the manual as well as at the machine level. Therefore, it is imperative to validate POS tagset at, and across various NLP processes in order to achieve computational as well as manual optimality.   

6.  Cognitive Load on Annotator
One of the major objectives of corpus linguistics is to design tagger which minimises human labour for annotating the text. Such automatic tagger, however, requires linguistic knowledge. Ideally an automatic tagger is “trained” by giving it the results of a manually annotated corpus, also called "training corpus." It is on the basis of "training corpus," the automatic tagger gains linguistic knowledge in association with machine learning techniques.
With respect to POS tagging, automatic tagger is trained to acquire knowledge to establish a tag-token relationship. The tagger acquires this knowledge from "training corpus", which is manually POS annotated. This, in turn, establishes a work flow that manual annotation forms the backbone of all kinds of annotation for NLP tasks.
With the given importance of manual annotation, and of POS annotation specifically for NLP tasks, it is important to ensure that manual POS annotation has zero-error. Since manual tagging is a tedious process, it is always desirable to reduce tagging load on the annotator to ensure such a standard. It is desirable that the annotation process is simple, intuitive, easier, and makes feel-good so that the cognitive load on the annotator is reduced to maximum limit. The first most requisite is to make the user comfortable with the GUI based tool. The look and feel of the tool can be customised according to the user so that it can set to an environment in which the user would like to work comfortably.
To reduce cognitive load on the annotator, the tool can be designed in such a way that it reduces number of human annotation interference which in consequence aims to minimise human error in tagging. For example, in Nepali, Direct Case has "0" value for Case Marker, and Oblique takes morphological Case Marker as given in values. The tool needs to be programmed in accordance with the linguistic facts such that value assignment for Direct Case takes automatically whereas for Oblique, the value assignment will be carried out manually. As a consequence of such filtering program, chances of error with respect to Direct Case are reduced. The tool, therefore, needs to be flexible enough to be customised with filters to accommodate language specific tagging facts while tagging data from many languages.
It is also desirable to facilitate annotation of finite list of items automatically. For example, punctuation markers are finite, and the tool can be designed to tag them automatically reducing iteration that otherwise a manual annotator has to carry out.
The developments and incorporation of such heuristic as well as linguistic facts into the tool primarily based on POS tagset can provide an impetus to ease off cognitive load on the annotator to ensure zero-error standard. 
[From: Mallikarjun, B, Yoonus, M. Sinha, Samar & A, Vadivel. 2010. Indian Languages and Part-of-Speech Annotation. Mysore: Linguistic Data Consortium for Indian Language: pp. 7-13. ISBN-81-7342-197-8]

POS Annotation Vis-A-Vis Corpus Annotation

Annotation is a process of ascribing grammatical categories to token/word of a corpus. Prior to corpus annotation, Text Encoding Initiative (TEI) makes annotation of corpus reader friendly and suggests universal grammatical categories for annotations enabling corpora to be stored and transferred. Moreover, TEI uses Standard Generalised Mark-up Language (SGML), an ISO-standard 8879 technology for defining generalised mark-up languages for documents, for text encoding and annotation purpose, and more recently XML has been adopted. This enables to encode any textual resource, in a manner that is hardware, software, and application independent.
Leech (1993) describes seven maxims for annotation of text corpora:
  1. Reversibility: Annotation should be removable and the annotated corpus can be reverted back to raw corpus.
  2. Extractibilty: Annotations should be able to be separated from the corpus text.
  3. Reader Friendliness: Annotation has to be such that it is reader friendly.
  4. Maker Explicitness: Manual as well as automatic tagging should make difference to the corpus user.
  5. Potentiality: Annotation is a potential representation rather than absolute representation.
  6. Mentality: Annotation should be theory independent.
  7. Non-Standardness: None of the annotation scheme is regarded as the a priori standard. Standards emerge through practical consensus, and the set of corpus tags will, very likely, be revised many times during the course, in order to find an optimal set for each language.
As a design measure of POS tagset, it is widely accepted that the POS tagset will not include any derivational, etymological, syntactic, semantic or discourse information (Hardie 2003). However, composition of tags does have its significance in annotation. Leech (1997) suggests the following criteria for labelling tag:
  1. Conciseness: It is more convenient to use concise label than verbose, lengthy ones. For example, "mas" rather than masculine. 
  2. Perspicuity: Interpretable labels are more user friendly than which cannot. Cloeren (1999) writes, "For reasons of readability there is a preference for mnemonic tags. Full-length names may be clearer individually, but make the annotated text virtually unreadable." For example, "NMZ" is more easily interpreted as nominaliser than "NML".
  3. Analysability: Decomposed labels are friendly to human annotator as well as machine. 
  4. Compositionality: A tag needs to be logically composed as a string of symbols representing levels of taxonomic categories. For example, a tag in Hindi is for Category Noun, Type Common, and Attributes Gender, Number, Case, Case Marker, Distributive and Honorificity along with its valuation.
Leech & Smith (1990: 27) point out that syntactic parsing is arguably the central task of NLP, and POS tagging, being a prerequisite to parsing, is "the most central area of corpus processing.” Though POS tag has a limited scope of syntactic disambiguation, it shares a network of relationship with various other intermediate tasks in constructing an optimal system. In the architecture of corpus annotation, different levels of annotation feed POS tagging, and vice-versa. POS tagging being a mid level NLP process, ideally it can and should make use of lexical level processes and should yield results that are desired for the syntactic parsing. Therefore, both these processes are to be considered in designing POS tagsets.
Bharati et al. (2006) among others, show that features from Morph Analyser can be used for enhancing the performance of a POS tagger. Infact, they argue that Morph Analyser itself can identify the parts of speech in most of the cases, and a POS tagger can be used to disambiguate the multiple answers provided by Morph Analyser. In retaliation, POS tagged data is used for other higher level processes like chunking, parsing, etc.
Similar view on POS tagging is expressed by Kavi Narayana Murthy (p.c.), who considers POS annotation as a mid level process depending on lexical level annotation and processes. In his view, Lexicon, in a sense, can be considered a tagger that it tags root forms of words/lemmata with 'all possible' tags. Morpho Analyser deals with inflected and/or derived forms of words and assign ‘all possible tags' to all valid forms of all valid words in a given language. Given this, a tagger that tags words/tokens in a running text can be viewed as a disambiguater rather than as an assigner of tags. All possible tags have already been assigned by the Lexicon/Morph Analyser and the task of POS level annotation is only to eliminate or at least reduce ambiguities if any. Further, he opines that this approach to POS tagging has several advantages:
  1. Impossible tags are never assigned.
  2. Words which have only one possible tag need not even be considered, only ambiguous cases need to be considered by the tagger.
  3. The degree and nature of ambiguities can be studied, both at the root word level (from the Lexicon) and at the running text level (from a tagged corpus).
  4. All types of ambiguities are not of the same nature. Therefore, different strategies can be formulated to handle them. For example, some kinds of ambiguities are easily solved using local context while others may inherently need long distance dependencies to be considered.
  5. In the context of Indian languages, he emphasises that Indian languages are morphologically very rich and a Morph Analyser is the most essential component for processing which can substantially reduce the ambiguities at the Lexicon though it can also introduce some ambiguities of its own. But the ambiguities introduced by the Morph Analyser are always uniform and fully rule governed. This helps us to design a judicious combination of linguistic (rule based/knowledge based) and statistical/machine-learning approaches.
[From: Mallikarjun, B, Yoonus, M. Sinha, Samar & A, Vadivel. 2010. Indian Languages and Part-of-Speech Annotation. Mysore: Linguistic Data Consortium for Indian Language: pp. 2-4. ISBN-81-7342-197-8]