Tuesday, November 30, 2010

Hampi 112010

Sunday, November 21, 2010

Nepali POS tagset: The Nelralec Tagset (Nt-01)

The Nelralec (Nepali Language Resources and Localization for Education and Communication) tagset for Nepali was developed by a team comprising of linguists Yogendra Yadava, Ram Lohani, and Bhim Regmi and Andrew Hardie on the basis of the EAGLES guidelines for morphosyntactic annotation of corpora. The Nelralec tagset is fully hierarchical where in a tag such as VVYN1F, the initial letter V indicates the grammatical category i.e. verb. The following V indicates that the verb is finite, and the letter Y indicates third person. The fully specific tag VVYN1F indicates a very tightly defined, narrow category - feminine singular non-honorific third person finite verbs, such as "chE".
The tagset is compiled with respect to the standard Nepali; hence, the dialectal differences are not taken into consideration while compiling the tagset. Interestingly, the tagset has two main structural features that distinguishes it from a standard grammatical analysis of Nepali even though it is primarily based on previous analyses of Nepali grammar for instance Acharya (1991). As a matter of fact, the tagset is conceived and developed as a model of Nepali grammar for the purpose of POS annotation. In other words, it an abstraction designed to form a basis for manual and automatic POS annotation of tokens.
First, a single graphical token which contain multiple elements are tokenised as separate tokens i.e. break the graphical unit into several tokens, and each of them is annotated accordingly. The form which is disjointed from the start or end of another token and made into a separate token of its own is sometimes called a 'clitic' (in this tagging scheme). The token splitted and the 'clitic' are marked by symbol #. To illustrate an example, the Nepali postpositions, which are preferentially written as affixes on the noun or other word that they govern, are treated as separate tokens in this scheme of analysis. This gives the tagset the flexibility needed to handle very large array of potentially possible configurations of case markers. Second, tense, aspect and mood are not marked up on finite verbs, which are classified solely according to their agreement marking -- a necessary simplification for dealing with the complex verbal inflections of Nepali, which, together with the use of compound verbs, could not be indicated by the tagset without the use of thousands of additional categories.
On the other hand, the treatment of compound noun is very different from that of 'clitic'. In Nepali, compound as well as reduplicated words can be written in one of the three ways as shown below:

(22) chOrA chOrI (as two separate tokens) (lit. Son daughter - ‘children’)

(23) chOrA-chOrI (with hyphen) (lit. Son-daughter - ‘children’)

(24) chOrAchOrI (as a single token) (lit. Sondaughter - ‘children’)

(22) will be tagged as two separate tokens. (23&24) are tagged according to the nature of the last element of the compound i.e. the tag would be consistent as in "chOrI".

Nouns are classified into two types: proper and common. From a noun token, Case and number endings are tokenised separately. The former is treated as postposition. A model of number-gender in Nepali is developed for the purpose of POS tagging. The gender marker in Nepali like -O, -I and -A as in "chOrO, chOrI (Son - dir.sg, daughter - dir/obl.sg, Son - obl.lt/sg, respectively), and chOrA", respectively are ignored on nouns on two grounds. Firstly, as these features on noun are lexical-derivational feature, hence ignored. Secondly, there is a lack of exact symmetric counterpart with the markers regarding gender. For example, there is no masculine counterpart of the token ending with -I like AImAI (woman) -- marda (man). On the other hand, the same feature on pronouns, adjectives, non-finite verbs, etc. where the distinction is motivated by agreement are tagged accordingly. Even the noun token with honorific markers like sara, sAheba, jyU, etc. are tagged as NN (common) or NP (proper). In the Nelralec tagset, postpositions are those clitics (as defined in this tagging scheme) that are deattached from the noun token like case markers, plural suffix, etc. Similarly, Nepali classifiers are annotated separately.
Nepali adjectives, depending upon the nature of their morphological behaviour, are divided into five types. These types are primarily based on gender-number agreement i.e. masculine singular, feminine singular, other for masculine and feminine plurals, unmarked for undeclinable adjectives, and a common tag for both comparative and superlative adjectives.
As this POS tagset is developed as a model of Nepali grammar for POS annotation, pronominals are organised unlike in the traditional/descriptive grammar. Pronouns are organised as personal and reflexive. The former is organised on the basis of person as First, Second and other for unspecified person and honorificity is marked on five levels (see Hardie et al. 2005). Interestingly, in Nepali genitive case alter the phonetic form of the pronoun and cannot be separated as in the noun. Hence, it is treated as a single unit having tag like PMXKM i.e. Pronoun-1P-umarked for honorific-possessive-masculine for mErO/hAmrO. Similarly for ergative/instrumental case markers are also inseparable from the pronoun.
The pronoun-determiner is organised as a separate tag, and is subdivided into demonstrative, interrogative, relative and general (mnemonics are labelled according to their form in Nepali for the two interrogative and relative). As the pronoun-determiner functions as demonstrative and as a pronoun in Nepali, it is imperative to tag the tokens on the basis of the local phrasal context.
Nepali has a large number of TAM combinations, and if every possible combination is to be tagged separately, the tags would be unmanageable enormous. Therefore, in a case of verb, which has two verb roots but a single token, the Nelralec tagset follows a convention that the last identifiable verb is taken into consideration for annotation. For example, in "garnEcha" (do-subjective mood.BE.prs), "cha" is taken into account for annotating the verb token. However, two separate verb tokens will receive individual tags. Consequently, there is no distinction between main and auxiliary verbs in the tagset. Since, the idea behind the Nelralec tagset is to accomplish POS annotation, certain aspects of Nepali verb morphology is ignored viz. passive, causative and negative. These aspects of morphology are annotated as their counterpart i.e. active, non-causative and positive, respectively.
Within the verbal domain annotation, finiteness is distinguished on the basis of person marking. A verb with person marking on it is considered as finite opposed to without person marker for non-finite. Under the non-finite verb form, the participles like "gardO, gardI, gardA, gardai, garE, garnE, garEra" and the subjunctive e-form like garE (note that it is phonetically the same as a participle) and i-form for instance "garI" of the Nepali verbs are grouped accordingly. Similarly, command verb forms are tagged separately according to the honorific status.
In Nepali finite verbs, the distinction operates on Person (First, Second and Third), Number (Singular and Plural), Gender (Masculine and Feminine) and Honorific (Non-honorific and Medial). From the above, theoretically speaking 24 tags can be derived; however, only 10 tags are required since not all the combinations of these morphosyntactic features have separate forms in Nepali. Interestingly, separate tags are designed in this tagset for optative verbs as they behave differently in many ways from the other finite verbs.
In the Nelralec tagset, the mnemonics of the tag elements are schematised according to the Nepali form like M for first person (after "ma" (I)), T for second person (after "timI" (you)). Interestingly, there is no uniform scheme in organising tags on the basis of their types and attributes. For example, Nouns are NN and NP showing category and type - common and proper, respectively. Conversely, Adjectives are distinguished as JM, JF, JO, JX, and JT on the basis of the morphosyntactic attributes - gender and degree (see Hardie et al. ibid.: 5-11 for details of other categories). In other words, the Nelralec tagset assumes underspecification of both types and attributes among its 112 tags.  

[From: Mallikarjun, B, Yoonus, M. Sinha, Samar & A, Vadivel. 2010. Indian Languages and Part-of-Speech Annotation. Mysore: Linguistic Data Consortium for Indian Language: pp. 22-25. ISBN-81-7342-197-8]

Issues in POS Tagset Design

In the initial phase of POS tagset development for NLP purposes, the tagsets were designed and developed from the machine learning point of view in lieu of linguistic point of view. Under such considerations, language is arbitrarily considered as a sequence of tokens to be associated with a given set of tags. In formal terms, a set of strings over Σ (i.e. any subset of Σ*) is called a formal language over Σ. Moreover, focus on linguistic knowledge in designing tagset was neglected.
However, with the growing realisation that linguistic knowledge is essential in any work on language, the issues involved in designing POS tagset are discussed from the linguistic perspectives too as these issues have wide implications on the annotation of the linguistic data, and the resultant output and application based on it.
In this section, the following conceptual design issues are discussed with relevant illustrations from Indian languages.

1. Theoretical Background
In the development of a new tagset, the developers will analyse linguistic data in light of a particular linguistic theory that they advocate. The development of tagset, therefore, is not theory independent or theory-neutral as one often wishes it to be due to the conflicting assumptions. Consequently, the theoretical assumptions play an important role in deciding many other aspects of tagset design.
However, it is also possible that the developers are application-oriented rather than linguistic-theory oriented. For example, the Machine Learning researchers using a POS tagged corpus for their experiments are primarily concerned with Machine-Learnable tagging than with a specific linguistic theory. Therefore, such researchers will develop POS tagset accordingly. Paradoxically, this view has dominated the development of POS tagset to a large extent.
English being the first language of corpus linguistics, the grammatical framework chosen to describe its POS are Generalised Phrase Structure Grammar and Lexical Functional Grammar, which had promoted the notion that a category is composed of a bundle of features. In the Indian language POS tagset scenario, IIIT-Hindi tagset and Telugu tagset developed by CALTS, Hyderabad (Sree R. J. et al. 2008) are based on the Paninian perspective (for details see section 6). However, it is a desirable feature that tagset is not theory-laden but supports linguistic analysis also. 
2. Form and Function
One of the major decisions that the tagging schema needs to resolve is a tagging decision between form and function of a token in the text. As a given word/token may function differently in different syntactic contexts, they may be assigned different tags depending upon the function rather than on the form. Such cases, however, pose a computational complexity for automatic tagging, since more than one tag is given for the same form but with different contextual syntactic functions. On the other hand, two syntactic functions of a token/word may be assigned a single tag on the basis of its form. This also leads to information loss.
To maintain a firm decision between form and function, different approaches are decided for POS tagging; and each approach has underlying assumption to validate the decision. To illustrate such an assumption, a token is POS tagged on the basis of the form rather than the function in AnnCorra (Bharti et al. 2006). This decision is based on the priority that it eradicates choices involved in manual tagging, and establishes a token-tag relation which leads to efficient machine learning. In contrast to AnnCorra, Stuttgart-Tübingen Tag-Set (STTS) for German (Atwell ms.) has made linguistically motivated distinction between attributive and predicative adjectives. However, there are other approaches where there is a division of labour with respect to the hierarchy regarding form and function. The MSRI developed ILPOSTS based Hindi tagset is such one tagset which takes morphosyntactic form into account for assigning attribute-value (the lowest in the hierarchy), and function for annotating the Type (the mid-level hierarchy).
Knowles & Don (2003) has devised another approach for Malay, a language in which words change their function according to context. For example, "masuk" is a verb in a context but it is a noun "entrance" in a context of building, car-park, etc. Acknowledging this linguistic fact in Malay, Knowles & Don's tagset for Malay separates lexical class or form from syntactic function, and give each word in the lexicon only one class-tag. They have used the term ‘tag’ to label a lexical class, and ‘slot’ to refer to a position in syntactic structure in Malay (see Atwell ms.: 19).
Yet another view on this dichotomy is expressed as the following. To illustrate the form and function dichotomy, “maaDi” in Kannada is ambiguous between plural imperative and past verbal participle. A tagger needs to resolve such ambiguity through context. However, there is no need to consider those distinctions which are entirely within the scope of syntax. For example, syntax allows, as a general, universal rule, that nouns can act as adjectival modifiers. This rule is very much a part of any syntactic system. Hence, a tagger need not tag a noun as an adjective because of its function. This is unnecessary and it adds to the complexity of machine learning (Kavi Narayana Murthy, p.c.).
This view asserts that ambiguity arising out of form needs to be disambiguated at POS level provided there are no syntactic rules to account for its function. In other words, POS tagging is primarily based on form, and function is a secondary concern of tagging to be carried out as a last resort for disambiguation.
3. Granularity: Coarse Vs. Fine
The one of the important concerns in developing a tagset for a language is granularity - coarseness and fineness. They refer to the broad annotation and the finer annotation, respectively of any grammatical category. The aim of the corpus annotation is to maximise information content so that the tagged corpus can be used for a variety of applications. But as a matter of fact, the applications are not known in advance, hence, the level of linguistic annotation required is also unknown. The general corpus developers, as a principle, prefer to maximise linguistic enrichment by designing tagset in such a way that the annotation can be customised according to the needs of the application.
However, in POS tagset design, there are two schemes for granularity. The coarse annotation has far less number of tags than the fine grained annotation, and aids in higher accuracy in the course of manual tagging and in efficient machine learning. Despite such advantages, the coarse grained POS tagset is of less use as it does not capture much relevant information on POS. On the other hand, a finer annotation provide a very large number of information but also leads to create a problem for automatic tagging as it maximises tag options for a given token leading to computational complexity.
In view of the above mentioned advantages and disadvantages of the schemes, an ideal POS tagset design makes a subtle balance for POS annotation. However, it is important to remember that all linguistic information cannot be annotated at the POS level as well as all other linguistic information cannot be recovered from other levels of annotation. As a rule of thumb, it is imperative to capture optimal information at this level of annotation. In other words, POS design has to be such that coarse as well as fine information is retrieved as per the needs of the application.
In this context of granularity, the hierarchical architecture provides an edge over the flat architecture as it allows to modularise information accordingly. This is usually conceived along the levels of hierarchy - deeper the level, finer the features are encoded. On the other hand, flat may be too coarse or too finer or may lose relevant information in the POS tagged corpus.
The Text Analytics and Natural Language Processing ( Tanl) tagset (Attardi & Simi (Ms)) used for the EVALITA09 POS tagging is one such tagset designed for both coarse and fine grained annotation. It consists of 328 tags, and provides three levels of POS tags: coarse-grain, fine-grain and morphed tags. The coarse-grain tags consist of the 14 categories, the fine-grain tags have 36 tags like indefinite pronoun, personal pronoun, possessive pronoun, interrogative pronoun and relative pronoun among pronoun, and the morphed tags consist of 328 categories, which include morphological information like person, number, gender, tense, mode, and clitic.

4. Orthographic Conventions
One of the major issues that one faces in designing a tagset is to account orthographic practices that are beyond the known linguistic principles of categorisation. It is a known linguistic fact that a single token need not necessarily express meaning but rather a group of tokens. Such linguistic unit has come to known as multi token word (MTW) (also commonly known as multiword expression in computational literature). For example, a complex postposition in Hindi, के लिए collectively expresses a single meaning of "benefaction/purposive" (as a case marker). In isolation, के is a masculine genitive case marker and लिए has no semantic content. Ideally, therefore, के लिए can be tagged in one of the three ways:

(4) के\ and लिए\ as two separate POS labels (though tag for लिए is an issue).
(5) [के\ लिए\]\ as a single complex postposition with two different POS labels.
(6) [के लिए]\ as a complex but a single POS label.

It is one of the major decisions that a tagset designing has to take firmly regarding POS labelling to different tokens of a single lexical word. It is often the case that such issues are tagged ad hoc/arbitrarily at the POS level annotation, and are resolved at the higher level like local token grouping/chunking where a group of tokens is assigned a single tag.
Apart from MTW, contractions pose as a major issue with respect to token and linguistic annotation. Contrary to MTW, contractions are those orthographic forms that are shortened than the usual form reflecting the spoken form yet it partially retains the usual orthographic form. For example in Nepali, भा'थ्यो or भा'-थ्यो is contracted form of भएको थियो. The contracted form भा is a contraction of a participial भएको which is different from a dubitative particle भा. Similarly, थ्यो is a contracted form of थियो.
There are two known approaches to tackle this orthographic convention. The first approach considers the form as an orthographic convention reflecting a spoken form of two known distinct tokens. Therefore, the contracted forms are pre-processed, and tokenised as two separate tokens after separating punctuation markers from these tokens and tagged accordingly. To illustrate the case mentioned above, भा will be tagged as a participial and थ्यो as a verb assuming them to be an alternative orthographic form of their respective category. Alternately, the contracted form is considered as a single token reflecting to a linguistic reality in the mind of the speaker/author. In accordance with the language use the token is tagged.

5. Computational Complexity
One of the important functions of POS tagging is to resolve category level ambiguity. Paradoxically, in practice, there remain many issues where ambiguity remains unresolved or partially resolved even after POS tagging, and becomes a source of ambiguity for further processing. In this context, it is important to remember that the ambiguities are related with token-tag rather than semantic or structural ambiguity.
One of the most common examples to cite is about case syncretism, where the same form of marker is used for different case markers. For example, Hindi dative and accusative case markers have a similar phonetic/orthographic form as को. In a form based approach, let’s assume that को is assigned dative consistently irrespective of linguistic context in which it is accusative. In the process, this results in a loss of linguistic information that को is also an accusative case marker in Hindi. This approach, however, facilitates an ease for machine learning algorithm to POS tag but the resultant output has a loss of relevant linguistic information. Though, such an approach solves an issue ad hoc at the POS level annotation, its result needs to be recategorised and reassign the appropriate tag in association with other levels of annotation like semantic tagging in order to regain the lost linguistic information which is significant for higher level processing.
In a function based approach, though it demands annotator to distinguish each case and tag accordingly which in turn adds cognitive load to the annotator (see section 5.6), each linguistic information is tagged appropriately despite similar forms. However, for machine, it is a more difficult task to distinguish POS tags, technically to disambiguate, as there is no linguistic supplement to distinguish the two (Bhat & Richa (ms.) for detailed discussion on the issue). Thus, a system requires other tools and techniques to disambiguate it adding to computational complexity.
As a matter of fact, these approaches is a tug-of-war between detailed linguistic tagging and an ease for cognitive load to the annotator or/and subsequent automatic tagging. In an ideal tagging scheme, these two aspects are balanced finely so that it remains optimal with respect to the design scheme and the various processes both at the manual as well as at the machine level. Therefore, it is imperative to validate POS tagset at, and across various NLP processes in order to achieve computational as well as manual optimality.   

6.  Cognitive Load on Annotator
One of the major objectives of corpus linguistics is to design tagger which minimises human labour for annotating the text. Such automatic tagger, however, requires linguistic knowledge. Ideally an automatic tagger is “trained” by giving it the results of a manually annotated corpus, also called "training corpus." It is on the basis of "training corpus," the automatic tagger gains linguistic knowledge in association with machine learning techniques.
With respect to POS tagging, automatic tagger is trained to acquire knowledge to establish a tag-token relationship. The tagger acquires this knowledge from "training corpus", which is manually POS annotated. This, in turn, establishes a work flow that manual annotation forms the backbone of all kinds of annotation for NLP tasks.
With the given importance of manual annotation, and of POS annotation specifically for NLP tasks, it is important to ensure that manual POS annotation has zero-error. Since manual tagging is a tedious process, it is always desirable to reduce tagging load on the annotator to ensure such a standard. It is desirable that the annotation process is simple, intuitive, easier, and makes feel-good so that the cognitive load on the annotator is reduced to maximum limit. The first most requisite is to make the user comfortable with the GUI based tool. The look and feel of the tool can be customised according to the user so that it can set to an environment in which the user would like to work comfortably.
To reduce cognitive load on the annotator, the tool can be designed in such a way that it reduces number of human annotation interference which in consequence aims to minimise human error in tagging. For example, in Nepali, Direct Case has "0" value for Case Marker, and Oblique takes morphological Case Marker as given in values. The tool needs to be programmed in accordance with the linguistic facts such that value assignment for Direct Case takes automatically whereas for Oblique, the value assignment will be carried out manually. As a consequence of such filtering program, chances of error with respect to Direct Case are reduced. The tool, therefore, needs to be flexible enough to be customised with filters to accommodate language specific tagging facts while tagging data from many languages.
It is also desirable to facilitate annotation of finite list of items automatically. For example, punctuation markers are finite, and the tool can be designed to tag them automatically reducing iteration that otherwise a manual annotator has to carry out.
The developments and incorporation of such heuristic as well as linguistic facts into the tool primarily based on POS tagset can provide an impetus to ease off cognitive load on the annotator to ensure zero-error standard. 
[From: Mallikarjun, B, Yoonus, M. Sinha, Samar & A, Vadivel. 2010. Indian Languages and Part-of-Speech Annotation. Mysore: Linguistic Data Consortium for Indian Language: pp. 7-13. ISBN-81-7342-197-8]

POS Annotation Vis-A-Vis Corpus Annotation

Annotation is a process of ascribing grammatical categories to token/word of a corpus. Prior to corpus annotation, Text Encoding Initiative (TEI) makes annotation of corpus reader friendly and suggests universal grammatical categories for annotations enabling corpora to be stored and transferred. Moreover, TEI uses Standard Generalised Mark-up Language (SGML), an ISO-standard 8879 technology for defining generalised mark-up languages for documents, for text encoding and annotation purpose, and more recently XML has been adopted. This enables to encode any textual resource, in a manner that is hardware, software, and application independent.
Leech (1993) describes seven maxims for annotation of text corpora:
  1. Reversibility: Annotation should be removable and the annotated corpus can be reverted back to raw corpus.
  2. Extractibilty: Annotations should be able to be separated from the corpus text.
  3. Reader Friendliness: Annotation has to be such that it is reader friendly.
  4. Maker Explicitness: Manual as well as automatic tagging should make difference to the corpus user.
  5. Potentiality: Annotation is a potential representation rather than absolute representation.
  6. Mentality: Annotation should be theory independent.
  7. Non-Standardness: None of the annotation scheme is regarded as the a priori standard. Standards emerge through practical consensus, and the set of corpus tags will, very likely, be revised many times during the course, in order to find an optimal set for each language.
As a design measure of POS tagset, it is widely accepted that the POS tagset will not include any derivational, etymological, syntactic, semantic or discourse information (Hardie 2003). However, composition of tags does have its significance in annotation. Leech (1997) suggests the following criteria for labelling tag:
  1. Conciseness: It is more convenient to use concise label than verbose, lengthy ones. For example, "mas" rather than masculine. 
  2. Perspicuity: Interpretable labels are more user friendly than which cannot. Cloeren (1999) writes, "For reasons of readability there is a preference for mnemonic tags. Full-length names may be clearer individually, but make the annotated text virtually unreadable." For example, "NMZ" is more easily interpreted as nominaliser than "NML".
  3. Analysability: Decomposed labels are friendly to human annotator as well as machine. 
  4. Compositionality: A tag needs to be logically composed as a string of symbols representing levels of taxonomic categories. For example, a tag NC.mas.sg.dir.0.n.n in Hindi is for Category Noun, Type Common, and Attributes Gender, Number, Case, Case Marker, Distributive and Honorificity along with its valuation.
Leech & Smith (1990: 27) point out that syntactic parsing is arguably the central task of NLP, and POS tagging, being a prerequisite to parsing, is "the most central area of corpus processing.” Though POS tag has a limited scope of syntactic disambiguation, it shares a network of relationship with various other intermediate tasks in constructing an optimal system. In the architecture of corpus annotation, different levels of annotation feed POS tagging, and vice-versa. POS tagging being a mid level NLP process, ideally it can and should make use of lexical level processes and should yield results that are desired for the syntactic parsing. Therefore, both these processes are to be considered in designing POS tagsets.
Bharati et al. (2006) among others, show that features from Morph Analyser can be used for enhancing the performance of a POS tagger. Infact, they argue that Morph Analyser itself can identify the parts of speech in most of the cases, and a POS tagger can be used to disambiguate the multiple answers provided by Morph Analyser. In retaliation, POS tagged data is used for other higher level processes like chunking, parsing, etc.
Similar view on POS tagging is expressed by Kavi Narayana Murthy (p.c.), who considers POS annotation as a mid level process depending on lexical level annotation and processes. In his view, Lexicon, in a sense, can be considered a tagger that it tags root forms of words/lemmata with 'all possible' tags. Morpho Analyser deals with inflected and/or derived forms of words and assign ‘all possible tags' to all valid forms of all valid words in a given language. Given this, a tagger that tags words/tokens in a running text can be viewed as a disambiguater rather than as an assigner of tags. All possible tags have already been assigned by the Lexicon/Morph Analyser and the task of POS level annotation is only to eliminate or at least reduce ambiguities if any. Further, he opines that this approach to POS tagging has several advantages:
  1. Impossible tags are never assigned.
  2. Words which have only one possible tag need not even be considered, only ambiguous cases need to be considered by the tagger.
  3. The degree and nature of ambiguities can be studied, both at the root word level (from the Lexicon) and at the running text level (from a tagged corpus).
  4. All types of ambiguities are not of the same nature. Therefore, different strategies can be formulated to handle them. For example, some kinds of ambiguities are easily solved using local context while others may inherently need long distance dependencies to be considered.
  5. In the context of Indian languages, he emphasises that Indian languages are morphologically very rich and a Morph Analyser is the most essential component for processing which can substantially reduce the ambiguities at the Lexicon though it can also introduce some ambiguities of its own. But the ambiguities introduced by the Morph Analyser are always uniform and fully rule governed. This helps us to design a judicious combination of linguistic (rule based/knowledge based) and statistical/machine-learning approaches.
[From: Mallikarjun, B, Yoonus, M. Sinha, Samar & A, Vadivel. 2010. Indian Languages and Part-of-Speech Annotation. Mysore: Linguistic Data Consortium for Indian Language: pp. 2-4. ISBN-81-7342-197-8]

POS Annotation

The ultimate goal of research on Natural Language Processing (henceforth, NLP) is to understand human language, and to facilitate human-machine interaction through human, natural language (weak AI) and to model theory of mind (strong AI). To achieve such a Promethean mission, research on NLP has focussed on various intermediate tasks that make partial sense of language structure without requiring complete understanding; consequently, contributing in developing a successful system. Part-of-Speech (henceforth, POS) tagging is one such task.
In corpus linguistics, POS tagging, also called grammatical tagging or word-category disambiguation, is a classification system, a process of marking up the words in a text corpus as corresponding to particular parts of speech, based on both its definition, as well as its context i.e., relationship with adjacent and related words in a phrase, clause, or sentence. It is the most common form of corpus annotation and is widely accepted as the first stage of a more comprehensive syntactic annotation. It serves wide number of applications like speech synthesis and recognition, information extraction, partial parsing, machine translation, lexicography, etc., to name a few.
However, it is important to remember that POS tag is different from parts of speech label (as understood in general parlance). The latter captures the basic grammatical category of word/token in a given language without any specific information about its morphosyntactic content and about punctuation markers. On the other hand, POS tag is annotation of word/token in its entirety following the writing convention that a language/script/orthography follows which includes punctuations and other conventions followed in the writing. Hence, it needs to accommodate many non-linguistic but writing convention based issues too.
As a process, POS tagging assigns a tag to a specific unit of natural language text. Hence, the text to be tagged is first passed through a tokeniser which applies various formatting rules to divide the text into tokens, a unit of written material divided by white space.
At a formal level of description, POS tagging can be stated as in (1), where a sequence of tokens W= w1...wn corresponds to a sequence of tags T=t1...tn, drawn from a set of tags {T}.

(1) S = argmax P (t1...tn | w1...wn)           (Dandapat 2008: 4)

This description (1) implies that the input to the tagger is a whole sentence and the output is a whole sequence of tags. Such formalisation also assumes that tagging is an independent process, independent of the dictionary and morphology. The task is to assign tags. It is possible that a word/token is tagged with a category which is not at all possible. For example, "in" can be tagged as a verb by a POS tagger. However, it explicates that a token is given a tag in context of the adjacent tokens, and has no relationship between a token and a tag based on the morphosyntactic cues of the former. On the other hand, tagsets are designed to capture finer morphosyntactic details; consequently, a large number of tags are devised as a relationship between a token and a tag without any dependence on the former's contextual position. Under such criteria, POS tagging can be stated as,

(2) S = argmax P (t | w)              (Kavi Narayana Murthy, p.c.)

On the other hand, a large of POS tagset is designed to annotate based both form and function of a token in a given clause/sentence. Consequently, such tagging is neither solely based on form nor on function, and can be formally expressed as,

(3) S = argmax P (ti | w1...wi...wn)             (Kavi Narayana Murthy, p.c.)
The basic requirement for POS tagging is a POS tagset, a tagging scheme, practical definitions of each tag and tag elements with words and contexts where each tag and tag elements applies. For manual annotation, it requires graphical user interface (GUI) based annotation tool designed to assign tags from a specific tagset. At the level of automatic tagging, a tagger, a program for assigning a tag to each token in the corpus, implementing the tagset and tagging scheme in a tag assignment algorithm is required. Ideally an automatic tagger is “trained” by giving it the results of a manually annotated corpus. The tagger then tags unknown text corpora from a set of rules or of a statistical analysis of the results of the manually tagged corpus. There are large numbers of methods, techniques and free/open source tools available for automatic tagging (visit http://www-nlp.stanford.edu/links/statnlp.html#Taggers). 

Ambiguity: In computational linguistics, ambiguity refers to a state where there is a choice of tag to a given token.
Annotation Tool: A tool used for tagging.
Decomposable: A tag is known as decomposable if the string representing the tag contains one or more shorter sub-strings that are meaningful out of the context of the original tag. It is a desirable feature of the hierarchical tagset.
Hierarchical: The term “hierarchical”, when used of a tagset, means that the categories in that tagset are structured relative to one another. Rather than a large number of independent categories, a hierarchical tagset will contain a small number of categories, each of which contains a number of sub-categories, each of which may contain sub-sub-categories, and so on, in a tree-like structure (Hardie 2003: 48).
Lexicon: A list of possible tags for the root forms of all the valid words in a given language.
Local Token Grouping: A group of tokens that form part of a single linguistic word.
Morph Analyser: A tool that splits a given word into its constituent morphemes and identifies their corresponding grammatical categories.
Multi Token Word: It refers to a collection of separate tokens which is a single lexical expression in a language though written separately and independently these tokens may have an independent meaning or nonce but other than as a single lexical expression.
Part-of-speech: Categories [that] group lexical items which perform similar grammatical functions (Greene & Rubin 1971: 3).
POS Tag: A POS label given to a token (optionally along with its morphosyntactic attributes).
Pre-processing: A process of normalisation of text before tokenisation.
Tag element: It is a part of a tag which provides information about individual elements that makes up a tag. Prototypically, it includes Type and other morpho-syntactic Attributes.
Tagset: A set of defined tags. A set of word categories to be applied to the word tokens of a text (Hardie 2003).
Tagging: The process of assigning a tag to a token. Also known as annotation.
Token: A printed item separated by white space.
Training corpus: A manually annotated corpus on which automatic or semi automatic tagger is trained to acquire linguistic knowledge.
Underspecification: A lack of feature in a given tagset in comparison with another tagset.

[From: Mallikarjun, B, Yoonus, M. Sinha, Samar & A, Vadivel. 2010. Indian Languages and Part-of-Speech Annotation. Mysore: Linguistic Data Consortium for Indian Language: pp. 1-2., 26 ISBN-81-7342-197-8]