Lingua Parasitica: Issues in POS Tagset Design

In the initial phase of POS tagset development for NLP purposes, the tagsets were designed and developed from the machine learning point of view in lieu of linguistic point of view. Under such considerations, language is arbitrarily considered as a sequence of tokens to be associated with a given set of tags. In formal terms, a set of strings over Σ (i.e. any subset of Σ*) is called a formal language over Σ. Moreover, focus on linguistic knowledge in designing tagset was neglected.

However, with the growing realisation that linguistic knowledge is essential in any work on language, the issues involved in designing POS tagset are discussed from the linguistic perspectives too as these issues have wide implications on the annotation of the linguistic data, and the resultant output and application based on it.

In this section, the following conceptual design issues are discussed with relevant illustrations from Indian languages.

1. Theoretical Background

In the development of a new tagset, the developers will analyse linguistic data in light of a particular linguistic theory that they advocate. The development of tagset, therefore, is not theory independent or theory-neutral as one often wishes it to be due to the conflicting assumptions. Consequently, the theoretical assumptions play an important role in deciding many other aspects of tagset design.

However, it is also possible that the developers are application-oriented rather than linguistic-theory oriented. For example, the Machine Learning researchers using a POS tagged corpus for their experiments are primarily concerned with Machine-Learnable tagging than with a specific linguistic theory. Therefore, such researchers will develop POS tagset accordingly. Paradoxically, this view has dominated the development of POS tagset to a large extent.

English being the first language of corpus linguistics, the grammatical framework chosen to describe its POS are Generalised Phrase Structure Grammar and Lexical Functional Grammar, which had promoted the notion that a category is composed of a bundle of features. In the Indian language POS tagset scenario, IIIT-Hindi tagset and Telugu tagset developed by CALTS, Hyderabad (Sree R. J. et al. 2008) are based on the Paninian perspective (for details see section 6). However, it is a desirable feature that tagset is not theory-laden but supports linguistic analysis also.

2. Form and Function

One of the major decisions that the tagging schema needs to resolve is a tagging decision between form and function of a token in the text. As a given word/token may function differently in different syntactic contexts, they may be assigned different tags depending upon the function rather than on the form. Such cases, however, pose a computational complexity for automatic tagging, since more than one tag is given for the same form but with different contextual syntactic functions. On the other hand, two syntactic functions of a token/word may be assigned a single tag on the basis of its form. This also leads to information loss.

To maintain a firm decision between form and function, different approaches are decided for POS tagging; and each approach has underlying assumption to validate the decision. To illustrate such an assumption, a token is POS tagged on the basis of the form rather than the function in AnnCorra (Bharti et al. 2006). This decision is based on the priority that it eradicates choices involved in manual tagging, and establishes a token-tag relation which leads to efficient machine learning. In contrast to AnnCorra, Stuttgart-Tübingen Tag-Set (STTS) for German (Atwell ms.) has made linguistically motivated distinction between attributive and predicative adjectives. However, there are other approaches where there is a division of labour with respect to the hierarchy regarding form and function. The MSRI developed ILPOSTS based Hindi tagset is such one tagset which takes morphosyntactic form into account for assigning attribute-value (the lowest in the hierarchy), and function for annotating the Type (the mid-level hierarchy).

Knowles & Don (2003) has devised another approach for Malay, a language in which words change their function according to context. For example, "masuk" is a verb in a context but it is a noun "entrance" in a context of building, car-park, etc. Acknowledging this linguistic fact in Malay, Knowles & Don's tagset for Malay separates lexical class or form from syntactic function, and give each word in the lexicon only one class-tag. They have used the term ‘tag’ to label a lexical class, and ‘slot’ to refer to a position in syntactic structure in Malay (see Atwell ms.: 19).

Yet another view on this dichotomy is expressed as the following. To illustrate the form and function dichotomy, “maaDi” in Kannada is ambiguous between plural imperative and past verbal participle. A tagger needs to resolve such ambiguity through context. However, there is no need to consider those distinctions which are entirely within the scope of syntax. For example, syntax allows, as a general, universal rule, that nouns can act as adjectival modifiers. This rule is very much a part of any syntactic system. Hence, a tagger need not tag a noun as an adjective because of its function. This is unnecessary and it adds to the complexity of machine learning (Kavi Narayana Murthy, p.c.).

This view asserts that ambiguity arising out of form needs to be disambiguated at POS level provided there are no syntactic rules to account for its function. In other words, POS tagging is primarily based on form, and function is a secondary concern of tagging to be carried out as a last resort for disambiguation.

3. Granularity: Coarse Vs. Fine

The one of the important concerns in developing a tagset for a language is granularity - coarseness and fineness. They refer to the broad annotation and the finer annotation, respectively of any grammatical category. The aim of the corpus annotation is to maximise information content so that the tagged corpus can be used for a variety of applications. But as a matter of fact, the applications are not known in advance, hence, the level of linguistic annotation required is also unknown. The general corpus developers, as a principle, prefer to maximise linguistic enrichment by designing tagset in such a way that the annotation can be customised according to the needs of the application.

However, in POS tagset design, there are two schemes for granularity. The coarse annotation has far less number of tags than the fine grained annotation, and aids in higher accuracy in the course of manual tagging and in efficient machine learning. Despite such advantages, the coarse grained POS tagset is of less use as it does not capture much relevant information on POS. On the other hand, a finer annotation provide a very large number of information but also leads to create a problem for automatic tagging as it maximises tag options for a given token leading to computational complexity.

In view of the above mentioned advantages and disadvantages of the schemes, an ideal POS tagset design makes a subtle balance for POS annotation. However, it is important to remember that all linguistic information cannot be annotated at the POS level as well as all other linguistic information cannot be recovered from other levels of annotation. As a rule of thumb, it is imperative to capture optimal information at this level of annotation. In other words, POS design has to be such that coarse as well as fine information is retrieved as per the needs of the application.

In this context of granularity, the hierarchical architecture provides an edge over the flat architecture as it allows to modularise information accordingly. This is usually conceived along the levels of hierarchy - deeper the level, finer the features are encoded. On the other hand, flat may be too coarse or too finer or may lose relevant information in the POS tagged corpus.

The Text Analytics and Natural Language Processing ( Tanl) tagset (Attardi & Simi (Ms)) used for the EVALITA09 POS tagging is one such tagset designed for both coarse and fine grained annotation. It consists of 328 tags, and provides three levels of POS tags: coarse-grain, fine-grain and morphed tags. The coarse-grain tags consist of the 14 categories, the fine-grain tags have 36 tags like indefinite pronoun, personal pronoun, possessive pronoun, interrogative pronoun and relative pronoun among pronoun, and the morphed tags consist of 328 categories, which include morphological information like person, number, gender, tense, mode, and clitic.

4. Orthographic Conventions

One of the major issues that one faces in designing a tagset is to account orthographic practices that are beyond the known linguistic principles of categorisation. It is a known linguistic fact that a single token need not necessarily express meaning but rather a group of tokens. Such linguistic unit has come to known as multi token word (MTW) (also commonly known as multiword expression in computational literature). For example, a complex postposition in Hindi, के लिए collectively expresses a single meaning of "benefaction/purposive" (as a case marker). In isolation, के is a masculine genitive case marker and लिए has no semantic content. Ideally, therefore, के लिए can be tagged in one of the three ways:

(4) के\ and लिए\ as two separate POS labels (though tag for लिए is an issue).

(5) [के\ लिए\]\ as a single complex postposition with two different POS labels.

(6) [के लिए]\ as a complex but a single POS label.

It is one of the major decisions that a tagset designing has to take firmly regarding POS labelling to different tokens of a single lexical word. It is often the case that such issues are tagged ad hoc/arbitrarily at the POS level annotation, and are resolved at the higher level like local token grouping/chunking where a group of tokens is assigned a single tag.

Apart from MTW, contractions pose as a major issue with respect to token and linguistic annotation. Contrary to MTW, contractions are those orthographic forms that are shortened than the usual form reflecting the spoken form yet it partially retains the usual orthographic form. For example in Nepali, भा'थ्यो or भा'-थ्यो is contracted form of भएको थियो. The contracted form भा is a contraction of a participial भएको which is different from a dubitative particle भा. Similarly, थ्यो is a contracted form of थियो.

There are two known approaches to tackle this orthographic convention. The first approach considers the form as an orthographic convention reflecting a spoken form of two known distinct tokens. Therefore, the contracted forms are pre-processed, and tokenised as two separate tokens after separating punctuation markers from these tokens and tagged accordingly. To illustrate the case mentioned above, भा will be tagged as a participial and थ्यो as a verb assuming them to be an alternative orthographic form of their respective category. Alternately, the contracted form is considered as a single token reflecting to a linguistic reality in the mind of the speaker/author. In accordance with the language use the token is tagged.

5. Computational Complexity

One of the important functions of POS tagging is to resolve category level ambiguity. Paradoxically, in practice, there remain many issues where ambiguity remains unresolved or partially resolved even after POS tagging, and becomes a source of ambiguity for further processing. In this context, it is important to remember that the ambiguities are related with token-tag rather than semantic or structural ambiguity.

One of the most common examples to cite is about case syncretism, where the same form of marker is used for different case markers. For example, Hindi dative and accusative case markers have a similar phonetic/orthographic form as को. In a form based approach, let’s assume that को is assigned dative consistently irrespective of linguistic context in which it is accusative. In the process, this results in a loss of linguistic information that को is also an accusative case marker in Hindi. This approach, however, facilitates an ease for machine learning algorithm to POS tag but the resultant output has a loss of relevant linguistic information. Though, such an approach solves an issue ad hoc at the POS level annotation, its result needs to be recategorised and reassign the appropriate tag in association with other levels of annotation like semantic tagging in order to regain the lost linguistic information which is significant for higher level processing.

In a function based approach, though it demands annotator to distinguish each case and tag accordingly which in turn adds cognitive load to the annotator (see section 5.6), each linguistic information is tagged appropriately despite similar forms. However, for machine, it is a more difficult task to distinguish POS tags, technically to disambiguate, as there is no linguistic supplement to distinguish the two (Bhat & Richa (ms.) for detailed discussion on the issue). Thus, a system requires other tools and techniques to disambiguate it adding to computational complexity.

As a matter of fact, these approaches is a tug-of-war between detailed linguistic tagging and an ease for cognitive load to the annotator or/and subsequent automatic tagging. In an ideal tagging scheme, these two aspects are balanced finely so that it remains optimal with respect to the design scheme and the various processes both at the manual as well as at the machine level. Therefore, it is imperative to validate POS tagset at, and across various NLP processes in order to achieve computational as well as manual optimality.

6. Cognitive Load on Annotator

One of the major objectives of corpus linguistics is to design tagger which minimises human labour for annotating the text. Such automatic tagger, however, requires linguistic knowledge. Ideally an automatic tagger is “trained” by giving it the results of a manually annotated corpus, also called "training corpus." It is on the basis of "training corpus," the automatic tagger gains linguistic knowledge in association with machine learning techniques.

With respect to POS tagging, automatic tagger is trained to acquire knowledge to establish a tag-token relationship. The tagger acquires this knowledge from "training corpus", which is manually POS annotated. This, in turn, establishes a work flow that manual annotation forms the backbone of all kinds of annotation for NLP tasks.

With the given importance of manual annotation, and of POS annotation specifically for NLP tasks, it is important to ensure that manual POS annotation has zero-error. Since manual tagging is a tedious process, it is always desirable to reduce tagging load on the annotator to ensure such a standard. It is desirable that the annotation process is simple, intuitive, easier, and makes feel-good so that the cognitive load on the annotator is reduced to maximum limit. The first most requisite is to make the user comfortable with the GUI based tool. The look and feel of the tool can be customised according to the user so that it can set to an environment in which the user would like to work comfortably.

To reduce cognitive load on the annotator, the tool can be designed in such a way that it reduces number of human annotation interference which in consequence aims to minimise human error in tagging. For example, in Nepali, Direct Case has "0" value for Case Marker, and Oblique takes morphological Case Marker as given in values. The tool needs to be programmed in accordance with the linguistic facts such that value assignment for Direct Case takes automatically whereas for Oblique, the value assignment will be carried out manually. As a consequence of such filtering program, chances of error with respect to Direct Case are reduced. The tool, therefore, needs to be flexible enough to be customised with filters to accommodate language specific tagging facts while tagging data from many languages.

It is also desirable to facilitate annotation of finite list of items automatically. For example, punctuation markers are finite, and the tool can be designed to tag them automatically reducing iteration that otherwise a manual annotator has to carry out.

The developments and incorporation of such heuristic as well as linguistic facts into the tool primarily based on POS tagset can provide an impetus to ease off cognitive load on the annotator to ensure zero-error standard.

[From: Mallikarjun, B, Yoonus, M. Sinha, Samar & A, Vadivel. 2010. Indian Languages and Part-of-Speech Annotation. Mysore: Linguistic Data Consortium for Indian Language: pp. 7-13. ISBN-81-7342-197-8]

Lingua Parasitica

Sunday, November 21, 2010

Issues in POS Tagset Design

No comments:

Post a Comment