Annotation is a process of ascribing grammatical categories to token/word of a corpus. Prior to corpus annotation, Text Encoding Initiative (TEI) makes annotation of corpus reader friendly and suggests universal grammatical categories for annotations enabling corpora to be stored and transferred. Moreover, TEI uses Standard Generalised Mark-up Language (SGML), an ISO-standard 8879 technology for defining generalised mark-up languages for documents, for text encoding and annotation purpose, and more recently XML has been adopted. This enables to encode any textual resource, in a manner that is hardware, software, and application independent.
Leech (1993) describes seven maxims for annotation of text corpora:
- Reversibility: Annotation should be removable and the annotated corpus can be reverted back to raw corpus.
- Extractibilty: Annotations should be able to be separated from the corpus text.
- Reader Friendliness: Annotation has to be such that it is reader friendly.
- Maker Explicitness: Manual as well as automatic tagging should make difference to the corpus user.
- Potentiality: Annotation is a potential representation rather than absolute representation.
- Mentality: Annotation should be theory independent.
- Non-Standardness: None of the annotation scheme is regarded as the a priori standard. Standards emerge through practical consensus, and the set of corpus tags will, very likely, be revised many times during the course, in order to find an optimal set for each language.
As a design measure of POS tagset, it is widely accepted that the POS tagset will not include any derivational, etymological, syntactic, semantic or discourse information (Hardie 2003). However, composition of tags does have its significance in annotation. Leech (1997) suggests the following criteria for labelling tag:
- Conciseness: It is more convenient to use concise label than verbose, lengthy ones. For example, "mas" rather than masculine.
- Perspicuity: Interpretable labels are more user friendly than which cannot. Cloeren (1999) writes, "For reasons of readability there is a preference for mnemonic tags. Full-length names may be clearer individually, but make the annotated text virtually unreadable." For example, "NMZ" is more easily interpreted as nominaliser than "NML".
- Analysability: Decomposed labels are friendly to human annotator as well as machine.
- Compositionality: A tag needs to be logically composed as a string of symbols representing levels of taxonomic categories. For example, a tag NC.mas.sg.dir.0.n.n in Hindi is for Category Noun, Type Common, and Attributes Gender, Number, Case, Case Marker, Distributive and Honorificity along with its valuation.
Leech & Smith (1990: 27) point out that syntactic parsing is arguably the central task of NLP, and POS tagging, being a prerequisite to parsing, is "the most central area of corpus processing.” Though POS tag has a limited scope of syntactic disambiguation, it shares a network of relationship with various other intermediate tasks in constructing an optimal system. In the architecture of corpus annotation, different levels of annotation feed POS tagging, and vice-versa. POS tagging being a mid level NLP process, ideally it can and should make use of lexical level processes and should yield results that are desired for the syntactic parsing. Therefore, both these processes are to be considered in designing POS tagsets.
Bharati et al. (2006) among others, show that features from Morph Analyser can be used for enhancing the performance of a POS tagger. Infact, they argue that Morph Analyser itself can identify the parts of speech in most of the cases, and a POS tagger can be used to disambiguate the multiple answers provided by Morph Analyser. In retaliation, POS tagged data is used for other higher level processes like chunking, parsing, etc.
Similar view on POS tagging is expressed by Kavi Narayana Murthy (p.c.), who considers POS annotation as a mid level process depending on lexical level annotation and processes. In his view, Lexicon, in a sense, can be considered a tagger that it tags root forms of words/lemmata with 'all possible' tags. Morpho Analyser deals with inflected and/or derived forms of words and assign ‘all possible tags' to all valid forms of all valid words in a given language. Given this, a tagger that tags words/tokens in a running text can be viewed as a disambiguater rather than as an assigner of tags. All possible tags have already been assigned by the Lexicon/Morph Analyser and the task of POS level annotation is only to eliminate or at least reduce ambiguities if any. Further, he opines that this approach to POS tagging has several advantages:
- Impossible tags are never assigned.
- Words which have only one possible tag need not even be considered, only ambiguous cases need to be considered by the tagger.
- The degree and nature of ambiguities can be studied, both at the root word level (from the Lexicon) and at the running text level (from a tagged corpus).
- All types of ambiguities are not of the same nature. Therefore, different strategies can be formulated to handle them. For example, some kinds of ambiguities are easily solved using local context while others may inherently need long distance dependencies to be considered.
- In the context of Indian languages, he emphasises that Indian languages are morphologically very rich and a Morph Analyser is the most essential component for processing which can substantially reduce the ambiguities at the Lexicon though it can also introduce some ambiguities of its own. But the ambiguities introduced by the Morph Analyser are always uniform and fully rule governed. This helps us to design a judicious combination of linguistic (rule based/knowledge based) and statistical/machine-learning approaches.
[From: Mallikarjun, B, Yoonus, M. Sinha, Samar & A, Vadivel. 2010. Indian Languages and Part-of-Speech Annotation. Mysore: Linguistic Data Consortium for Indian Language: pp. 2-4. ISBN-81-7342-197-8]