Lingua Parasitica: POS Annotation

The ultimate goal of research on Natural Language Processing (henceforth, NLP) is to understand human language, and to facilitate human-machine interaction through human, natural language (weak AI) and to model theory of mind (strong AI). To achieve such a Promethean mission, research on NLP has focussed on various intermediate tasks that make partial sense of language structure without requiring complete understanding; consequently, contributing in developing a successful system. Part-of-Speech (henceforth, POS) tagging is one such task.

In corpus linguistics, POS tagging, also called grammatical tagging or word-category disambiguation, is a classification system, a process of marking up the words in a text corpus as corresponding to particular parts of speech, based on both its definition, as well as its context i.e., relationship with adjacent and related words in a phrase, clause, or sentence. It is the most common form of corpus annotation and is widely accepted as the first stage of a more comprehensive syntactic annotation. It serves wide number of applications like speech synthesis and recognition, information extraction, partial parsing, machine translation, lexicography, etc., to name a few.

However, it is important to remember that POS tag is different from parts of speech label (as understood in general parlance). The latter captures the basic grammatical category of word/token in a given language without any specific information about its morphosyntactic content and about punctuation markers. On the other hand, POS tag is annotation of word/token in its entirety following the writing convention that a language/script/orthography follows which includes punctuations and other conventions followed in the writing. Hence, it needs to accommodate many non-linguistic but writing convention based issues too.

As a process, POS tagging assigns a tag to a specific unit of natural language text. Hence, the text to be tagged is first passed through a tokeniser which applies various formatting rules to divide the text into tokens, a unit of written material divided by white space.

At a formal level of description, POS tagging can be stated as in (1), where a sequence of tokens W= w1...wn corresponds to a sequence of tags T=t1...tn, drawn from a set of tags {T}.

(1) S = argmax P (t1...tn | w1...wn) (Dandapat 2008: 4)

t1...tn

This description (1) implies that the input to the tagger is a whole sentence and the output is a whole sequence of tags. Such formalisation also assumes that tagging is an independent process, independent of the dictionary and morphology. The task is to assign tags. It is possible that a word/token is tagged with a category which is not at all possible. For example, "in" can be tagged as a verb by a POS tagger. However, it explicates that a token is given a tag in context of the adjacent tokens, and has no relationship between a token and a tag based on the morphosyntactic cues of the former. On the other hand, tagsets are designed to capture finer morphosyntactic details; consequently, a large number of tags are devised as a relationship between a token and a tag without any dependence on the former's contextual position. Under such criteria, POS tagging can be stated as,

(2) S = argmax P (t | w) (Kavi Narayana Murthy, p.c.)

On the other hand, a large of POS tagset is designed to annotate based both form and function of a token in a given clause/sentence. Consequently, such tagging is neither solely based on form nor on function, and can be formally expressed as,

(3) S = argmax P (ti | w1...wi...wn) (Kavi Narayana Murthy, p.c.)

t1...tn

The basic requirement for POS tagging is a POS tagset, a tagging scheme, practical definitions of each tag and tag elements with words and contexts where each tag and tag elements applies. For manual annotation, it requires graphical user interface (GUI) based annotation tool designed to assign tags from a specific tagset. At the level of automatic tagging, a tagger, a program for assigning a tag to each token in the corpus, implementing the tagset and tagging scheme in a tag assignment algorithm is required. Ideally an automatic tagger is “trained” by giving it the results of a manually annotated corpus. The tagger then tags unknown text corpora from a set of rules or of a statistical analysis of the results of the manually tagged corpus. There are large numbers of methods, techniques and free/open source tools available for automatic tagging (visit http://www-nlp.stanford.edu/links/statnlp.html#Taggers).

Glossary

Ambiguity: In computational linguistics, ambiguity refers to a state where there is a choice of tag to a given token.

Annotation Tool: A tool used for tagging.

Decomposable: A tag is known as decomposable if the string representing the tag contains one or more shorter sub-strings that are meaningful out of the context of the original tag. It is a desirable feature of the hierarchical tagset.

Hierarchical: The term “hierarchical”, when used of a tagset, means that the categories in that tagset are structured relative to one another. Rather than a large number of independent categories, a hierarchical tagset will contain a small number of categories, each of which contains a number of sub-categories, each of which may contain sub-sub-categories, and so on, in a tree-like structure (Hardie 2003: 48).

Lexicon: A list of possible tags for the root forms of all the valid words in a given language.

Local Token Grouping: A group of tokens that form part of a single linguistic word.

Morph Analyser: A tool that splits a given word into its constituent morphemes and identifies their corresponding grammatical categories.

Multi Token Word: It refers to a collection of separate tokens which is a single lexical expression in a language though written separately and independently these tokens may have an independent meaning or nonce but other than as a single lexical expression.

Part-of-speech: Categories [that] group lexical items which perform similar grammatical functions (Greene & Rubin 1971: 3).

POS Tag: A POS label given to a token (optionally along with its morphosyntactic attributes).

Pre-processing: A process of normalisation of text before tokenisation.

Tag element: It is a part of a tag which provides information about individual elements that makes up a tag. Prototypically, it includes Type and other morpho-syntactic Attributes.

Tagset: A set of defined tags. A set of word categories to be applied to the word tokens of a text (Hardie 2003).

Tagging: The process of assigning a tag to a token. Also known as annotation.

Token: A printed item separated by white space.

Training corpus: A manually annotated corpus on which automatic or semi automatic tagger is trained to acquire linguistic knowledge.

Underspecification: A lack of feature in a given tagset in comparison with another tagset.

[From: Mallikarjun, B, Yoonus, M. Sinha, Samar & A, Vadivel. 2010. Indian Languages and Part-of-Speech Annotation. Mysore: Linguistic Data Consortium for Indian Language: pp. 1-2., 26 ISBN-81-7342-197-8]

Lingua Parasitica

Sunday, November 21, 2010

POS Annotation

No comments:

Post a Comment