Task Overview

A text chunker divides sentences in phrases which consist of a sequence of consecutive words which are syntactically related. The phrases are nonoverlapping and nonrecursive. In the beginning of the nineties, [Abney(1991)] suggested to use chunking as a preprocessing step of a parser. Ten years later, most statistical parsers contained a chunking phase (for example [Ratnaparkhi(1998)]). In this study, we will divide chunking in two subtasks: finding only noun phrases and identifying arbitrary chunks.

Machine learning approaches towards noun phrase chunking started with work by [Church(1988)] who used bracket frequencies associated with POS tags for finding noun phrase boundaries in text. In an influential paper about chunking, [Ramshaw and Marcus(1995)] show that chunking can be regarded as a tagging task. Even more importantly, the authors propose a training and test data set that are still being used for comparing different text chunking methods. These data sets were extracted from the Wall Street Journal part of the Penn Treebank II corpus [Marcus et al.(1993)]. Sections 15-18 are used as training data and section 20 as test data.²In principle, the noun phrase chunks present in the material are noun phrases that do not include other noun phrases, with initial material (determiners, adjectives, etc.) up to the head but without postmodifying phrases (prepositional phrases or clauses) [Ramshaw and Marcus(1995)].

The noun phrase chunking data produced by [Ramshaw and Marcus(1995)] contains a couple of nontrivial features. First, unlike in the Penn Treebank, possessives between two noun phrases have been attached to the second noun phrase rather than the first. An example in which round brackets mark chunk boundaries: ( Nigel Lawson ) ('s restated commitment ): the possessive 's has been moved from Nigel Lawson to restated commitment. Second, Treebank annotation may result in nonexpected noun phrase annotations: British Chancellor of ( the Exchequer ) Nigel Lawson in which only one noun chunk has been marked. The problem here is that neither British Chancellor nor Nigel Lawson has been annotated as separate noun phrases in the Treebank. Both British ... Exchequer and British ... Lawson are annotated as noun phrases in the Treebank but these phrases could not be used as noun chunks because they contain the smaller noun phrase the Exchequer.

[Ramshaw and Marcus(1995)] proposed to encode chunks with tags: I for words that are inside a noun chunk and O for words that are outside a chunk. In case one noun phrase immediately follows another one, they use the tag B for the first word of the second phrase in order to show that a new phrase starts there. With the three tags I, O and B any chunk structure can be encoded. This representation has two advantages. First, it enables trainable POS taggers to be used as chunkers by simply changing their training data. Second, it minimizes consistency errors which appear with the bracket representation where open and close brackets generated by the learner may not be balanced. Here is an example sentence first with noun phrases encoded by pairs of brackets and then with the Ramshaw and Marcus IOB representation:

[Tjong Kim Sang and Veenstra(1999)] presents three variants on the Ramshaw and Marcus representation and shows that the bracket representation can also be regarded as a tagging representation with two streams of brackets. They named the variants IOB2, IOE1 and IOE2 and used IOB1 as name for the Ramshaw and Marcus representation. IOB2 was the same as IOB1 but now every chunk-initial word receives tag B. IOE1 differs from IOB1 in the fact that rather than the tag B, a tag E is used for the final word of a noun chunk which is immediately followed by another chunk. IOE2 is a variant of IOE1 in which each final word of a noun phrase is tagged with E. The bracket representations use open brackets for phrase-initial words, close brackets for phrase-final words and a period for all other words. Table 1 contains example tag sequences for all six tag sequences for the example sentence.

IOB1

IOB2

IOE1

IOE2

The representation variants are interesting because a learner will make different errors when trained with data encoded in a different representation. This means that we can train one learner with five³ data representations and obtain five different analyses of the data which we can combine with system combination techniques. Thus the different data representations may enable us to improve the performance of the chunker. The data representations can be used both for noun phrase chunking and for arbitrary chunking. In the latter task, more than one chunk type exists so the tags need to be expanded with type-specific suffixes. For example: B-VP, I-VP, E-VP,

-VP and

-VP.

The arbitrary chunking task was more difficult to design because many interesting phrase types often contain parts which belong to other phrases [Tjong Kim Sang and Buchholz(2000)]. For example, verb phrases may contain noun phrases and prepositional phrases often include a noun phrase. Furthermore, noun phrases may contain quantitative or adjective phrases which may prevent them from being identified as noun chunks. The noun, verb and prepositional phrases should be included and therefore the following measures have been taken when constructing the data for the arbitrary chunking task: First, a couple of phrase types, for example quantifier phrases and adjective phrases, have been removed from places where they prevented the identification of noun phrases. This made possible annotating more phrases as noun chunks. Second, some phrase types in the annotated data, for example verb phrases and prepositional phrases, lack material that has already been included in a phrase of another type. Third, adjacent verb clusters have been put in one flat verb phrase unlike in the Treebank where often each verb starts a new phrase. And fourth, adverbial phrase boundaries have been removed from adjective phrases and verb phrases to allow all material to be included in the mother phrase.

This chunk definition scheme will generate data in which most of the tokens have been assigned to a chunk of some type. The odd tokens that fall out are usually punctuation signs. This chunk scheme has been used for generating training and test data for the CoNLL-2000 shared task [Tjong Kim Sang and Buchholz(2000)]. The data contains the same segments of the Wall Street Journal part of the Penn Treebank as the noun phrase data of [Ramshaw and Marcus(1995)]: sections 15-18 as training data and section 20 as test data.⁴We will use these data sets in our arbitrary chunking experiments.

The training and the test data contain two types of features: words and POS tags. The words have been taken from the Penn Treebank. The POS tags of the Treebank have been manually checked and therefore they should not be used in the chunking data. In future applications, the chunking process will be applied to a text with POS tags that have been generated automatically. These POS tags will contain errors and therefore the performance of the chunker will be worse than when applied to a Treebank text with manually checked POS tags. If we want to obtain realistic performance rates, we should work with automatically generated POS tags in our shallow parsing experiment. Conform with earlier work like that of [Ramshaw and Marcus(1995)], we have used POS tags that were generated by the Brill tagger [Brill(1994)].