Chunking

Table 7 shows a selection of the best results published for the noun phrase chunking task.¹⁰As far as we know, the results presented in this paper (line MBL) are the third-best results. We have participated in producing the second-best result [Tjong Kim Sang et al.(2000)] which was produced by combining of the results of five different learning techniques. The best results for this data set have been generated with Support Vector Machines [Kudoh and Matsumoto(2001)].¹¹A statistical analysis of our current result revealed that all performances outside of the region 92.93-93.73 are significantly different from ours. This means that all results in the table, except from the 93.26, are significantly different from ours.

A topic to which we have paid little attention is the analysis of the errors that our approach makes. Such an analysis would provide insights into the weaknesses of the system and might provide clues to methods for improving the system. For noun phrase chunking we have performed a limited error analysis by manually evaluating the errors that were made in the first section of a 10-fold cross-validation experiment on the training data while using the chunker described by [Tjong Kim Sang(2000a)]. This analysis revealed that the majority of the errors were caused by errors in the part-of-speech tags (28% of the false positives/29% of the false negatives). In order to acquire reasonable results, it is custom not to use the part-of-speech tags from the Treebank, but use tags that have been generated by a part-of-speech tagger. This prevents the system performance from reaching levels which would be unattainable for texts for which no perfect part-of-speech tags exist. Unfortunately the tagger makes errors and some of these errors cause the noun phrase segmentation to become incorrect.

The second most frequently occurring error cause was related to conjunctions of noun phrases (16%/18%). Deciding whether a phrase like red dwarfs and giants consist of one or two noun phrases requires semantic knowledge and might be too ambitious for present-day systems to solve. The other major causes of errors all relate to similar hard cases: attachment of punctuation signs (15%/12%; inconsistent in the Treebank), deciding whether ambiguous phrases without conjunctions should be one or two noun phrases (11%/12%), adverb attachment (5%/4%), noun phrases containing the word to (3%/3%), Treebank noun phrase segmentation errors (3%/1%) and noun phrases consisting of the word that (0%/2%). Apart from these hard cases there also were quite a few errors for which we could not determine an obvious cause (19%/19%).

The most obvious suggestion for improvement that came out of the error analysis was to use a better part-of-speech tagger. We are currently using the Brill tagger [Brill(1994)]. Better taggers are available nowadays but using the Brill tags here was necessary in order to be able to compare our approach with earlier studies, which have used the Brill tags as well. The error analysis did not produce other immediate suggestions for improving our noun phrase chunking approach. We are relieved about this because it would have been an embarrassment if our chunker had produced systematic errors. However, there is a trivial way to improve the results of the noun phrase chunker: by using more training data. Different studies have shown that by increasing the training data size by 300%, the F $_{\beta =1}$ error might drop with as much as 25% [Ramshaw and Marcus(1995),Tjong Kim Sang(2000a),Kudoh and Matsumoto(2001)]. Another study for a different problem, confusion set disambiguation, has shown that a further cut in the error rate is possible with even larger training data sets [Banko and Brill(2001)]. In order to test this for noun phrase chunking we need a hand-parsed corpus which is larger than anything that is presently available.

Table 8: A selection of results that have been published for the arbitrary chunking data set of the CoNLL-2000 shared task. Our chunker [Tjong Kim Sang(2000b)] is fifth-best. The baseline results have been produced by a system that selects the most frequent chunk tag (IOB1) for each part-of-speech tag. The best performance for this task has been obtained by a system using regularized Winnow [Zhang et al.(2001)]. Systems that have been applied both to the arbitrary chunking task and the noun phrase chunking task performed approximately equally well for NP chunks in both tasks.

section 20	precision	recall	F $_{\beta =1}$
[Zhang et al.(2001)]	94.29%	94.01%	94.13
[Kudoh and Matsumoto(2001)]	93.89%	93.92%	93.91
[Kudoh and Matsumoto(2000)]	93.45%	93.51%	93.48
[Van Halteren(2000)]	93.13%	93.51%	93.32
[Tjong Kim Sang(2000b)]	94.04%	91.00%	92.50
[Zhou et al.(2000)]	91.99%	92.25%	92.12
[Déjean(2000)]	91.87%	92.31%	92.09
baseline	72.58%	82.14%	77.07

Table 8 contains a selection of the best results published for the arbitrary chunking data used in the CoNLL-2000 shared task.¹²Our chunker [Tjong Kim Sang(2000b)] is the fifth-best on this list. Immediately obvious is the imbalance between precision and recall: the system identifies a small number of phrases with a high precision rate. We assume that this is primarily caused by our method for generating balanced structures from streams of open and close brackets. We have performed a bootstrap resampling test on the chunk tag sequence associated with this result. An evaluation of 10,000 pairs indicated that the significance interval for our system (F $_{\beta =1}$ = 92.50) is 92.18-92.81 which means that all systems that are ahead of ours perform significantly better and all systems that are behind perform significantly worse. We are not sure what is causing these large performance differences. At this moment we assume that our approach has difficulty with classification tasks when the number of different output classes increases.

Table 9: Results of the clause identification part of the CoNLL-2001 shared task. Our clause identifier [Tjong Kim Sang(2001a)] is third-best. The baseline results have been produced by a system that only puts clause brackets around complete sentences. The best performance for this task has been obtained by a system using boosted decision trees [Carreras and Màrquez(2001)].

section 21	precision	recall	F $_{\beta =1}$
[Carreras and Màrquez(2001)]	84.82%	73.28%	78.63
[Molina and Pla(2001)]	70.89%	65.57%	68.12
[Tjong Kim Sang(2001a)]	76.91%	60.61%	67.79
[Patrick and Goyal(2001)]	73.75%	60.00%	66.17
[Déjean(2001)]	72.56%	54.55%	62.77
[Hammerton(2001)]	55.81%	45.99%	50.42
baseline	98.44%	31.48%	47.71