Table 7 shows a selection of the best results published for the noun phrase chunking task.10As far as we know, the results presented in this paper (line MBL) are the third-best results. We have participated in producing the second-best result [Tjong Kim Sang et al.(2000)] which was produced by combining of the results of five different learning techniques. The best results for this data set have been generated with Support Vector Machines [Kudoh and Matsumoto(2001)].11A statistical analysis of our current result revealed that all performances outside of the region 92.93-93.73 are significantly different from ours. This means that all results in the table, except from the 93.26, are significantly different from ours.
A topic to which we have paid little attention is the analysis of the errors that our approach makes. Such an analysis would provide insights into the weaknesses of the system and might provide clues to methods for improving the system. For noun phrase chunking we have performed a limited error analysis by manually evaluating the errors that were made in the first section of a 10-fold cross-validation experiment on the training data while using the chunker described by [Tjong Kim Sang(2000a)]. This analysis revealed that the majority of the errors were caused by errors in the part-of-speech tags (28% of the false positives/29% of the false negatives). In order to acquire reasonable results, it is custom not to use the part-of-speech tags from the Treebank, but use tags that have been generated by a part-of-speech tagger. This prevents the system performance from reaching levels which would be unattainable for texts for which no perfect part-of-speech tags exist. Unfortunately the tagger makes errors and some of these errors cause the noun phrase segmentation to become incorrect.
The second most frequently occurring error cause was related to conjunctions of noun phrases (16%/18%). Deciding whether a phrase like red dwarfs and giants consist of one or two noun phrases requires semantic knowledge and might be too ambitious for present-day systems to solve. The other major causes of errors all relate to similar hard cases: attachment of punctuation signs (15%/12%; inconsistent in the Treebank), deciding whether ambiguous phrases without conjunctions should be one or two noun phrases (11%/12%), adverb attachment (5%/4%), noun phrases containing the word to (3%/3%), Treebank noun phrase segmentation errors (3%/1%) and noun phrases consisting of the word that (0%/2%). Apart from these hard cases there also were quite a few errors for which we could not determine an obvious cause (19%/19%).
The most obvious suggestion for improvement that came out of the error
analysis was to use a better part-of-speech tagger.
We are currently using the Brill tagger [Brill(1994)].
Better taggers are available nowadays but using the Brill tags here
was necessary in order to be able to compare our approach with earlier
studies, which have used the Brill tags as well.
The error analysis did not produce other immediate suggestions for
improving our noun phrase chunking approach.
We are relieved about this because it would have been an
embarrassment if our chunker had produced systematic errors.
However, there is a trivial way to improve the results of the noun
phrase chunker: by using more training data.
Different studies have shown that by increasing the training data size
by 300%, the F error might drop with as much as 25%
[Ramshaw and Marcus(1995),Tjong Kim Sang(2000a),Kudoh and Matsumoto(2001)].
Another study for a different problem, confusion set disambiguation,
has shown that a further cut in the error rate is possible with even
larger training data sets [Banko and Brill(2001)].
In order to test this for noun phrase chunking we need a hand-parsed
corpus which is larger than anything that is presently available.
|
Table 8 contains a selection of the best results published
for the arbitrary chunking data used in the CoNLL-2000 shared
task.12Our chunker [Tjong Kim Sang(2000b)] is the fifth-best on this list.
Immediately obvious is the imbalance between precision and recall:
the system identifies a small number of phrases with a high precision
rate.
We assume that this is primarily caused by our method for generating
balanced structures from streams of open and close brackets.
We have performed a bootstrap resampling test on the chunk tag
sequence associated with this result.
An evaluation of 10,000 pairs indicated that the significance interval
for our system (F = 92.50) is 92.18-92.81 which means that
all systems that are ahead of ours perform significantly better and all
systems that are behind perform significantly worse.
We are not sure what is causing these large performance differences.
At this moment we assume that our approach has difficulty with
classification tasks when the number of different output classes
increases.
|