Parsing

A complete overview of the clause identification results of the CoNLL-2001 shared task can be found in Table 9 [Tjong Kim Sang and Déjean(2001)]. Our approach was third-best. A bootstrap resampling test with a population of 10,000 random samples generated from our results produced the 90% significance interval 66.66-68.95 for our system which means that our result is not significantly different from the second result. The boosted decision trees used by [Carreras and Màrquez(2001)] did a lot better than the other systems. In Section 4.1, we have made a comparison between the performance of this system and ours and concluded that the performance differences were both caused by the choice of learning system and by a difference in the features chosen for representing the task.

The noun phrase parsing task has not received much attention in the research community and there are only few results to compare with. [Osborne(1999)] used a grammar-extension method based on Minimal Description Length and applied it to a Definite Clause Grammar. His system used different training and test segments of the Penn Treebank than we did. At best, it obtained an F $_{\beta =1}$ rate of 60.0 on the test data (precision 53.2% and recall 68.7%). [Krymolowski and Dagan(2000)] applied a memory-based learning technique specialized for learning sequences to a noun phrase parsing task. Their system obtained F $_{\beta =1}$ =83.7 (precision 88.5% and recall 79.3%) on yet another segment of the Treebank. This performance is very close to that of our approach (F $_{\beta =1}$ =83.79). The memory-based sequence learner used much more training data than ours (about four times as much) but unlike our method, it generated its output without using lexical information, which is impressive. The performance of the Collins parser on the subtask of noun phrase parsing which we mentioned in Section 4.2 (F $_{\beta =1}$ =89.8) shows that there is room for improvement left for all systems that were discussed here.¹³

Table 10: A selection of results that have been published for parsing sentences shorter than 100 words of the Penn Treebank. The performance of our parser [Tjong Kim Sang(2001b)] is not quite state-of-the-art. The best performance for this task has been obtained by statistical parsers and data-oriented parsers [Collins(2000),Charniak(2000),Bod(2000)].

section 23	precision	recall	F $_{\beta =1}$
[Collins(2000)]	89.9%	89.6%	89.7
[Bod(2001)]	89.7%	89.7%	89.7
[Charniak(2000)]	89.5%	89.6%	89.5
[Collins(1999)]	88.3%	88.1%	88.2
[Ratnaparkhi(1998)]	87.5%	86.3%	86.9
[Charniak(1997)]	86.6%	86.7%	86.6
[Magerman(1995)]	84.3%	84.0%	84.1
[Tjong Kim Sang(2001b)]	82.3%	78.7%	80.5

A selection of results for parsing the Penn Treebank can be found in Table 10. The F $_{\beta =1}$ error rate of the best systems is about half of that of ours. A more detailed comparison of the output data of our memory-based parser and one of the versions of the Collins parser (model 2, [Collins(1999)]) has shown the large performance difference is caused by the way nonbase phrases are processed [Tjong Kim Sang(2001b)]. Our chunker performs reasonably well compared with the first stage of the Collins parser (F $_{\beta =1}$ = 49.30 compared with 49.85). Especially at the first few levels after the base levels, our parser looses F $_{\beta =1}$ points compared with the Collins parser. The initial difference of 0.65 at the base level grows to 2.92 after three more levels, 5.16 after six and 6.13 after nine levels with a final difference of 6.59 after 20 levels [Tjong Kim Sang(2001b)]. At the end of Section 4.3, we have put forward some suggestions for improving our parser. However, we have also noted that further improvement might not be worthwhile because it will make our parser even slower than it already is.