A complete overview of the clause identification results of the CoNLL-2001 shared task can be found in Table 9 [Tjong Kim Sang and Déjean(2001)]. Our approach was third-best. A bootstrap resampling test with a population of 10,000 random samples generated from our results produced the 90% significance interval 66.66-68.95 for our system which means that our result is not significantly different from the second result. The boosted decision trees used by [Carreras and Màrquez(2001)] did a lot better than the other systems. In Section 4.1, we have made a comparison between the performance of this system and ours and concluded that the performance differences were both caused by the choice of learning system and by a difference in the features chosen for representing the task.
The noun phrase parsing task has not received much attention in the
research community and there are only few results to compare with.
[Osborne(1999)] used a grammar-extension method based on Minimal
Description Length and applied it to a Definite Clause Grammar.
His system used different training and test segments of the Penn
Treebank than we did.
At best, it obtained an F rate of 60.0 on the test data
(precision 53.2% and recall 68.7%).
[Krymolowski and Dagan(2000)] applied a memory-based learning technique
specialized for learning sequences to a noun phrase parsing task.
Their system obtained F
=83.7 (precision 88.5% and recall
79.3%) on yet another segment of the Treebank.
This performance is very close to that of our approach
(F
=83.79).
The memory-based sequence learner used much more training data than
ours (about four times as much) but unlike our method, it generated
its output without using lexical information, which is impressive.
The performance of the Collins parser on the subtask of noun phrase
parsing which we mentioned in Section 4.2
(F
=89.8) shows that there is room for improvement left for
all systems that were discussed here.13
|
A selection of results for parsing the Penn Treebank can be found in
Table 10.
The F error rate of the best systems is about half of that
of ours.
A more detailed comparison of the output data of our memory-based
parser and one of the versions of the Collins parser
(model 2, [Collins(1999)]) has shown the large performance difference
is caused by the way nonbase phrases are processed
[Tjong Kim Sang(2001b)].
Our chunker performs reasonably well compared with the first stage of
the Collins parser (F
= 49.30 compared with 49.85).
Especially at the first few levels after the base levels, our parser
looses F
points compared with the Collins parser.
The initial difference of 0.65 at the base level grows to 2.92 after
three more levels, 5.16 after six and 6.13 after nine levels with a
final difference of 6.59 after 20 levels [Tjong Kim Sang(2001b)].
At the end of Section 4.3, we have put forward some
suggestions for improving our parser.
However, we have also noted that further improvement might not
be worthwhile because it will make our parser even slower than it
already is.