We will compare the results of a shallow parser with an available
hand-parsed corpus.
For this purpose we will use the precision and recall of the phrases
in the results.
Precision is the percentage of phrases found by the learner that are
correct according to the corpus.
Recall is the percentage of corpus phrases found by the learner.
It is easier to optimize a system configuration based on one
evaluation score and therefore we combine precision and recall
in the F rate [Van Rijsbergen(1975)]:
![]() |
(3) |
can be used for giving precision a larger (
1) or
smaller (
1) weight than recall.
We do not have a preference for one or the other and therefore we use
=1.
In previous work on shallow parsing, often a word-related accuracy
rate is used as evaluation criterion.
We do not believe that this is a good method for evaluating results of
phrase detection algorithms.
Accuracy rates assign positive values to correctly identified
non-phrase words and to partially identified phrases.
Furthermore they will produce different numbers for the same analysis
based on the data representation used.
For these reasons, the relation between accuracy rates and F
rates is poor and preference should be given to using the latter.
Accuracy rates have one advantage over F rates: standard
statistical tests can be used for determining if the difference
between two accuracy rates is significant.
Accuracy is a relatively simple function
where
is the number of items that have been processed and
is the number of items that received the correct class.
Unfortunately, F
is more complex: after some arithmetic
we get
where
is the number of
phrases found by the learner,
the number of phrases found
that were correct and
the number of phrases in the corpus
according to some gold standard.
The value of the
variable is an upper bound on the variable
.
The complexity of the F
computation makes it hard to
apply standard statistical tests to F
rates.
[Yeh(2000)] offers a method for computing significance values for
F rate comparisons: by using computationally-intensive
randomization tests.
His approach requires test data classifications for all systems that
need to the compared.
Usually we only have access to the test data classifications of our
own system and therefore we have used a variant of these
randomization tests presented: bootstrap resampling
[Noreen(1989)].
The basic idea of this approach is to regard the test data
classifications as a population of cases.
A random sample of this population can be created by arbitrarily
choosing cases with replacement.
We can create many random samples of the same size as the test data
and compute an average F
rate over the samples and a
standard deviation for this average.
These statistical measures can be used for deciding if the performance
of another system is significantly different from our system.
Since we do not know if the performance of our system is distributed
according to a normal distribution, we will determine
significance boundaries in such a way that 5% of the samples evaluate
worse (or better) than the chosen boundary.