next up previous
Next: Chunking Up: Approach Previous: Parameter Tuning


Evaluation

We will compare the results of a shallow parser with an available hand-parsed corpus. For this purpose we will use the precision and recall of the phrases in the results. Precision is the percentage of phrases found by the learner that are correct according to the corpus. Recall is the percentage of corpus phrases found by the learner. It is easier to optimize a system configuration based on one evaluation score and therefore we combine precision and recall in the F$_{\beta}$ rate [Van Rijsbergen(1975)]:


\begin{displaymath}
F_{\beta} = \frac{(\beta^2+1)*precision*recall}{\beta^2*precision+recall}
\end{displaymath} (3)

$\beta$ can be used for giving precision a larger ($\beta>$1) or smaller ($\beta<$1) weight than recall. We do not have a preference for one or the other and therefore we use $\beta$=1. In previous work on shallow parsing, often a word-related accuracy rate is used as evaluation criterion. We do not believe that this is a good method for evaluating results of phrase detection algorithms. Accuracy rates assign positive values to correctly identified non-phrase words and to partially identified phrases. Furthermore they will produce different numbers for the same analysis based on the data representation used. For these reasons, the relation between accuracy rates and F$_{\beta}$ rates is poor and preference should be given to using the latter.

Accuracy rates have one advantage over F$_{\beta}$ rates: standard statistical tests can be used for determining if the difference between two accuracy rates is significant. Accuracy is a relatively simple function $correct/processed$ where $processed$ is the number of items that have been processed and $correct$ is the number of items that received the correct class. Unfortunately, F$_{\beta =1}$ is more complex: after some arithmetic we get $2*correct/(found+corpus)$ where $found$ is the number of phrases found by the learner, $correct$ the number of phrases found that were correct and $corpus$ the number of phrases in the corpus according to some gold standard. The value of the $corpus$ variable is an upper bound on the variable $correct$. The complexity of the F$_{\beta =1}$ computation makes it hard to apply standard statistical tests to F$_{\beta =1}$ rates.

[Yeh(2000)] offers a method for computing significance values for F$_{\beta =1}$ rate comparisons: by using computationally-intensive randomization tests. His approach requires test data classifications for all systems that need to the compared. Usually we only have access to the test data classifications of our own system and therefore we have used a variant of these randomization tests presented: bootstrap resampling [Noreen(1989)]. The basic idea of this approach is to regard the test data classifications as a population of cases. A random sample of this population can be created by arbitrarily choosing cases with replacement. We can create many random samples of the same size as the test data and compute an average F$_{\beta =1}$ rate over the samples and a standard deviation for this average. These statistical measures can be used for deciding if the performance of another system is significantly different from our system. Since we do not know if the performance of our system is distributed according to a normal distribution, we will determine significance boundaries in such a way that 5% of the samples evaluate worse (or better) than the chosen boundary.


next up previous
Next: Chunking Up: Approach Previous: Parameter Tuning
Erik Tjong Kim Sang 2002-03-13