Parameter Tuning

Next: Evaluation Up: Approach Previous: System Combination

Parameter Tuning

In this paper, we will compare different learner set-ups and apply the best one to standard data sets. For example, we will examine different data representations and test different system combination techniques. We should be careful not to tune the system to the test data and therefore we will only use the available training data for finding the best configuration for the learner. This can be done by using 10-fold cross-validation [Weiss and Kulikowski(1991)]. The training data will be divided in ten sections of similar size and each section will be processed by a system which has been trained on the other nine. The overall performance on all ten sections will be regarded as the performance of the system.

In our experiments, we will process the data twice. First we will let the learner generate a classification of the data. After this the learner will process the data another time, this time while including the classifications found earlier for the context of a data item. While working with n-fold cross-validation, we should be careful that information from a test part is not accidentally used in its training part. In the first processing phase we will generate classes for the first section while using the other nine sections. Thus information about the classes in, for example, section two is encoded in the classes produced in section one. If in the second phase we use the classifications of the first section while processing section two, we are analyzing a section while having access to (indirect) information about the classes in the data. Information about the classes in section two might leak to this process via the training data, something which is undesired.

There are two ways for preventing this form of information leaking. Both concern being more strict when it comes to creating the training data of the second system. In a cascaded 10-fold cross-validation experiment, the second phase training data for section x must be constructed without using this section. This means that instead of running one 10-fold cross-validation experiment with the first system, we need to run ten 9-fold cross-validation experiments in order to obtain correct training data for the ten sections in the second system. Section one will be trained with the 9-fold cross-validation results from sections 2-10, section 2 with 1 and 3-10 and so on. If at any time we need to add a third phase to the cascade of systems, we need to run 8-fold cross-validation experiments with the first system and 9-fold cross-validation experiments with the second. For extra systems the number of extra runs increases and the amount of available training data for the first system decreases.

The second method for preventing training information from a processing phase leaking to the classifications of a next phase is by only using results from previous phases in the test data. In the training data we use the perfect classes rather than the output of the previous phase. This has two disadvantages. First, we cannot use a feature containing the class of the focus word because this feature is the same as the output class. This means that we can only use the classes of neighboring words. Second, the opportunity to correct errors made in the first phase will be restricted because the training data no longer contains information about the errors made by this phase. The advantage of this approach is that we can use all training data in all training phases, so the problem of a diminishing quantity of training data disappears. This approach is especially useful with longer cascades of learners, as for example is required in full parsing.

Here is an example to illustrate the two methods: suppose a word in the sixth section in the second phase of a ten-fold cross-validation experiment in chunking is represented by the following eight features:

$w_{i-1}$ $w_{i+1}$ $p_{i-1}$ $p_{i+1}$ $c_{i-1}$ $c_{i+1}$

The goal is to find a chunk tag for word . The word features , $w_{i-1}$ and $w_{i+1}$ represent, the word itself, the preceding word and the next word, respectively. The POS tag features , $p_{i-1}$ and $p_{i+1}$ contain the POS tags of the three words. The two chunk features $c_{i-1}$ and $c_{i+1}$ hold the chunk tag of the preceding and the next word. The word and POS tag information have been taken from the training data. In the first method, the two chunk features are computed by a preceding phase. If this item is part of the training data for section x, $c_{i-1}$ and $c_{i+1}$ were generated by a nine-fold cross-validation experiment which uses all sections except section x. This means that the two chunk features have been generated by training with all sections except 6 and x. If the item is part of the test data, then the chunk features are computed by a ten-fold cross-validation experiment (training with sections 1-5 and 7-10). The second method generates chunk features for the test data in the same way but for training data it takes $c_{i-1}$ and $c_{i+1}$ from the training data, thus preventing that they contain implicit information about the test sections.¹

Next: Evaluation Up: Approach Previous: System Combination

Erik Tjong Kim Sang 2002-03-13