System Combination

Next: Parameter Tuning Up: Approach Previous: Feature Selection

System Combination

When different machine learning systems are applied to the same task, they will make different errors. The combined results of these systems can be used for generating an analysis for the task that is usually better than that of any of the participating systems, for example by choosing pattern analyses selected by the majority of the systems. This approach will eliminate errors that made by a minority of the systems. Here is a made-up example: suppose we have five systems, c - c, which assign binary classes to patterns. Their output for eight patterns, p - p, is as follows:

	c	c	c	c	c	correct
p	0	0	0	0	0	0
p	1	1	1	1	1	1
p	0	0	0	0	0	0
p	1	0	1	1	1	1
p	0	0	1	0	0	0
p	1	1	1	1	0	1
p	1	0	0	0	0	0
p	1	1	1	0	1	1

Each of the five systems makes an error. We can use a combination of the five by choosing the class that has been predicted most frequently for each pattern. For the first three patterns this will not make a difference because all systems predict the same class. For pattern 4 we will choose class 1, thereby eliminating an error of classifier 2. Pattern 5 will be associated with class 0, thus eliminating classifier 3's only error. Patterns 6, 7 and 8 will receive classes 1, 0 and 1 respectively, thereby eliminating errors of classifiers 5, 1 and 4. Thus the majority choice will generate a perfect analysis of the data.

In this paper we will evaluate different techniques for combining system output, most of which have been put forward by [Van Halteren et al.(2001)]. We use four voting methods and three stacked classifiers. Voting methods assign weights to the output of the individual systems and for each pattern choose the class with the largest accumulated score. The most simple voting method is the one we have used in the preceding example: Majority Voting. It gives all systems the same weight. A more elaborate method is accuracy voting (TotPrecision). It assigns a weight to each system which is equal to the accuracy of the system on some evaluation data.

Some classes might be easier to predict than other classes and for this reason we have also tested two voting methods which use weights based on accuracies for particular class tags. The first is TagPrecision. For each output value of system , it uses a weight which is equal to the precision of that system obtained for this value . The second method is Precision-Recall. It starts from the same weights as TagPrecision but adds to these the probability that systems producing different output values would have missed . For example, suppose that there are two systems and , and that for some data item predicts value while predicts something else. In that case, the probability that is right is while the probability that would have missed is . Precision-Recall will assign the weight to the event of predicting .

A stacked classifier is a classifier which processes the results of other classifiers. We have used three variants of stacked classifiers. The first is called TagPair. It examines pairs of values produced by two systems and estimates the probability that a certain output value is associated with the pair. In the case of the two systems and producing two distinct values and , TagPair will examine evaluation data and find that the value pair is associated with, for example, in 20% of the cases, in 70% and in 10%. These numbers will be used as weights for the three output values and the one that has accumulated the largest value after examining all value pairs in the pattern, will be selected. Unlike the voting methods, TagPair has the opportunity to choose the correct output tag even if all systems have made an incorrect prediction (for example, in this example).

The other stacked classifier which we have evaluated is the memory-based learner itself. We have tested it in two modes: one in which only the output of the systems was included and one in which we included information about the test item. This extra information was the word that needed to be classified, its part-of-speech (POS) tag and the context (words/POS tags) in which it appeared. The memory-based learner used the same settings as described earlier in this section: it used the Gain Ratio metric and examined a nearest neighborhood of size three.

The weight assignment methods used by the voting methods and the stacked classifiers suffer from the same problem as Gain Ratio: they might fail to disregard irrelevant features. For this reason we have often tested the combination methods both with all available system results as well as with a subset of these, thus mimicking the feature selection method described earlier. Apart from Majority Voting, all voting methods and stacked classifiers require training data. This means that we need both training data for the individual systems and training data for the combinators. We will describe how we have selected the training data in the next section.

Next: Parameter Tuning Up: Approach Previous: Feature Selection

Erik Tjong Kim Sang 2002-03-13