When different machine learning systems are applied to the same task,
they will make different errors.
The combined results of these systems can be used for generating an
analysis for the task that is usually better than that of any of the
participating systems, for example by choosing pattern analyses
selected by the majority of the systems.
This approach will eliminate errors that made by a minority of the
systems.
Here is a made-up example:
suppose we have five systems, c - c
, which assign binary
classes to patterns.
Their output for eight patterns, p
- p
, is as follows:
c![]() |
c![]() |
c![]() |
c![]() |
c![]() |
correct | |
p![]() |
0 | 0 | 0 | 0 | 0 | 0 |
p![]() |
1 | 1 | 1 | 1 | 1 | 1 |
p![]() |
0 | 0 | 0 | 0 | 0 | 0 |
p![]() |
1 | 0 | 1 | 1 | 1 | 1 |
p![]() |
0 | 0 | 1 | 0 | 0 | 0 |
p![]() |
1 | 1 | 1 | 1 | 0 | 1 |
p![]() |
1 | 0 | 0 | 0 | 0 | 0 |
p![]() |
1 | 1 | 1 | 0 | 1 | 1 |
Each of the five systems makes an error. We can use a combination of the five by choosing the class that has been predicted most frequently for each pattern. For the first three patterns this will not make a difference because all systems predict the same class. For pattern 4 we will choose class 1, thereby eliminating an error of classifier 2. Pattern 5 will be associated with class 0, thus eliminating classifier 3's only error. Patterns 6, 7 and 8 will receive classes 1, 0 and 1 respectively, thereby eliminating errors of classifiers 5, 1 and 4. Thus the majority choice will generate a perfect analysis of the data.
In this paper we will evaluate different techniques for combining system output, most of which have been put forward by [Van Halteren et al.(2001)]. We use four voting methods and three stacked classifiers. Voting methods assign weights to the output of the individual systems and for each pattern choose the class with the largest accumulated score. The most simple voting method is the one we have used in the preceding example: Majority Voting. It gives all systems the same weight. A more elaborate method is accuracy voting (TotPrecision). It assigns a weight to each system which is equal to the accuracy of the system on some evaluation data.
Some classes might be easier to predict than other classes and for
this reason we have also tested two voting methods which use weights
based on accuracies for particular class tags.
The first is TagPrecision.
For each output value of system
, it uses a weight which is
equal to the precision of that system
obtained for this value
.
The second method is Precision-Recall.
It starts from the same weights as TagPrecision but adds to these the
probability that systems producing different output values would have
missed
.
For example, suppose that there are two systems
and
, and
that for some data item
predicts value
while
predicts something else.
In that case, the probability that
is right is
while the probability that
would have
missed
is
.
Precision-Recall will assign the weight
to the event of
predicting
.
A stacked classifier is a classifier which processes the results of
other classifiers.
We have used three variants of stacked classifiers.
The first is called TagPair.
It examines pairs of values produced by two systems and estimates the
probability that a certain output value is associated with the pair.
In the case of the two systems and
producing two distinct
values
and
, TagPair will examine evaluation data and find
that the value pair is associated with, for example,
in 20% of
the cases,
in 70% and
in 10%.
These numbers will be used as weights for the three output values and
the one that has accumulated the largest value after examining all
value pairs in the pattern, will be selected.
Unlike the voting methods, TagPair has the opportunity to choose the
correct output tag even if all systems have made an incorrect
prediction (for example,
in this example).
The other stacked classifier which we have evaluated is the memory-based learner itself. We have tested it in two modes: one in which only the output of the systems was included and one in which we included information about the test item. This extra information was the word that needed to be classified, its part-of-speech (POS) tag and the context (words/POS tags) in which it appeared. The memory-based learner used the same settings as described earlier in this section: it used the Gain Ratio metric and examined a nearest neighborhood of size three.
The weight assignment methods used by the voting methods and the stacked classifiers suffer from the same problem as Gain Ratio: they might fail to disregard irrelevant features. For this reason we have often tested the combination methods both with all available system results as well as with a subset of these, thus mimicking the feature selection method described earlier. Apart from Majority Voting, all voting methods and stacked classifiers require training data. This means that we need both training data for the individual systems and training data for the combinators. We will describe how we have selected the training data in the next section.