Our work with chunks of arbitrary types6is similar to that with noun phrase chunks apart from two facts. First, we refrained from using feature selection methods. Applying these methods did not gain us much for noun phrase chunking but they required a lot of extra computational work. Therefore we went back to using a fixed set of features in these experiments. The context size we used here was four left and four right for words and POS tags in the first pass over the data, and three left and three right for words and POS tags, and two left and two right without the focus for chunk tags in the second pass. This means that both first and second pass use 18 features. The second pass has only been used for the four IO data representations. Table 2 shows that the second pass improved the performance of the first pass only by a small margin for the two bracket representations O and C.
The second difference between this study and the one for noun phrase chunks originates from the fact that apart from chunk boundaries, we need to find chunk types as well. We can approach this task in two ways. First, we could train the learner to identify both chunk boundaries and chunk types at the same time. We have called this approach the Single-Phase Approach. Second, we could split the task and train a learner to identify all chunk boundaries and feed its output to another classifier which identifies the types of the chunks (Double-Phase Approach). A computationally-intensive approach would be to develop learners for each different chunk type. They could identify chunks independently of each other and words assigned to more than one chunk could be disambiguated by choosing the chunk type that occurs most frequently in the training data (N-Phase Approach). Since we did not know in advance which of these three processing strategies would generate the best results, we have evaluated all three.
In order to find the best processing strategy and the best combination technique, we have performed several 10-fold cross-validation experiments on the training data. We have processed this data for each processing strategy and in each of the six data representations earlier used for noun phrase chunking. After this we have used the seven combination techniques presented in Section 2.3 for combining these. The results can be found in Table 4. Of the three processing strategies, the N-Phase Approach generally performed best with Double-Phase being second best and Single-Phase performing worst. Again, system combination improved all individual results. There were only small differences between the seven combination techniques when compared for the same processing approach. The only exception were the two stacked MBL classifiers applied to the Single-Phase Approach results. They did about 0.3 F rate better than most of the other combination techniques.
The best result was generated with the N-Phase Approach in combination with a stacked memory-based classifier (MBL, 92.76). A bootstrap resampling test with 8000 random populations generated the 90% significance interval 92.60-92.90 which means that this result is significantly better than any Single-Phase or Double-Phase result. However, the N-Phase approach has a big computing overhead: the number of passes over the data is at least N times the number of representations. Therefore, we have chosen the Double-Phase Approach combined with Majority Voting for our further work. This approach combines a reasonable performance with computational efficiency. The Single-Phase Approach is potentially faster but its performance is worse unless we use a stacked classifier which requires extra combinator training data.
When we applied the Double-Phase Approach combined with Majority Voting to the CoNLL-2000 data sets, we obtained an F rate of 92.50 (precision 94.04% and recall 91.00%). An overview of the performance rates of the different chunk types can be found in Table 5. Our system does well for the three most frequently occurring chunk types, noun phrases, prepositional phrases and verb phrases, and less well for the other seven. The chunk type UCP which occurred in the training data, was not present in the test data. With this result, our memory-based arbitrary chunker finished third of eleven participants in the CoNLL-2000 shared task. The two systems that performed better were Support Vector Machines (F=93.48, [Kudoh and Matsumoto(2000)]) and Weighted Probability Distribution Voting (F=93.32, [Van Halteren(2000)]).
|