Noun Phrase Recognition

We will use a memory-based learner to find noun phrase chunks in text. In order to determine the best configuration for the learner, we will test different system configurations on the standard training data sets put forward by [Ramshaw and Marcus(1995)]. We will evaluate different feature sets for representing words. Additionally, we will use the five data representations for generating different system results and use system combination techniques for combining these results.

In our experiments we will represent words as sets of words and POS tags. These sets contain the word itself, its part-of-speech (POS) tag and a left and right context of a maximum of four words and POS tags on each side, 18 features in total. We have explained in Section 2.2 that memory-based learners equipped with the Gain Ratio metric have difficulty in dealing with irrelevant features. Therefore we will use a feature selection method, bi-directional hill-climbing starting with zero features, for finding the best subset of the 18 features for each different data representation.

The memory-based learner will make two passes over the data. First, it will attempt to predict the noun phrases in the data as well as possible. After this it will use the output of this first pass as information about the noun phrases in the immediate context of the current word. This means that the second pass has access to the 18 features of the first pass plus the chunk tags of the two words immediately in front of the current word and the chunk tags of the two words immediately following the current word. This cascaded approach was chosen because it was useful for improving overall performance in our earlier work [Tjong Kim Sang and Veenstra(1999)]. We omitted the chunk tag for the current word because including it gave a negative bias to the chunker performance. Gain Ratio would correctly identify it as a feature which contained a lot of information about the output class and the weight it assigned to it would make it hard for the other features to influence the output class at all [Tjong Kim Sang and Veenstra(1999)].⁵

We performed a cascaded feature search while using five different data representations on the training data of [Ramshaw and Marcus(1995)] in a 10-fold cross-validation approach. We prevented information leaking in the second phase conform Section 2.4 by using the estimated chunk tags for test data and using the corpus tags in the training data. In this way we made sure that when the test data consisted of section x, no information about section x was available in the training data. The results of the 10-fold cross-validation experiments can be found in Table 2. In the best feature sets of the first pass most of the nine POS tag features are used (almost eight on average) but interestingly enough only a few of the word features (just over four on average). The best sets for the second pass use fewer POS tag features (under seven), fewer word tags (just over two) and most of the chunk features (about three). The table shows that a wide context is more important for the POS features than for the chunk features and less important for the word features.

Table 2: Best F $_{\beta =1}$ found for six data representations in two passes while using a bi-directional hill-climbing feature search algorithm in a 10-fold cross-validation process applied to the training data for the noun phrase chunking task. Note that the rates obtained for the O (open bracket) and C (close bracket) representations are for phrase starts and phrase ends respectively and thus higher than for the first four which evaluate complete phrase identification.

train	Pass 1			Pass 2
Repr.	F $_{\beta =1}$	features		F $_{\beta =1}$	features
IOB1	91.88	word $_{-4..0}$	POS $_{-2..3}$	92.54	word $_{-2..0}$	POS $_{-4..3}$	chunk $_{-2,-1,1,2}$
IOB2	91.78	word $_{-1..0}$	POS $_{-4..3}$	92.29	word $_{-1..0}$	POS $_{-4..2}$	chunk $_{-1,1,2}$
IOE1	91.64	word $_{0..1}$	POS $_{-3..3}$	92.28	word $_{0..1}$	POS $_{-3..3}$	chunk $_{-1,1,2}$
IOE2	92.19	word $_{-3..4}$	POS $_{-4..4}$	92.59	word $_{0..1}$	POS $_{-1..3}$	chunk $_{-2,-1,1,2}$
O	96.04	word $_{-2..0}$	POS $_{-4..3}$	96.11	word $_{-1,0}$	POS $_{-4..1}$	chunk $_{-1,2}$
C	96.43	word $_{0..4}$	POS $_{-4..4}$	96.45	word $_{0..2}$	POS $_{-4..2}$	chunk $_{-2,-1,1}$

Our motive for processing six representations rather than one was to obtain different results which we could combine in order to improve performance. System combination can be seen as a second cascade behind passes one and two. For reasons mentioned in Section 2.4, adding a second cascade in a 10-fold cross-validation experiment requires taking extra care to prevent information leaking from a training data at one level to the training data of the next level. We have taken care of this problem by preparing the training data of the combination techniques with 9-fold cross-validation runs which were independent of the 10-fold cross-validation experiments used for generating the test data. For example, the test data for the first section was generated by training with sections 2-10 twice, first without information about context chunk tags and then with the perfect information of the context chunk tags. The training data was generated with a 9-fold cross-validation process on sections 2-10, also first without context chunk tags and then with perfect context chunk tags. By working this way it was impossible for information about the first section to enter the training data of the combination processes.

Most system combination techniques require results that are in the same format. We have results in six different formats which means that we need to convert them to one format. Since we do not know which of the formats would suit the combination process best, we have evaluated all formats. The four IO formats can trivially be converted to each other and to the O and the C format. The conversion of the two bracket formats to the other four is nontrivial. The two data streams have been generated independently of each other and this means that they may contain inconsistencies. We have chosen to get rid of these by removing all brackets which cannot be matched with the closest candidate. For example, if we have a structure like ( a ( b c ) d ) then the first bracket will be removed because it cannot be matched with the second bracket. The second and third will be kept because they match. Finally, the fourth will be removed because it cannot be matched with the third. We obtain the balanced structure a ( b c ) d which can trivially be converted to the four IO formats.

Table 3: F $_{\beta =1}$ rates obtained on 10-fold cross-validation experiments on the noun phrase chunking data while combining results obtained with five different data representations. All five representations have been tested and best rates have been obtained while using the the combined bracket representation O+C. All combination results are better than any result of the individual systems (92.59, see Table 2) and generally combing five systems led to better results than when only three or four were used. The best results have been obtained with a stacked memory-based classifier that used all system results except those generated with IOE1. However, the performance differences are small.

train	IOB1	IOB2	IOE1	IOE2	O+C
all systems
Majority	93.06	93.06	93.14	93.12	93.35
TotPrecision	93.06	93.05	93.13	93.05	93.35
TagPrecision	93.06	93.10	93.11	93.11	93.35
Precision-Recall	93.06	93.10	93.11	93.08	93.35
TagPair	93.05	93.14	93.10	93.13	93.36
MBL	93.14	93.12	93.07	92.92	93.35
MBL+	92.81	92.74	92.91	92.78	93.29
some systems
Majority	93.02	93.12	93.08	92.99	93.37
TotPrecision	93.02	93.12	93.08	92.99	93.37
TagPrecision	93.04	93.13	93.10	92.99	93.37
Precision-Recall	93.04	93.13	93.13	93.04	93.37
TagPair	93.08	93.16	93.12	93.05	93.37
MBL	93.12	93.18	93.18	93.03	93.38

We have combined the five results of pass two of the 10-fold cross-validation experiments on the noun phrase chunking training data (O and C have now been regarded as one data stream O+C). We have used the system combination techniques described in Section 2.3: Majority Voting, TotPrecision, TagPrecision, Precision-Recall, TagPair and two variants of a stacked memory-based learner. The first stacked learner did not use any context information while the second one had access to a limited amount of context information: the current word, the current POS tag or pairs containing the current POS tag and one of the three current word, previous POS tag or next POS tag. We have performed combination experiments with all five data streams and with all subsets of three and four data streams. The results can be found in Table 3. For the second stacked classifier we only included the best results (obtained with context feature current POS tag). System combination improved performance: the worst result of the combination techniques is still better than the best result of the individual systems. The differences between the combination techniques are small. Furthermore, system combination with the four IO data representations leads to similar results but the combined bracket representation consistently obtains higher F $_{\beta =1}$ rates. It should be noted though that while combination of the data with the IO representations leads to similar precision and recall figures, O+C obtains its higher F $_{\beta =1}$ rates with high precision rates and lower recall rates.

Since the performance differences between the combination techniques displayed in Table 3 are small, we are relatively free in selecting a technique for further processing. We chose Majority Voting because it is the simplest of the combination techniques that were tested since it does not require extra combinator training data like the other techniques. It does seem reasonable to use the O+C representation during the combination process because the best results have been obtained with this representation. We will restrict ourselves to a few systems rather than combining all because Majority Vote in combination with the O+C representation obtained a slightly higher F $_{\beta =1}$ rate that way. The best rate was obtained while using only the systems with data representations IOB1, IOE2 and O+C so we restrict ourselves to these three. This leaves us with the following processing scheme:

We have applied this procedure to the data sets of [Ramshaw and Marcus(1995)]: sections 15-18 of the Wall Street Journal part of the Penn Treebank [Marcus et al.(1993)] as training data and section 20 of the same corpus as test data. The system obtained a F $_{\beta =1}$ rate of 93.34 (precision 94.01% and recall 92.67%). This is a modest improvement of our earlier work [Tjong Kim Sang(2000a)] in which we did not use feature selection and where we obtained an F $_{\beta =1}$ rate of 93.26. In order to estimate significance thresholds, we have applied a bootstrap resampling test to the output of our system. We created 10,000 populations by randomly drawing sentences with replacement from the system results. The number of sentences in each population was the same as in the test corpus. The average F $_{\beta =1}$ of the 10,000 populations was 93.33 with a standard deviation of 0.24. For 5 percent of the populations, the F $_{\beta =1}$ rate was equal to or lower than 92.93 and for another 5 percent it was equal to or higher than 93.73. Since 93.26 is between the two significance boundaries, our current system does not perform significantly better than the previous version without feature selection.