Clause Identification

In clause identification the goal is to divide sentences in clauses which typically contain a subject and a predicate. We have used the clause data of the CoNLL-2001 shared task [Tjong Kim Sang and Déjean(2001)] which was derived from the Wall Street Journal Part of the Penn Treebank [Marcus et al.(1993)]. Here is an example sentence from the Treebank, with all information but words and clause brackets omitted:

This sentence contains four clauses. In the data that we have worked with, the function and type information has been removed. This means that the type tags NOM and PRP have been omitted and that the SBAR tag has been replaced by S. Like the chunking data, these data sets contained words and part-of-speech tags which were generated by the Brill tagger [Brill(1994)]. Additionally they contained chunk tags which were computed by the arbitrary chunking method we discussed in the previous section.

We have approached identifying clauses in the following way:⁷first we evaluated different memory-based learners for predicting the positions of open clause brackets and close clause brackets, regardless of their level of embedding. The two resulting bracket streams will be inconsistent and in order to solve this we have developed a list of rules which change a possibly inconsistent set of brackets to a balanced structure. The evaluation of the learners and the development of the balancing rules will be done with 10-fold cross-validation of the CoNLL-2001 training data. Information leaking is prevented by using corpus clause tags as context features in the training data of cascaded learners rather than clause tags computed in a previous learning phase. The best learner configurations and balancing rules found will be applied to the data for the clause identification shared task.

Like in our noun phrase chunking work, we have tested memory-based learners with different sets of features. At the time we performed these experiments, we did not have access to feature selection methods and therefore we have only evaluated a few fixed feature configurations:

All feature groups were tested with four context sizes: no context information or information about a symmetrical window of one, two or three words. Like in our chunking work, we want to check if an improved performance can be obtained by using system combination. However, since we attempt to predict brackets at all levels in one step, we cannot use the five data representations here. Instead we have evaluated combination of some of the feature configurations mentioned above: a majority vote of the three using a single type of information (1+2+3), a majority vote of the three using pairs of information (4+5+6) and a majority vote of the previous two and the one using three types of information (7+(1+2+3)+(4+5+6)). The last one is a combination of three results of which two themselves are combinations of three results.

Clauses may contain many words and it is possible that the maximal context used by the learner, three words left and right, is not enough for predicting clause boundaries accurately. However, we cannot make the context size much larger than three because that would make it harder for the learner to generalize. We have tried to deal with this problem by evaluating another set of features which contain summaries of sentences rather than every word. Since we have chunk information of the sentences available, we can compress them by removing all words from each chunk except the main one, the head word. The head words can be generated by a set of rules put forward by [Magerman(1995)] and modified by [Collins(1999)].⁸After removing the nonhead words from each chunk, we can replace the POS tag of the remaining word with the chunk tag and thus obtain data with words and chunk tags only (words outside of a chunk keep their POS tag). Again we have evaluated sets of features which hold a single type of information, words (w-) or chunk tags (c-), or pairs of information, words and chunk tags (wc-).

Table 6: F $_{\beta =1}$ rates obtained in 10-fold cross-validation experiments with the training data while predicting open clause brackets (left) and close clause brackets (right). We used different combinations of information (w: words, p: POS tags and c: chunk tags) and different context sizes (0-3). The best results for open brackets have been obtained with a majority vote of three information pairs while using context size 1 (row 9) For close clause brackets best results were obtained with words and POS tags after compressing the chunks and while using context size 3 (row 13).

	train	0	1	2	3		train	0	1	2	3
1	w	61.77	84.40	83.74	81.08	1	w	61.11	75.99	77.52	77.63
2	p	30.44	80.40	80.47	76.85	2	p	61.71	77.52	78.74	77.95
3	c	13.67	76.76	79.05	78.71	3	c	00.00	67.25	75.06	75.70
4	wp	62.24	87.19	84.45	81.22	4	wp	61.25	76.52	77.92	78.12
5	wc	67.95	87.31	85.74	82.97	5	wc	61.01	75.96	77.46	77.79
6	pc	49.29	86.65	84.92	81.72	6	pc	61.74	77.44	78.40	77.93
7	wpc	68.66	87.92	85.93	83.28	7	wpc	61.21	76.17	77.73	78.00
8	1+2+3	38.32	85.24	86.92	85.38	8	1+2+3	61.67	75.93	79.60	79.94
9	4+5+6	68.04	88.83	87.44	84.98	9	4+5+6	61.44	77.30	79.15	79.38
10	7+8+9	68.03	88.75	87.72	85.45	10	7+8+9	61.44	77.20	79.25	79.60
11	w-	54.05	83.70	83.48	81.25	11	w-	61.24	76.01	78.69	79.25
12	c-	14.26	77.70	79.30	78.50	12	c-	61.73	76.82	78.34	80.90
13	wc-	58.47	86.53	85.74	82.77	13	wc-	61.43	76.77	80.15	81.61

We have evaluated the twelve groups of feature sets while predicting the clause open and clause close brackets. The results can be found in Table 6. The learner performed best while predicting open clause brackets with information about the words immediately next to the current word (column 1). When more information was available, its performance dropped slightly. Of the different feature sets tested, the majority vote of sets that used pairs of information performed best (column 1, row 9). The classifiers that generated close brackets improved whenever extra context information became available. The best performance was reached while using a pair of words and chunk tags in the summarized format (column 3, row 13). We have performed an extra experiment to test if the system improved when using four context words rather than three. With words and chunk tags in the summarized format the system obtained F $_{\beta =1}$ =81.72 for context size four compared with 81.61 for context size three. This increase is small so we have chosen context size three for our further experiments.

With the streams of open and close brackets, we attempted to generate balanced clause structures by modifying the data streams with a set of heuristic rules. In these rules we gave more confidence to the open bracket predictions since, as can be seen in Table 6 the system performs better in predicting open brackets than close brackets. After testing different rule sets created by hand and evaluating these on the available training data, we decided on using the following rule set:

These rules were able to generate complete and consistent embedded clause structures for the output that the system generated for the training data of the CoNLL-2001 shared task. The rules have one main defect: they are incapable of predicting that two or more clauses start at the same position. This will make it impossible for the system to detect such clause start but unfortunately, according to our rule set evaluation, adding recognition facilities for such multiple clause start would have a negative influence on overall performance levels. This set of rules obtained a clause F $_{\beta =1}$ of 71.34 on the training data of this task when applied to the best results for open and close brackets. The rules did not change the open bracket positions and on average the changes they made to the close bracket positions were an improvement (F $_{\beta =1}$ = 84.11 compared to 81.61).

An argument which could be made is that since open bracket prediction is more accurate than close bracket prediction, one could use the information of the open bracket positions when predicting clause close brackets. We have attempted to do this by repeating the experiment with the best configuration for close brackets (wc- with context size 3) while adding a feature which stated at which clause level the current word was, according to earlier open and close brackets. This approach improved the F $_{\beta =1}$ rate of the close bracket predictor from 81.61 to 83.50. However, after applying the balancing rules to the open brackets and the improved close brackets, we only got a clause F $_{\beta =1}$ of 71.39, a minimal improvement over the previous 71.34. It seems that the extra performance gain obtained in the close bracket predictor was obtained by solving problems which could already be solved by the balancing rules.

We applied the balancing rules together with an open bracket predictor using a combination of pairs of feature types (context size 1) and a close bracket predictor using summarized pairs of words and chunk tags (context size 3) to the data files of the CoNLL-2001 shared task. Our clause identification method obtained an F $_{\beta =1}$ rate of 67.79 for identifying complete clauses (precision 76.91% and recall 60.61%). In the CoNLL-2001 shared task, the system finished third of six participants. One system outperformed the others by a large margin: the boosted decision tree method by [Carreras and Màrquez(2001)]. Their system obtained an F $_{\beta =1}$ rate of 78.63 on this task. The main difference between their approach and ours is that they use a larger number of features, methods for predicting multiple co-occurring clause starts and a more advanced statistical model for combining brackets to clauses.

In a post-conference study, we have attempted to estimate more precisely the cause of the performance difference between our method an the boosted decision trees used by [Carreras and Màrquez(2001)]. Our hypothesis was that not only the choice of system made a difference, but also the choice of features. For this purpose, Carreras and Màrques kindly repeated an experiment in predicting open brackets but this time while using our feature set: pairs of information using a window of one word left and one right, while results were combined with majority voting (Table 6, left, row 9, column 1). The experiment was performed while testing on the CoNLL-2001 development data set. Originally the memory-based learner obtained F $_{\beta =1}$ = 89.80 on this data set while their boosted decision tree approach reached 93.89. However, while using the memory-based feature set, the performance of the decision trees dropped to 91.32. When both systems use the same features, the boosted decision trees outperform the memory-based learner. But it is able to perform better with its own feature set. Our hypothesis was correct: the performance difference between the two approaches was both caused by choice of the learner and the choice of the feature set.

The next obvious question is whether the memory-based system would perform better with the feature set of the boosted decision trees. Providing an answer to this question was nontrivial. The feature set consisted of thousands of binary features which were more than the memory-based learner could handle. After converting the features from binary-valued to multi-valued, there were about 70 features left. At best, the system obtained F $_{\beta =1}$ = 90.52 with this feature set. Since we feared that still the number of features was too large for the system to handle, we performed a forward sequential selection search process in the feature space starting with zero features. The memory-based learner reached an optimal performance with 13 features at F $_{\beta =1}$ = 91.82. These results show that there is still room for improvement for the memory-based learner but that cooperation with a feature selection method will be helpful.