In clause identification the goal is to divide sentences in clauses which typically contain a subject and a predicate. We have used the clause data of the CoNLL-2001 shared task [Tjong Kim Sang and Déjean(2001)] which was derived from the Wall Street Journal Part of the Penn Treebank [Marcus et al.(1993)]. Here is an example sentence from the Treebank, with all information but words and clause brackets omitted:
(S Coach them in
(S-NOM handling complaints)
(SBAR-PRP so that
(S they can resolve problems immediately)
)
.
)
This sentence contains four clauses. In the data that we have worked with, the function and type information has been removed. This means that the type tags NOM and PRP have been omitted and that the SBAR tag has been replaced by S. Like the chunking data, these data sets contained words and part-of-speech tags which were generated by the Brill tagger [Brill(1994)]. Additionally they contained chunk tags which were computed by the arbitrary chunking method we discussed in the previous section.
We have approached identifying clauses in the following way:7first we evaluated different memory-based learners for predicting the positions of open clause brackets and close clause brackets, regardless of their level of embedding. The two resulting bracket streams will be inconsistent and in order to solve this we have developed a list of rules which change a possibly inconsistent set of brackets to a balanced structure. The evaluation of the learners and the development of the balancing rules will be done with 10-fold cross-validation of the CoNLL-2001 training data. Information leaking is prevented by using corpus clause tags as context features in the training data of cascaded learners rather than clause tags computed in a previous learning phase. The best learner configurations and balancing rules found will be applied to the data for the clause identification shared task.
Like in our noun phrase chunking work, we have tested memory-based learners with different sets of features. At the time we performed these experiments, we did not have access to feature selection methods and therefore we have only evaluated a few fixed feature configurations:
All feature groups were tested with four context sizes: no context information or information about a symmetrical window of one, two or three words. Like in our chunking work, we want to check if an improved performance can be obtained by using system combination. However, since we attempt to predict brackets at all levels in one step, we cannot use the five data representations here. Instead we have evaluated combination of some of the feature configurations mentioned above: a majority vote of the three using a single type of information (1+2+3), a majority vote of the three using pairs of information (4+5+6) and a majority vote of the previous two and the one using three types of information (7+(1+2+3)+(4+5+6)). The last one is a combination of three results of which two themselves are combinations of three results.
Clauses may contain many words and it is possible that the maximal context used by the learner, three words left and right, is not enough for predicting clause boundaries accurately. However, we cannot make the context size much larger than three because that would make it harder for the learner to generalize. We have tried to deal with this problem by evaluating another set of features which contain summaries of sentences rather than every word. Since we have chunk information of the sentences available, we can compress them by removing all words from each chunk except the main one, the head word. The head words can be generated by a set of rules put forward by [Magerman(1995)] and modified by [Collins(1999)].8After removing the nonhead words from each chunk, we can replace the POS tag of the remaining word with the chunk tag and thus obtain data with words and chunk tags only (words outside of a chunk keep their POS tag). Again we have evaluated sets of features which hold a single type of information, words (w-) or chunk tags (c-), or pairs of information, words and chunk tags (wc-).
We have evaluated the twelve groups of feature sets while predicting
the clause open and clause close brackets.
The results can be found in Table 6.
The learner performed best while predicting open clause brackets with
information about the words immediately next to the current word
(column 1).
When more information was available, its performance dropped slightly.
Of the different feature sets tested, the majority vote of sets that
used pairs of information performed best (column 1, row 9).
The classifiers that generated close brackets improved whenever extra
context information became available.
The best performance was reached while using a pair of words and chunk
tags in the summarized format (column 3, row 13).
We have performed an extra experiment to test if the system improved
when using four context words rather than three.
With words and chunk tags in the summarized format the system obtained
F=81.72 for context size four compared with 81.61 for
context size three.
This increase is small so we have chosen context size three for our
further experiments.
With the streams of open and close brackets, we attempted to generate balanced clause structures by modifying the data streams with a set of heuristic rules. In these rules we gave more confidence to the open bracket predictions since, as can be seen in Table 6 the system performs better in predicting open brackets than close brackets. After testing different rule sets created by hand and evaluating these on the available training data, we decided on using the following rule set:
These rules were able to generate complete and consistent embedded
clause structures for the output that the system generated for the
training data of the CoNLL-2001 shared task.
The rules have one main defect: they are incapable of predicting that
two or more clauses start at the same position.
This will make it impossible for the system to detect such clause
start but unfortunately, according to our rule set evaluation, adding
recognition facilities for such multiple clause start would have a
negative influence on overall performance levels.
This set of rules obtained a clause F of 71.34 on the
training data of this task when applied to the best results for open
and close brackets.
The rules did not change the open bracket positions and on average the
changes they made to the close bracket positions were an improvement
(F
= 84.11 compared to 81.61).
An argument which could be made is that since open bracket prediction
is more accurate than close bracket prediction, one could use the
information of the open bracket positions when predicting clause
close brackets.
We have attempted to do this by repeating the experiment with the best
configuration for close brackets (wc- with context size 3) while
adding a feature which stated at which clause level the current word
was, according to earlier open and close brackets.
This approach improved the F rate of the close bracket
predictor from 81.61 to 83.50.
However, after applying the balancing rules to the open brackets and
the improved close brackets, we only got a clause F
of
71.39, a minimal improvement over the previous 71.34.
It seems that the extra performance gain obtained in the close bracket
predictor was obtained by solving problems which could already be
solved by the balancing rules.
We applied the balancing rules together with an open bracket predictor
using a combination of pairs of feature types (context size 1) and a
close bracket predictor using summarized pairs of words and chunk tags
(context size 3) to the data files of the CoNLL-2001 shared task.
Our clause identification method obtained an F rate of
67.79 for identifying complete clauses (precision 76.91% and recall
60.61%).
In the CoNLL-2001 shared task, the system finished third of six
participants.
One system outperformed the others by a large margin: the boosted
decision tree method by [Carreras and Màrquez(2001)].
Their system obtained an F
rate of 78.63 on this task.
The main difference between their approach and ours is that they use a
larger number of features, methods for predicting multiple
co-occurring clause starts and a more advanced statistical model for
combining brackets to clauses.
In a post-conference study, we have attempted to estimate more
precisely the cause of the performance difference between our method
an the boosted decision trees used by [Carreras and Màrquez(2001)].
Our hypothesis was that not only the choice of system made a
difference, but also the choice of features.
For this purpose, Carreras and Màrques kindly repeated an experiment
in predicting open brackets but this time while using our feature set:
pairs of information using a window of one word left and one right,
while results were combined with majority voting (Table
6, left, row 9, column 1).
The experiment was performed while testing on the CoNLL-2001
development data set.
Originally the memory-based learner obtained F = 89.80 on
this data set while their boosted decision tree approach reached
93.89.
However, while using the memory-based feature set, the performance of
the decision trees dropped to 91.32.
When both systems use the same features, the boosted decision trees
outperform the memory-based learner.
But it is able to perform better with its own feature set.
Our hypothesis was correct: the performance difference between the two
approaches was both caused by choice of the learner and the choice of
the feature set.
The next obvious question is whether the memory-based system would
perform better with the feature set of the boosted decision trees.
Providing an answer to this question was nontrivial.
The feature set consisted of thousands of binary features which were
more than the memory-based learner could handle.
After converting the features from binary-valued to multi-valued,
there were about 70 features left.
At best, the system obtained F = 90.52 with this feature
set.
Since we feared that still the number of features was too large for
the system to handle, we performed a forward sequential selection
search process in the feature space starting with zero features.
The memory-based learner reached an optimal performance with 13
features at F
= 91.82.
These results show that there is still room for improvement for the
memory-based learner but that cooperation with a feature selection
method will be helpful.