The approach for parsing noun phrases outlined in the previous section can be used for generating parse trees containing phrases of arbitrary phrases as well. In that case we would be using chunking techniques for performing full parsing. The is not a new idea. [Ejerhed and Church(1983)] present a Swedish grammar which includes noun phrase chunk rules. [Abney(1991)] describes a chunk parser which consists of two parts: one that finds base chunks and another that attaches the chunks to each other in order to obtain parse trees. [Daelemans(1995)] suggested to find long-distance dependencies with a cascade of lazy learners among which were constituent identifiers. [Ratnaparkhi(1998)] built a parser based on a chunker with an additional bottom-up process which determines at what position to start new phrases or to join constituents with earlier ones. With this approach he obtained state-of-the-art parsing results. [Brants(1999)] applied a cascade of Markov model chunkers to the task of parsing German sentences. We have extended our noun phrase parsing techniques to parsing arbitrary phrases [Tjong Kim Sang(2001b)]. We will present the main findings of this study here as well.
The standard data sets for testing statistical parsers are different than the ones we used for our earlier work on chunking and shallow parsing. The data sets have been extracted from the Wall Street Journal (WSJ) part of the Penn Treebank [Marcus et al.(1993)] as well but they contain different segments. The training data consists of sections 02-21 (39,832 sentences) while section 23 is used as test data (2416 sentences). The data sets consists of words, and part-of-speech tags which have been generated by the part-of-speech tagger described by [Ratnaparkhi(1996)]. In the data the phrase types ADVP and PRT have been collapsed into one category and during evaluation the positions of punctuation signs in the parse tree have been ignored. These adaptations have been done by different authors in order to make it possible to compare the results of their systems with the first study that used these data sets [Magerman(1995)] and all follow-up work.
In our work on arbitrary parsing, we were interested in finding an answer to four questions. In order to obtain these answers, we have performed tests with smaller data sets which were taken from the standard training data for this task: WSJ sections 15-18 as training data and section 20 as test data. The first topic we were interested in, was the influence of context size and size of the examined nearest neighborhood size (parameter k of the memory-based learner) on the performance of the parser. We took the noun phrase parser developed in the previous section, lifted its restriction of generating noun phrases only and applied it to this data set while using different context sizes and values for parameter k for the classifiers that identified phrases above the base levels. The different types of the chunks were derived by using the Double-Phase Approach for chunking (see Section 3.3). The best configuration we found was a context of two left and two right words and POS tags with k is 1. The nearest neighborhood size is smaller than used in our earlier work (3) and the best context size is smaller than in our noun phrase chunking work (4). However, the best context size we found for this task is exactly the same as reported by [Ratnaparkhi(1998)].
The second topic we were interested in was the type of training data
that should be used for finding phrases above the base level.
In our noun phrase parsing work, we found that the best performance
could be obtained by using only data of the current phrase level.
This will cause a problems for our parser, since the tree depth may
become as large as 31 in our corpus but there will be few training
material available for these high level phrases if we use the same
training configuration as in our noun phrase parsing work.
We have tested two different training configurations to see if we
could use more training data for this task without losing performance.
With the first of these, using the current, previous and next phrase level,
performance was as well (F=77.13) as while using only the
current level (77.17).
However, when we trained the cascade of chunkers while using brackets
of all phrase levels, the performance dropped to 67.49.
We have decided to keep on using the current phrase level only in the
training data despite its problems with identifying higher level
phrases.
In the results that we have presented in this paper, the precision
rates have always been higher than the recall rates.
For a part, this is caused by the method we use for balancing open
brackets and close brackets.
It removes all brackets which cannot be matched with another one which
is approximately the same as accepting clauses which are likely to be
correct and throwing away all others.
We wanted to test if we could obtain more balanced precision and
recall rates because we hoped that these would lead to a better
F rate.
Therefore we have tested two alternative methods for combining
brackets.
The first disregarded the type of the open brackets and allowed close
brackets to be combined with open brackets of any type.
The second method allowed open brackets to match with close brackets
of any type.
Unfortunately neither the first (F
=72.33) nor the second
method (76.06) managed to obtain the same F
rate as our
standard method for combining brackets.
Therefore we decided to stick with the latter.
The final issue which we wanted to examine is the performance
progression of the parser at the different levels of the process.
The recall of the parser should increase for every extra step in the
cascade of chunkers but we would also like to know how precision and
F progressed.
We have measured this for our small parameter tuning data set and
found that indeed recall increased until level 30 of a maximum of 32
and remained constant after that.
Precision dropped until the same level, remaining at the same value
afterwards while F
reached a top value at level 19 and
dropped afterwards.
The reason for the later drop in F
value is that while
the recall is still rising, it cannot make up for the loss of
precision at later levels.
Since we want to optimize the F
rate, we have decided to
restrict the number of cascaded chunkers in our parser to 19 levels.
We have added an extra post-processing step which after the 19 levels
of processing adds clause brackets (S) around sentences which have not
already been identified as a clauses.
We have applied the best parser configuration found to the standard
parsing data.
Our parser used an arbitrary chunker with the configuration described
in Section 3.3 (a Majority Vote of five systems using
different data representations) but trained with the relevant data for
this task.
Higher level phrases were identified by a cascade of 19 chunkers, each
of which had a pair of independent open and close bracket classifiers
which used a context of two left and two right of words and POS tags
while being trained with brackets of the current level only.
At each level, open and close brackets were combined to chunks by
removing all brackets that could not be matched with a bracket of the
same type.
The parser contained a post-processing process which added clause
brackets around sentences which were not identified as a clause after
the 19 processing stages.
This chunk parser obtained an F rate of 80.49 on WSJ
section 23 (precision 82.34% and recall 78.72%).
The performance of our chunk parser is modest compared with
state-of-the-art statistical parsers, which obtain around 90
F rate [Collins(1999),Charniak(2000)].
However, we have a couple of suggestions for improving its
performance.
First, we could attempt giving the parser access to more information,
for example about lower phrase levels.
Currently, the parser only knows the head words and phrase types of
daughters of phrases that are being built and this might not be
enough.
Second, we could try to find a better method for predicting bracket
positions.
For reasons explained in the previous section, we could not use a
majority vote of systems using different representations.
This might have helped to obtain a better performance.
Finally, we would like to change the greedy approach of our parser.
Currently it chooses the best segmentation of chunks at each level and
builds on that but ideally it would be able to remember some
next-to-best configurations as well and perform backtracking from the
earlier choices whenever necessary.
This approach would probably improve performance considerably
(as shown by [Ratnaparkhi(1998)], Table 6.5).
A practical problem which needs to be solved here, is that in nearest
neighbor memory-based learning alternative classes do not receive
confidence measures.
Rather, sets of item-dependent distances are used to determine the
usability of the classes.
Comparing partial trees requires comparing sets of distances and
it is not obvious how this should be done.
These extra measures will probably improve the performance of the chunk parser. However, it is questionable whether it is worthwhile continuing with this approach. The present version of the parser already requires a lot of memory and processing time: more than a second per word for chunking only compared with a mere 0.14 seconds per sentence for a statistical parser which performed better [Ratnaparkhi(1998)]. Extra extensions will probably slow down the parser even more so we are not sure if extending this approach is worth the trouble.
|