Full Parsing

The approach for parsing noun phrases outlined in the previous section can be used for generating parse trees containing phrases of arbitrary phrases as well. In that case we would be using chunking techniques for performing full parsing. The is not a new idea. [Ejerhed and Church(1983)] present a Swedish grammar which includes noun phrase chunk rules. [Abney(1991)] describes a chunk parser which consists of two parts: one that finds base chunks and another that attaches the chunks to each other in order to obtain parse trees. [Daelemans(1995)] suggested to find long-distance dependencies with a cascade of lazy learners among which were constituent identifiers. [Ratnaparkhi(1998)] built a parser based on a chunker with an additional bottom-up process which determines at what position to start new phrases or to join constituents with earlier ones. With this approach he obtained state-of-the-art parsing results. [Brants(1999)] applied a cascade of Markov model chunkers to the task of parsing German sentences. We have extended our noun phrase parsing techniques to parsing arbitrary phrases [Tjong Kim Sang(2001b)]. We will present the main findings of this study here as well.

The standard data sets for testing statistical parsers are different than the ones we used for our earlier work on chunking and shallow parsing. The data sets have been extracted from the Wall Street Journal (WSJ) part of the Penn Treebank [Marcus et al.(1993)] as well but they contain different segments. The training data consists of sections 02-21 (39,832 sentences) while section 23 is used as test data (2416 sentences). The data sets consists of words, and part-of-speech tags which have been generated by the part-of-speech tagger described by [Ratnaparkhi(1996)]. In the data the phrase types ADVP and PRT have been collapsed into one category and during evaluation the positions of punctuation signs in the parse tree have been ignored. These adaptations have been done by different authors in order to make it possible to compare the results of their systems with the first study that used these data sets [Magerman(1995)] and all follow-up work.

In our work on arbitrary parsing, we were interested in finding an answer to four questions. In order to obtain these answers, we have performed tests with smaller data sets which were taken from the standard training data for this task: WSJ sections 15-18 as training data and section 20 as test data. The first topic we were interested in, was the influence of context size and size of the examined nearest neighborhood size (parameter k of the memory-based learner) on the performance of the parser. We took the noun phrase parser developed in the previous section, lifted its restriction of generating noun phrases only and applied it to this data set while using different context sizes and values for parameter k for the classifiers that identified phrases above the base levels. The different types of the chunks were derived by using the Double-Phase Approach for chunking (see Section 3.3). The best configuration we found was a context of two left and two right words and POS tags with k is 1. The nearest neighborhood size is smaller than used in our earlier work (3) and the best context size is smaller than in our noun phrase chunking work (4). However, the best context size we found for this task is exactly the same as reported by [Ratnaparkhi(1998)].

The second topic we were interested in was the type of training data that should be used for finding phrases above the base level. In our noun phrase parsing work, we found that the best performance could be obtained by using only data of the current phrase level. This will cause a problems for our parser, since the tree depth may become as large as 31 in our corpus but there will be few training material available for these high level phrases if we use the same training configuration as in our noun phrase parsing work. We have tested two different training configurations to see if we could use more training data for this task without losing performance. With the first of these, using the current, previous and next phrase level, performance was as well (F $_{\beta =1}$ =77.13) as while using only the current level (77.17). However, when we trained the cascade of chunkers while using brackets of all phrase levels, the performance dropped to 67.49. We have decided to keep on using the current phrase level only in the training data despite its problems with identifying higher level phrases.

In the results that we have presented in this paper, the precision rates have always been higher than the recall rates. For a part, this is caused by the method we use for balancing open brackets and close brackets. It removes all brackets which cannot be matched with another one which is approximately the same as accepting clauses which are likely to be correct and throwing away all others. We wanted to test if we could obtain more balanced precision and recall rates because we hoped that these would lead to a better F $_{\beta =1}$ rate. Therefore we have tested two alternative methods for combining brackets. The first disregarded the type of the open brackets and allowed close brackets to be combined with open brackets of any type. The second method allowed open brackets to match with close brackets of any type. Unfortunately neither the first (F $_{\beta =1}$ =72.33) nor the second method (76.06) managed to obtain the same F $_{\beta =1}$ rate as our standard method for combining brackets. Therefore we decided to stick with the latter.

The final issue which we wanted to examine is the performance progression of the parser at the different levels of the process. The recall of the parser should increase for every extra step in the cascade of chunkers but we would also like to know how precision and F $_{\beta =1}$ progressed. We have measured this for our small parameter tuning data set and found that indeed recall increased until level 30 of a maximum of 32 and remained constant after that. Precision dropped until the same level, remaining at the same value afterwards while F $_{\beta =1}$ reached a top value at level 19 and dropped afterwards. The reason for the later drop in F $_{\beta =1}$ value is that while the recall is still rising, it cannot make up for the loss of precision at later levels. Since we want to optimize the F $_{\beta =1}$ rate, we have decided to restrict the number of cascaded chunkers in our parser to 19 levels. We have added an extra post-processing step which after the 19 levels of processing adds clause brackets (S) around sentences which have not already been identified as a clauses.

We have applied the best parser configuration found to the standard parsing data. Our parser used an arbitrary chunker with the configuration described in Section 3.3 (a Majority Vote of five systems using different data representations) but trained with the relevant data for this task. Higher level phrases were identified by a cascade of 19 chunkers, each of which had a pair of independent open and close bracket classifiers which used a context of two left and two right of words and POS tags while being trained with brackets of the current level only. At each level, open and close brackets were combined to chunks by removing all brackets that could not be matched with a bracket of the same type. The parser contained a post-processing process which added clause brackets around sentences which were not identified as a clause after the 19 processing stages. This chunk parser obtained an F $_{\beta =1}$ rate of 80.49 on WSJ section 23 (precision 82.34% and recall 78.72%).

The performance of our chunk parser is modest compared with state-of-the-art statistical parsers, which obtain around 90 F $_{\beta =1}$ rate [Collins(1999),Charniak(2000)]. However, we have a couple of suggestions for improving its performance. First, we could attempt giving the parser access to more information, for example about lower phrase levels. Currently, the parser only knows the head words and phrase types of daughters of phrases that are being built and this might not be enough. Second, we could try to find a better method for predicting bracket positions. For reasons explained in the previous section, we could not use a majority vote of systems using different representations. This might have helped to obtain a better performance. Finally, we would like to change the greedy approach of our parser. Currently it chooses the best segmentation of chunks at each level and builds on that but ideally it would be able to remember some next-to-best configurations as well and perform backtracking from the earlier choices whenever necessary. This approach would probably improve performance considerably (as shown by [Ratnaparkhi(1998)], Table 6.5). A practical problem which needs to be solved here, is that in nearest neighbor memory-based learning alternative classes do not receive confidence measures. Rather, sets of item-dependent distances are used to determine the usability of the classes. Comparing partial trees requires comparing sets of distances and it is not obvious how this should be done.

These extra measures will probably improve the performance of the chunk parser. However, it is questionable whether it is worthwhile continuing with this approach. The present version of the parser already requires a lot of memory and processing time: more than a second per word for chunking only compared with a mere 0.14 seconds per sentence for a statistical parser which performed better [Ratnaparkhi(1998)]. Extra extensions will probably slow down the parser even more so we are not sure if extending this approach is worth the trouble.

Table 7: A selection of results that have been published for the Ramshaw and Marcus data sets for noun phrase chunking. Our chunker (MBL) is third-best. The baseline results have been produced by a system that selects the most frequent chunk tag (IOB1) for each part-of-speech tag. The best performance for this task has been obtained by a system using Support Vector Machines [Kudoh and Matsumoto(2001)].

section 20	precision	recall	F $_{\beta =1}$
[Kudoh and Matsumoto(2001)]	94.15%	94.29%	94.22
[Tjong Kim Sang et al.(2000)]	94.18%	93.55%	93.86
MBL	94.01%	92.67%	93.34
[Tjong Kim Sang(2000a)]	93.63%	92.89%	93.26
[Muñoz et al.(1999)]	92.4%	93.1%	92.8
[Ramshaw and Marcus(1995)]	91.80%	92.27%	92.03
[Argamon-Engelson et al.(1999)]	91.6%	91.6%	91.6
baseline	78.20%	81.87%	79.99