\documentclass[twoside,11pt]{article}
%\usepackage{/home/osborne/latex/colacl,lscape,graphics}
%\usepackage{epsfig}
%\usepackage{/home/osborne/latex/jmlr2e}
%\usepackage{/home/miles/latex/jmlr2e}
\usepackage{jmlr2e}
\jmlrheading{1}{2002}{695-719}{9/01}{3/02}{Miles Osborne}
\ShortHeadings{Shallow Parsing and Noise}{Osborne}
\firstpageno{695}
\begin{document}


\title{Shallow Parsing using  Noisy and Non-Stationary Training Material}
\author {\name Miles Osborne \email osborne@cogsci.ed.ac.uk \\ 
\addr
 Division of Informatics \\  University of
  Edinburgh \\ 2
    Buccleuch Place \\ Edinburgh EH8 9LW,  Scotland.
 }

\editor{James Hammerton, Miles Osborne, Susan Armstrong  and Walter Daelemans}

\maketitle
\newtheorem{sent}{}
\newcommand{\labelRule}[2]{\begin{center}\begin{tabular}{ll} #1 & (#2) \end{tabular} \end{center}}
\newcommand{\lexical}[2]{\begin{tabular}{l} #1 $\mapsto$ #2 \end{tabular}}
\newcommand{\numberSentence}[1]{\begin{sent} \small #1 \end{sent}}
\newcommand{\cat}[1]{$\mbox{#1}$}
\newcommand{\lcat}[1]{$\backslash \mbox{#1}$}
\newcommand{\rcat}[1]{$/ \mbox{#1}$}
\newcommand{\psr}[2]{$#1 \rightarrow #2$}
%\setcounter{page}{61}
%
% Displayed natural-language sentence
%
\newcommand{\sentence}[1]
     {\begin{flushleft}{\it #1}\end{flushleft}}


\begin{abstract}
Shallow parsers are usually assumed to be trained on {\it noise-free}
material, drawn from the same distribution as the testing
material. However, when either the training set is {\it noisy} or else
drawn from a {\it different} distributions,
performance may be degraded.  Using the parsed Wall Street Journal, we 
investigate the performance of four shallow parsers (maximum entropy,
memory-based learning, N-grams and ensemble learning)  trained using 
various types of artificially noisy material.  Our first set of results show that shallow parsers
are surprisingly robust to synthetic noise, with performance gradually
decreasing as the rate of noise increases.  Further
results show that no single shallow parser performs best in all noise
situations. Final results show that simple, parser-specific extensions
can improve noise-tolerance.
Our second set of results addresses the question of whether naturally
occurring disfluencies undermines performance more than does a change
in distribution.  Results using the parsed Switchboard corpus suggest 
that, although naturally
occurring disfluencies might harm performance, differences in
distribution between the training set and the testing set are more
significant. 
\end{abstract}

\section{Introduction}


Shallow parsers are usually automatically 
constructed  using a training set of annotated  examples \citep{Rams95,Skut98,Dael99a}. 
 Without
loss of generality, let $X$ be a set of word sequences  and $Y$ be a set
of syntactic labels.  The training set will then be a sequence of
pairs of the form $\langle x_1, y_1 \rangle, \ldots ,\langle x_n, y_n
\rangle$, where $x_i \in X, y_i \in Y$.  Such a sequence would be
generated by some process $Q$. On the basis of such a
training set, a shallow parser would make predictions about 
future, unlabelled examples.


Arguably, significant progress in shallow parsing performance will only come from having
access to massive quantities of annotated training material \citep{Bank01}.
 Unsupervised
approaches (which do not require annotated material, and so can make
use of  unlimited quantities of raw text), by contrast, 
are unlikely to yield comparable levels of 
performance. Now, producing  annotated material on an industrial scale
is highly likely to result in the introduction of  significant {\it noise} levels and/or
{\it distributional differences} into the training set. Therefore,
successfully dealing
with lower-quality (but voluminous) annotated training material will
hinge upon the ability of shallow parsers to tolerate high noise levels.  
Until now, the need to deal with noise has not been 
identified as a potential problem within the 
 shallow parsing literature. Furthermore, issues involved with 
distributional differences  in the training set 
have not been dealt with either.
In this paper, we consider both of these issues -noise in the training set and
distribution differences- when training a variety of
shallow parsers. 

Our first set of results shows that a variety of shallow parsers
are surprisingly robust to synthetic noise, and only show a marked
decrease in performance when the noise rates are high.
Further
results show that no single shallow parser performs best in all noise
situations. Final results show that simple, parser-specific extensions
can improve noise-tolerance.

Our second set of results addresses the question of whether naturally
occurring disfluencies undermines performance more than does a change
in distribution.  Our results suggest that, although naturally
occurring disfluencies might harm performance, differences in
distribution between the training set and the testing set are more
significant. Note that this conclusion is only true for noise levels
that are natural, and not, for example when noise levels are higher (as
might be produced when quickly trying to create large volumes of
annotated material).

 
The rest of this paper is as follows.  Section \ref{shallow} briefly
surveys shallow parsers from the perspective of
noise-tolerance. Afterwards (Section \ref{parsers}) we introduce the
shallow parsers used in our experiments.  Next (Section \ref{noise}) 
we present a set of
simple, artificial noise models.  This then leads onto a set of
experiments (Section \ref{exp}) where we show what happens when various parsers are
trained upon noisy material, and how our various parsers can be made
more noise tolerant. Section \ref{swbd} deals with naturally occurring
disfluencies and distributional differences.  Finally, Section \ref{comments} summarises the paper.

We now look more closely at noise tolerance in a variety of shallow parsers.

\section{Noise and shallow parsing \label{shallow}}

%As was mentioned in the introduction, shallow parsers are usually
%trained on annotated examples. 
% Now, producing such a training set is both expensive and
%error-prone (as for example when the training set is derived from
%parsed corpora \citep{Marc93}). Errors (or {\it noise}) can occur in 
%the examples
%(as for example when phrases are {\it disfluent}) or in the labels (as for example when 
%phrases are mislabelled).\footnote{We tend to use the terms {\it
%noise} and {\it disfluency} interchangeable. This does not imply, for
%example, that naturally occurring disfluencies are uniformly
%distributed.} 
%Since the training set is used to specify
%the parser, noise in the training set may undermine
%performance.   

%Quite apart from noise in the training set, when the training set is
%generated by a {\it different} process to that which generates future,
%unseen examples, performance may also be reduced.  Such a mismatch
%occurs either when the training set is {\it non-stationary} (as 
%for example when the training set consists of material uttered by a
%variety of people, or when different genre are included).  

Shallow parsers (as reported in the literature)
usually do not have explicit mechanisms for dealing
with noise.  This is probably because they are assumed to be trained
upon relatively noise-free material.  However, this does not mean to
say that 
shallow parsers cannot deal with noise.  For example, we see that 
shallow parsing 
systems  can be made more noise tolerant 
either by {\it smoothing}, selection of a model which
is `simple', with post-processing of parsed output, or else with some
combination of all three methods. Such approaches potentially might
also tackle the problem of non-stationary training material. 

Smoothing refers to  a general
set of techniques whereby a model estimated from a training set is
modified in some way, such that performance upon material unlike the
training material improves.  Such modification is motivated on {\it a
priori} grounds (meaning information outside of  the training set is
brought to bear). 
Since noise typically manifests as a model
having a complex surface, making it smoother may remove some of the
effects of noise.  Note that because smoothing is motivated on an {\it
  a
priori} basis,  smoothing may in some circumstances actually 
reduce performance.  
We shall see two different examples of
dealing with noise through smoothing in Sections \ref{mbl} and 
\ref{tnt}.  

Selecting a model which is `simple' can also be seen as a technique
for dealing with noise.  For example, maximum entropy will select a
model (from the set of possible maximum likelihood models) that has
parameters which are maximally compressible.  This is therefore a
preference for simplicity, which in turn is similar to explicit
smoothing. However, unlike smoothed maximum likelihood models (which are really maximum a posteriori models),
 maximum entropy is still maximum
likelihood estimation, and so (in principle) nothing need be lost
about the training material.  Note that a preference for simplicity is
nothing more than a noise-avoidance strategy, which may or may not
work, depending upon the nature of the sample \citep{Schaf93}.
Section \ref{maxent} describes a shallow
parser based upon maximum entropy.


Post-processing the test material, in the context of shallow parsing,
was advocated by \cite{Card98}.  They manually created a set of rules
to correct the output of their parser.  Whilst they were able to show
an improvement in performance, in general, such a technique is
arguably ad-hoc
and cannot
guarantee that all error patterns are treated. In subsequent work,
\cite{card99} automated the construction of these postprocessing rules.

Given the (open) question of whether some single noise-tolerance strategy works
in all domains, short of finding such an approach, an alternative
method is to `average' over whatever noise-avoidance 
methods one has available. The
general idea is that no  single approach will always work, but that
when taken together, an {\it ensemble} of approaches can work better
than any single method within that ensemble \citep{diet99}. 

Ensemble approaches have
been shown to be successful in a number of domains.  For example, the
current best parse selection results using the Wall Street Journal
were obtained using ensemble techniques \citep{hend00}; POS
tagging is improved by ensemble methods \citep{Brill98}, and for the
 shallow parsing domain that
we consider, ensemble methods enjoy a clear advantage over single
learners \citep{kudo00}.  

The two key
ideas  behind
 ensemble approaches
is that the individual models should be better than chance (they are
`weak' learners) and that errors made by the learners are
uncorrelated. Under such circumstances, the ensemble will provably be
better than any of the component models. 
\cite{Asla93} have 
proved that when the appropriate assumptions are met, 
ensemble learning can reduce errors, even when the
training material is noisy.  
Ensemble learning does not always guarantee a performance
improvement. Should errors be correlated (or the component learners
be worse than chance) then the performance of the ensemble can be
worse than the performance of the components.  An example of an
ensemble shallow parser is given in Section \ref{ep}.


We now  describe what we mean by shallow parsing, and introduce 
the shallow parsers used in our experiments.
\section{Four shallow parsers \label{parsers}}

In this paper, we deal with the recovery of base phrases from the
(parsed) Wall Street Journal (henceforth called WSJ). 
Base phrasal recovery is a well-known
task, and various results exist for it.  See
\cite{Sang00} for an overview.

In brief, the task is to
annotate part-of-speech (POS) tagged sentences with 
non-overlapping phrases.  Words in
sentences are marked with a `chunk' label, which can be either an {\it O}
label
(meaning the word is not in a phrase), a {\it B} label (meaning the
associated word is the first word of a given phrase) or an {\it I}
label (meaning the word is within some given phrase).  Both {\it B}
and {\it I} labels are further divided into Verbal, Nominal,
Adjectival, Adverbial, Prepositional and a small variety of other
minor phrasal categories. In total, there are $22$ possible chunk
labels. Figure \ref{exsent} shows
 an example labelled sentence.
\begin{figure}[!htb]
\begin{center}
\begin{tabular}{lllllllll}
He & reckons & the & current & account & deficit & will & narrow & to \\
B-NP & B-VP & B-NP & I-NP & I-NP &  I-NP &  B-VP &    I-VP & I-VP \\
\\ \\
 only & \# & \$ &  1.8 & billion & in & September &  &  \\
I-VP & B-PP & B-NP & I-NP & I-NP   & B-PP & B-NP & & 
\end{tabular}
\end{center}
\caption{An example annotated Wall Street Journal sentence \label{exsent}}
\end{figure}
    Note that POS
labels have been suppressed.


We now present the shallow parsers used in our experiments.  The
motivation for using these  parsers is that they are broadly
representative of shallow parsers, and furthermore, help deal with a
potential criticism that any results obtained from a single parser
might reflect idiosyncrasies of that approach, and not more general
findings. 


\subsection{The maximum entropy-based shallow parser \label{maxent}}

We used Ratnaparkhi's maximum entropy-based POS tagger 
as the basis of our maximum entropy-based shallow parser \citep{Ratn96}.
In brief, the tagger   uses the following exponential model:
\begin{eqnarray*} 
P(h,t) = \pi \mu \prod_{j=1}^{k} \alpha_{j}^{f_{j}(h,t)}
\end{eqnarray*}
$\pi$ is a normalising constant, $\mu,\alpha_1 \ldots \alpha_k$ are
parameters and $f_1 \ldots f_k$ are `features'.  Each parameter
$\alpha_j$ corresponds to a feature $f_j$.  Each feature takes an integer
values representing how many times that feature was active on an example.
$h_i$ is the {\it history} available when predicting POS tag $t_i$.

Features capture aspects of the tagging task that are deemed important when
modelling.  Here, features  are in terms of the current word to be tagged, the
previous two words, the next two words and the previous two tags.  The
tagger also uses features such as prefixes of words, word suffixes,
etc. The interested reader should consult the original paper for a
description of the features used in the tagger \citep{Ratn96}.


The parameters $\mu,\alpha_1 \ldots \alpha_k$ are chosen to maximise
the probability of the training set:
\begin{eqnarray*}
P(D \mid M) = \prod^{n}_{i=1}P(h_i,t_i) \label{gislike}
\end{eqnarray*}
$D$ is the training set (tags and their respective histories), and $M$ is
the model (parameters and their associated weights). 
Parameters are estimated using {\it Generalised Iterative Scaling}
(GIS).

The tagger is trained by taking a set of sentences, and 
 for each word in each sentence, labelling that word with an
 intended POS tag.  Afterwards, the
tagger uses GIS to find the best set of weights to maximise the
probability of seeing this data. GIS is an iterative training method,
 and after each iteration, the likelihood probability (equation
 \ref{gislike}) increases up to a maximum.  

When tagging, the model tries to recover the most likely (unobserved) tag sequence,
given a sequence of observed words. 


When shallow parsing using this approach, consider the following fragment of the training set:
\begin{center}
\begin{tabular}{l|llll}   \hline 
Word & $w_1$ & $w_2$ & $w_3$  & $w_4$  \\
POS Tag & $t_1$ & $t_2$ & $t_3$  & $t_4$  \\ 
Chunk  & $c_1$ & $c_2$ & $c_3$ & $c_4$  \\ \hline  
\end{tabular}
\end{center}
Words are $w_1, w_2,w_3$ and $w_4$, 
tags are $t_1, t_2, t_3$ and $t_4$ and  chunk
labels are $c_1, c_2, c_3$ and $c_4$. The task is to predict the chunk
label for word $w_2$ (the current word).

In order to convince the part-of-speech tagger to shallow parse, we 
  concatenate together the POS
label of the current word, the POS label of the next word ($t_3$),
that of the following word ($t_4$), and finally the last four letters of the
current word ($w_2$).  This concatenation is then used in place of a more
conventional word (as expected by the tagger). In place of the
tagger's POS labels, we use chunk labels. The training set is then a
  sequence of concatentation-chunk label pairs. The motivation
for encoding shallow parsing information in this way comes from a
tradeoff between context and accuracy.  The more context we encode,
the greater our accuracy; the more context, the sparser our features
will be.  \cite{Osbo00c} gives a more detailed description of some of the
issues involved with shallow parsing using this tagger.
Our experiments were all
 run for $100$ iterations of GIS (unless specified otherwise).

\subsection{The memory-based shallow parser \label{mbl}}
Our memory-based shallow parser was constructed using the
publically-available {\it TiMBL} system \citep{Dael00}.  Memory-based 
learning stores all training examples in memory.  Examples are vectors
of features and an associated label.  At run time, an unseen example
is classified in terms of how similar it is to seen examples.  The $k$
most similar  training examples are retrieved from memory, and
the label of the unseen example is then some function of these seen examples.

In more detail, let $D(X,Y)$ be the `distance' between vector $X$ and
vector $Y$. There are many possible distance metrics, but in this work
we used the
default {\it overlap} metric:
\begin{eqnarray*}
D(X,Y) = \sum_{i=1}^{n} w_i \sigma(x_i,y_i) \\
\sigma(x_i,y_i) = \left\{ \begin{array}{ll} 0 &\mbox{if}~ x_i = y_i \\
                                            1 & \mbox{otherwise}
                          \end{array} \right.
\end{eqnarray*}
Here, $x_i$ and $y_i$ are the $i^{th}$ components of the vectors $X$ and
$Y$. $w_i$ is a weighting which can bias matching in various
ways. We used the {\it gain ratio} weighting scheme (as it produced
reasonable results). Together, the similarity metric and weighting
scheme is called IB1-IG \citep{dael92b}.

This overlap metric prefers examples which have a minimal number
of (weighted) mismatches. We find the $k$ examples that minimise the
overlap metric, and select the final label as being the label that is
most common within the $k$ examples.


Throughout our main experiments (Section \ref{mainexp}), we used the
IB1-IG  similarity metric (as it produced acceptable results for our
task; the choice of metric is not crucial for our experiments),
and used $k=3$. We also experimented with various values of $k$ in our
secondary experiments (Section \ref{extensions}).  

When shallow parsing, we created vectors consisting of the current
word to be labelled, the current POS tag, the previous four POS
tags, and the POS tag of the next word. Better results can
be obtained by using even more context (at the expense of slower
testing).  Since we were more concerned with running large numbers of
experiments than with obtaining the best possible results, we 
were prepared to accept a reduced performance level
(that is, our approach was about $1\%$ less than that of \cite{Veen00}).
Note that this shallow parser uses different information from the
maximum entropy parser.  However, when there is no noise, both models
yield very similar performance levels to each other.  

\subsection{The N-gram shallow parser \label{tnt}}

We use Brant's Markov-based POS tagger  as the basis of
our N-gram shallow parser \citep{brant00}.  Brant's tagger selects the most likely tag
sequence $t_1 \ldots t_n$ as follows:
\begin{eqnarray*}
\mbox{argmax}_{t_1 \ldots t_n} P(t_{n+1} \mid t_{n}) \prod_{i=1}^{n}
P(t_i \mid t_{i-1}, t_{i-2}) P(w_i \mid t_i)
\end{eqnarray*}
Here, $w_i$ is a word and  $t_{n+1}$ is a tag denoting the end of a
sentence.

The components of the previous equation are estimated using
N-grams. The  model uses linear interpolation as a smoothing
device (smoothing over the term $P(t_i \mid t_{i-1}, t_{i-2})$).  There is a mechanism for dealing with unknown words, based
upon suffixes.
The model is trained using maximum likelihood estimation, whilst the
weights for linear interpolation are found using deleted interpolation.

We used the default settings of the system.  In particular, this
means:  using statistics from singleton words when dealing with unknown
words, using linear
interpolation to smooth and using trigrams. Note that at times we used
unigrams and bigrams instead of trigrams.  These cases are clearly
marked in the paper.

When shallow parsing, we used the same training set (words as
concatenations of POS tags, etc) as was used when
training the maximum entropy shallow parser (mentioned in Section \ref{maxent}).

Later in this paper, whenever we refer to the `N-gram parser', as
should be obvious, we mean Brant's tagger retrained as a shallow parser. 


\subsection{The ensemble parser \label{ep}}

Our ensemble shallow parser used all three previously mentioned
parsers, and arbitrated between them using a simple majority vote.
That is, we use all parsers to label material. For each word, we
count the number of times a label was predicted 
 by the various parsers.  If a label is predicted by a majority,
then we use that as the label predicted by the ensemble parser.
Otherwise, we default to the decision made by one of the component
parsers.  As alluded to in  Section \ref{exp}, varying the default decision
yields (marginally) different performance. 
Note  we have not used weighted majority voting as at least one
of our shallow parsers is non-probabilistic. 


\section{Noise models \label{noise}}


In this Section, we introduce the four noise models that were used in
our experiments.  Although our models are not claimed to be realistic
accounts of disfluencies as found in naturally occurring language (nor
of errors made by annotators), they
do allow us to  introduce various types
of noise into the training set, such that the parsers are stressed in
different ways.   Our noise models furthermore are  parameterised, so we can
also vary the amount of noise.
Note that training on disfluent corpora (such as {\it
  Switchboard}, see Section \ref{swbd}) would allow us to observe the effects of realistic,
naturally occurring noise, but
only at a fixed rate.  


Our first model ({\tt white}) simulates classification errors (called the  {\it
  classification noise} model) \citep{Angl88}.  Within the {\tt white}
  noise model, when training, each example
received by the learner is mislabelled randomly and independently
with some fixed probability.  For example, the fragment shown in
  Figure \ref{exfrag} might become corrupted as shown in Figure \ref{exwhite}.
\begin{figure}[!htb]
\begin{center}
\begin{tabular}{lllll} 
He & reckons & the & current & account \\
B-NP & B-VP & B-NP & I-NP & I-NP 
\end{tabular}
\end{center}
\caption{Example sentence fragment, without noise \label{exfrag}}
\end{figure}

\begin{figure}[!htb]
\begin{center}
\begin{tabular}{lllll} 
He & reckons & the & current & account \\
I-NP & B-VP & O & I-NP & I-NP 
\end{tabular}
\end{center}
\caption{An example annotated sentence with white noise in the labels
  (as produced by the {\tt white} noise model)\label{exwhite}}
\end{figure}

Mislabelling is within the set of possible labels, and no account is
taken whether the corrupted example is a well-formed example (a phrase
is introduced by an appropriate beginning phrasal marker).  This means
that examples corrupted in this manner pose a harder learning problem than do
examples which are mislabelled, yet still well-formed.

This model has been studied theoretically, and various
noise-tolerant learners are known to exist for it.  From a linguistic
perspective, this model is artificial, and it is unlikely that any
real training set would be corrupted in this manner. However, it is 
a useful baseline.

The second model ({\tt filled}) introduces `filled-pauses' into
sentences \citep{Shri94}.  A
filled-pause is an interjection, with no semantic content, which can
occur anywhere in a sentence.  For example, Figure \ref{exfp} shows
the insertion of a filled pause into the  fragment shown in
Figure \ref{exfrag}.

\begin{figure}[!htb]
\begin{center}
\begin{tabular}{llllll} 
He & reckons &  uh & the & current & account \\
B-NP & B-VP & B-VP & B-NP & I-NP & I-NP 
\end{tabular}
\end{center}
\caption{A sentence containing a filled pause (as produced by the {\tt
    filled} noise model)\label{exfp}}
\end{figure}
Filled pauses are introduced at a fixed  and independent rate. They
are assumed to simply have the same chunk label as the previous
word (this adds extra noise to the sequence of labels and to the
sequence of words). As such, a training set with filled-pauses
contains well-formed phrases. 


The third model ({\tt repeat}) 
introduces simple `repetitions' into sentences \citep{Shri94}.  Figure
\ref{exrep} shows an example repetition.
\begin{figure}[!htb]
\begin{center}
\begin{tabular}{llllll} 
He & reckons &  reckons & the & current & account \\
B-NP & B-VP & I-VP & B-NP & I-NP & I-NP 
\end{tabular}
\end{center}
\caption{A sentence containing a repetition (as produced by the {\tt
    repeat} noise model)\label{exrep}}
\end{figure}
Like filled-pauses, repetitions are introduced randomly, at a fixed
rate.  Repetitions are assumed to be in the same phrase as that of the
previous word (this prevents the introduction of illegal sequences)
and  are assumed to only occur once (as multiple occurrences of the
same symbol can easily be eliminated by post-processing).
They are  also constrained to only occur before function
words.\footnote{Constraining repetitions in this way is an attempt to
  model the fact that $40\%$ of disfluencies in natural text 
  behave in this manner \citep{Shri94}.}
Since repetitions
depend upon the surface form of a sentence, they will be distributed
differently from the changes made by our other models.  

The final noise model ({\tt all}) used all three previous models. In
more detail, we applied the {\tt repeat} model, then the {\tt filled}
model, and finally the {\tt white} model to the training set.  

As can be expected, our noise models are simple, and so arguably
create unrealistic training scenarios. For example, our 
approach largely predicts that disfluencies are linearly
distributed. \cite{Shri94} found that
the rate of disfluencies in some texts were best described
exponentially.   The filled pause and repetition models
 should assign part-of-speech
tags by running a part-of-speech tagger over the corrupted material. 


The noise models are not that closely tied to the types of noise one
might expect when manually annotating material on a massive
scale. Instead, they are largely inspired by the disfluency
literature. In lieu of a study of the kinds of errors made when
annotating material, we conjecture that the {\it white} noise model
might be seen as crudely approximating annotation errors. That 
is, the examples are noise-free, but the actual 
labels will be at times wrong. The other noise models may be
appropriate when training on material from another genre. For example, when moving from speech to edited newswire text.
Interesting future work should attempt to characterise the classes of
errors made when annotating very large quantities of material, capture
such errors in noise models, and then study their effects.

In their defence, the noise models even if unrealistic, do  allow  
experiments to be carried-out whereby  noise-rates  can be
varied. Varying the noise rate cannot be done so easily using
naturally occurring material. 

We now present our experiments using synthetic noise in the training
sets.

\section{Experiments using artificial noise \label{exp}}

Throughout these experiments, we
used a training set consisting of $8,885$ sentences.  This was
automatically extracted from the parsed Wall Street Journal \citep{Marc93}.  We also
used a disjoint testing set ($2,011$ sentences), also drawn from the parsed Wall Street Journal. This training and testing
set was the same one as used by the CoNLL00 shared task.\footnote{The  
site {\tt http://lcg-www.uia.ac.be/conll2000/chunking/}  describes the task and contains all the training, testing
  and evaluation material used here.} 

Note that the training and testing set was re-tagged using the Brill
tagger \citep{bril92b}. This means that the data (with respect to the
original parsed Wall Street Journal) contains tagging errors (noise). 
 However, as  both the training and testing sets
were tagged in this manner, the effects of tagging errors will be
factored-out. Whenever we say `uncorrupted' material, we really mean
the material as used in the CoNLL00 shared task.  Future work should repeat our
experiments using training and testing material as originally tagged
in the Wall Street Journal.

When shallow parsing (as was previously mentioned in Section \ref{parsers}),
 the task was to assign chunk labels to tagged
sentences. Evaluation was in terms of an `f-score'.  This is the (harmonic)
average of recall and precision, where {\it recall} is the percentage of
exactly matching phrases in the testing set found by the system, and 
{\it precision} is the percentage of detected phrases that are
 correct.  Shallow parsers are usually measured using f-scores, and
 not in terms of tagging accuracy.  For the purposes of our
 experiments, the actual choice of metric is not that important.


 The best
reported system (an ensemble) achieved an f-score of $93.84$ \citep{kudo00}.  
In
general, shallow parsing is harder than POS tagging since
the correct assignment of chunk labels can depend upon much larger
contexts than those that are used by POS taggers.  For example,
shallow parsing typically needs a context of five words and/or
POS labels.  POS taggers usually operate in
terms of three words and/or POS tags.

As a baseline, a system which made decisions according to
the most likely chunk label given a POS tag (using an
uncorrupted training set) achieved an f-score of $77.07$.  For all
noise models, and varying the rate of noise, this baseline mainly held
(as shown in Figure \ref{baseline}). These results are the averages of
$10$ runs. That is, we used the same training and testing set on each
run, but because our noise models make decisions on a  random basis, they
produced different sets of disfluencies after each run.

 All other experiments were single runs (as there were so
many of them). Note that at times, performance using noisy material is
marginally 
better (at least for our baseline model) than for performance using
`clean' material.  This is probably an artifact.

\begin{figure}
\begin{center}
\begin{tabular}{c c c c  c c }
Noise Model & Rate    & F-Score &   Noise Model  & Rate & F-Score \\ 

{\tt White} & 25 & 77.10 & {\tt White} & 50 & 77.11 \\ 
{\tt Repeat} & 25 & 77.16 & {\tt Repeat} & 50 & 77.96 \\ 
{\tt Filled} & 25 & 77.06 &  {\tt Filled} & 50 & 77.06 \\ 
{\tt All} & 25 & 77.10 & {\tt All} & 50 & 77.98 \\ 
{\tt White} & 75 & 77.15 & {\tt White} & 100 & 2.93 \\ 
{\tt Repeat} & 75 & 78.27 & {\tt Repeat} & 100 & 78.27 \\ 
{\tt Filled} & 75 & 77.06 & {\tt Filled} & 100 & 77.06 \\ 
{\tt All} & 75 & 78.23 & {\tt All} & 100 & 3.57 
\end{tabular}
\end{center}
\caption{Baseline performance using various noise models.  {\it Rate}
  shows the percentage chance of a corruption occurring \label{baseline}}
\end{figure}

The reason for our baseline system having  high levels of performance, even when the noise
rates are high, follows from the highly peaked nature of the underlying
distributions. Figure \ref{d2} shows an example 
distribution from the baseline parser.  As can be seen, a few points
in this distribution accounts for the majority of the 
probability mass.  When noise is introduced by our models, these
distributions become flatter.  However, because the distributions are
so peaked, even with a large amount of noise, these peaks are still
prominent, and so performance is maintained.


\begin{figure}
\begin{center}
\begin{tabular}{lll} 
Label & $P(\mbox{label} \mid \mbox{VBN})$ & Count \\
B-ADVP & 0.0002 & 1 \\
I-ADVP & 0.0002 & 1 \\
O & 0.0015 & 7 \\
I-ADJP & 0.0092 & 44 \\
B-PP & 0.0164 & 78 \\
B-ADJP & 0.0193 & 92 \\
B-NP & 0.0393 & 187 \\
I-NP & 0.0951 & 453 \\
B-VP & 0.2297 & 1094 \\
I-VP & 0.5891 & 2806 \\
\end{tabular}
\end{center}
\caption{Distribution of chunk labels for
  $P(\mbox{label} \mid \mbox{VBN})$, trained upon uncorrupted WSJ \label{d2}}
\end{figure}


\subsection{Performance using noise models \label{mainexp}}

Here we show the performance of our  shallow parsers in the
presence of various kinds of noise.  Section \ref{ind} analyses these results.

Figure \ref{white} shows the
results using the {\tt white} noise model.  Here, as in graphs
\ref{filled}, \ref{repeat} and \ref{all}, we show the performance of
all single-model parsers, and the performance of the best ensemble
parser, as the noise rate increases. We have not shown all ensemble
parsers as, by-and-large, they are all quite similar to each
other. It should be clear which parser relates to which curve from the labels.

\begin{flushleft} 
\begin{figure}[!hbt] 
\begin{center} 
\epsfxsize=.5\linewidth 
\rotatebox{270}{\epsfbox{gnuplot/white.ps}} 
\end{center} 
\caption{Results using {\tt White} Noise Model \label{white}}
\end{figure} 
\end{flushleft} 

%\begin{figure}[!thb]
%\begin{center}
%\begin{tabular}{c|cccccc}
%Rate & Max & MBL & TNT & Vote (Max) & Vote (MBL) & Vote (TNT) \\ \hline
%0 & 90.87 & 90.22 & 84.08 & 91.63 & 91.57 & 91.44 \\ 
%5 & 90.32 & 89.11 & 82.42 & 91.12 & 91.27 & 91.09 \\ 
%10 & 89.79 & 88.3 & 81.21 & 90.73 & 90.96 & 90.59 \\ 
%15 & 89.21 & 87.27 & 79.71 & 90.29 & 90.72 & 90.18 \\ 
%20 & 88.47 & 85.96 & 78.81 & 89.86 & 90.38 & 89.78 \\ 
%25 & 88.02 & 84.52 & 77.39 & 89.51 & 90.01 & 89.32 \\ 
%30 & 87.21 & 82.58 & 75.48 & 88.88 & 89.34 & 88.6 \\ 
%35 & 85.95 & 80.73 & 73.77 & 88.09 & 88.67 & 87.59 \\ 
%40 & 85.05 & 78.15 & 71.42 & 87.38 & 87.72 & 86.61 \\ 
%45 & 83.44 & 75.24 & 69.61 & 86.01 & 85.9 & 84.8 \\ 
%50 & 81.83 & 72.01 & 67.59 & 84.56 & 84.19 & 83.16 \\ 
%55 & 79.15 & 67.84 & 64.37 & 82.2 & 81.8 & 80.3 \\ 
%60 & 76.52 & 63.52 & 62.02 & 80.15 & 79.29 & 77.5 \\ 
%65 & 72.84 & 57.89 & 59.35 & 76.64 & 75.08 & 73.44 \\ 
%70 & 66.5 & 53.1 & 55.42 & 71.29 & 70.23 & 68.31 \\ 
%75 & 59.82 & 46.67 & 50.53 & 65.34 & 64.54 & 62.21 \\ 
%80 & 50.49 & 38.53 & 44.85 & 56.24 & 56.05 & 54.17 \\ 
%85 & 40.07 & 29.87 & 38.51 & 45.21 & 45.33 & 44.8 \\ 
%90 & 25.36 & 20.6 & 28.93 & 29.59 & 31.35 & 32.05 \\ 
%95 & 11.49 & 11.61 & 18.4 & 13.88 & 16.04 & 18.99 \\ 
%100 & 3.5 & 3.08 & 3.74 & 3.55 & 3.33 & 3.7 \\ 
%\end{tabular}
%\end{center}
%\caption{Results using {\tt White} Noise Model \label{white}}
%\end{figure}

As can be seen, increasing the noise rate decreases the performance of
all our parsers.  Furthermore, we see that this degradation is
relatively graceful: no parser exhibits a catastrophic performance
drop as noise is introduced into the training set. 
When there is noise in the labels, we see that the maximum entropy
parser is more robust than the memory-based parser.  The N-gram
parser is mid-way between the two other single-model parsers.  

Turning now to the ensemble parsers (only the ensemble parser
defaulting to maximum entropy is shown), firstly, we find that all three
parsers usually outperform the single-model parsers.  Ensemble
learning can make shallow parsing more robust in the presence of white
noise.  Secondly, we see that
defaulting to maximum entropy usually produces the best results. 
Finally, at $100\%$ noise
the training set cannot be distinguished from the results of flipping a
coin, and so performance is minimal for all parsers.


%\begin{figure}[!thb]
%\begin{center}
%\begin{tabular}{c|cccccc}
%Rate & Max & MBL & TNT & Vote (Max) & Vote (MBL) & Vote (TNT) \\ \hline
%0 & 90.88 & 90.22 & 84.08 & 91.63 & 91.58 & 91.45 \\ 
%5 & 90.54 & 90.11 & 83.51 & 91.31 & 91.32 & 91.19 \\ 
%10 & 89.94 & 90.13 & 82.45 & 90.8 & 90.95 & 90.76 \\ 
%15 & 89.46 & 89.67 & 82.04 & 90.57 & 90.72 & 90.53 \\ 
%20 & 88.71 & 89.76 & 81.18 & 90.07 & 90.35 & 89.99 \\ 
%25 & 87.81 & 89.47 & 80.28 & 89.48 & 89.82 & 89.49 \\ 
%30 & 87.1 & 89.34 & 79.02 & 89.06 & 89.39 & 89.01 \\ 
%35 & 86 & 88.93 & 77.89 & 88.34 & 88.76 & 88.38 \\ 
%40 & 84.94 & 88.89 & 76.71 & 88 & 88.44 & 88.12 \\ 
%45 & 83.87 & 88.26 & 75.18 & 87.11 & 87.66 & 87.31 \\ 
%50 & 82.39 & 87.44 & 73.69 & 85.8 & 86.69 & 86.15 \\ 
%55 & 80.34 & 86.95 & 70.45 & 84.53 & 85.45 & 84.95 \\ 
%60 & 78.83 & 86.73 & 69.04 & 83.95 & 84.98 & 84.28 \\ 
%65 & 76.34 & 86.11 & 66.7 & 82.63 & 83.69 & 83.14 \\ 
%70 & 73.54 & 85.64 & 64.14 & 80.57 & 82.13 & 81.25 \\ 
%75 & 70.79 & 84.66 & 61.42 & 78.64 & 80.25 & 79.33 \\ 
%80 & 67.47 & 83.89 & 58.32 & 75.85 & 77.96 & 76.39 \\ 
%85 & 62.02 & 81.76 & 55.24 & 72.23 & 74.72 & 72.93 \\ 
%90 & 56.21 & 79.54 & 52.39 & 68.15 & 71.27 & 68.85 \\ 
%95 & 46.06 & 76.77 & 47.09 & 60.46 & 64.86 & 60.84 \\ 
%100 & 6.93 & 52.61 & 8.33 & 14.22 & 38.93 & 12.76 \\ 
%\end{tabular}
%\end{center}
%\caption{Results using {\tt Filled} Noise Model \label{filled}}
%\end{figure}

\begin{flushleft} 
\begin{figure}[!hbt] 
\begin{center} 
\epsfxsize=.5\linewidth 
\rotatebox{270}{\epsfbox{gnuplot/fill.ps}} 
\end{center} 
\caption{Results using {\tt Filled} Noise Model \label{filled}}
\end{figure} 
\end{flushleft} 

Looking now at the results using our {\tt filled} noise model (Figure
\ref{filled}), we see a different picture.  The main difference from
the previous results is that the memory-based parser is now the clear
winner, even generally outperforming the ensemble parsers.  With the
various ensemble parsers, we find that again, defaulting to the
behaviour of the best single parser yields marginally the best
ensemble parsing results.  However, we also found that the results for
the other two ensemble parsers (maximum entropy default and n-gram
default) produce results that are similar to each other.  

%\begin{figure}[!thb]
%\begin{center}
%\begin{tabular}{c|cccccc}
%Rate & Max & MBL & TNT & Vote (Max) & Vote (MBL) & Vote (TNT) \\ \hline
%0 & 90.87 & 90.22 & 84.08 & 91.63 & 91.57 & 91.45 \\ 
%5 & 90.88 & 90.16 & 83.87 & 91.58 & 91.52 & 91.42 \\ 
%10 & 90.75 & 90.08 & 83.72 & 91.48 & 91.41 & 91.3 \\ 
%15 & 90.51 & 90.14 & 83.49 & 91.3 & 91.33 & 91.16 \\ 
%20 & 90.45 & 90.11 & 83.23 & 91.17 & 91.13 & 90.97 \\ 
%25 & 90.24 & 90.22 & 82.95 & 90.97 & 91.01 & 90.83 \\ 
%30 & 90.18 & 89.99 & 82.83 & 90.91 & 90.95 & 90.83 \\ 
%35 & 90 & 90.03 & 82.39 & 90.75 & 90.78 & 90.64 \\ 
%40 & 89.71 & 90.17 & 82.28 & 90.68 & 90.86 & 90.65 \\ 
%45 & 89.36 & 89.76 & 81.62 & 90.15 & 90.39 & 90.17 \\ 
%50 & 89.34 & 89.98 & 81.58 & 90.26 & 90.43 & 90.2 \\ 
%55 & 89.13 & 89.6 & 81.4 & 89.98 & 90.22 & 90.1 \\ 
%60 & 88.7 & 89.73 & 81.04 & 89.62 & 89.93 & 89.71 \\ 
%65 & 88.55 & 89.76 & 80.79 & 89.58 & 89.9 & 89.77 \\ 
%70 & 88.11 & 89.79 & 80.36 & 89.19 & 89.63 & 89.48 \\ 
%75 & 87.33 & 89.51 & 80.13 & 88.64 & 89.32 & 89.01 \\ 
%80 & 86.73 & 89.43 & 79.58 & 88.16 & 88.84 & 88.69 \\ 
%85 & 85.99 & 89.19 & 79.03 & 87.6 & 88.41 & 88.3 \\ 
%90 & 85.15 & 88.84 & 78.28 & 86.91 & 87.84 & 87.63 \\ 
%95 & 83.02 & 88.63 & 76.98 & 85.12 & 86.62 & 86.41 \\ 
%100 & 76.4 & 87.87 & 75.17 & 80.67 & 83.61 & 82.99 \\ 
%\end{tabular}
%\end{center}
%\caption{Results using {\tt Repeat} Noise Model \label{repeat}}
%\end{figure}

\begin{flushleft} 
\begin{figure}[!hbt] 
\begin{center} 
\epsfxsize=.5\linewidth 
\rotatebox{270}{\epsfbox{gnuplot/repeat.ps}} 
\end{center} 
\caption{Results using {\tt Repeat} Noise Model.  Note the change of scale. \label{repeat}}
\end{figure} 
\end{flushleft} 

The results using the {\tt repeat} noise model
 (Figure \ref{repeat}) are different again
from the previous two sets of results.  Generally, we see that
performance is much less affected than before.  

%\begin{figure}[!thb]
%\begin{center}
%\begin{tabular}{c|cccccc}
%Rate & Max & MBL & TNT & Vote (Max) & Vote (MBL) & Vote (TNT) \\ \hline
%0 & 90.86 & 90.23 & 84.12 & 91.63 & 91.57 & 91.45 \\ 
%5 & 89.58 & 89.03 & 82 & 90.65 & 90.93 & 90.69 \\ 
%10 & 88.26 & 87.73 & 79.92 & 89.82 & 90.16 & 89.86 \\ 
%15 & 86.65 & 86.63 & 78.01 & 88.76 & 89.45 & 88.86 \\ 
%20 & 85.24 & 85.19 & 75.63 & 87.69 & 88.54 & 87.74 \\ 
%25 & 83.09 & 84.23 & 72.92 & 86.46 & 87.55 & 86.28 \\ 
%30 & 80.95 & 82.29 & 69.44 & 85.01 & 86.38 & 84.69 \\ 
%35 & 77.49 & 80.54 & 67.17 & 82.99 & 84.8 & 83.07 \\ 
%40 & 74.36 & 77.35 & 63.91 & 80.69 & 82.81 & 80.45 \\ 
%45 & 71.04 & 75.27 & 60.49 & 78.62 & 80.83 & 78.09 \\ 
%50 & 65.64 & 71.59 & 57 & 74.28 & 77.61 & 74.72 \\ 
%55 & 59.2 & 67.91 & 53.45 & 69.75 & 73.83 & 70.32 \\ 
%60 & 54.87 & 63.45 & 49.22 & 65.6 & 70.57 & 65.99 \\ 
%65 & 46.42 & 58.64 & 45.51 & 57.73 & 64.73 & 59.99 \\ 
%70 & 38.6 & 53.09 & 41.29 & 50.43 & 59.93 & 54.13 \\ 
%75 & 32.86 & 48.93 & 36.99 & 43.81 & 54.26 & 47.54 \\ 
%80 & 24.88 & 42.25 & 30.89 & 34.53 & 46.21 & 39.36 \\ 
%85 & 15.56 & 34.25 & 26.3 & 23.73 & 36.44 & 30.82 \\ 
%90 & 9.33 & 25.17 & 17.19 & 13.9 & 24.84 & 19.24 \\ 
%95 & 5.53 & 17.77 & 8.95 & 7.47 & 16.08 & 9.48 \\ 
%100 & 3.11 & 3.04 & 3.23 & 3.14 & 3.18 & 3.06 \\ 
%\end{tabular}
%\end{center}
%\caption{Results using {\tt All} Noise Model \label{all}}
%\end{figure}  

\begin{flushleft} 
\begin{figure}[!hbt] 
\begin{center} 
\epsfxsize=.5\linewidth 
\rotatebox{270}{\epsfbox{gnuplot/all.ps}} 
\end{center} 
\caption{Results using {\tt All} Noise Model \label{all}}
\end{figure} 
\end{flushleft} 


Our final set of results, as produced by the {\tt all} noise model  (shown
in Figure \ref{all}) reveals behaviour that is again different from
the other results.  As shown previously in Figure \ref{filled}, the memory-based
parser does best out of the single model parsers.  However, like the
results shown in Figure \ref{white}, the ensemble parsers mainly do
better than the single-model parsers. Unlike the previous results, we
see that defaulting to the N-gram parser, for high noise levels,
yields better results than when defaulting to the maximum entropy parser.

Figure \ref{bestsum} summaries which parser is best, for selected noise models.


\begin{figure}[!htb]
\begin{center}
\begin{tabular}{c|cccccccc}
Rate & model & Best  &  model & Best  & model & Best  &  model & Best \\
25 & {\tt white} &  vote (MBL)     & {\tt filled} & vote (MBL)   &
{\tt repeat} & vote (MBL)& {\tt
all} & vote(MBL) \\
50 & {\tt white} &  vote (maxent)     & {\tt filled} & MBL  & {\tt
repeat} & vote (MBL)& {\tt
all} & vote (MBL)\\
75 & {\tt white} &   vote (maxent)    & {\tt filled} &  MBL  & {\tt
repeat} & MBL& {\tt all} & baseline\\
\end{tabular}
\end{center}

\caption{Summary of results \label{bestsum}}
\end{figure}


\subsection{Analysis \label{ind}}

Here we analyse the results.  The most important point to be made is that no
parser is uniformly best.  For example, the ensemble parser
(defaulting to maximum entropy) generally yields the best results with respect
to the {\tt white} noise model.  However, for the {\tt filled} noise
model, we see that the memory-based parser is usually best.  The reason that
we get this variation in performance is due to properties of the
noise-avoidance strategies used, and, in the case of ensemble parsers,
to a violation of errors not being correlated.  In more detail, we see
that for white noise, because the ensemble learners beat all the
single model parsers, we can conclude that the errors made are
uncorrelated.  Hence, the individual parsers all deal with white-noise
suboptimally (otherwise, there would be no gain to be made using
ensemble learning).  For filled pauses, we see that both maximum entropy and
N-grams make correlated errors (as the ensemble parser, defaulting to
memory-based learning, is worse than the single memory-based parser),
and so ensemble learning does not help.  

The results using noise in the labels (the {\tt white} model) on
average are worse than the results obtained using disfluencies in the
examples (the {\tt filled} and {\tt repeat} models).  This might be
because  the {\tt white} model introduces more randomness 
into the training set than does the other models.\footnote{An
  anonymous reviewer made the point that because this noise model does
  not guarantee that examples are well-formed, the  performance
  drop may be also due to the fact that the hypothesis space is
  potentially much
  larger. Hence, it would be possible to reduce this drop by making
  sure that examples are well-formed, possibly by some post-processing.} Another
reason might be that supervised learners are more sensitive to noise
in labels  than to noise in examples.

The noise-avoidance 
story which appears to have  emerged is that, out of a space of possible
noise-reduction techniques,  and for the set of parsers considered in
this paper, no single method appears to work 
uniformly well.  Theoretical support for this position is given
by \cite{Wolp95}.  Their {\it No Free Lunch} theorem (informally) states
that no single algorithm will work uniformly well across all problems.  In
 terms of noise, this means that an algorithm which
can deal well with (for example) errors in the labels of examples may
not do well with (for example) errors in the examples themselves. This
also suggests that a parser which does well in the noise-free case may
not do well in a noisy situation, and that some other parser may do better.

We do not conclude that any one technique is better than any other
approach. For example, we see that memory-based learning is closely
related to smoothing using backing-off \citep{Zavr97}.  Hence, we could
probably emulate the noise-tolerance of memory-based learning within
an N-gram parser that used backing-off. This is equivalent to using
some other model better suited to noise-tolerance than the one we did
use.  %Furthermore, all our parsers
%used different models (features), and as an anonymous reviewer
%mentioned, it is entirely possible that our results are a property of
%the particular model used (and not necessarily of the
%technique).\footnote{To control for this effect, one would need to
%  make sure that each parser used the same model. However, to repeat,
%  we are not trying to make statements about the relative merits of
%  any particular parser.  We are more interested in performance
%  differences for a given parser when confronted by various kinds of
%  noise.}


However, 
returning to the {\it No Free Lunch
  Theorem}, we conjecture
that no single parser will be uniformly superior in all noise
scenarios. So, if we made the N-gram parser tolerant to filled-pauses,
we would expect it to perform worse on some other problem.  Section
\ref{extensions} illustrates this point.


In summary, we see that robustness to noisy training material is a
many-faceted problem, and it is unlikely that any single strategy will
work well uniformly. Instead, we advocate understanding the properties
of the noise-avoidance strategy used by the parser and determining
whether it is suitable for some particular domain.  We do not believe
in a silver-bullet algorithm which will work well for any given problem.


\subsection{Parser-specific extensions \label{extensions}}

Here we consider, in the light of our previous experiments, whether we
can make our parsers more robust to noise.  We restrict ourselves to 
what can be easily achieved within the framework of a particular
parser. So, we do not build new parsers from scratch; instead we vary 
any available parameters that might improve results.  This restriction
is natural in that one is more likely to try to get the best results
from a given off-the-shelf parser than try to build one from scratch.

Note that we are not primarily interested in trying to maximise
 the performance
of a system for a given noisy training set. Instead, we wish to gain
an insight into how parser-specific extensions behave across a variety
of settings. As such, we report system performances for a variety of
noise models and  for different noise levels. We therefore have not
used a held-out set to optimise the various parser extensions. Clearly
however, in a practical setting, one would want to do this.

Turning to the maximum entropy parser, within the confines of the
implementation, the only parameter we can vary
is the number of training iterations allowed to fit the model.
As was mentioned in Section \ref{maxent}, after each iteration the
model fits the training material more closely. In general (modulo
overfitting),  after each iteration, performance also increases.  Now,
roughly speaking, after a few iterations, the model fits aspects of the
data of which there is ample evidence.  
  As the number of iterations increases, the model fits more
specific `patterns'.  Should    it be  possible to differentiate
noise from the
underlying, uncorrupted  machinery producing the data in this manner, then it is
possible that early stopping (running for only a few iterations rather
than for many iterations) would yield better performance than when
halting after carrying out many iterations.


Figure \ref{maxentk} shows 
the results we obtained when varying the number of iterations
(using the  {\tt all} noise model). The columns marked  `Iter'
refer to the number of iterations used to train a model.

\begin{figure}[!htb]
\begin{center}
\begin{tabular}{c|  c c c c c c c c c c}
Rate   & Iter & F-score  & Iter & F-score  & Iter &
F-score  & Iter & F-score  & Iter & F-score  \\ \hline
0 & 20 & 89.89   & 40 & 90.54   & 60 & 90.67   & 80 & 90.81   & 100 & 90.87   \\ 
 10  & 20 & 88   & 40 & 88.54   & 60 & 88.62   & 80 & 88.44   & 100 & 88.32   \\ 
 20  & 20 & 85.56   & 40 & 85.81   & 60 & 85.7   & 80 & 85.46   & 100 & 85.24   \\ 
 30  & 20 & 82.1   & 40 & 82.01   & 60 & 81.37   & 80 & 80.83   & 100 & 80.75   \\ 
 40  & 20 & 75.47   & 40 & 75.54   & 60 & 74.94   & 80 & 74.64   & 100 & 74.51   \\ 
 50  & 20 & 68.25   & 40 & 67.44   & 60 & 66.86   & 80 & 66.7   & 100 & 66.42   \\ 
 60  & 20 & 57.26   & 40 & 55.38   & 60 & 54.62   & 80 & 54.31   & 100 & 54.12   \\ 
 70  & 20 & 45.22   & 40 & 40.81   & 60 & 39.55   & 80 & 39.23   & 100 & 39.2   \\ 
 80  & 20 & 30.89   & 40 & 25.38   & 60 & 23.74   & 80 & 23.14   & 100 & 22.9   \\ 
 90  & 20 & 14.34   & 40 & 11.73   & 60 & 10.94   & 80 & 10.62   & 100 & 10.47   \\ 
 100  & 20 & 3.19   & 40 & 3.29   & 60 & 3.33   & 80 & 3.46   & 100 & 3.56 
\end{tabular}
\end{center}
\caption{Results for the maximum entropy parser, using early stopping,
  with the {\tt all} noise model \label{maxentk}}
\end{figure}
As can be seen, early stopping does improve results: as the noise rate
increases, it is better to use fewer and fewer iterations. 

Next, we considered whether we could make the memory-based parser more
resistant to noise in the labels. The TiMBL implementation allows one to
vary the number of nearest neighbours that are used when
classifying unseen examples.  When $k=1$, the system does no
smoothing over labels, and bases its decision upon just the single
best matching example.  As $k$ increases, the system becomes less
sensitive to individual labels.  So, varying $k$ ought to make the
parser more robust to labelling errors.

\begin{figure}[!htb]
\begin{center}
\begin{tabular}{c|c c  c c c c c c c c c c}
Rate & $k$ & F-score & $k$ & F-score & $k$ & F-score & $k$ & F-score &
$k$ & F-score & $k$ & F-score 
\\ \hline
0 & 1 & 89.44 & 3 & 90.22   & 5 & 89.78   & 7 & 88.96   & 9 & 88.16   & 11 & 87.58   \\ 
 10 & 1 & 78.26 & 3 & 87.74   & 5 & 87.79   & 7 & 87.36   & 9 & 86.56   & 11 & 86.09   \\ 
 20 & 1 & 66.86 & 3 & 84.97   & 5 & 85.94   & 7 & 85.86   & 9 & 85.83   & 11 & 85.42   \\ 
 30 & 1 & 57.51 & 3 & 82.93   & 5 & 85.14   & 7 & 85.02   & 9 & 84.65   & 11 & 84.04   \\ 
 40 & 1 & 46.96 & 3 & 77.89   & 5 & 82.92   & 7 & 83.89   & 9 & 84.14   & 11 & 83.61   \\ 
 50 & 1 & 36.46 & 3 & 71.35   & 5 & 80.48   & 7 & 82.13   & 9 & 82.19   & 11 & 81.77   \\ 
 60 & 1 & 29.67 & 3 & 64.6   & 5 & 77.18   & 7 & 80.81   & 9 & 80.73   & 11 & 80.37   \\ 
 70 & 1 & 22.44 & 3 & 52.97   & 5 & 70.86   & 7 & 77.05   & 9 & 78.69   & 11 & 78.17   \\ 
 80 & 1 & 16.99 & 3 & 42.89   & 5 & 61.93   & 7 & 71.84   & 9 & 75.28   & 11 & 75.18   \\ 
 90 & 1 & 11.09 & 3 & 25.18   & 5 & 41.38   & 7 & 54.18   & 9 & 63.22   & 11 & 65.75   \\ 
 100 & 1 & 2.93 & 3 & 2.77   & 5 & 2.62   & 7 & 2.74   & 9 & 2.09   & 11 & 1.8 
\end{tabular}
\end{center}
\caption{Results for the memory-based parser, varying $k$ and using
  the {\tt all} noise model \label{mblk}}
\end{figure}

Figure \ref{mblk} shows the results obtained when training over
material again corrupted by the {\tt all} noise model.  For each noise level, we
varied $k$ and recorded the results.  


Clearly, varying the number of nearest neighbours ($k$) improves
results as the amount of noise increases. 
Note that what is best when training upon material with 
no added noise (a rate of 0) is not best when training upon material
that is noisy.  Also, we see that with no noise, the best value of
$k$ is not 1.  This suggests that the Wall Street Journal is noisy (a
well known result).  Another possible reason %is that the Brill tagger
%introduced even more noise into the data than was present initially.
is that this is some strange  result of the particular choice of features we
used. 

Our results can be seen as being largely contrary to the related work
of \cite{Dael99}. They also varied $k$ for a range of language
problems.  In general, they found that increasing $k$ (from $1$)
decreased performance.  Partly on the basis of this, they concluded
that one should not abstract (smooth) from the data.  Clearly we see
that this argument only holds when using clean training material.

 
Finally, we considered whether we could improve the results of the
N-gram parser.  The implementation we use allows one to vary the order
of the Markov model.\footnote{Note that it is also possible to use
  less specific features in an N-gram model.  This would also simplify
 the resulting model, and so make it less sensitive to noise.  An
 interesting set of experiments would be to see how this approach
 compares with varying the order of the model.} That is, we can use either unigrams (order 1),
bigrams (order 2) or else trigrams (order 3).  Clearly, the lower the
order, the simpler the model.  Simple models can sometimes help deal
with the effects of noise (as was argued in Section \ref{shallow}).

As before, we corrupted the training set with varying rates of noise
(using the {\tt all} model) and estimated models using these corrupted
training sets.  We also varied the order of the model.  Figure
\ref{ngramres} summarises our results.

\begin{figure}
\begin{center}
\begin{tabular}{c | c c c c c c c}
Rate & Model order & F-score & Model order & F-score  & Model order &
F-score \\ \hline
0 & 1              & 75.45 & 2 & 83.86 & 3 & 84.10 \\
10 & 1             & 74.20 & 2 & 80.21 & 3 & 80.23 \\
20 & 1             & 71.52 & 2 & 76.22 & 3 & 75.44 \\
30 & 1             & 69.21 & 2 & 71.52 & 3 & 69.86 \\
40 & 1             & 65.98 & 2 &     66.37  & 3 &     64.04 \\
50 & 1             & 60.35 & 2 &     59.48  & 3 &     56.64 \\
60 & 1             & 54.14 & 2 &     52.54  & 3 &     49.85 \\
70 & 1             & 46.43 & 2 &     43.71  & 3 &     41.42 \\
80 & 1             & 36.77 & 2 &     33.77  & 3 &     31.81 \\
90 & 1             & 20.88 & 2  &    17.79  & 3 &     17.03 \\
100 & 1            &  2.15  & 2  &   2.42  & 3 &      2.61 
\end{tabular}
\end{center}
\caption{Results for the N-gram-based parser, varying the model order
  and using {\tt All} noise model \label{ngramres}}
\end{figure}
As expected, we see that with increasing amounts of noise, it becomes
better to use simpler models. Note that we could have further
simplified our models by reducing the information present within a
`word'.  If we recall, the N-gram parser is trained using words which
are concatenations of (for example) POS tags.  By reducing
the information content of a word, we would also simplify the model,
and so presumably improve upon our results.  The baseline parser
(which only considers POS tags of a unigram model) can be
seen as being such a simplified approach.

In summary, we see that in theory, all of our single-model parsers can be made
more noise-resistant.  Furthermore, we see that tackling noise means
abstracting from the training material. For example, early stopping
prevents maximum entropy from modelling all regularities in the data;
increasing the size of $k$ in memory-based learning forces the system
to abstract away from actual examples; reducing the order of a Markov
model means making the parser ignore more and more of the context in
the training set.  In practice (trying to optimise a parser for noise
in realistic setting), our noise-avoidance strategies may become less
useful. Clearly this is an empirical matter.


In the next Section, we look at what happens 
when  the training set contains  naturally occurring disfluencies.

\section{Experiments using Switchboard \label{swbd}}

Since our previous experiments all used naturally occurring language
corrupted by artificial noise, we also trained our shallow parsers
using realistically disfluent text.  This allows us to investigate
naturally occurring noise. We did not have
access to (noisy) material produced by people annotating large
quantities of text and so could not carry out experiments measuring
noise related to annotator errors.

   The WSJ is not disfluent (being
edited newswire material) so studying the effects of naturally
occurring disfluencies requires non-edited  material.
The parsed {\it Switchboard Corpus}
(henceforth called SWBD) 
consists of manually parsed 
telephone conversations between pairs of people.  Since it
is composed of 
conversational speech, it contains numerous disfluencies (such as
interruptions, incomplete sentences etc).  Training upon SWBD would
therefore provide complementary results to those found using synthetic
noise. However, SWBD is a
different genre from WSJ, and so phrasal distributions will be different
between these two domains.  We therefore have to differentiate between
the effects of disfluencies and distributional variations before we
can come to any conclusions about the effects of naturally occurring
disfluencies on shallow parsing. 

For 
the experiments here, we used all files in section $2$ of the SWBD distribution.  
Using the same script (from the CoNLL website), we created a
training set that consisted of utterances marked with POS
tags and chunk labels.  Figure \ref{exswb} shows an example SWBD utterance annotated in this manner (POS
labels have been suppressed).  There were $35,706$ sentences in our
SWBD training set.  


\begin{figure}[!htb]
\begin{center}
\begin{tabular}{cccccccc}
And & , & um & , & she & had & a & fall \\ 
O & O & B-INTJ & O  & B-NP & B-VP & B-NP & I-NP
\end{tabular}
\caption{An example SWBD annotated utterance \label{exswb}}
\end{center}
\end{figure}


Using the same parser configurations described in Section \ref{exp},
and when testing using the same testing material (namely uncorrupted
parsed WSJ material), we obtained the 
results shown in Figure \ref{swbres1}.
Unsurprisingly, these results are all worse than when training upon
uncorrupted WSJ material (as shown, for example, in
the top row of Figure \ref{white}). 

\begin{figure}[!htb]
\begin{center}
\begin{tabular}{cc c c}
Parser & F-score & Parser & F-score \\ 
 Max & 82.97 & Vote (Max) & 84.62 \\
TNT & 70.73  & Vote (TNT) & 84.33 \\
MBL & 83.20  & Vote (MBL) & 85.08 \\
Baseline & 65.86 & - & - \\
\end{tabular}
\end{center}
\caption{Results when training upon SWBD \label{swbres1}}
\end{figure}

In order to determine whether this performance drop was due to disfluencies, 
or else due to a change in domain (or both), we ran further
experiments. The next Section (\ref{swbd1}) examines the influence of
disfluencies, whilst Section \ref{swbd2} looks at distributional
differences between SWBD and WSJ.  
 
\subsection{Estimating the influence of noise  \label{swbd1}}

The first experiment tried to see if we could reduce noise using one
of the techniques mentioned previously in Section \ref{extensions}. 
If we could reduce the effects of noise, then this would suggest that
SWBD contained significant levels of noise, which in turn was
contributing towards the performance drop.  This assumes that there is a connection between the
experimental setup in Figure \ref{mblk} and \ref{mblvaryk}.  
Figure
\ref{mblvaryk} shows the results obtained when training the
memory-based parser and varying $k$.  As can be seen, the best
performance occurs when $k$ is $3$.  Comparing these results with 
those obtained using artificial noise (and the WSJ),
Figure \ref{mblk}, we see evidence that the noise rate is less than
$10\%$.\footnote{As an anonymous reviewer pointed out, caution should
  be exercised here.  We do not know for sure the relationship between
  Switchboard noise and our noise models. This means that it still
  remains to be seen exactly how the $k$ level relates to  noise
  levels. One way of dealing with this issue would be to try and model
  naturally occurring noise more closely and then see whether, for
  varying levels of noise, there was a relationship with $k$.}
   That is, for Figure \ref{mblk}, $k$ being $3$ is best when the noise rate is less than $10\%$. Noise may therefore not be the significant factor why
there is a performance drop.

\begin{figure}[!htb]
\begin{center}
\begin{tabular}{ccccccc}
$k$ & 1 & 3 & 5 & 7 & 9 & 11 \\
F-score & 80.35 & 83.20 & 82.63 & 81.68 & 80.94 & 80.32
\end{tabular}
\caption{Results when training upon SWBD, using the
  memory-based parser, varying $k$ \label{mblvaryk}}
\end{center}
\end{figure}

These results are not sufficient in themselves, so we also looked at what happened when annotated disfluencies were
removed from SWBD. If these
disfluencies  were significantly
contributing towards errors, then we might expect that this `cleaning' of
the data would improve our results.  We removed any words that were
marked as being an interjection from the training material.   As can
be seen in Figure \ref{swbdres3}, if anything, removing
interjections harms performance.  The reason for this lack of
improvement is that after
removing interjections, phrases are made contiguous that otherwise
would have been separated by interjections.

\begin{figure}[!htb]
\begin{center}
\begin{tabular}{cc c c}
Parser & F-score & Parser & F-score \\ 
 Max & 81.90 & Vote (Max) & 83.77 \\
TNT & 70.47  & Vote (TNT) & 83.57 \\
MBL & 82.70  & Vote (MBL) & 84.58 \\
Baseline & 72.24 & - & - \\
\end{tabular}
\end{center}
\caption{Results when training upon  SWBD with some disfluencies
removed \label{swbdres3}}
\end{figure}

Figure \ref{d6} shows a  distribution
 of the baseline parser (when trained upon the usual SWBD
 training set).  Comparing this distribution with the one obtained
 when the baseline parser was trained on WSJ
 material (Figure \ref{d2}), we see that
 there are differences. For example, in Figures \ref{d2} and \ref{d6},
 we see that in SWBD, words tagged with the POS label VBN  are more
 likely to be within a verbal chunk than to start a verbal chunk in
 WSJ (as was the case in the WSJ corpus).  Furthermore, we see
 that the distribution for SWBD is  not significantly more spread-out (more uniform)
 than the distributions for  WSJ.  If
 disfluencies were strongly present, and were uniformly distributed,
 then we might expect to see this.  


\begin{figure}[!htb]
\begin{center}
\begin{tabular}{lll} 
Label & $P(\mbox{label} \mid \mbox{VBN})$ & Count \\
I-CONJP & 0.0002 & 1 \\
B-NP & 0.0035 & 18 \\
O & 0.0037 & 19 \\
B-PP & 0.0066 & 34 \\
I-NP & 0.0072 & 37 \\
I-ADJP & 0.0093 & 48 \\
B-ADJP & 0.0231 & 119 \\
B-VP & 0.2936 & 1512 \\
I-VP & 0.6527 & 3361 
\end{tabular}
\end{center}
\caption{Distribution of chunk labels for 
  $P(\mbox{label} \mid \mbox{VBN})$, trained upon SWBD \label{d6}}
\end{figure}

In summary, we are lead to believe that naturally occurring
disfluencies are not that harmful to performance.  The next set of
experiments reinforces this opinion.

\subsection{Estimating the effects of distributional differences \label{swbd2}}


In order to determine whether the performance drop was due to
distributional differences,  we  considered the issue of what
happened when the SWBD training material was augmented
with various amounts of WSJ training
material. If SWBD simply lacked phrasal patterns found in WSJ, then we
would expect to find a significant improvement should we simply add
that missing information.

Figure \ref{ratiorecovery} shows what happens when various
randomly selected amounts of WSJ training are
added to the SWBD material.  There were $120k$ tokens (words or
punctuation marks) in the SWBD material.  The total WSJ training set
consisted of $211k$ tokens.

Here, the column marked
`level' shows the number  of WSJ tokens added (in thousands).  So,
a level of $0$ indicates that the training material is entirely SWBD, whilst a level of $211$ shows that the training material
consisted of all of SWBD and the WSJ material. As can be seen, with
only a small amount of WSJ  material, performance significantly
increases. Also, when using all material, we see that performance is
very close to the performance obtained when using just the 
uncorrupted WSJ material (as shown in Figure
\ref{white}).  

Now, these results may suggest that SWBD simply
lacks phrases found in WSJ, and that even a
moderate amount of material containing such information is better than
nothing. 
  However, it is also
possible that performance may simply be increasing due to more
accurate statistics.  To deal with this issue, we also used, in
addition to section $2$,  section
$3$ of SWBD.  Figure \ref{swbdres2} shows our results.
The training set roughly doubled in size, and 
now consisted of $61,515$ SWBD  sentences.


\begin{figure}[!thb]
\begin{center}
\begin{tabular}{c|cccccc}
Level & Max & MBL & TNT & Vote (Max) & Vote (MBL) & Vote (TNT) \\ \hline
0   &   82.31 & 82.83 & 69.99 & 84.15 &  84.66 &  83.89 \\
21  & 86.80 & 86.14 &  77.60 & 87.84  & 88.24 & 87.81 \\
42   & 87.91 & 87.45 & 79.46 & 88.80 &  89.18 & 88.87 \\
63   & 89.01 & 88.07 & 80.49 & 89.80 &  90.07 &  89.79 \\
84   & 89.27 & 88.29 & 81.11 & 89.96 &  90.08 &  89.89 \\
105   & 89.76 & 88.80 & 81.73 & 90.27 &  90.31 & 90.23 \\
126   & 90.19 & 89.02 & 82.02 & 90.65 &  90.71 & 90.61 \\
147   & 90.17 & 89.00 & 82.92 & 90.68 &  90.68 & 90.56 \\
168   & 90.55 & 89.39 & 82.91 & 91.01 &  91.05 &  90.92 \\
189   & 90.72 & 89.43 & 83.41 & 91.03 &  91.02 & 90.93 \\
211  & 90.84 & 89.63 & 83.65 & 91.03 &  91.03 & 90.92
\end{tabular}
\caption{Results when training upon SWBD and varying
  randomly sampled quantities of {\it WSJ} \label{ratiorecovery}}
\end{center}
\end{figure}


\begin{figure}[!htb]
\begin{center}
\begin{tabular}{cc c c}
Parser & F-score & Parser & F-score \\ 
 Max & 83.35 & Vote (Max) & 84.16 \\
TNT & 71.37  & Vote (TNT) & 84.98 \\
MBL & 83.39  & Vote (MBL) & 85.16 \\
Baseline & 63.71 & - & - \\
\end{tabular}
\end{center}
\caption{Results when training upon double the quantity of SWBD material\label{swbdres2}}
\end{figure}

As can be seen (by comparing Figures \ref{swbdres2} and \ref{swbres1}), our results only marginally increase, and so the
results obtained in Figure \ref{ratiorecovery} are not due to more
accurate estimation of models due to training upon more
material. 


An error analysis of the mistakes made when training just on SWBD
material (using the best ensemble parser), and not made when training
upon WSJ material,  found that:
\begin{center}
\begin{itemize}
\item Noun phrases in the WSJ are predicted as being too short. For example, we see:
\begin{center}
\begin{tabular}{cccccccc}
Rockwell & International & Corp. & 's & Tulsa & unit & said & \ldots\\ 
         & I-NP / B-NP              &   I-NP / B-NP    &    &       &      &  
\end{tabular}
\end{center}
Here, the words {\it International} and {\it Corp} are labelled as
being the start of a base noun phrase, when in reality they should all
be within the same base noun phrase.
\item Base noun phrases with non-nominal premodifiers are seen as
being phrases other than base noun phrases:
\begin{center}
\begin{tabular}{ccccccccc}
\ldots & it & had & operating & profit & of & \$ & 10 & million \\
       &    &     &  B-NP / I-VP         &  I-NP / B-NP      &    &   &    &      
\end{tabular}
\end{center}
\item There was little evidence of `strange' parsing decisions, as one
would expect if disfluencies were responsible for parsing errors.
\end{itemize}
\end{center}

In conclusion then, we believe  that the disfluencies in SWBD, whilst
still a factor, are less harmful to performance than the
change of domain (speech to edited text).  Alternatively, this means
that even if SWBD was edited and all disfluencies were
removed, should we train upon that, we would not expect to obtain a
significant performance increase.  When the task is shallow parsing
newswire material, it is better to annotate more newswire material
than to train upon material drawn from a significantly different
distribution. \cite{gild01} made a related point when training full-scale
parsers using the Brown and WSJ parsed treebanks.

\section{Comments \label{comments}}
We mentioned in the introduction that shallow parsing could in
principle  be
undermined by either noise in the training set, or else by
distributional differences between the training set and the testing
set. The first set of results showed that the underlying distributions
are such that shallow parsing is inherently noise-tolerant, and that
only large quantities of noise will significantly undermine
performance. However, should one wish to improve performance when
noise was present, then simple extensions of our basic parsers help.
  Our experiments
showed that different kinds of noise favoured different parsers, and
that in general, no single technique emerged as being best in all
situations.  When dealing with a new problem
(with unknown noise), the best strategy is probably to consider a
range of techniques, optimise them using a held-out set, and then
decide which one to use on the basis of results.  The alternative, of
taking a favourite approach and then hoping that is will perform well
is almost certainly likely to lead to suboptimal results.

We do not wish to comment upon the individual parsers (for example,
arguing that memory-based learning is better than maximum entropy).
The reason for abstaining follows from our results, which tells us
that even if some parser appeared to do well in many situations, we
should not conclude that it would do well in some future  situation.
However, we were pleased with the performance of memory-based
learning, and suggest that it might be well suited  for domains such as
discourse parsing, which is known to be noisy.

The experiments with SWBD suggested that a mismatch in
distribution between examples in the training set and the testing set
undermined performance more than did disfluencies.  Furthermore, we
concluded that should one want to improve upon the performance of
shallow parsing, it is better to annotate more examples, drawn from
the target distribution (which in this case is the WSJ) than to train upon additional material drawn from other
distributions. Language is non-stationary, and pretending otherwise is
ill-advised.  

Drawing together the results from WSJ and SWBD, and with the caveat
that the noise models we used are artificial and so not necessarily
indicative of realistically found noise, one might conclude
that naturally occurring noise is not that much of an issue, and that instead, making sure
that the training set and testing sets are both drawn from the same
distribution is the key factor.  With current treebanks, this is
almost certainly true.  However, should one want to greatly increase
the amount of annotated material in some industrial manner, then
undoubtedly
 such newly created material would become noisier.  Under such
circumstances, and again given the caveat regarding the relationship
between our noise models and naturally occurring noise, the techniques mentioned in this paper would become relevant.


In this paper, we only considered shallow parsing using base
phrases. Further work should consider what happens when parsing is
deeper. We conjecture that the distributions associated with deep
parsing would continue to be sharply peaked.  Consequently, we would
expect similar noise-performance relationships to those reported here.

\acks{We would like to thank Maria Wolters and Mark Core for
  discussions about disfluencies,  the authors of the various
  systems used in this work, and the three anonymous reviewers for
  very helpful comments.}
%\bibliographystyle{plain}

\bibliography{/home/osborne/docs/bib/mdl,/home/osborne/docs/bib/references,/home/osborne/docs/bib/rf,/home/osborne/docs/bib/noise,/home/osborne/docs/bib/np}

%\bibliography{/home/miles/docs/bib/mdl,/home/miles/docs/bib/references,/home/miles/docs/bib/rf,/home/miles/docs/bib/noise,/home/miles/docs/bib/np}
\end{document}