\documentclass[twoside,11pt,letterpaper]{article}
\usepackage{hammerton02a}

\jmlrheading{2}{2002}{551-558}{3/02}{3/02}{James Hammerton, Miles
Osborne, Susan Armstrong and Walter Daelemans} 

\ShortHeadings{Introduction}{Hammerton, Osborne, Armstrong and Daelemans}
\firstpageno{551}

\begin{document}

\title{Introduction to Special Issue on Machine Learning Approaches to Shallow Parsing}
\author{James Hammerton \email j.hammerton@let.rug.nl \\
\addr 
Alfa-Informatica \\
University of Groningen \\
The Netherlands \\
\AND
Miles Osborne \email osborne@cogsci.ed.ac.uk \\
\addr 
Division of Informatics \\
University of Edinburgh \\
Scotland \\
\AND
Susan Armstrong \email susan.armstrong@issco.unige.ch \\
\addr 
ISSCO/ETI \\
University of Geneva \\
Switzerland
\AND
Walter Daelemans \email walter.daelemans@uia.ua.ac.be \\
\addr 
Center for Dutch Language and Speech \\
University of Antwerp \\
Belgium
}
%\editor{James Hammerton, Miles Osborne, Susan Armstrong and Walter Daelemans}
%\date{}
%\bibliographystyle{plain}

\maketitle


\begin{abstract}%
This article introduces the problem of partial or shallow parsing
(assigning partial syntactic structure to sentences) and explains why
it is an important natural language processing (NLP) task. The
complexity of the task makes Machine Learning an attractive option in
comparison to the handcrafting of rules. On the other hand, because of the
same task complexity, shallow parsing makes an excellent benchmark
problem for evaluating machine learning algorithms. We sketch the
origins of shallow parsing as a specific task for machine learning of
language, and introduce the articles accepted for this special issue,
a representative sample of current research in this area. Finally,
future directions for machine learning of shallow parsing are
suggested. 
\end{abstract}

\section{Introduction}
In full parsing, a grammar and search strategy are used to assign a
complete syntactic structure to sentences. The main problem here is to
select the most plausible syntactic analysis given the often thousands
of possible analyses a typical parser with a sophisticated grammar may
return. Stochastic approaches can be used to order the analyses
according to their probability or to generate the most probable
parse(s) only. See~\cite{Jurafsky+2000} for an introduction to
traditional and stochastic approaches to parsing.

However, not all natural language processing (NLP) applications require
a complete syntactic analysis. A full parse often
provides more information than needed and sometimes less.  E.g., in
Information Retrieval, it may be enough to find simple NPs (Noun
Phrases) and VPs (Verb Phrases). In Information Extraction, Summary
Generation, and Question Answering, we are interested especially in
information about specific syntactico-semantic relations such as
agent, object, location, time, etc (basically, who did what to whom,
when, where and why), rather than elaborate configurational syntactic
analyses.

Partial or shallow parsing | the task of recovering only a limited
amount of syntactic information from natural language sentences | has
proved to be a useful technology for written and spoken language
domains.  For example, within the Verbmobil project, shallow parsers
were used to add robustness to a large speech-to-speech translation
system \citep{wahl00}.  Shallow parsers are also typically used to
reduce the search space for full-blown, `deep' parsers \citep{Coll96}.
Yet another application of shallow parsing is question-answering on
the World Wide Web, where there is a need to efficiently process large
quantities of (potentially) ill-formed documents
\citep{buch,Srihari+99}. And more generally all text mining
applications, e.g. in biology~\citep{Sekimizu+98}. 


\cite{Abney91} is credited with being the first to argue for the
relevance of shallow parsing, both from the point of view of
psycholinguistic evidence and from the point of view of practical
applications. His own approach used hand-crafted cascaded Finite State
Transducers to get at a shallow parse.  

Typical modules within a shallow parser architecture include the
following:

\begin{enumerate}
\item
Part-of-Speech Tagging. Given a word and its context, decide what the
correct morphosyntactic class of that word is (noun, verb, etc.).  POS
tagging is a well-understood problem in NLP~\citep{Halteren99}, to
which machine learning approaches are routinely applied.
\item
Chunking. Given the words and their morphosyntactic class, decide
which words can be grouped as chunks (noun phrases, verb phrases,
complete clauses, etc.)
\item
Relation Finding. Given the chunks in a sentence, decide which
relations they have with the main verb (subject, object, location,
etc.)
\end{enumerate}

Because shallow parsers have to deal with natural languages
in their entirety, they are large, and frequently contain thousands of
rules (or rule analogues). For example, a rule might state that
determiners (words such as {\it the}) are good predictors of noun
phrases.  These rule sets also tend to be largely `soft', in that
exceptions abound. Continuing with our example, in the phrase:

\begin{quote}
\ldots fatalities on non-interstate  roads were about the same
\end{quote}
the word {\it the} is instead within the adjectival phrase {\it were
  about the same}. This example was taken from the Parsed Wall Street
  Journal \citep{Marc93}.

Building shallow parsers is therefore a labour-intensive
task. Unsurprisingly, shallow parsers are usually automatically built,
using techniques originating within the machine learning (or
statistical) community.  

The work by~\cite{Ramshaw+95} proved to be an important inspiration
source for this work. By formulating the task of NP-chunking as a
tagging task, a large number of machine learning techniques suddenly
became available to solve the problem. In this approach, each word is
associated with one of three tags: I (for a word inside an NP), O
(for outside of an NP), and B (for between the end of one and the
start of another NP). The classification task can easily be extended
to other types of chunks and with some effort even to finding
relations~\citep{Buchholz+99}. For an extension of a HMM approach from
tagging to chunking, see~\cite{Skut+98}.

Readers 
% with a background in machine learning 
are encouraged to visit the {\em Computational Natural Language
Learning} (CoNLL) shared task websites:\footnote{CoNLL is the yearly
conference of SIGNLL, the Special Interest Group of the Association
for Computational Linguistics on Machine Learning of Language; {\tt
http://www.aclweb.org/signll}.}
\begin{quote}
http://lcg-www.uia.ac.be/conll2000/chunking/
\end{quote}
and:
\begin{quote}
http://lcg-www.uia.ac.be/conll2001/clauses/
\end{quote}
for background reading, datasets and results of more than 20 shallow
parsing systems.  

Applying learning techniques is however not
necessarily straightforward:

\begin{itemize}

\item The amount of data to be processed will push batch systems to
the limit.  This means that learners will need to scale.

\item Labelled training material is frequently noisy and only exists
in relatively small quantities.  Here, `small' is with respect to a
language as a whole. Any learner must therefore deal with overfitting.

\item Real-world sentences tend to be long.  Learners which do not
operate in (near) linear time are simply unfit for the task.
\end{itemize}

Shallow parsing, like much of natural language processing, is
therefore a challenging domain for machine learning
research. 

Note that shallow parsing does not refer to a single technique.
Instead, it is better to consider it to refer to a family of related
methods, all of which attempt to recover some syntactic information,
at the possible expense of ignoring all other such information.

%This special issue presents a number of papers dealing
%This special issue arose from discussions amongst the participants in
%the Learning Computational Grammars (LCG) project\footnote{All the
%editors have been involved with the LCG project. The homepage for the
%project is at http://lcg-www.uia.ac.be/} about producing a collection
%of papers that represent the state of the art in the application of
%machine learning to natural language processing tasks. The topic of
%machine learning of shallow parsing was chosen because there has been
%considerable interest in this area in recent years and because various
%shallow parsing tasks, such as noun-phrase identification and
%chunking, also formed the main focus of the work of the LCG project
%participants.
%The papers presented in this special issue are broadly representative
%of  the (current) state
%of the art in   machine learning approaches to shallow parsing.



%\section{Machine Learning and Shallow Parsing}
%\label{MLSP}

%Shallow parsing involves extracting key pieces of syntactic or
%semantic information from text, and in recent years many machine
%learning methods have been applied to shallow parsing of large
%amounts of real world data. 


%Thus applying machine learning techniques to shallow parsing tasks
%will provide valuable insight into the real-world performance of a
%machine learning system. 


%\section{Earlier work on machine learning of shallow parsing}

%TO BE WRITTEN!

%\section{The Special Issue}

%This section discusses the papers from the special issue itself. These
%papers employ a wide variety of approaches in shallow parsing and look
%at issues such as how best to combine individual learners in ensemble
%systems and how well shallow parsers deal with noise. As such we
%believe they represent the state of the art in machine learning of
%shallow parsing, collectively illustrating the diversity of approaches
%as well as highlighting the issues involved in the learning of shallow
%parsing.

\section{Overview of Papers}
Here we briefly summarise the papers in this issue.
\subsection{Memory Based Shallow Parsing \label{mbl}}

\cite{tks02} considered the issues involved with applying memory-based
learning (MBL) to shallow parsing.  MBL consistently performs well for
a variety of shallow parsing tasks, often yielding (near) best
results \citep{Daelemans+99,Buchholz+99}.  From this, one might conclude
that MBL was a promising learning technique for pushing
shallow parsing to {\it full} parsing. For full parsing,  MBL fared
less well, however, and the
results were not as good as for the other parsers that were compared.
%As Tjong Kim Sang notes, this was most probably due to the way in
%which the MBL parser processes the sentences to find recursive
%structure -- it finds base phrases first, then it finds the phrases
%for the next level up in the tree and so on until all the phrases have
%been found. Since deeper structures are less common than shallower
%structures, MBL had less and less training data for each additional
%level of embedding.  
This does not mean that MBL is fundamentally unsuited for full-blown
parsing. Instead, it suggests that the task needs to be encoded in
some other manner.

In his paper, a weakness of MBL | that it can have difficulty handling
large numbers of features | was identified.  A feature selection
method, namely bidirectional hill climbing~\citep{caruana94}, was found to
yield insignificant gains in performance for NP parsing. However, it
did produce a significant improvement for clause identification.

Tjong Kim Sang also showed how ensemble learning techniques such as
(weighted) majority voting and stacking could improve upon performance.
All system combination methods improved on the results of the
individual MBL classifiers, and the best performer was to employ MBL
itself as a stacked classifier. 


%The work presented in this paper illustrates the importance of the
%themes mentioned in Section~\ref{MLSP}. Careful feature selection is
%important for good performance with MBL. %Give MBL too many features
%and performance can be degraded significantly, as the author demonstrates
%directly with the simple example of learning the exclusive-or
%problem. With no extra features learning is perfect. With only 4 extra
%random features performance is significantly degraded and 10 extra
%random features yields a performance not much better than guessing.
%Good selection of features can improve the performance as demonstrated
%by the results in the clause identification task. However for
%NP-chunking only a small improvement was observed suggesting the
%original set of features did not contain much irrelevant information.
%A conclusion of the paper is that good feature selection is important
%for successful application of MBL to shallow parsing.  %Related to this
%issue 
%
%The importance of the representation of the task is also
%highlighted. Not only does the variation in performance exhibited with
%%the different input representations and the feature selection
%illustrate this but also more subtly the relatively poor performance
%of MBL on full parsing. Using MBL for language processing tasks
%involves processing the sentences on a word-by-word basis and using a
%window of the current word, plus the next/previous N words, where N is
%usually $<$ 4. Thus MBL only takes into account the local context for
%each word. Thus when processing recursive structure, MBL has to
%perform multiple passes over each sentence in order to extract each
%level of structure. The inability to directly take into account more
%global information forces this form of processing and may also
%contribute to the relatively poor performance of MBL on full
%parsing. Nevertheless MBL is generally a strong performer for shallow
%parsing tasks, showing that a lot can be achieved using a simple
%classifier that employs similarity-based matching.

\subsection{Shallow Parsing using Specialized HMM}

\cite{molpla} presented a shallow parser based on Hidden Markov Models
(HMMs).  HMMS are routinely used in speech recognition and
part-of-speech tagging (POS tagging). Here, the HMM was used to find the
most probable sequence of output shallow parsing labels for the current
sequence of inputs. Unlike with the previous MBL approach to shallow
parsing (which is classification-based), this approach used a generation
approach.  Their generative model enabled information about the whole
sentence to be taken into account when determining the output shallow
parsing label for each word, since it is the probability of the whole
sequence of output tags occurring given the current input that is
maximised (and not just the probability of individual decisions). The
authors' HMMs are applied to a variety of shallow parsing tasks. 

%With the chunking task, as if to highlight the
%importance of both feature selection and how the task is represented,
%Molina and Pla first train their HMMs using just POS tags as input and
%the original output tags as output. Then they train them using
%different selections of input features and modified output tags. The
%input features in each case consisted of the POS tags plus certain
%selected words (the criterion of selection being varied) whilst the
%modified output tags would consist of the original output tag
%concatenated with the POS tag and the current word if it was one of
%the selected words. 

Various ways of encoding the task were shown to produce different
results. Clearly, this suggests that feature specification is an
important issue.  Interestingly enough, the authors, whilst not using
ensemble learning methods, produced results comparable with systems
which did use such techniques. Here, the obvious comparison is with
the MBL paper mentioned in section \ref{mbl}. An interesting
possibility here is that their generative model (which allows previous
decisions to directly influence future decisions) emulates the ability
of ensemble learners to correct for classifiers which do not take
previous decisions into account.

%The performance on a development set ranged from an fscore of 84.34
%for the basic features and output tags to an fscore of 92.23 for basic
%%features enhanced with words selected by a combination of look at
%those whose error rate exceeded a threshold in the development set and
%words that belong to certain chunks with a high frequency in the
%training set and where the output tags consisted of the original tags
%%concatenated with the POS tags and the current word if it was
%selected. 

%This approach is also applied to the task of clause-identification,
%where the best results obtained placed the HMMs second in the CoNLL
%2001 shared task, with an fscore of 68.12 (it should be noted a HMM
%system with ``basic'' input and output tags was placed 4th with an
%f-score of 66.79).

%The HMMs are obviously strong performers for shallow parsing, but a
%question is begged as to whether if the other systems had modified the
%output tags involved in a similar manner they might not also get
%better results. At any rate, Molina and Pla's paper highlight the
%importance of task representation and feature selection, since their
%system, which does not employ system combination, obtains fscores
%comparable to those produced by the systems in the CoNLL shared tasks,
%that did employ system combination, such as MBL, largely due to careful
%selection of input features and careful modification of the output
%tags. 

\subsection{Text Chunking Based on a Generalization of Winnow}

\cite{zhang02} presented a generalised version of the Winnow
algorithm.  They observed that the original Winnow algorithm is only
guaranteed to converge on linearly separable data. So, given the
possibility that features for shallow parsing are not linearly
separable, the authors modified Winnow such that it would converge,
even for non-linearly separable features. They also showed that both
versions of Winnow were robust to irrelevant features.

The authors used a very large set of features, including those derived
from sources other than the training set.
%When applied to the chunking task, Winnow processes each sentence word
%by word, using a sliding window consisting of the current word and
%two  words either side of it. The input features consisted of the POS tags
%and words within the current window, the pairwise interactions between
%the POS tags, pairwise interactions between POS tags and words,
%predicted chunk tags for words before the current word, pairwise
%interactions between the predicted chunk tags and pairwise
%interactions between the chunk/POS tags. Additionally, the authors
%used extra linguistic information as features, derived from parsing
%using English Slot Grammar (ESG), to see if it improved
%performance. The syntactic roles assigned to each word by the ESG were
%used as features along with pairwise interactions between them and
%between them and the POS tags. Each feature was represented using
%orthogonal vectors and hash tables were used to minimise memory usage
%by the resulting large input vectors. 
%
%After some experiments to determine the best parameters to use for
%training, the authors test the chunker on the CoNLL 2000 shared task
%data, obtaining the following overall fscores on the test set: 
%
%\begin{itemize}
%\item 92.22 using first order features only (i.e. without the pairwise
%interactions). The original Winnow algorithm only gets 89.49 using
%these features.
%\item 93.57 with basic features.  
%\item 94.17 with enhanced features
%\end{itemize}
Winnow was found to be  a strong performer for this
task, giving the best results reported for a non-ensemble classifier
in the CoNLL 2000 shared task. Clearly, the ability to exploit very
large numbers of (potentially irrelevant) features is a crucial
component of a successful shallow parsing system.

%As with the papers already discussed, the importance of feature
%selection is illustrated here with higher order features and enhanced
%linguistic features both boosting performance significantly. Again,
%the question is begged as to whether the other algorithms that were
%used in the CoNLL 2000 shared task might also benefit from the use of
%higher order and extra linguistic features. 

%It is interesting to contrast MBL and Winnow here, in terms of feature
%selection and task representation -- both use features from a sliding
%window as input and use similarity based judgements and both have been
%applied successfully to shallow parsing tasks through careful feature
%selection. Winnow's main difference with MBL in this respect is that
%it can learn which features are important and is robust to irrelevant
%features.

\subsection{Shallow Parsing With PoS Taggers and Linguistic Knowledge}

\cite{megyesi} retrained three POS taggers for shallow parsing. Unlike the
other papers, she dealt with shallow parsing for Swedish, and not English.

%\begin{itemize}
%\item FNTBL\citep{fntbl}, an implementation transformation based
%learning\citep{tbl}, an error-driven rule-based learner.
%\item MXPOST\citep{mxpost} which is based on a maximum entropy model where
%contextual information is represented as binary features.
%\item TNT\citep{tnt}, which is based on Hidden Markov Models, using the Viterbi
%%alogrithm and beam search. The states represent tags and the
%transition probabilities depend on pairs of tags.
%\end{itemize}

%They are applied to the following tasks: learning to produce both POS
%tags and phrase labels given words as input, learning to produce
%phrase labels given words as input, learning to produce phrase labels
%given words and POS tags as input and learning to produce phrase
%labels given only POS tags as input. In each of these tasks the phrase
%labels indicate all the phrases the current word is in, in contrast to
%the CoNLL tasks which only identify one type of phrase for the word to
%be in.
%
%\noindent The phrases to be detected are:%
%
%\begin{itemize}
%\item adverb phrases, 
%\item minimal adjectival phrases consisting of adjectives and possible
%modifiers, %
%
%\item adjectival phrases consisting of more than one minimal
%adjectival phrase with a delimiter or conjunction in between them, 
%
%\item noun phrases that include the head noun and modifiers to its
%left, 
%\item prepositional phrases, 
%\item noun phrases that include a prepositional phrase as modifier, 
%\item verb clusters consisting of a continuous verb group belonging to
%the same verb phrase without other intervening phrases, 
%\item infinitive phrase including an infinitive verb together with
%infinitive particle and which may contain adverb phrases and/or verbal
%particles, and
%\item numeral expressions. 
%\end{itemize}
%
%\noindent The corpus used was second version of the Stockholm-Umea corpus
%\citep{stock-umea} annotated with PAROLE tags and parsed with the SPARK
%parser and a context-free grammar developed by Megyesi herself.
%
%Initially, The taggers were trained on 200K words from the corpus and
%tested on a test set of 117,536 words. The best performance, in terms
%of classification accuracy, was achieved by FNTBL (94.84\%) when using
%POS tags only as input to learn phrase labels. All taggers performed
%at their best on this task. Using lexical information as input gave
%worse performance. However when lexical information was used, MXPOST
%performed better than the other taggers.  This is most probably due to
%the reduced number of types to learn and the reduced number of unknown
%tokens.

Experimental results showed that, again, when using POS taggers as the basis
of shallow parsers, careful
consideration needs to be given to how the task is to be encoded
(choice of features).  Unlike other studies, the author found that
ignoring lexical information improved performance for all her
systems. It is unclear whether this is due to linguistic differences
between English and Swedish, or else due to the fact that some of her
POS taggers were built with English in mind.

The shallow parsers were then trained on varying amounts of training data for
each task. Unsurprisingly performance improved with the amount of
training data in each case.  However, no shallow parser yielded uniformly
superior results to any other shallow parser. 

 %Where lexical information is used as
%input, TNT achieved the better results on small data sets whilst
%MXPOST gave best performance on the larger data sets. However where
%only POS information is used as input, FNTBL performed better. 
%
%Ideally the f-score, or precision and recall values would have been
%used to measure performance, however the number of classes each tagger
%has to learn to distinguish between varied from 400 to 3100 depending
%on the task and amount of training data used making it impractical to
%give precision and recall values.  
%
%These results again illustrate the importance of feature selection and
%task representation in machine learning and shallow parsing, this time
%in a somewhat different domain, namely shallow parsing of Swedish
%rather than English text. The general observation that the best
%performance was achieved using POS tags only as input illustrates that
%the taggers used operate at their best with a small number of input
%types and output classes and low levels of unknown tokens. The
%differences in performance between the taggers was attributed in part
%to different window sizes used in each case and different strategies
%for analysing morphology when unknown words were encountered. 
%
%An interesting question is whether, if the words were represented in a
%manner that encodes information about how the words are used in the
%corpus e.g. vectors encoding the co-occurrence statistics of the words
%as in \citep{zavrel}, the same results regarding the use of lexical
%information would be observed.

\subsection{Learning Rules and their exceptions}

\cite{dejean02} presented a top-down rule induction system, called
ALLiS, for learning linguistic structures. The initial system is
enhanced with additional mechanisms to deal with noisy data. The
author identifies two types of difficulties -- significant noise in
the data and the presence of linguistically motivated
exceptions. Since linguistically motivated exceptions occur, they
cannot be treated as noise. To address these problems, a refinement
algorithm is introduced to learn exceptions for each rule that is
learned. The second improvement introduces linguistically motivated
prior knowledge to improve the efficiency and accuracy of the system.

The experimental results clearly demonstrate significant improvement
with the introduction of the two mechanisms. The refinement mechanism is
based on the assumption that there is some regularity to the errors in
the data and thus, by systematically searching for exceptions, the rule
induction system is improved. With the use of prior knowledge, the
context of only one element need be taken into account and the search
space is reduced resulting in a significant reduction in learning time. In
comparison to \citep{tbl}, a well-known transformation based learning
system (TBL), ALLiS needs fewer rules and overcomes a number of
classification errors produced by TBL.

The incorporation of linguistically motivated prior knowledge in a
learning-based system is an interesting addition, and as pointed out in
the paper, the question arises whether such background information would
be useful in other systems. In any case, it is clear that additional
mechanisms are necessary to deal with the noise and exceptions present
in natural language data for tasks such as shallow parsing.


\subsection{Shallow Parsing using Noisy and Non-Stationary Training Material}

\cite{osborne} considered an  issue that has gone largely unaddressed in
the shallow parsing literature, namely what happens when the training
set is either noisy, or else drawn from a different distribution to
the testing material.

This paper took a range of shallow parsers (including both single
model parsers and ensemble parsers) and trained them using various types
of artificially noisy material. 
%All the shallow parsers were 
%robust, with performance degrading only gradually as the noise rates
%increase. No single parser performs best in all situations and various
%parser specific extensions are shown to improve the results.
In a second set of experiments, the issue of whether naturally
occurring disfluencies have more impact on performance than a change
in the distribution of the training material was investigated. It was
found that the changes in the distribution are more important.

The author drew various conclusions from this work. Shallow parsers
are robust and only large quantities of noise will significantly
impair performance. Should one wish to improve performance then simple
parser specific extensions can help. No single technique worked best
with all types of noise with different kinds of noise favouring
different parsers. 
%The author suggests in the light of this that when
%dealing with a new problem with unknown noise the best strategy will
%probably be to select a range of techniques, optimise them on the
%basis of a held out set and decide which to use on the basis of the
%results. 
Regarding the results on changes in the distribution of training data,
the clear lesson is that if one wishes to improve the performance of
shallow parsers on a particular task, it is better to annotate more
examples from the target distribution than to use additional training
material from other distributions.

%The findings of this paper are of obvious interest to the shallow
%parsing community in demonstrating that noisy material can be handled
%well by commonly used shallow parsing techniques and demonstrating
%that the distributions of the training and testing data are more
%important in determining performance than naturally occurring noise
%levels.

One surprise in this paper is that the parsers employing system
combination, although generally the best performers in the literature,
were not always the best at dealing with noise. Clearly, ensemble
learning is not always a sure-fire strategy.

%When noise was added
%in the form of adding pauses to the sentences being processed, the
%best performer in dealing with the noise was MBL, despite the ensemble
%parsers performing best without any noise. In order for ensemble
%learners to improve performance over the individual techniques
%employed, the errors made by each individual have to be uncorrelated,
%and the author suggests that the errors made by each technique in this
%case are correlated.

\section{Conclusions}

%Various themes emerged when looking over
%these recent papers on learning shallow parsing:
In summary, a few points can be made:
\begin{itemize}
\item Feature selection, as in machine learning in general, is an
important consideration for machine learning of shallow parsers. Some
learning approaches only work well when the features have been
carefully selected and weighted, whilst others can cope with large
numbers of irrelevant features. The Winnow and MBL papers both clearly
illustrated these considerations.
\item A recent trend in the literature is for performance to be
potentially improved by training several classifiers on the task and
combining their results to produce a final result. This can be done in
various ways such as using various (weighted) voting methods and using
stacked classifiers. This however is not guaranteed to produce the
best results as Osborne's paper above illustrates.
\item The majority of the systems are probabilistic, with the obvious
exception of MBL. Few shallow parsers reported in the literature are,
for example based upon Inductive Logic Programming or neural
networks. It seems that the reason for this is the need for
scalability.
\item All parsers assumed labelled input.  Clearly this limits
performance, as only a small amount of labelled training material
exists.  \cite{zhang02} did use other knowledge sources, in addition
to the training set.
\item Shallow parsers are noise-tolerant, and only massive quantities
of noise will significantly undermine performance.
\item Not all shallow parsers used generative models (as might be
expected from the nature of the task).  Discriminative models (those
which attempt to maximise the difference between alternative labels,
but not necessarily model the distribution of annotated sentences) are
also employed.  However, the exact link between these two classes of
models has yet to be demonstrated.
% comment from miles:  the next item seems to be subsumed by feature
% selection. 
%
%Generally speaking
%this has been found to improve performance, but it does require the
%errors made by the individual classifiers to be uncorrelated in order
%to work.
%\item {\bf Representation of the task}. How the task is presented to a
%machine learning system is also crucial as it determines whether for
%example the algorithm uses only local context (e.g. the current and
%previous/next N words) or information from a whole sentence in trying
%to learn the task. Also the input and output representations employed
%will also impact on performance. E.g. using non-orthogonal versus
%orthogonal vectors as inputs/outputs for a neural network can make a
%big difference to performance since with orthogonal vectors there is
%no interference between different inputs. Such interference can both
%hinder learning (if similar inputs need to be treated differently by
%the network) or enhance it (if similar inputs need to be treated
%similarly).
\end{itemize}

Research in shallow parsing is clearly ongoing.  We hope that more
machine learning researchers will take-up the gauntlet and include
shallow parsing as an additional, real-world domain with which to
evaluate machine learning systems.

\section*{Acknowledgements}
 The editors wish to thank the following reviewers for their valuable help in
 producing this special issue:
\begin{quotation}
\noindent Richard K. Belew, Thorsten Brants, Eric Brill, Mary Califf, Claire
Cardie, Rafael Carrasco, Alexander Clark, Steve Clark, Daniel Gildea,
Hans Van Halteren, Jamie Henderson, Colin De La Higuera, Yuval Krymolowski, Marshall Mayberry, Grace Ngai, Adwait Ratnaparkhi, Erik F. Tjong Kim Sang, Jimi Shanahan, Cindi Thompson, Chris Watkins, Ton Weijters and Maria Wolters.
\end{quotation}
All of the editors were involved, at some time or other, with the EU
TMR project {\it Learning Computational Grammars}. The homepage for the
project is at http://lcg-www.uia.ac.be/.  We wish to thank John
Nerbonne for leading the project, Erik F. Tjong Kim Sang for maintaining
the excellent shallow parsing website and finally the editorial staff
of the JMLR for supporting this special issue.
%{\bf jmlr people}.
\bibliography{hammerton02a}
\end{document}
