\documentclass[twoside,11pt]{article}

\usepackage{amsmath}
\usepackage{epsfig}

\usepackage{jmlr2e}


% Definitions of handy macros can go here

\newcommand{\dataset}{{\cal D}}
\newcommand{\fracpartial}[2]{\frac{\partial #1}{\partial  #2}}

% Heading arguments are {volume}{year}{pages}{submitted}{published}{authors}

\jmlrheading{2}{2002}{595-613}{9/01}{3/02}{Antonio Molina and Ferran Pla}
\ShortHeadings{Shallow Parsing using Specialized HMMs}{Molina and Pla}
\firstpageno{595}

%\begin{list}

\title{Shallow Parsing using Specialized HMMs}

\author{\name Antonio Molina \email amolina@dsic.upv.es\\
%\AND
\name Ferran Pla \email fpla@dsic.upv.es\\
\addr Departament de Sistemes Inform\`{a}tics i Computaci\'{o}\\
Universitat Polit\`{e}cnica de Val\`{e}ncia\\
Cam\'{\i} de Vera s/n, 46020 Val\`encia (Spain)}

\editor{James Hammerton, Miles Osborne, Susan Armstrong and Walter Daelemans}
\begin{document}
\maketitle

\begin{abstract}

We present a unified technique to solve different shallow parsing tasks as a
tagging problem using a Hidden Markov Model-based approach (HMM). This technique consists of the
incorporation of the relevant information for each task into the models. To do
this, the training corpus is transformed to take into account this
information. In this way, no change is necessary for either the training or
tagging process, so it allows for the use of a standard HMM approach. Taking
into account this information, we
construct a Specialized HMM which gives more complete contextual models. 
We have tested our system on chunking and clause identification tasks using
different specialization criteria. The results obtained are in line with the
results reported for most of the relevant state-of-the-art
approaches.   
    

\end{abstract}

\begin{keywords}
Shallow Parsing, Text Chunking, Clause Identification, Statistical Language Modeling, Specialized HMMs.
\end{keywords}

\section{Introduction}

Shallow parsing has become an interesting alternative to full parsing.
The main goal of a shallow parser is to divide a text into segments which
correspond to certain syntactic units. Although the detailed information from
a full parse is lost, shallow parsing can be done on non-restricted texts in 
an efficient and reliable way. In addition, partial syntactical information
can help to solve many natural language processing tasks, such as information
extraction, text summarization, machine translation and spoken language 
understanding.

Shallow parsing involves several different tasks, such as text chunking, noun
phrase chunking or clause identification. {\it Text chunking} consists of
dividing an input text into non-overlapping segments. These segments are
non-recursive, that is, they cannot include other segments and  are usually
called {\it chunks} as  defined by \cite{abney91}. Noun phrase chunking ({\it
  NP chunking}) is a part of the text chunking task, which consists of
detecting only noun phrase chunks. 
The aim of the {\it Clause identification}
task is to detect the start and the end boundaries of each clause (sequence of
words that contains a subject and a predicate) in a sentence. 
For example, the
sentence {\it ``You will start to see shows where viewers program the
  program''} would be chunked as follows: 

\begin{itemize}
\item[] (NP You) (VP will start to see) (NP shows) (ADVP where) (NP viewers)\\
  (VP program) (NP the program) .\footnote{In the example NP is Noun
  Phrase, VP is Verbal Phrase, ADVP is Adverbial Phrase and S is a clause.}

\end{itemize}
The clauses in the sentence would be:

\begin{itemize}
\item[](S You will start to see shows (S where (S viewers program the program
  )) .)
\end{itemize}

Chunks and clause information in a sentence can also be represented by means
of tags. In \cite{sang00b}, there are several equivalent chunk tag sets for
representing chunking. The {\it IOB2} set, which was previously used by
\cite{ratnaparkhi98a}, uses three kinds of tags: B-X for the first word of a
chunk of type X; I-X for a non-initial word in an X chunk; O for a word
outside of any chunk. For clause identification, each word can be tagged with the
corresponding brackets if the word starts and/or ends a clause, or with a null
tag if the word is not the start or the end of a clause. The above example can
be represented using this notation as follows:

\begin{small}
\begin{center}
\begin{tabular}{lll}
You & B-NP & (S*\\
'll & B-VP & *\\
start & I-VP& *\\
to &I-VP &*\\
see &I-VP& *\\
shows &B-NP& *\\
where &B-ADVP& (S*\\
viewers & B-NP &(S*\\
program & B-VP &*\\
the &B-NP &*\\
program &I-NP &*S)S)\\
. &O &*S)\\
\end{tabular}
\end{center}
\end{small}


Earlier approaches to solving this problem consisted of parsers which are
based on a grammar of hand-coded rules. \cite{abney96b} developed the
incremental partial parser {\it CASS} based on finite state methods for
detecting chunks and clauses. \cite{mokhtar97} also built an incremental
architecture of finite-state transducers that identifies chunks and detects
subjects and objects. \cite{voutilainen93} used a different formalism
(constraint grammars) to detect NPs.

 
In the literature, you can find different learning methods which have been applied to
perform shallow parsing: Transformation-based Learning, Memory-based Learning,
Hidden Markov Models, Maximum Entropy, Support Vector Machines, etc. The first
works focused mainly on NP detection. \cite{ramshaw95} used
Transformation-based learning and put forward  a standard training and testing
data set, that has later been used to contrast other approaches. Memory-based
learning was used by \cite{daelemans99} and \cite{argamon98}. The main
reference for text chunking is the shared task for
CoNLL-2000\footnote{Chunking shared task is available at
  {\tt http://lcg-www.uia.ac.be/conll2000/chunking/}}
\citep{sang00d}.  In Section \ref{comparison}, we will briefly review the
different approaches to text chunking presented in this shared task.


Learning approaches for clause identification have recently been
developed. \cite{orasan00} applied memory-based learning techniques and
corrected the output by applying some rules. In the shared task for
CoNLL-2001\footnote{Clause identification shared task is
  available at  {\tt http://lcg-www.uia.ac.be/conll2001/clauses/}}
\citep{tksdj2001} other approaches were presented (Hidden Markov Models,
Memory-based Learning, Boosting, etc.)  

In this paper, we present a unified technique to construct Specialized HMMs to be used for solving shallow parsing tasks. First, in Section
2, we formalize shallow parsing as a tagging problem and define the method for
the specialization of the models. In Section 3, we present the results of the
application of our approach to text chunking. We test different specialization
criteria under the same experimental conditions of the CoNLL-2000 shared
task. We achieved state-of-the-art performance, as we show in Section 4. We
also apply this technique to solve the clause identification problem in Section
5. Finally, we present some concluding remarks.    


\section{Shallow parsing as HMM-based tagging}\label{sectLex}

We consider shallow parsing to be a tagging problem.
From the statistical point of view, tagging can be solved as a
maximization problem. 

Let $\mathcal{O}$ be a set of output tags and $\mathcal{I}$ the input
vocabulary of the application.  
Given an input sentence $I=i_1, \ldots, i_T$, where $i_j\in \mathcal{I}:
\forall j$, the process consists of finding the sequence of states of maximum
probability on the model. That is, the sequence of output tags,  $O=o_1,
\ldots, o_T$, where $o_j\in \mathcal{O}: \forall j$. This process can be
formalized as follows:
\begin{align}
\widehat{O} & =\arg\max_{O} P(O|I)\nonumber\\
& =\arg\max_{O}\ \left(  \frac{ P(O)\cdot P(I|O)}{P(I)}\right)\text{; }
O\in\mathcal{O}^{T}
\label{eq1}
\end{align}

Due to the fact that this maximization process is independent of the input
 sequence, and taking into account the Markov assumptions, the problem is
 reduced to solving the following equation (for a second--order HMM):


\begin{equation}
\arg\max_{o_1 \ldots o_T}\left(
{\displaystyle\prod\limits_{j:1 \ldots T}}
P(o_{j}|o_{j-1}, o_{j-2})\cdot P(i_{j}|o_{j})\right) \label{eq2}%
\end{equation}

The parameters of equation~\ref{eq2} can be represented as a second-order HMM
whose states correspond to a tag pair.
Contextual probabilities, $P(o_{j}|o_{j-1},o_{j-2})$, represent the transition
probabilities between states and $P(i_{j}|o_{j})$ represents
the output probabilities.

This formalism has been widely used to efficiently solve part-of-speech (POS)
tagging in \citep{church88, merialdo94, Brants:00a}, etc. In
POS tagging, the input vocabulary is composed of words and the output tags are
POS or morphosyntactic tags. The segmentation produced by some different
shallow parsing tasks, such as text chunking or clause identification, can be
represented as a sequence of tags as we mentioned above. Therefore, these
problems can also be carried out in a way similar to POS tagging. 
However, to successfully solve each task, it is necessary to deal
 with certain peculiarities.

On the one hand, you have to decide which available input information is
  really relevant to the task. POS tagging considers only {\it words} in the
  input. In contrast, chunking can consider {\it words} and {\it POS tags},
  and clause identification can take into account {\it words}, {\it POS tags} and 
{\it chunk tags}. In this case, if all this input information
is considered, the input vocabulary of the application could become very
large, and the model would be poorly estimated.

On the other hand, the output tag set could be too generic to produce accurate
models. The contextual model can be enriched by considering a more
fine-grained output tag set by
adding some kind of information to the output tags. For instance, in 
the chunking task we have enriched the chunk tags by adding to them  POS
information  and selected words as we will show below. This aspect has also
been tackled in POS tagging \citep{kim99, sang-zoo00, pla2001b}, by lexicalizing the models, that is, by incorporating words 
into the contextual model.

In this work, we propose a simple technique that permits us to encode the
available information into the model, 
without changing the learning or the tagging processes. 
This method
consists of modifying the original training data set in order to consider only
the relevant input information and to extend the output tags with additional
information.  

This transformation is the result of applying a {\it specialization} function
$f_{s}$ on the original training set $\mathcal{T}$ to produce a new one
$\widetilde{\mathcal{T}}$, that is:
\[
f_{s}:\mathcal{T}\subset(\mathcal{I}\times {\mathcal O})^{\ast}\rightarrow\widetilde{\mathcal{T}}\subset(\widetilde{\mathcal{I}}\times\widetilde{\mathcal{O}})^{\ast} 
\]
This function transforms every training tuple $ \langle
i_j,o_j \rangle$ 
to a new tuple 
$   \langle \widetilde{i}_j,\widetilde{o}_j
\rangle $, and thus the original input and
output sets to the new sets
$\widetilde{\mathcal{I}}$ and $\widetilde{\mathcal{O}}$, by concatenating the
selected information. 
This function has to be experimentally defined for each
task, as we will discuss in Section \ref{chunking}.

\begin{figure}[t]
\begin{small}
\begin{center}
\begin{tabular}{llllllll}

\multicolumn{4}{c}{$\mathcal{T}$} & $\stackrel{f_s}{\longrightarrow}$ &
\multicolumn{3}{c}{$\widetilde{\mathcal{T}}$}\\
\cline{1-4}\cline{6-8}
~&~&~&~&~&~&~&~\\
\multicolumn{2}{c}{$ I $}& & \multicolumn{1}{c}{$ O $} & &\multicolumn{1}{c}{$\widetilde{I}$} & &
\multicolumn{1}{c}{$\widetilde{O}$}\\\cline{1-2}\cline{4-4}\cline{6-6}\cline{8-8}
& & & & & & &\\ 
You     &PRP& &B-NP & &PRP & &\multicolumn{1}{r}{PRP$\cdot$B-NP}\\
will    &MD & &B-VP & &MD & &\multicolumn{1}{r}{MD$\cdot$B-VP}\\
start   &VB & &I-VP & &VB& &\multicolumn{1}{r}{ VB$\cdot$I-VP}\\
to      &TO & &I-VP & &TO& & \multicolumn{1}{r}{TO$\cdot$I-VP}\\
see     &VB & &I-VP & &VB& & \multicolumn{1}{r}{VB$\cdot$I-VP}\\
shows   &NNS& &B-NP & &NNS& &\multicolumn{1}{r}{ NNS$\cdot$B-NP}\\
where   &WRB& &B-ADVP &  &where$\cdot$WRB& &\multicolumn{1}{r}{ where$\cdot$WRB$\cdot$B-ADVP}\\
viewers &NNS& &B-NP & & NNS& &\multicolumn{1}{r}{ NNS$\cdot$B-NP}\\
program &VBP& &B-VP & & VBP& & \multicolumn{1}{r}{VBP$\cdot$B-VP}\\
the     &DT & &B-NP & & DT& &\multicolumn{1}{r}{ DT$\cdot$B-NP}\\
program &NN & &I-NP & & NN& & \multicolumn{1}{r}{NN$\cdot$I-NP}\\
.       &  .& &O & &.& & \multicolumn{1}{r}{.$\cdot$O}\\\\\hline
\end{tabular}
\end{center}
\end{small}
\caption{Example of the result of applying specialization on a sentence.} \label{fig-example-esp}
\end{figure}

Figure \ref{fig-example-esp} shows an example of the application of this
function on a sample of the training set used in the chunking task. In this
example, we have considered POS tags and certain selected words as relevant
input information. The output tags have also been enriched with this
information. For example, the tuple  $\langle$You$\cdot$PRP, B-NP$\rangle$ is
transformed to the new tuple $\langle$PRP,~PRP$\cdot$B-NP$\rangle$,
considering only POS information. On the other hand, the tuple
$\langle$where$\cdot$WRB,~B-ADVP$\rangle$, considering also lexical
information, is transformed to the new tuple
$\langle$where$\cdot$WRB,~where$\cdot$WRB$\cdot$B-ADVP$\rangle$.

From this new training set {$\widetilde{\mathcal{T}}$}, we can learn the Specialized HMM by maximum likelihood in the usual way. 
The tagging process is carried out by Dynamic Programming Decoding using the
Viterbi algorithm. This decoding process is not modified, you simply consider
the decisions taken into account in the specialization process. 
That is, to consider the relevant information as input 
and to map the sequence of output tags (which belongs to $
\widetilde{\mathcal{O}}$)  to the original output tags (which belongs to $\mathcal{O}$). This can be carried out in a direct way.     


\section{Chunking evaluation}\label{chunking}

We present a set of experiments in order to evaluate the  chunking approach
proposed in this work. 
We focus on the different specialization criteria considered to
construct the specialized HMM.
As we mentioned above, one of the advantages of our approach is that no change
is needed for either the training or the decoding processes
carried out when specialized HMMs are used. To confirm this, all the
experimental work was conducted using the TnT\footnote{TnT is available at {\tt
    http://www.coli.uni-sb.de/thorsten/tnt}} tagger developed by
\cite{Brants:00a} without making any modification to its source code.

TnT is a very efficient statistical POS tagger based on HMMs. 
To deal with sparse problems, it uses linear interpolation as a smoothing
technique to estimate the model. To handle unknown words, it uses a
probabilistic method based on the analysis of the suffix of the words.\footnote{In our case, the suffix method is not the most suitable for handling
  unknown words because the input vocabulary is reduced to POS tags or
  concatenations of POS tags and words. In all the experiments performed, the vocabulary was always seen in the training set. Therefore, we did not
  consider necessary to study this problem.} All the following
experiments were done with TnT's default options using second-order HMMs.


We used the data defined in the shared task of CoNLL-2000.
The characteristics of this task were described by \cite{sang00d}. It used the
same  Wall Street Journal corpus sections defined by \cite{ramshaw95}, that is, 
sections 15-18 for training, and section 20 for testing. The set of chunks
(NP, VP, ADVP, ADJP, PP, SBAR, CONJP, PRT, INTJ, LST, UCP)  was
derived from the full parsing taking into account certain assumptions and
simplifications.\footnote{The script to derive the chunks is available at {\tt http://ilk.kub.nl/\symbol{126}sabine/chunklink/}}
The POS tagging was obtained using the Brill tagger \citep{brill95a} without correcting the
tagger output.
We also compared our results to those obtained by the other approaches that
participated in the shared task (see Section \ref{comparison}).

As we stated in Section \ref{sectLex}, two decisions must be made in order to define a specialization function
on the training set: which
information from the available input is relevant and how to refine the output tags.
  

These decisions were tested experimentally. We used a development set
(which was different from the test set) in order to select this information. 
We divided the original training set into two partitions: 90\% for training and 10\% for tuning (development
set). To do that, we used nine consecutive sentences from the original training set for
training and we used the tenth sentence for testing. We tested different combinations of input and output information on the
development set and we selected those that improved the results obtained
with no specialized training data set.
 
The baseline system, which we called {\bf BASIC}, considered training tuples
such as
$\langle p_i,ch_i \rangle$, that is, only POS tags ($p_i$) were taken into account as
input vocabulary, and no changes were made in the chunk tag set ($ch_i$).
This criterion gave a very poor performance, because the output tag set was too generic to construct accurate models. Therefore, we defined a 
specialization function $f_s$ in order to extend the output vocabulary as
follows:
\[
f_{s}(\langle w_{i} \cdot p_{i},ch_{i}\rangle)=\langle p_{i},p_{i} \cdot  ch_{i}\rangle
\]
This criterion, denoted as {\bf SP}, makes training tuples $\langle p_i,p_i \cdot ch_i \rangle$ in which only POS tags are considered as input and the output tags are enriched
with the POS tag associated with the input words ($w_i$).
  
Finally, the {\bf SP} criterion was tested by adding lexical information in
both the input and the output. The problem was that adding all the words to the
input and/or the output produced very large models and no improvements were
observed. For this reason, we tested a selective lexicalization of the model. That is,
only a set of certain relevant words (${\mathcal W}_s$) were considered in the contextual language
model. In this respect, we defined the following specialization function:
\[
f_{s-Lex}(\langle w_{i} \cdot p_{i},ch_{i}\rangle)=\left\{
\begin{array}
[c]{r}
\langle w_i \cdot {p}_i, w_i \cdot p_i \cdot ch_i \rangle  \textrm{ if $w_i \in {\mathcal W}_s$}\\
\langle p_i,p_i \cdot ch_i \rangle \textrm{  if  $w_i \notin {\mathcal W}_s$} 
\end{array}
\right.
\]
The main problem of this lexicalization is to define the set of words
$\mathcal{W}_s$ to be considered. We defined some criteria in order to
automatically  extract the relevant words that improved the performance on the
development set. These criteria are summarized below.

\begin{itemize}
\item {\bf Lex-WCC} selects the words from the training set that belong
to closed categories.\footnote{The closed categories considered are: {\it CC, DT, MD, POS, PP\$, RP, TO, WDT, WP\$, EX, IN, PDT, PRP, WP, WRB.}} 
\item {\bf Lex-WHF} selects the words whose frequency in the training set
was higher than a certain {\it threshold}. In order to determine which threshold maximized the performance of
the model (that is, the best set of words to specialize the model), we tuned
it on the development partition with word sets of different sizes.
The best performance was obtained by selecting the words whose frequency was
higher than 100. 
\item {\bf Lex-WTE} selects the words whose chunk tagging error rate was
  higher than a
  certain {\it threshold}. These words were extracted from the output provided
  by the tagger that uses the {\bf SP} model. The best {\it threshold}
  obtained  corresponds to the words whose  error frequency was higher than 2 in the development set.
\item {\bf Lex-WCH} selects the words that belong to certain chunks such as 
SBAR, PP and VP with high frequency in the training set.
\item {\bf Lex-COM} selects the words corresponding to a combination of {\bf
    Lex-WTE} and {\bf Lex-WCH} criteria.  
\end{itemize}

\begin{table}[h]
\begin{center}
\begin{tabular}
{|l|c|c|c|r|r|}\cline{1-6}
\multicolumn{1}{|l|}{specialization criteria} & precision & recall &
F$_{\beta=1}$  & $ |\widetilde{\mathcal{O}}|$ & $|{\mathcal W}_s|$\\\hline\hline
 {\bf BASIC}  & 84.16\% & 84.52\% & 84.34 & 22 & 0 \\\hline
 {\bf SP}  & 90.44\% & 89.96\% & 90.20 & 317 & 0\\\hline
 {\bf SP + Lex-WCC}  & 91.99\% & 91.56\% & 91.77 & 830 & 154\\\hline
 {\bf SP + Lex-WHF}  & 91.87\% & 92.14\% & 92.00 & 1,086 & 144\\\hline
 {\bf SP + Lex-WTE}  & 92.22\% & 92.00\% & 92.11 & 592 & 38\\\hline
 {\bf SP + Lex-WCH}   & 92.03\% & 92.25\% & 92.14 & 1,305 & 217\\\hline
 {\bf SP + Lex-COM} & 92.10\% & 92.35\% & 92.23 & 1,341 & 225\\\hline

\end{tabular}
\end{center}
\caption{Overall chunking results on the development data set using Specialized HMMs with different specialization criteria.}
\label{tab_EspCriteria_dev}
\end{table}

Table~\ref{tab_EspCriteria_dev} shows the results of the tuning process on 
the development data set measured in terms of precision, recall and F$_{\beta}$
rate. In addition, it shows the size of the output tag set
($|\widetilde{\mathcal{O}}|$) and the size of the selected word set
($|{\mathcal W}_s|$). It can be observed that all the
specializations considered outperformed the {\bf BASIC} model. 
Although each lexicalization criterion used a different set of selected words
with a size that ranked between 38 and 225 words, {\bf SP~+~Lex-WHF}, {\bf
  SP~+~Lex-WTE} and  {\bf SP~+~Lex-WCH} criteria achieved similar results while {\bf
  SP~+~Lex-WCC} criterion achieved somewhat worse results. 
We obtained the best performance with the combination criterion {\bf
  SP~+~Lex-COM}, with a F$_{\beta}$ improvement of 9.4\% with respect to the {\bf BASIC} model. The
improvement of the {\bf SP} model with respect to the {\bf BASIC} model was
about 7\% and the use of lexicalization criteria incremented $F_{\beta}$ about
2\% with respect to the {\bf SP} model.

We think that these statistical criteria can be improved
by means of a linguistic study to determine which words are really
relevant to this disambiguation problem. The experiments conducted have given
us some clues about this, because we have observed that the effect of some
words on the overall result is not significant. For example, the  {\bf
 SP~+~Lex-WTE} criterion which only provide a small set
 of 38 specialized words performed better than other criteria.


\begin{table}[h]
\begin{center}
\begin{tabular}
{|l|c|c|c|r|r|}\cline{1-6}
\multicolumn{1}{|l|}{specialization criteria} &  precision &  recall &
F$_{\beta=1}$  & $ |\widetilde{\mathcal{O}}|$ & $|{\mathcal W}_s|$\\\hline\hline
 {\bf BASIC}  & 84.31\% & 84.35\% & 84.33 & 22 & 0 \\\hline
 {\bf SP}  & 89.58\% & 89.55\% & 89.57 & 320 & 0\\\hline
 {\bf SP + Lex-WCC}  & 91.50\% & 91.51\% & 91.51 & 846 & 154\\\hline
 {\bf SP + Lex-WHF}  & 91.30\% & 91.76\% & 91.53 & 1,105 & 144\\\hline
 {\bf SP + Lex-WTE}  & 91.65\% & 91.82\% & 91.73 & 601 & 38\\\hline
 {\bf SP + Lex-WCH}   & 91.74\% & 92.12\% & 91.93 & 1,339 & 217\\\hline
 {\bf SP + Lex-COM} & 91.96\% & 92.41\% & 92.19 & 1,381 & 225\\\hline

\end{tabular}
\end{center}
\caption{Overall chunking results on the shared task using Specialized HMMs with different specialization criteria.}
\label{tab_EspCriteria}
\end{table}  

\begin{table}[h]
\begin{center}
\begin{tabular}
{|l|c|c|c|}\cline{1-4}
\multicolumn{1}{|l|} {chunk} & precision & recall & F$_{\beta=1}$\\\hline\hline
             ADJP & 71.19\% & 67.12\% & 69.10\\
             ADVP & 79.84\% & 79.10\% & 79.47\\
            CONJP & 38.46\% & 55.56\% & 45.45\\
             INTJ & 50.00\% & 50.00\% & 50.00\\
               NP & 92.30\% & 92.68\% & 92.49\\
               PP & 96.58\% & 97.40\% & 96.99\\
              PRT & 71.43\% & 75.47\% & 73.39\\
             SBAR & 85.50\% & 84.86\% & 85.18\\
               VP & 91.73\% & 92.81\% & 92.26\\\hline
              all & 91.96\% & 92.41\% & 92.19\\\hline
\end{tabular}
\end{center}
\caption{Chunking results on the test set of the shared task using Specialized HMMs with the 
  {\bf SP~+~Lex-COM} criterion.}
\label{tab_shared_task}
\end{table}

\begin{figure}
[h]
\begin{center}
\includegraphics[
width=12.0cm
]
{supergraficon.eps}
\caption{Improvement on F$_\beta$ rate for certain chunks on the shared task using Specialized HMMs.}
\label{f1}
\end{center}
\end{figure}  

\begin{figure}
[h]
\begin{center}
\includegraphics[
width=9cm
]
{FerFiguraTab.eps}
\caption{Evolution of the F$_{\beta}$ rate using training sets of different
  size.}
\label{figa1}
\end{center}
\end{figure}


Once the system was tested on the development data set, 
 new models were learnt using the original training data set (sections 15-18)
with the best specialization parameters obtained in the tuning process. Next,
the system was tested on a new unseen data set (section
 20). Table~\ref{tab_EspCriteria} shows that the system had a similar
 behaviour for both development and test data sets. The best performance was
 also obtained by using the {\bf SP~+~Lex-COM} criterion, which suggests that
 a model that includes these specialization parameters  could be successfully applied to other unseen data. 
 
The results for precision, recall and  F$_{\beta}$ rate for all the chunks
considered using the {\bf SP~+~Lex-COM} criterion are
summarized in Table~\ref{tab_shared_task}. These results outperformed the
F$_{\beta}$ rate of {\bf BASIC} and {\bf SP} for each chunk. The details of
this improvement are shown in Figure \ref{f1}. The highest improvement was
achieved for SBAR and PRT chunks. This is because the set of selected words
that we considered included words that usually appear in these chunks.

Due to the fact that the number of parameters of the models increases, these models will be better estimated if the training data set is larger. To confirm this, we
conducted an experiment increasing the size of the training data set. We chose
training data from sections 00 to 19 of the WSJ corpus; the test data set was again
section 20 and we used the same word set of the shared task ($|{\mathcal W}_s|=225$)  obtained using the {\bf SP~+~Lex-COM} 
criterion to specialize the model.   
Figure \ref{figa1} shows that  F$_{\beta}$ improves as the size of the training
set increases, achieving a F$_{\beta}$ rate of 93.25 with a training data set
size of about 950,000 words and 1,960 output tags.
\begin{table}[h]
\begin{center}
\begin{tabular}
{|l|c|c|c|}\cline{1-4}
\multicolumn{1}{|l|}{chunk} & precision & recall & F$_{\beta=1}$\\\hline\hline


             ADJP & 78.54\% & 71.00\% & 74.58\\
             ADVP & 81.85\% & 79.68\% & 80.75\\
            CONJP & 40.00\% & 66.67\% & 50.00\\
             INTJ & 50.00\% & 50.00\% & 50.00\\
               NP & 93.52\% & 93.43\% & 93.48\\
               PP & 96.77\% & 97.73\% & 97.25\\
              PRT & 75.00\% & 82.08\% & 78.38\\
             SBAR & 88.70\% & 88.04\% & 88.37\\
               VP & 93.33\% & 93.73\% & 93.53\\\hline
              all & 93.25\% & 93.24\% & 93.25\\\hline

\end{tabular}
\end{center}
\caption{Chunking results, on the test of the shared task with large training data  set, using Specialized HMMs with {\bf SP~+~Lex-COM} criterion. }
\label{tab_shared_task_PLUS}
\end{table}

In Table \ref{tab_shared_task_PLUS} the
results for each chunk using this large training data set are summarized. It
can also be observed
that all the F$_{\beta}$ rates outperformed the results achieved with the small
training data set. Although the  overall F$_{\beta}$ improvement is only about 1\%, the best improvements were achieved for those chunks that include selected words (6.8\% for PRT, 7.9\% for ADJP and 3.8\% for SBAR).   

Finally, we would like to note that the efficiency of the system is not
reduced even if  $|\widetilde{\mathcal{O}}|$ increases when specialized models
are used. For the model learnt with the {\bf SP~+~Lex-COM} criterion the
tagging speed holds around 30,000 words/second running on a Pentium 500
Mhz. Although we have not made a comparative study  of the efficiency with
other approaches, we think that these performances are difficult to overcome
by other systems. 

\section{Comparison with other chunking approaches}\label{comparison}

We compared our results with those systems that have presented results
for the same training and test data used in the chunking shared task performed
in CoNLL-2000. Each of these systems uses a specific learning method and
different kinds of information. Following, we will briefly review each of
these approaches. Finally, we compare the different kinds of information used for the different learning methods. Other comparisons, such
as an efficiency comparison, cannot be done because the authors did not usually report it. Basically, these systems can be divided into the following groups:
Rule-based systems,  Memory-based systems,  Statistical systems and Combined Systems.


\subsection*{Rule-based systems} 
 
The ALLiS system \citep{dejean00} is based on theory refinement. It attempts to improve a previously learnt grammar using ``contextualization'' and ``lexicalization'' operators. The method only takes context into account if the confidence value for a certain POS tag is under a certain threshold. It calculates the ratio between the number of occurrences of a POS tag in a chunk and the number of occurrences of this POS tag in the training corpora. If this ratio is higher than a certain threshold, then the corresponding chunk tag is assigned. If not, it takes into account left and right context (``contextualization'') and the current word (``lexicalization'').
  
The system presented by \citet*{johansson00} picks the most likely chunk tag for a given context, assuming that a larger context (if it has been seen in the training data) overrides the label proposed by a smaller context. It obtains the best results for 5-context, that is, a context of the current POS tag, the two left POS tags and the two right POS tags.
 
\subsection*{Memory-based systems}  

\citet*{veenstra00} studied how the memory-based learning algorithm,
    implemented in TiMBL software, performs with different
    settings. Memory-based learning consists of storing the instances seen
    during learning in memory along with the corresponding categories. A new instance can be classified in a category by computing the distance between the new instance and the stored instances. The best results are obtained using IB1-IG algorithm \citep{daelemans97} and applying the modified value difference metric to POS features. 

  
\subsection*{Statistical systems} 

\citet*{osborne00} used Ratnaparkhi's Maximum Entropy-based tagger
    \citep{ratnaparkhi96} to perform chunking. To do that, the input to the
    tagger is redefined to be a concatenation (``configuration'') of the
    different contexts that are useful for chunking. 
 It performs chunking in two steps. First, it guesses a default chunk tag by applying a
    model which has been learnt with configurations which consists of   current POS tags and words. Second, it produces the definitive
    chunk tags by applying a model learnt with configurations which also takes into account the chunk tags previously guessed. Little improvement is achieved by incorporating suffix and
    prefix information to the configurations.
 

The approach of \citet*{koeling00} also builds a ME model which takes into
account several individual features 
and complex features combining POS tags and chunk tags.

In our previous work \citep{pla2000c}, we used a two level first-order HMM that performed tagging and chunking at the same time. The model was also refined by lexicalization to improve its performance.


  \citet*{zhou00} incorporated contextual information into a bigram model by
    means of defining structured tags as input. These tags are composed of the
    current word (if it belongs to a certain category), the current POS tag, the previous POS tag, the descriptor of the phrase category and a structural relation tag that indicates if two adjacent words have the same parent. The model is refined by an error-learning technique that keeps only the words whose error rate decreases when they are incorporated into the model. Finally, a memory-based sequence learning is applied to incorporate chunk pattern probabilities achieving slight improvement.


\subsection*{Combined systems}
These systems combine the output of different classifiers in order to improve the chunking performance.  


  \citet*{kudo00} combined several Support Vector Machine classifiers. SVMs
    are  a very suitable learning approach for solving two-class
    pattern recognition problems. Basically, SVMs are binary classifiers that
    can guess whether an instance (a token) belongs to a class or not. Pairwise
    classification is used to
    solve chunking (which is a multi-class task). This consists of training a classifier for each pair of
    different chunk tags (for K chunk tags, K*(K-1)/2 classifiers have to be
    learnt). The results of all the classifiers are combined by a dynamic
    programming algorithm. Later \citet*{kudo2001} introduced a weighted voting technique that improved the results on the same training and test data.
 

  By majority voting, \citet*{sang00c} combined the results provided by five
    different memory-based learning classifiers (one classifier for each
    different chunk representation, that is, IOB1, IOB2, IOE1, IOE2 and
    C+O). In all cases, the combined system outperformed the individual
    systems. The system performs the task in two steps. First, it identifies
    chunk boundaries and second, it assigns the chunk tag. In the first step it considers POS and words as features and, in the
    second step, it adds the context of chunk tags guessed in the first phase.  

 
  \citet*{vanhalteren00} presented another combined system that uses a more sophisticated
    combining technique called  Weighted Probability Distribution Voting
    (WPDV). It combines the output of five classifiers: a memory-based
    learning classifier and four different WPDV classifiers. Due to the fact
    that output can present some systematic errors, these are also corrected using a WPDV model for each kind of error. 


Recently, other systems based on the Winnow algorithm have been applied to chunking on the same data set. These systems learn different classifiers. Each predicts the start or the end of a kind of chunk. The output of these classifiers is combined in order to chunk the sentence satisfying some constraints such as non-overlapping constraints. The SNoW architecture, which is based on the Winnow algorithm, is used by \citet*{li2001}. The algorithm was modified by \citet*{zhan01} to guarantee its convergence for linearly non-separable data and was successfully applied to text chunking.   
   
\begin{table}[t]
\begin{center}
{\small   
\begin{tabular}{|l|l|c|c|c|c|c|c|c|c|}\hline
 System &  Method & $w$ & $w_{left}$ & $w_{right}$ & $p$ & $p_{left}$ & $p_{right}$ & $c_{left}$ & $F_{\beta}$ \\\hline\hline  
[KM01] & SVM(Comb) & x & 2 & 2 & x & 2 & 2 & 2 & 93.91 \\\hline
[ZDJ01] & Winnow & x & 2 & 2 & x & 2 & 2 & 2 & 93.51 \\\hline
[Hal00] & WPDV(Comb)& x & 1-5 & 1-5 & x & 3-5 & 3 & 2 & 93.32 \\\hline
[LR01]& Winnow & & & & & & & & 93.02 \\\hline
[TKS00]& MBL(Comb) & x & 4 & 4 & x & 4 & 4 & & 92.50 \\\hline
SP+Lex-COM & HMM & x & 2 & & x & 2 & & 2 & 92.19 \\\hline
[ZST00]& HMM + MBL& x & 1 & & x & 1 & & 1 & 92.12 \\\hline
[Dej00]& Rule-based & x & & & x & 1 & 1 & & 92.09 \\\hline
[Koe00]& ME & x & 1 & 1 & x & 3 & 2 & 3 & 91.97 \\\hline
[Osb00]& ME & x & 2 & 2 & x & 2 & 2 & 2 & 91.94 \\\hline
[VB00]& MBL & x & 5 & 3 & x & 5 & 3 & &91.54 \\\hline
[PMP00]& HMM & x & 1 & & x & 1 & & 1 &90.14 \\\hline
[Joh00]& Rule-based& & & & x & 0-3 & 0-3 & &87.23 \\\hline
\end{tabular}
}
\caption{$F_{\beta}$ results of the different shallow parsing systems
  related and a  comparison of the
  information used by them in the learning process. 
} \label{tabla-rasgos}
\end{center}
\end{table}

\subsection*{Comparison}

In Table \ref{tabla-rasgos} we have summarized the features that each
approach takes into account and the $F_{\beta}$ result reported. The table indicates whether a
  model uses some of the following features: the current word ($w$), current
  POS tag ($p$), the words to the left ($w_{left}$) and to the right
  ($w_{right}$), the POS tags to the left ($p_{left}$) and to the right
  ($p_{right}$), and the chunk tag ($c_{left}$) to the left. In addition,
  \cite{osborne00} considers prefix and suffix word information, and the
  current chunk; \cite{koeling00} also incorporates complex features by
  concatenating individual features; \cite{zhou00} include structural
  relations between words, the descriptor of the phrase category, and takes
  into account only certain words; \cite{sang00c} also considers a context of
  two left and two right chunk tags guessed in a first level. \cite{zhan01} also include second-order features, such as POS-POS, chunk-chunk, word-POS and chunk-POS. \cite{li2001}
  do not report feature information.

It can be
seen that combined systems perform better than the individual systems (only
Winnow-based systems outperform some of the combined systems). There are six systems that
produce very similar results (between 91.5\% and 92.2\%). The results of our
system (Specialized HMMs)  are slightly better than these individual
classifiers. Only the Winnow-based systems perform better than Specialized HMMs. One
conclusion that we can draw from Table~\ref{tabla-rasgos} is that HMM approaches perform
better than other systems and require encoding less information. This shows
the importance not only of the feature selection but the importance of algorithm itself. It can be considered that the dynamic programming decoding algorithm used by HMM systems to solve the maximization equation implicitly takes into account information of the whole sentence.   

We have found two approaches that use a technique similar to the one we
describe in this paper: the \citet*{osborne00} Maximum Entropy approach and the HMM used by
\citet*{zhou00}. Like Specialized HMMs, both approaches consider structural
tags or concatenations as input to the system. The differences lie in the
underlying model used and the information which is taken into account in the
input. Both approaches need to encode more information than Specialized HMMs,
but they do not achieve better results: \citeauthor{osborne00} achieved a
similar result ($F_{\beta}$=91.94), but \citeauthor{zhou00} obtained
lower results ($F_{\beta}$=89.57) when error correcting and memory-based
techniques were not applied. Moreover,
\citeauthor{osborne00} needed to learn two models that have to be applied
sequentially: a first model that proposes an initial chunk tag, and a second
one that takes into account the information provided by the first one.

Note that the performance of our preliminary system~\citep{pla2000c} 
was very poor because it used  bigram models instead of trigram models. 
In addition, it only took words as input in order to perform tagging and
chunking at the same time, which decreased the performance as we reported in \citep{pla2000a}. 


\section {Clause identification}

Clause identification can be carried out in a way similar to text chunking, using the  specialization technique previously presented. As in text chunking, the success of the method lies in the definition of the specialization function, that is, in an appropriate selection of the input information and the output tag set.

All the following experiments were conducted  under the same conditions defined at the
clause-splitting shared task of CoNLL-2001 \citep{tksdj2001}. At this shared
task, clause detection was divided into three parts: clause start-boundary
detection, clause end-boundary detection and embedded clause detection. Here,
we only report results on the third part of the task (the rest of the results
can be consulted in \citep{molina2001}). This shared task defined sections
15-18 from the WSJ corpus as training data set, section 20 as development set and section 21 as test set.

Clause-splitting can be seen as a step after chunking. Therefore, the available input information consists of words, POS tags and chunk tags. The output information are the clause tags. We performed clause-splitting in two phases: first, we
found the best segmentation in clauses for the input sentence using a Specialized HMM; second, we
corrected some balancing inconsistencies observed in the output by applying some rules.

In this case, we considered that the relevant input information for clause-detection was formed by POS tags and chunk tags. In order to produce more accurate models, the output tags were enriched with the POS tag. To avoid incorrectly
balanced clauses in the output, we also added  the number corresponding to the
depth level of the clause in the sentence to the output tags. Thus, the
specialization function $f_s$ was defined as follows:
\[
f_{s}(\langle w_{i} \cdot p_{i} \cdot ch_{i},s_{i}\rangle)=\langle p_{i} \cdot ch_{i},p_{i} \cdot  s^{'}_{i}\rangle
\]
where $s_{i}$ is a clause tag and $s^{'}_{i}$ is the enumerated clause tag (each bracket in the clause tag is enumerated with the depth level of the corresponding clause). The effect of the application of this function ({\bf SP} criterion) on a sentence can be seen in Figure \ref{fig-clause}.


\begin{figure}[t]
\begin{center}
{\small
\begin{tabular}{lllllllll}

\multicolumn{5}{c}{$\mathcal{T}$} & $\stackrel{f_s}{\longrightarrow}$ &
\multicolumn{3}{c}{$\widetilde{\mathcal{T}}$}\\
\cline{1-5}\cline{7-9}
~&~&~&~&~&~&~&~&~\\


\multicolumn{3}{c} {$ I$}& & \multicolumn{1}{c}{$ O$} & &
\multicolumn{1}{c}{$\widetilde {I}$}
& &\multicolumn{1}{c}{$\widetilde{O}$} \\
\cline{1-3}\cline{5-5}\cline{7-7}\cline{9-9}
~&~&~&~&~&~&~&~&~\\
You &PRP &B-NP& &\multicolumn{1}{r}{(S*} & & PRP$\cdot$B-NP & &\multicolumn{1}{r}{PRP$\cdot$(S1*}\\
will &MD &B-VP& &\multicolumn{1}{r}{*} & &MD$\cdot$B-VP & &\multicolumn{1}{r}{MD$\cdot$*1}\\
start &VB &I-VP& &\multicolumn{1}{r}{*}&  &VB$\cdot$I-VP & &\multicolumn{1}{r}{VB$\cdot$*1}\\
to &TO &I-VP& &\multicolumn{1}{r}{*} & &TO$\cdot$I-VP & &\multicolumn{1}{r}{TO$\cdot$*1}\\
see &VB &I-VP& &\multicolumn{1}{r}{*} & &VB$\cdot$I-VP & &\multicolumn{1}{r}{VB$\cdot$*1}\\
shows &NNS &B-NP& &\multicolumn{1}{r}{*} & &NNS$\cdot$B-NP & &\multicolumn{1}{r}{NNS$\cdot$*1}\\
where &WRB &B-ADVP& &\multicolumn{1}{r}{(S*} & &WRB$\cdot$B-ADVP& &\multicolumn{1}{r}{WRB$\cdot$(S2*}\\
viewers &NNS &B-NP& &\multicolumn{1}{r}{(S*} & &NNS$\cdot$B-NP& &\multicolumn{1}{r}{NNS$\cdot$(S3*}\\
program &VBP &B-VP& &\multicolumn{1}{r}{*} & &VBP$\cdot$B-VP& &\multicolumn{1}{r}{VBP$\cdot$*3}\\
the &DT &B-NP & &\multicolumn{1}{r}{*} & &DT$\cdot$B-NP&  &\multicolumn{1}{r}{DT$\cdot$*3}\\
program &NN &I-NP& &\multicolumn{1}{r}{*S)S)} & & NN$\cdot$NN& &\multicolumn{1}{r}{NN$\cdot$*S3)S2)}\\
. &. &O& &\multicolumn{1}{r}{*S)} & &.$\cdot$O & &\multicolumn{1}{r}{O$\cdot$*1S)}\\\\\hline
\end{tabular}
}
\end{center}
\caption{Example of the result of applying specialization on a sample of the training set used in clause-splitting.} \label{fig-clause}
\end{figure}

Due to the fact that models have to be smoothed to guarantee a complete coverage of
the language, this does not assure the correct balancing of the output.
Therefore, we applied the following correcting rules to repair the inconsistencies in the output:
\begin{enumerate}
\item If the clause segmentation presents more {\it start} than {\it end}
  boundaries, we add the {\it end} boundaries that are needed to the last word
  in the sentence (just before the dot).
\item If the clause segmentation presents more {\it end} than {\it start}
  boundaries, we add the {\it start} boundaries that are needed to the first word in the sentence.
\item If the sentence does not start with a {\it start} boundary or does not
  finish with an {\it end} boundary, we add these {\it start} and {\it end}
  tags.
\end{enumerate}

An alternative solution would be to incorporate these rules into the model, but to do this it would be necessary to modify learning, decoding and smoothing processes, which is far from this preliminary approach to clause-detection.

Finally, we also tested the different specialization criteria obtaining
slight improvements. The best results were achieved by incorporating some of
the most frequent words into the
model ({\bf SP+Lex-WHF} criterion). These results, which can be seen in
Table \ref{tab-rdo-claus}, are in line with those presented in the
clause-splitting shared task of CoNLL-2001 by other systems
\citep{tksdj2001}. Our system achieved a performance which was slightly better than
others based on boosting algorithms, neuronal networks, symbolic methods or
memory-based learning. The best system  used  Ada-Boost learning combined with
decision trees. In addition, the set of features was adapted to the task
following linguistic criteria.


\begin{table}[t]
\begin{center}
\begin{tabular}{|l|c|c|c|}\cline{1-4}
\multicolumn{1}{|l|}{system}
                 & precision & recall  & F$_{\beta=1}$ \\\hline\hline
                 \citeauthor{carreras2001conll}  &    84.82\%  &   73.28\%  &   78.63 \\\hline
                 {\bf SP+Lex-WHF} & 70.89\%  & 65.57\% & 68.12 \\\hline
                  \citeauthor{tks2001conll} &   76.91\%  &   60.61\%  &   67.79 \\\hline
                 {\bf SP}       & 69.62\%    & 64.17\% & 66.79 \\\hline
                 \citeauthor{patrick2001conll}   &   73.75\%  &   60.00\%  &   66.17  \\\hline
                 \citeauthor{dejean2001conll}  &   72.56\%  &   54.55\%  &   62.77   \\\hline
                 \citeauthor{hammerton2001conll}  &   55.81\%  &   45.99\%  &   50.42 \\\hline  
                 
                \end{tabular}
                \end{center}
\caption{Clause-splitting results of the different systems presented in the
                 shared task of CoNLL-2001.}
\label{tab-rdo-claus}
\end{table}


\section{Concluding remarks}

In this work, we have presented a technique that allows us to tackle different
natural language disambiguation tasks as tagging problems. In particular, we
have addressed the
shallow parsing and the clause identification problems. Using this technique,
the relevant information for each task can be determined. Thus, a specific task can be performed using a standard HMM-based tagger without modifying the learning and testing processes.

The results reported here show that the HMM approach performs in line with
other approaches that use more sophisticated learning methods when an
appropriate definition of the input and output vocabularies is
provided. Moreover, this approach maintains the efficiency of the system
throughout both the learning and the testing phases.

The specialization methods proposed are independent of the corpus and the
language used.  The lexicalization criteria presented provide sets of words that are very common, such as words that belong to closed categories or words that appear frequently in the corpus.  These selected words can also appear in other English corpora and, therefore,
the chunking or clause identification problem could be successfully solved using
this technique. Moreover, the criteria presented here are independent of the
language. This has been contrasted in previous works for the POS tagging
problem in English and Spanish corpora \citep{pla2001b,
  pla2001a}. Unfortunately, these aspects could not been tested for chunking and clause-detection due to the unavailability of other annotated corpora. 

We think the method presented here can be improved in two aspects: the
selection of the features that have to be included in the input and output
vocabularies for each disambiguation task, and the selection of the words that
really improve the performance of the system. To do this, it would be
necessary to take into account not only statistical criteria, but linguistic
criteria as well.

Finally, due to the fact that this technique does not need to change the
learning and tagging processes, we think that the application of this
technique using other taggers based on different paradigms could be of interest. 
 

\acks{
We would like to thank to the reviewers for their helpful comments. 
This work has been supported
      by the Spanish research projects CICYT  TIC2000--0664--C02--01 and TIC2000--1599--C01--01.}


\vskip 0.2in

\begin{thebibliography}{42}
\expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi
\expandafter\ifx\csname url\endcsname\relax
  \def\url#1{{\tt #1}}\fi

\bibitem[Abney(1991)]{abney91}
S.~Abney.
\newblock {\em {Parsing by Chunks}}.
\newblock {R. Berwick, S. Abney and C. Tenny (eds.) Principle--based Parsing }.
  {Kluwer Academic Publishers}, {Dordrecht}, 1991.

\bibitem[Abney(1996)]{abney96b}
S.~Abney.
\newblock {Partial Parsing via Finite-State Cascades}.
\newblock In {\em {Proceedings of the ESSLLI'96 Robust Parsing Workshop}},
  {Prague, Czech Republic}, 1996.

\bibitem[A{\"\i}t-Mokhtar and Chanod(1997)]{mokhtar97}
S.~A{\"\i}t-Mokhtar and J.P. Chanod.
\newblock {Incremental Finite-State Parsing}.
\newblock In {\em {Proceedings of the 5th Conference on Applied Natural
  Language Processing}}, {Washington D.C., USA}, 1997.

\bibitem[Argamon et~al.(1998)Argamon, Dagan, and Krymolowski]{argamon98}
S.~Argamon, I.~Dagan, and Y.~Krymolowski.
\newblock {A Memory--based Approach to Learning Shallow Natural Language
  Patterns}.
\newblock In {\em {Proceedings of the joint 17th International Conference on
  Computational Linguistics and 36th Annual Meeting of the Association for
  Computational Linguistics, COLING-ACL}}, pages {67--73}, {Montr\'eal,
  Canada}, 1998.

\bibitem[Brants(2000)]{Brants:00a}
Thorsten Brants.
\newblock {TnT} -- a statistical part-of-speech tagger.
\newblock In {\em Proceedings of the Sixth Applied Natural Language Processing
  ({ANLP}-2000)}, Seattle, WA, 2000.

\bibitem[Brill(1995)]{brill95a}
E.~Brill.
\newblock {Transformation--based Error--driven Learning and Natural Language
  Processing: A Case Study in Part--of--speech Tagging}.
\newblock {\em {Computational Linguistics}}, 21\penalty0 (4):\penalty0
  {543--565}, 1995.

\bibitem[Carreras and M{\`a}rquez(2001)]{carreras2001conll}
Xavier Carreras and Lu{\'\i}s M{\`a}rquez.
\newblock Boosting trees for clause splitting.
\newblock In Walter Daelemans and R{\'e}mi Zajac, editors, {\em Proceedings of
  CoNLL-2001}, pages 73--75. Toulouse, France, 2001.

\bibitem[Church(1988)]{church88}
K.~W. Church.
\newblock {A Stochastic Parts Program and Noun Phrase Parser for Unrestricted
  Text}.
\newblock In {\em {Proceedings of the 1st Conference on Applied Natural
  Language Processing, ANLP}}, pages {136--143}. {ACL}, 1988.

\bibitem[Daelemans et~al.(1999)Daelemans, Buchholz, and Veenstra]{daelemans99}
W.~Daelemans, S.~Buchholz, and J.~Veenstra.
\newblock {Memory-Based Shallow Parsing}.
\newblock In {\em {Proceedings of EMNLP/VLC-99}}, pages {239--246}, {University
  of Maryland, USA}, {June} 1999.

\bibitem[Daelemans et~al.(1997)Daelemans, Van~den Bosch, and
  Weijters]{daelemans97}
W.~Daelemans, Antal Van~den Bosch, and T.~Weijters.
\newblock {\em {IGTree: Using Trees for Compression and Classification in Lazy
  Learning Algorithms}}.
\newblock {D. Aha (ed.), Artificial Intelligence Review 11, Special issue on
  Lazy Learning}. {Kluwer Academic Publishers}, 1997.

\bibitem[D\'{e}jean(2000)]{dejean00}
Herv\'{e} D\'{e}jean.
\newblock {Learning Syntactic Structures with XML}.
\newblock In {\em {Proceedings of CoNLL-2000 and LLL-2000}}, {Lisbon,
  Portugal}, {September} 2000.

\bibitem[D{\'e}jean(2001)]{dejean2001conll}
Herv{\'e} D{\'e}jean.
\newblock Using allis for clausing.
\newblock In Walter Daelemans and R{\'e}mi Zajac, editors, {\em Proceedings of
  CoNLL-2001}, pages 64--66. Toulouse, France, 2001.

\bibitem[Hammerton(2001)]{hammerton2001conll}
James Hammerton.
\newblock Clause identification with long short-term memory.
\newblock In Walter Daelemans and R{\'e}mi Zajac, editors, {\em Proceedings of
  CoNLL-2001}, pages 61--63. Toulouse, France, 2001.

\bibitem[Johansson(2000)]{johansson00}
Christer Johansson.
\newblock {A Context Sensitive Maximum Likelihood Approach to Chunking}.
\newblock In {\em {Proceedings of CoNLL-2000 and LLL-2000}}, {Lisbon,
  Portugal}, {September} 2000.

\bibitem[Kim et~al.(1999)Kim, Lee, and Rim]{kim99}
J.D. Kim, S.Z. Lee, and H.C. Rim.
\newblock {HMM Specialization with Selective Lexicalization}.
\newblock In {\em {Proceedings of the join SIGDAT Conference on Empirical
  Methods in Natural Language Processing and Very Large Corpora
  (EMNLP-VLC-99)}}, 1999.

\bibitem[Koeling(2000)]{koeling00}
Rob Koeling.
\newblock {Chunking with Maximum Entropy Models}.
\newblock In {\em {Proceedings of CoNLL-2000 and LLL-2000}}, {Lisbon,
  Portugal}, {September} 2000.

\bibitem[Kudo and Matsumoto(2000)]{kudo00}
Taku Kudo and Yuji Matsumoto.
\newblock {Use of Support Vector Learning for Chunk Identification}.
\newblock In {\em {Proceedings of CoNLL-2000 and LLL-2000}}, {Lisbon,
  Portugal}, {September} 2000.

\bibitem[Kudo and Matsumoto(2001)]{kudo2001}
Taku Kudo and Yuji Matsumoto.
\newblock {Chunking with Support Vector Machines}.
\newblock In {\em {Proceedings of NAACL 2001}}, {Pittsburgh, USA}, 2001.
  {Morgan Kaufman Publishers}.
\newblock
  {http://cactus.aist-nara.ac.jp/\symbol{126}taku-ku/publications/naacl2001.ps%
}.

\bibitem[Lee et~al.(2000)Lee, ichi Tsujii, and Rim]{sang-zoo00}
Sang-Zoo Lee, Juni ichi Tsujii, and Hae-Chang Rim.
\newblock {Lexicalized Hidden Markov Models for Part-of-Speech Tagging}.
\newblock In {\em {Proceedings of 18th International Conference on
  Computational Linguistics}}, {Saarbrucken, Germany}, {August} 2000.

\bibitem[Li and Roth(2001)]{li2001}
Xin Li and Dan Roth.
\newblock {Exploring Evidence for Shallow Parsing}.
\newblock In {\em {Proceedings of the 5th Conference on Computational Natural
  Language Learning (CoNLL-2001)}}, {Toulouse, France}, {July} 2001.

\bibitem[Merialdo(1994)]{merialdo94}
B.~Merialdo.
\newblock {Tagging English Text with a Probabilistic Model}.
\newblock {\em {Computational Linguistics}}, 20\penalty0 (2):\penalty0
  {155--171}, 1994.

\bibitem[Molina and Pla(2001)]{molina2001}
Antonio Molina and Ferran Pla.
\newblock {Clause detection using HMM}.
\newblock In {\em {Proceedings of the 5th Conference on Computational Natural
  Language Learning (CoNLL-2001)}}, {Toulouse, France}, {July} 2001.

\bibitem[Orasan(2000)]{orasan00}
Constantin Orasan.
\newblock {A hybrid method for clause splitting in unrestricted English texts}.
\newblock In {\em {Proceedings of ACIDCA'2000}}, {Monastir, Tunisia}, 2000.

\bibitem[Osborne(2000)]{osborne00}
Miles Osborne.
\newblock {Shallow Parsing as Part-of-Speech Tagging}.
\newblock In {\em {Proceedings of CoNLL-2000 and LLL-2000}}, {Lisbon,
  Portugal}, {September} 2000.

\bibitem[Patrick and Goyal(2001)]{patrick2001conll}
Jon~D. Patrick and Ishaan Goyal.
\newblock Boosted decision graphs for nlp learning tasks.
\newblock In Walter Daelemans and R{\'e}mi Zajac, editors, {\em Proceedings of
  CoNLL-2001}, pages 58--60. Toulouse, France, 2001.

\bibitem[Pla and Molina(2001)]{pla2001b}
Ferran Pla and Antonio Molina.
\newblock {Part-of-Speech Tagging with Lexicalized HMM}.
\newblock In {\em {proceedings of International Conference on Recent Advances
  in Natural Language Processing (RANLP2001)}}, {Tzigov Chark, Bulgaria},
  {September} 2001.

\bibitem[Pla et~al.(2000{\natexlab{a}})Pla, Molina, and Prieto]{pla2000a}
Ferran Pla, Antonio Molina, and Natividad Prieto.
\newblock { Tagging and Chunking with Bigrams}.
\newblock In {\em {Proceedings of the COLING--2000}}, {Saarbr\"ucken, Germany},
  {August} 2000{\natexlab{a}}.

\bibitem[Pla et~al.(2000{\natexlab{b}})Pla, Molina, and Prieto]{pla2000c}
Ferran Pla, Antonio Molina, and Natividad Prieto.
\newblock {Improving Chunking by means of Lexical-Contextual Information in
  Statistical Language Models}.
\newblock In {\em {Proceedings of ConNLL--2000}}, {Lisbon, Portugal},
  {September} 2000{\natexlab{b}}.

\bibitem[Pla et~al.(2001)Pla, Molina, and Prieto]{pla2001a}
Ferran Pla, Antonio Molina, and Natividad Prieto.
\newblock {Evaluaci\'on de un etiquetador morfosint\'actico basado en bigramas
  especializados para el castellano}.
\newblock {\em {Revista para el Procesamiento del Lenguaje Natural}}, 2001.

\bibitem[Ramshaw and Marcus(1995)]{ramshaw95}
L.~Ramshaw and M.~Marcus.
\newblock {Text Chunking Using Transformation-Based Learning}.
\newblock In {\em {Proceedings of third Workshop on Very Large Corpora}}, pages
  {82--94}, {June} 1995.
\newblock {ftp://ftp.cis.upenn.edu/pub/chunker/wvlcbook.ps.gz}.

\bibitem[Ratnaparkhi(1996)]{ratnaparkhi96}
A.~Ratnaparkhi.
\newblock {A Maximum Entropy Part--of--speech Tagger}.
\newblock In {\em {Proceedings of the 1st Conference on Empirical Methods in
  Natural Language Processing, EMNLP}}, 1996.

\bibitem[Ratnaparkhi(1998)]{ratnaparkhi98a}
A.~Ratnaparkhi.
\newblock {\em {Maximum Entropy Models for Natural Language Ambiguity
  Resolution}}.
\newblock {Phd. Thesis}, {University of Pennsylvania}, 1998.
\newblock {http://www.cis.upenn.edu/\symbol{126}adwait}.

\bibitem[Tjong Kim~Sang(2000)]{sang00c}
Erik~F. Tjong Kim~Sang.
\newblock {Text Chunking by System Combination}.
\newblock In {\em {Proceedings of CoNLL-2000 and LLL-2000}}, {Lisbon,
  Portugal}, {September} 2000.

\bibitem[Tjong Kim~Sang(2001)]{tks2001conll}
Erik~F. Tjong Kim~Sang.
\newblock Memory-based clause identification.
\newblock In Walter Daelemans and R{\'e}mi Zajac, editors, {\em Proceedings of
  CoNLL-2001}, pages 67--69. Toulouse, France, 2001.

\bibitem[Tjong Kim~Sang and Buchholz(2000)]{sang00d}
Erik~F. Tjong Kim~Sang and Sabine Buchholz.
\newblock {Introduction to the CoNLL-2000 Shared Task: Chunking}.
\newblock In {\em {Proceedings of CoNLL-2000 and LLL-2000}}, {Lisbon,
  Portugal}, {September} 2000.

\bibitem[Tjong Kim~Sang et~al.(2000)Tjong Kim~Sang, Daelemans, D�jean, Koeling,
  Krymolowsky, Punyakanok, and Roth]{sang00b}
Erik~F. Tjong Kim~Sang, Walter Daelemans, Herv� D�jean, Rob Koeling, Yuval
  Krymolowsky, Vasin Punyakanok, and Dan Roth.
\newblock {Applying System Combination to Base Noun Phrase Identification}.
\newblock In {\em {Proceedings of 18th International Conference on
  Computational Linguistics COLING'2000}}, pages {857--863}, {Saarbr{\"u}cken,
  Germany}, {August} 2000. {Morgan Kaufman Publishers}.
\newblock {http://lcg-www.uia.ac.be/\symbol{126}erikt/papers/coling2000.ps}.

\bibitem[Tjong Kim~Sang and D{\'e}jean(2001)]{tksdj2001}
Erik~F. Tjong Kim~Sang and Herv{\'e} D{\'e}jean.
\newblock {Introduction to the CoNLL-2001 shared task: Clause identification}.
\newblock In {\em {Proceedings of the 5th Conference on Computational Natural
  Language Learning (CoNLL-2001)}}, {Toulouse, France}, {July} 2001.

\bibitem[Van~Halteren(2000)]{vanhalteren00}
Hans Van~Halteren.
\newblock {Chunking with WPDV Models}.
\newblock In {\em {Proceedings of CoNLL-2000 and LLL-2000}}, {Lisbon,
  Portugal}, {September} 2000.

\bibitem[Veenstra and Van~den Bosch(2000)]{veenstra00}
Jorn Veenstra and Antal Van~den Bosch.
\newblock {Single-Classifier Memory-Based Phrase Chunking}.
\newblock In {\em {Proceedings of CoNLL-2000 and LLL-2000}}, {Lisbon,
  Portugal}, {September} 2000.

\bibitem[Voutilainen(1993)]{voutilainen93}
Atro Voutilainen.
\newblock {NPTool, a Detector of English Noun Phrases}.
\newblock In {\em {Proceedings of the Workshop on Very Large Corpora}}. {ACL},
  {June} 1993.

\bibitem[Zhang et~al.(2001)Zhang, Damerau, and Johnson]{zhan01}
Tong Zhang, Fred Damerau, and David Johnson.
\newblock {Text chunking using regularized Winnow}.
\newblock In {\em {proceedings of the Joint EACL-ACL Meeting (ACL2001)}},
  {Toulouse, France}, {July} 2001.

\bibitem[Zhou et~al.(2000)Zhou, Su, and Tey]{zhou00}
GuoDong Zhou, Jian Su, and TongGuan Tey.
\newblock {Hybrid Text Chunking}.
\newblock In {\em {Proceedings of CoNLL-2000 and LLL-2000}}, {Lisbon,
  Portugal}, {September} 2000.

\end{thebibliography}


\end{document}