\documentclass[twoside,11pt]{article}
\usepackage{jmlr2e}
\usepackage{latexsym}
\usepackage{psfig}

\newcommand{\ruleskip}{\bigskip\hrule\bigskip}

\newcommand{\R}{{\sf R\hspace*{-0.9ex}\rule{0.15ex}%
       {1.5ex}\hspace*{0.9ex}}}
\newcommand{\N}{{\sf N\hspace*{-1.0ex}\rule{0.15ex}%
       {1.3ex}\hspace*{1.0ex}}}
\newcommand{\Q}{{\sf Q\hspace*{-1.1ex}\rule{0.15ex}%
       {1.5ex}\hspace*{1.1ex}}}
\newcommand{\C}{{\sf C\hspace*{-0.9ex}\rule{0.15ex}%
       {1.3ex}\hspace*{0.9ex}}}

%\newcommand{\bitem}{\begin{itemize}\begin{verydenselist}}
%\newcommand{\eitem}{\end{verydenselist}\end{itemize}}
\newcommand{\bitem}{\begin{itemize}\denselist}
\newcommand{\eitem}{\end{itemize}}
\newcommand{\denselist}{\itemsep 0pt\partopsep 0pt}

\newcommand{\Real}{\makebox[.50ex][l]{I}\makebox{R}}
\newcommand{\Image}{Im \,}
\newcommand{\be}{\begin{equation}}
\newcommand{\ee}{\end{equation}}
\newcommand{\bx}{\vc{x}}
\newcommand{\bw}{\vc{w}}
\newcommand{\hatP}{\hat{P}}
\newcommand{\hatp}{\hat{p}}
\newcommand{\dimension}{d}
\newcommand{\setD}{\mathcal{D}}
\newcommand{\setF}{\mathcal{F}}
\newcommand{\setH}{\mathcal{H}}
\newcommand{\setP}{\mathcal{P}}
\newcommand{\setV}{\mathcal{V}}
\newcommand{\setW}{\mathcal{W}}
\newcommand{\setX}{\mathcal{X}}

\newcommand{\hmax}{h^*}

\newcommand{\generalscore}[2]{\mbox{\textit{error}}({#1}\ :\ {#2})}
\newcommand{\score}[1] {\mbox{\textit{error}}_\sigma(#1)}
\newcommand{\pointscore}[1]{\mbox{\textit{score}}_\sigma(\vc{#1})}
\newcommand{\diff}[2]{\lambda_#2^#1}
\newcommand{\vc}[1]{\mathbf{#1}}
\newcommand{\proofend}[0]{\begin{flushright}$\Box$ \end{flushright}}
\newcommand{\expo}[1]{e^{#1}}
\newcommand{\margin}[1]{\mbox{\textit{margin}}(#1)}
\newcommand{\cdist}[1]{\mbox{\textit{cdist}}(#1)}
\newcommand{\wrt}[1]{\, d{#1}}
\newcommand{\innerprod}[2]{#1 \cdot #2}
\newcommand{\modulus}[1]{\|#1\|}
\newcommand{\area}[1]{\mbox{\textit{Area}}(#1)}

\newcommand{\Newsgroups}{\mathsf{Newsgroups}}
\newcommand{\Reuters}{\mathsf{Reuters}}

\newcommand{\Hybrid}{\mathsf{Hybrid}}
\newcommand{\MaxMin}{\mathsf{MaxMin}}
\newcommand{\MaxRatio}{\mathsf{Ratio}}
\newcommand{\Random}{\mathsf{Random}}
\newcommand{\BalancedRandom}{\mathsf{BalancedRandom}}
\newcommand{\Simple}{\mathsf{Simple}}

\newcommand{\commentout}[1]{}
\newcommand{\ignore}[1]{}
%\newtheorem{lemma}{Lemma}[section]
%\newtheorem{corollary}[lemma]{Corollary}
%\newtheorem{theorem}[lemma]{Theorem}
%\newtheorem{definition}[lemma]{Definition}
%\newtheorem{proposition}[lemma]{Proposition}
%\newtheorem{conjecture}[lemma]{Conjecture}
%\newtheorem{condition}[lemma]{Condition}
\newtheorem{fig}[figure]{Figure}

\newenvironment{verydenselist}{\begin{list}{--}%
{\setlength{\itemsep}{0ex}
\setlength{\topsep}{0ex}
\setlength{\parsep}{0pt}
\setlength{\itemindent}{0pt}
\setlength{\leftmargin}{1em}
\setlength{\partopsep}{0pt}}}%
{\end{list}}

\jmlrheading{}{2001}{}{10/01}{11/01}{Simon Tong and Daphne Koller}
\ShortHeadings{SVM Active Learning with Applications to Text Classification}{Tong and Koller}
\firstpageno{45}

\begin{document}

\title{Support Vector Machine Active Learning \\ with Applications
to Text Classification} 
\author{\name Simon Tong \email simon.tong@cs.stanford.edu \\
\name Daphne Koller \email koller@cs.stanford.edu \\
\addr Computer Science Department \\ 
Stanford University \\
Stanford CA 94305-9010, USA
}

\editor{Leslie Pack Kaelbling}

\maketitle

\begin{abstract}%
Support vector machines have met with significant success in numerous
real-world learning tasks.  However, like most machine learning
algorithms, they are generally applied using a randomly selected
training set classified in advance.  In many settings, we also have
the option of using {\em pool-based active learning}. Instead of using
a randomly selected training set, the learner has access to a pool of
unlabeled instances and can request the labels for some number of
them. We introduce a new algorithm for performing active learning
with support vector machines, i.e., an algorithm for choosing which
instances to request next. We provide a theoretical motivation for the
algorithm using the notion of a {\em version space}.  We present
experimental results showing that employing our active learning method
can significantly reduce the need for labeled training instances in
both the standard inductive and transductive settings.
\end{abstract}

\begin{keywords}
Active Learning, Selective Sampling, Support Vector
Machines, Classification, Relevance Feedback
\end{keywords}

\section{Introduction} \label{Introduction}
 In many supervised learning tasks, labeling instances to create a
training set is time-consuming and costly; thus, finding ways to
minimize the number of labeled instances is beneficial. Usually, the
training set is chosen to be a random sampling of instances. However,
in many cases {\em active learning} can be employed.  Here, the
learner can actively choose the training data. It is hoped that
allowing the learner this extra flexibility will reduce the learner's
need for large quantities of labeled data.

{\em Pool-based} active learning for classification was introduced by 
\citet{Lewis:SIGIR94}.
The learner has access to a pool of unlabeled data and can request
the true class label for a certain number of instances in the pool. In
many domains this is a reasonable approach since a large quantity of
unlabeled data is readily available. The main issue with active
learning is finding a way to choose good requests or {\em queries}
from the pool.

Examples of situations in which pool-based active learning 
can be employed are:

\begin{itemize}
\item{\bf Web searching.} A Web-based company wishes to search the
web for particular types of pages (e.g., pages containing lists of
journal publications). It employs a number of people to hand-label
some web pages so as to create a training set for an automatic
classifier that will eventually be used to classify the rest of the
web. Since human expertise is a limited resource, the company wishes to
reduce the number of pages the employees have to label. Rather than
labeling pages randomly drawn from the web, the computer requests
targeted pages that it believes will be most informative to label.

\item{\bf Email filtering.} The user wishes to create a personalized
automatic junk email filter. In the learning phase the automatic
learner has access to the user's past email files. It interactively
brings up past email and asks the user whether the displayed email
is junk mail or not. Based on the user's answer it brings up another
email and queries the user. The process is repeated some number of
times and the result is an email filter tailored to that specific
person.

\item {\bf Relevance feedback.} The user wishes to sort through a
database or website for items (images, articles, etc.) that are of
personal interest---an ``I'll know it when I see it'' type of
search. The computer displays an item and the user tells the
learner whether the item is interesting or not. Based on the user's
answer, the learner brings up another item from the database. After
some number of queries the learner then returns a number of items
in the database that it believes will be of interest to the user.
\end{itemize}

The first two examples involve {\em induction}. The goal is to create
a classifier that works well on unseen future instances. The third
example is an example of {\em transduction}\citep{Vapnik:1998}. The
learner's performance is assessed on the remaining instances in the
database rather than a totally independent test set.

We present a new algorithm that performs pool-based active
learning with support vector machines (SVMs). We provide theoretical
motivations for our approach to choosing the queries, together with
experimental results showing that active learning with SVMs can
significantly reduce the need for labeled training instances.

We shall use text classification as a running example throughout this
paper. This is the task of determining to which pre-defined topic a
given text document belongs. Text classification has an important role
to play, especially with the recent explosion of readily available
text data. There have been many approaches to achieve this
goal~\citep{Rocchio,Dumais+al:CIKM98,Sebastiani:2001}. Furthermore, it is also a domain
in which SVMs have shown notable
success~\citep{Joachims:ECML97,Dumais+al:CIKM98} and it is of interest
to see whether active learning can offer further improvement over this
already highly effective method.

The remainder of the paper is structured as follows.
Section~\ref{SVMs} discusses the use of SVMs both in terms of
induction and transduction. Section~\ref{Version_Space} then
introduces the notion of a {\em version space} and
Section~\ref{Active} provides theoretical motivation for three methods
for performing active learning with SVMs. In Section~\ref{Experiments}
we present experimental results for two real-world text domains that
indicate that active learning can significantly reduce the need for
labeled instances in practice. We conclude in
Section~\ref{Conclusions} with some discussion of the potential
significance of our results and some directions for future work.

\ignore{
\section{Text Classification} \label{Text_Classification}
Rather than working directly with the raw text, learners
typically work with features that are extracted from the
document. Usually, the ordering of the words within each document is
ignored and the features are chosen to be particular words.

Sometimes some preprocessing of the documents is done. Common words on
a stop list (such as ``to'', ``it'', ``and'') are ignored since they
provide little discriminative information. Also, words in the
documents are stemmed so that, for example, ``acquire'', ``acquiring'',
``acquired'' all get mapped to the same stem~\citep{Porter}. One other
form of preprocessing is similar to stop list removal, but more
extreme. One can perform feature selection and remove all words that
are not ``informative'' with respect to the particular set of
pre-defined topics~\citep{Yang+Pedersen:ICML97}. We will not be
considering this last form of preprocessing.

Given a set of $n$ documents a typical representation for documents is
via {\em TFIDF} weighting~\citep{Salton+Buckley}. Each document is
represented by a fixed length vector $\vc{x}_i$ of dimension
$\dimension$. Each one of the $\dimension$ features, $w_j$,
corresponds to a particular word (for example $w_1$ may correspond to
the word ``dog''). The vocabulary of $d$ words is often chosen to be
the words occurring in the entire set of (preprocessed)
documents. Given a document, we construct the value for the j-th
component of its corresponding vector $\vc{x}_i$ as follows: let
$TF(w_j)$ be the number of times the word $w_j$ occurs in the
document. Let $IDF(w_j) = log(n/N_j)$ where $N_j$ is the number of
documents that contain the word $w_j$. Then give the j-th component of
$\vc{x}_i$ a value of $TF(w_j).IDF(w_j)$.

Intuitively, $w_j$ is given a large value for a particular document if
that word occurs many times in the document and very rarely in the
other documents.
}

\section{Support Vector Machines} \label{SVMs}
Support vector machines~\citep{Vapnik:1982} have strong theoretical
foundations and excellent empirical successes. They have been applied
to tasks such as handwritten digit recognition, object recognition,
and text classification.

\subsection{SVMs for Induction}
\begin{figure*}
\centerline{
\begin{tabular}{cc}
\psfig{figure=Figures/svm2.eps,height=2.8in,angle=90} &
\psfig{figure=Figures/tsvm.eps,height=2.8in,angle=90} \\
(a) & (b)
\end{tabular}}
\vspace{-1ex}
\caption{(a) A simple linear support vector machine. (b) A SVM (dotted line) and a transductive SVM (solid line). Solid circles represent unlabeled instances.}
\label{fig:max_margin}
\label{fig:tsvm}
\end{figure*}

We shall consider SVMs in the binary classification setting. We are
given training data $\{ \vc{x}_1 \ldots \vc{x}_n \}$ that are vectors
in some space $\setX \subseteq \R^\dimension$. We are also given their
labels $\{ y_1 \ldots y_n \}$ where $y_i \in \{ -1,1 \}$. In their
simplest form, SVMs are hyperplanes that separate the training data by
a maximal margin (see Fig.~\ref{fig:max_margin}a) . All vectors
lying on one side of the hyperplane are labeled as $-1$, and all
vectors lying on the other side are labeled as 1. The training
instances that lie closest to the hyperplane are called {\em support
vectors}. More generally, SVMs allow one to project the original
training data in space $\setX$ to a higher dimensional feature space
$\setF$ via a Mercer kernel operator $K$. In other words, we consider
the set of classifiers of the form:

\be
\label{eq:kernel_classifier}
f(\vc{x}) = \left( \sum_{i=1}^n \alpha_i K(\vc{x}_i, \vc{x}) \right).
\ee
When $K$ satisfies Mercer's condition~\citep{Burges:1998} we can write:
$K(\vc{u}, \vc{v}) = \innerprod{\Phi(\vc{u})}{\Phi(\vc{v})}$ where
$\Phi:\setX \rightarrow \setF$ and ``$\innerprod{}{}$'' denotes an inner
product. We can then rewrite $f$ as:
\be
f(\vc{x}) = \innerprod{\vc{w}}{\Phi(\vc{x})}, 
\mbox{ where } \vc{w} = \sum_{i=1}^n \alpha_i \Phi(\vc{x}_i).
\label{eq:kernel_hyperplane}
\ee

Thus, by using $K$ we are implicitly projecting the training data
into a different (often higher dimensional) feature space $\setF$.
The SVM then computes the $\alpha_i$s that correspond to the maximal
margin hyperplane in $\setF$. By choosing different kernel functions
we can implicitly project the training data from $\setX$ into spaces
$\setF$ for which hyperplanes in $\setF$ correspond to more complex
decision boundaries in the original space $\setX$. 

Two commonly used kernels are the polynomial kernel given by
$K(\vc{u}, \vc{v}) = (\innerprod{\vc{u}}{\vc{v}} + 1)^p$ which induces
polynomial boundaries of degree $p$ in the original space
$\setX$\footnote{We have not introduced a bias weight in
Eq.~(\ref{eq:kernel_hyperplane}). Thus, the simple Euclidean inner
product will produce hyperplanes that pass through the
origin. However, a polynomial kernel of degree one induces
hyperplanes that do not need to pass through the origin.}  and the
radial basis function kernel $K(\vc{u}, \vc{v}) = (e^{-\gamma
\innerprod{(\vc{u}-\vc{v})}{(\vc{u}-\vc{v})}})$ which induces
boundaries by placing weighted Gaussians upon key training
instances. For the majority of this paper we will assume that the
modulus of the training data feature vectors are constant, i.e., for
all training instances $\vc{x}_i$, $\modulus{\Phi(\vc{x}_i)} =
\lambda$ for some fixed $\lambda$. The quantity
$\modulus{\Phi(\vc{x}_i)}$ is always constant for radial basis
function kernels, and so the assumption has no effect for this
kernel. For $\modulus{\Phi(\vc{x}_i)}$ to be constant with the
polynomial kernels we require that $\modulus{\vc{x}_i}$ be
constant. It is possible to relax this constraint on $\Phi(\vc{x}_i)$
and we shall discuss this at the end of Section~\ref{Active}.

\subsection{SVMs for Transduction}

The previous subsection worked within the framework of {\em
induction}. There was a labeled training set of data and the task was
to create a classifier that would have good performance on {\em
unseen} test data.  In addition to regular induction, SVMs can also be
used for {\em transduction}. Here we are first given a set of both
labeled and unlabeled data. The learning task is to assign labels to
the unlabeled data as accurately as possible.  SVMs can perform
transduction by finding the hyperplane that maximizes the margin
relative to both the labeled and unlabeled data. See
Figure~\ref{fig:tsvm}b for an example. Recently, {\em transductive
SVMs} (TSVMs) have been used for text
classification~\citep{Joachims:ICML99}, attaining some improvements in
precision/recall breakeven performance over regular inductive SVMs.

\section{Version Space} \label{Version_Space}
\begin{figure*}[t]
\centerline{
\begin{tabular}{cc}
\psfig{figure=Figures/vspacehq1.eps,height=2.8in} &
\psfig{figure=Figures/vspacehq2.eps,height=2.8in} \\
(a) & (b)
\end{tabular}}
\caption{(a) Version space duality. The surface of the hypersphere
represents unit weight vectors. Each of the two hyperplanes
corresponds to a labeled training instance. Each hyperplane restricts
the area on the hypersphere in which consistent hypotheses can
lie. Here, the version space is the surface segment of the hypersphere
closest to the camera. (b) An SVM classifier in a version space. The
dark embedded sphere is the largest radius sphere whose center lies in
the version space and whose surface does not intersect with the
hyperplanes. The center of the embedded sphere corresponds to the SVM,
its radius is proportional to the margin of the SVM in $\setF$, and the
training points corresponding to the hyperplanes that it touches are
the support vectors.}
\label{fig:duality}
\end{figure*}

Given a set of labeled training data and a Mercer kernel $K$, there is
a set of hyperplanes that separate the data in the induced feature
space $\setF$. We call this set of consistent hypotheses the {\em
version space}~\citep{Mitchell:AI82}. In other words, hypothesis $f$ is in
version space if for every training instance $\vc{x}_i$ with label
$y_i$ we have that $f(\vc{x}_i) > 0$ if $y_i = 1$ and $f(\vc{x}_i) <
0$ if $y_i = -1$. More formally:

\begin{definition} 
\label{def:version_space}
Our set of possible hypotheses is given as:
\[
\setH = \left \{ f \mid f(\vc{x}) = \frac{\innerprod{\vc{w}}{\Phi(\vc{x})}}{\modulus{\vc{w}}} \mbox{ where $\vc{w} \in \setW$ } \right \},
\]
where our {\em parameter space} $\setW$ is simply equal to $\setF$.
The {\em version space}, $\setV$ is then defined as:
\[
\setV = \{ f \in \setH \mid \forall i \in \{1 \ldots n \} \ \  y_i f(\vc{x}_i) > 0 \}.
\]
Notice that since $\setH$ is a set of hyperplanes, there is a bijection
between unit vectors $\vc{w}$ and hypotheses $f$ in $\setH$. Thus we will
redefine $\setV$ as:
\[
\setV = \{ \vc{w} \in \setW \mid \modulus{\vc{w}} = 1, \
y_i(\innerprod{\vc{w}}{\Phi(\vc{x}_i)}) > 0, i=1 \ldots n \}.
\]
\end{definition} 

Note that a version space only exists if the {\em training} data are
linearly separable in the feature space. Thus, we require linear
separability of the training data in the feature space.  This
restriction is much less harsh than it might at first seem. First, the
feature space often has a very high dimension and so in many cases it
results in the data set being linearly separable. Second, as noted by
Shawe-Taylor and Cristianini~(\citeyear{ShaweTaylor:COLT99}), it is
possible to modify any kernel so that the data in the new induced
feature space is linearly separable%
\footnote{This is done by redefining for all training instances
$\vc{x}_i$: $K(\vc{x}_i,\vc{x}_i) \leftarrow K(\vc{x}_i,\vc{x}_i) +
\nu$ where $\nu$ is a positive regularization constant. This
essentially achieves the same effect as the soft margin error
function~\citep{Cortes+Vapnik:ML95} commonly used in SVMs.  It permits
the training data to be linearly non-separable in the original feature
space.}.

There exists a duality between the feature space $\setF$ and the
parameter space $\setW$~\citep{Vapnik:1998,Herbrich+al} which we shall
take advantage of in the next section: points in $\setF$ correspond to
hyperplanes in $\setW$ and {\em vice versa}.

By definition, points in $\setW$ correspond to hyperplanes in
$\setF$. The intuition behind the converse is that observing a
training instance $\vc{x}_i$ in the feature space restricts the set of
separating hyperplanes to ones that classify $\vc{x}_i$ correctly. In
fact, we can show that the set of allowable points $\vc{w}$ in $\setW$
is restricted to lie on one side of a hyperplane in $\setW$.  More
formally, to show that points in $\setF$ correspond to hyperplanes in
$\setW$, suppose we are given a new training instance $\vc{x}_i$ with
label $y_i$. Then any separating hyperplane must satisfy
$y_i(\innerprod{\vc{w}}{\Phi(\vc{x}_i)}) > 0$. Now, instead of viewing
$\vc{w}$ as the normal vector of a hyperplane in $\setF$, think of
$\Phi(\vc{x}_i)$ as being the normal vector of a hyperplane in
$\setW$. Thus $y_i(\innerprod{\vc{w}}{\Phi(\vc{x}_i)}) > 0$ defines a
half space in $\setW$.  Furthermore
$\innerprod{\vc{w}}{\Phi(\vc{x}_i}) = 0$ defines a hyperplane in
$\setW$ that acts as one of the boundaries to version space
$\setV$. Notice that the version space is a connected region on the
surface of a hypersphere in parameter space. See
Figure~\ref{fig:duality}a for an example.

SVMs find the hyperplane that maximizes the margin in the feature space
$\setF$. One way to pose this optimization task is as follows:
\begin{eqnarray*}
\mbox{maximize}_{\vc{w} \in \setF} & \ \ \min_{i} \{y_i(\innerprod{\vc{w}}{\Phi(\vc{x}_i}))\} \\
\mbox{subject to: } & \modulus{\vc{w}} = 1 \\
& y_i (\innerprod{\vc{w}}{\Phi(\vc{x}_i)}) > 0 \ \ i = 1 \ldots n.
\end{eqnarray*}
By having the conditions $\modulus{\vc{w}} = 1$ and $y_i
(\innerprod{\vc{w}}{\Phi(\vc{x}_i)}) > 0$ we cause the solution to lie
in the version space.  Now, we can view the above problem as finding the
point $\vc{w}$ in the version space that maximizes the distance: $\min_{i}
\{y_i(\innerprod{\vc{w}}{\Phi(\vc{x}_i)})\}$. From the duality between
feature and parameter space, and since $\modulus{\Phi(\vc{x}_i)} =
\lambda$ , each ${\Phi(\vc{x}_i)}/{\lambda}$ is a unit
normal vector of a hyperplane in parameter space. Because of the
constraints $y_i(\innerprod{\vc{w}}{\Phi(\vc{x}_i)}) > 0 \ \ i = 1
\ldots n$ each of these hyperplanes delimit the version space. The expression
$y_i (\innerprod{\vc{w}}{\Phi(\vc{x}_i)})$ can be regarded as:
\[
\lambda \times \mbox{ the distance between the point } \vc{w}
\mbox{ and the hyperplane with normal vector } \Phi(\vc{x}_i).
\]

Thus, we want to find the point $\vc{w}^*$ in the version space that
maximizes the minimum distance to any of the delineating
hyperplanes. That is, SVMs find the center of the largest radius
hypersphere whose center can be placed in the version space and whose
surface does not intersect with the hyperplanes corresponding to the
labeled instances, as in Figure~\ref{fig:duality}b.

The normals of the hyperplanes that are touched by the maximal radius
hypersphere are the $\Phi(\vc{x}_i)$ for which the distance
$y_i (\innerprod{\vc{w}^*}{\Phi(\vc{x}_i)})$ is minimal. Now, taking the
original rather than the dual view, and regarding $\vc{w}^*$ as the unit
normal vector of the SVM and $\Phi(\vc{x}_i)$ as points in feature
space, we see that the hyperplanes that are touched by the maximal
radius hypersphere correspond to the support vectors (i.e., the
labeled points that are closest to the SVM hyperplane boundary).

The radius of the sphere is the distance from the center of the sphere
to one of the touching hyperplanes and is given by $y_i
(\innerprod{\vc{w}^*}{{\Phi(\vc{x}_i)}/{\lambda}})$ where
$\Phi(\vc{x}_i)$ is a support vector.  Now, viewing $\vc{w}^*$ as a
unit normal vector of the SVM and $\Phi(\vc{x}_i)$ as points in
feature space, we have that the distance $y_i
(\innerprod{\vc{w}^*}{{\Phi(\vc{x}_i)}}/{\lambda})$ is:
%{\footnotesize
\[
\frac{1}{\lambda} \times \mbox{ the distance between support vector } 
\Phi(\vc{x}_i) \mbox{ and the hyperplane with normal vector }\vc{w},
\]
%}
which is the margin of the SVM divided by $\lambda$. Thus, the radius
of the sphere is proportional to the margin of the SVM.

\section{Active Learning} \label{Active}
In pool-based active learning we have a pool of unlabeled
instances. It is assumed that the instances $\vc{x}$ are independently
and identically distributed according to some underlying distribution
 $F(\vc{x})$ and the labels are distributed according to
some conditional distribution $P(y \mid \vc{x})$.

Given an unlabeled pool $U$, an {\em active learner} $\ell$ has three
components: $(f,q,X)$. The first component is a classifier, $f:\setX
\rightarrow \{-1,1\}$, trained on the current set of labeled data $X$
(and possibly unlabeled instances in $U$ too). The second component
$q(X)$ is the querying function that, given a current labeled set $X$,
decides which instance in $U$ to query next. The active learner can
return a classifier $f$ after each query ({\em online learning}) or
after some fixed number of queries.

The main difference between an active learner and a passive learner is
the querying component $q$. This brings us to the issue of how to
choose the next unlabeled instance to query. Similar to
\citet{Seung+al:COLT92}, we use an approach that queries points so as
to attempt to reduce the size of the version space as much as
possible. We take a myopic approach that greedily chooses the next
query based on this criterion. We also note that myopia is a standard
approximation used in sequential decision making
problems~\cite{Horvitz+Rutledge:91,Latombe:1991,Heckerman+al:94}.  We
need two more definitions before we can proceed:

\begin{definition}
$\area{\setV}$ is the surface area that the version space $\setV$ occupies
on the hypersphere $\modulus{\vc{w}}=1$.
\end{definition}

\begin{definition}
Given an active learner $\ell$, let $\setV_i$ denote the version space
of $\ell$ after $i$ queries have been made. Now, given the $(i+1)$th
query $\vc{x}_{i+1}$, define:
\begin{eqnarray*}
\setV_i^{-} & = & \setV_i \cap \{ \vc{w} \in \setW \mid
-(\innerprod{\vc{w}}{\Phi(\vc{x}_{i+1})}) > 0 \}, \\
\setV_i^{+} & = & \setV_i \cap \{ \vc{w} \in \setW \mid
+(\innerprod{\vc{w}}{\Phi(\vc{x}_{i+1})}) > 0 \}.
\end{eqnarray*}
So $\setV_i^{-}$ and $\setV_i^{+}$ denote the resulting version spaces when
the next query $\vc{x}_{i+1}$ is labeled as $-1$ and $1$
respectively. 
\end{definition}

We wish to reduce the version space as fast as possible. Intuitively,
one good way of doing this is to choose a query that halves the
version space. The follow lemma says that, for any given number of
queries, the learner that chooses successive queries that halves
the version spaces is the learner that minimizes the maximum
expected size of the version space, where the maximum is taken over
all conditional distributions of $y$ given $\vc{x}$:
\pagebreak
\begin{lemma}
\label{lemma:ideal_learning}
Suppose we have an input space $\setX$, finite dimensional feature
space $\setF$ (induced via a kernel $K$), and parameter space
$\setW$. Suppose active learner $\ell^*$ always queries instances whose
corresponding hyperplanes in parameter space $\setW$ halves the area
of the current version space. Let $\ell$ be any other active
learner. Denote the version spaces of $\ell^*$ and $\ell$ after $i$
queries as $\setV^*_{i}$ and $\setV_{i}$ respectively. Let $\setP$
denote the set of all conditional distributions of $y$ given
$\vc{x}$. Then,
\[\forall i \in \N^+ \ 
\sup_{P \in \setP} E_P[\area{\setV_i^*}] \leq \sup_{P \in \setP}
E_P[\area{\setV_i}],
\]
with strict inequality whenever there exists a query $j \in \{1 \ldots
 i \}$ by $\ell$ that does not halve version space $V_{j-1}$.
\end{lemma}

{\em Proof.} The proof is straightforward. The learner, $\ell^*$
always chooses to query instances that halve the version space. Thus
$\area{\setV_{i+1}^*} = \frac{1}{2} \area{\setV_{i}^*}$ no matter what
the labeling of the query points are. Let $r$ denote the dimension of
feature space $\setF$. Then $r$ is also the dimension of the parameter
space $\setW$. Let $S_r$ denote the surface area of the unit
hypersphere of dimension $r$. Then, under any conditional distribution
$P$, $\area{\setV_{i}^*} = {S_r}/{2^i}$.

Now, suppose $\ell$ does not always query an instance that halves the
area of the version space. Then after some number, $k$, of queries
$\ell$ first chooses to query a point $\vc{x}_{k+1}$ that does not
halve the current version space $\setV_{k}$. Let $y_{k+1} \in
\{-1,1\}$ correspond to the labeling of $\vc{x}_{k+1}$ that will cause
the larger half of the version space to be chosen. 

Without loss of generality assume $\area{\setV_k^{-}} >
\area{\setV_k^{+}}$ and so $y_{k+1} = -1$. Note that
$\area{\setV_k^{-}} + \area{\setV_k^{+}} = {S_r}/2^{k}$,
so we have that $\area{\setV_k^{-}} > {S_r}/{2^{k+1}}$. 

Now consider the conditional distribution $P_0$:
\[
P_0(-1 \mid \vc{x}) = \left\{ \begin{array}{ll}
			\frac{1}{2} & \mbox{if $\vc{x} \neq \vc{x}_{k+1}$} \\
			1 & \mbox{if $\vc{x} = \vc{x}_{k+1}$}
			\end{array} \right. .
\]
Then under this distribution, $\forall i > k$,
\[
E_{P_0}[\area{\setV_i}] = \frac{1}{{2^{i-k-1}}}\area{\setV_k^{-}} > \frac{S_r}{{2^{i}}}.
\]
Hence, $\forall i > k$,
\[\sup_{P \in \setP} E_P[\area{\setV_i^*}] > \sup_{P \in \setP}
E_P[\area{\setV_i}].
\]
\proofend

Now, suppose $\bw^* \in \setW$ is the unit parameter vector
corresponding to the SVM that we would have obtained had we known the
actual labels of {\em all} of the data in the pool. We know that
$\bw^*$ must lie in each of the version spaces $\setV_1 \supset
\setV_2 \supset \setV_3 \ldots$, where $\setV_i$ denotes the version
space after $i$ queries. Thus, by shrinking the size of the version
space as much as possible with each query, we are reducing as fast as
possible the space in which $\bw^*$ can lie. Hence, the SVM that we
learn from our limited number of queries will lie close to $\bw^*$.

If one is willing to assume that there is a hypothesis lying within
$\setH$ that generates the data and that the generating hypothesis is
deterministic and that the data are noise free, then strong
generalization performance properties of an algorithm that halves
version space can also be shown~\citep{Freund+al:ML97}. For example one
can show that the generalization error decreases exponentially with
the number of queries.

This discussion provides motivation for an approach where we query
instances that split the current version space into two equal parts as
much as possible. Given an unlabeled instance $\vc{x}$ from the pool,
it is not practical to explicitly compute the sizes of the new version
spaces $\setV^{-}$ and $\setV^{+}$ (i.e., the version spaces obtained
when $\vc{x}$ is labeled as $-1$ and $+1$ respectively).
We next present three ways of approximating this procedure.

\ignore{
\begin{figure}
\psfig{figure=Figures/simple1.stanford.eps,height=2.2in,angle=270}
\caption{$\Simple$ Margin Method. 
In this stylized picture we have
flattened out the surface of the unit weight vector hypersphere that
appears in~\ref{fig:duality}a. The white area is the version space
$\setV_i$, which is bounded by solid hyperplanes corresponding to
labeled instances. The five dotted lines represent unlabeled instances
in the pool. The circle represents the largest radius hypersphere that
an fit in the version space. Note that the edges of the circle do not
touch the solid hyperplanes---just as the dark sphere
in~\ref{fig:duality}b does not meet the hyperplanes on the surface
of the larger hypersphere (they meet somewhere under the surface).
Instance $\vc{b}$ is the closest to the SVM $\vc{w}_i$ and so we will
choose to query $\vc{b}$.
}
\label{fig:simple}
\end{figure}
}

\begin{figure*}[t]
\centerline{
\begin{tabular}{cc}
\psfig{figure=Figures/simple1.stanford.eps,height=1.8in} &
\psfig{figure=Figures/simple2.stanford.eps,height=1.8in} \\
(a) & (b)
\end{tabular}}
\caption{(a) $\Simple$ Margin will query $\vc{b}$. (b) $\Simple$ Margin will query $\vc{a}$.}
\label{fig:simple}
\label{fig:simple2}
\end{figure*}

\begin{figure*}[t]
\centerline{
\begin{tabular}{cc}
\psfig{figure=Figures/maxmin.stanford.eps,height=1.8in} &
\psfig{figure=Figures/ratio.stanford.eps,height=1.8in} \\
(a) & (b)
\end{tabular}}
\caption{(a) $\MaxMin$ Margin will
query $\vc{b}$. The two SVMs with margins $m^{-}$ and $m^{+}$
for $\vc{b}$ are shown. (b) $\MaxRatio$ Margin will query $\vc{e}$. The two
SVMs with margins $m^{-}$ and $m^{+}$ for $\vc{e}$ are shown.
}
\label{fig:maxMargin}
\label{fig:maxRatio}
\end{figure*}

\begin{itemize}
\item {\bf Simple Margin.} Recall from section~\ref{Version_Space}
that, given some data $\{ \vc{x}_1 \ldots \vc{x}_i \}$ and labels $\{
y_1 \ldots y_i \}$, the SVM unit vector $\vc{w}_i$ obtained from this
data is the center of the largest hypersphere that can fit inside the
current version space $\setV_i$. The position of $\vc{w}_i$ in the
version space $\setV_i$ clearly depends on the shape of the region
$\setV_i$, however it is often approximately in the center of the
version space. Now, we can test each of the unlabeled instances
$\vc{x}$ in the pool to see how close their corresponding hyperplanes
in $\setW$ come to the centrally placed $\vc{w}_i$. The closer a
hyperplane in $\setW$ is to the point $\vc{w}_i$, the more centrally
it is placed in the version space, and the more it bisects the version
space. Thus we can pick the unlabeled instance in the pool whose
hyperplane in $\setW$ comes closest to the vector $\vc{w}_i$. For each
unlabeled instance $\vc{x}$, the shortest distance between its
hyperplane in $\setW$ and the vector $\vc{w}_i$ is simply the distance
between the feature vector $\Phi(\vc{x})$ and the hyperplane
$\vc{w}_i$ in $\setF$---which is easily computed by
$|\innerprod{\vc{w}_i}{\Phi(\vc{x})}|$.  This results in the natural
rule: learn an SVM on the existing labeled data and choose as the next
instance to query the instance that comes closest to the hyperplane in
$\setF$.

Figure~\ref{fig:simple}a presents an illustration. In the stylized
picture we have flattened out the surface of the unit weight vector
hypersphere that appears in Figure~\ref{fig:duality}a. The white area is
version space $\setV_i$ which is bounded by solid lines corresponding
to labeled instances. The five dotted lines represent unlabeled
instances in the pool. The circle represents the largest radius
hypersphere that can fit in the version space. Note that the edges of the
circle do not touch the solid lines---just as the dark sphere
in~\ref{fig:duality}b does not meet the hyperplanes on the surface
of the larger hypersphere (they meet somewhere under the surface).
The instance $\vc{b}$ is closest to the SVM $\vc{w}_i$ and so we will
choose to query $\vc{b}$.

\item {\bf MaxMin Margin.} The $\Simple$ Margin method can be a rather
rough approximation. It relies on the assumption that the version
space is fairly symmetric and that $\vc{w}_i$ is centrally placed. It
has been demonstrated, both in theory and practice, that these
assumptions can fail significantly~\citep{Herbrich+al}. Indeed, if we
are not careful we may actually query an instance whose hyperplane
does not even intersect the version space. The $\MaxMin$ approximation
is designed to overcome these problems to some degree. Given some data
$\{ \vc{x}_1 \ldots \vc{x}_i \}$ and labels $\{y_1 \ldots y_i \}$, the
SVM unit vector $\vc{w}_i$ is the center of the largest hypersphere
that can fit inside the current version space $\setV_i$ and the radius
$m_i$ of the hypersphere is proportional\footnote{To ease notation,
without loss of generality we shall assume the the constant of
proportionality is 1, i.e., the radius is equal to the margin.}  to
the size of the margin of $\vc{w}_i$. We can use the radius $m_i$ as
an indication of the size of the version
space~\citep{Vapnik:1998}. Suppose we have a candidate unlabeled
instance $\vc{x}$ in the pool. We can estimate the relative size of
the resulting version space $\setV^{-}$ by labeling $\vc{x}$ as $-1$,
finding the SVM obtained from adding $\vc{x}$ to our labeled training
data and looking at the size of its margin $m^{-}$. We can perform a
similar calculation for $\setV^{+}$ by relabeling $\vc{x}$ as class
$+1$ and finding the resulting SVM to obtain margin $m^{+}$.

Since we want an equal split of the version space, we wish
$\area{\setV^{-}}$ and $\area{\setV^{+}}$ to be similar. Now, consider 
$\min (\area{\setV^{-}},\area{\setV^{+}})$. It will be
small if $\area{\setV^{-}}$ and $\area{\setV^{+}}$ are very
different. Thus we will consider $\min (m^{-}, m^{+})$ as an
approximation and we will choose to query the $\vc{x}$ for which this
quantity is largest. Hence, the $\MaxMin$ query algorithm is as
follows: for each unlabeled instance $\vc{x}$ compute the margins
$m^{-}$ and $m^{+}$ of the SVMs obtained when we label $\vc{x}$ as
$-1$ and $+1$ respectively; then choose to query the unlabeled instance
for which the quantity $\min(m^{-}, m^{+})$ is greatest.

Figures~\ref{fig:simple2}b and~\ref{fig:maxMargin}a show
an example comparing the $\Simple$ Margin and $\MaxMin$ Margin
methods.

\item {\bf Ratio Margin.} This method is similar in spirit to the
$\MaxMin$ Margin method. We use $m^{-}$ and $m^{+}$ as indications of
the sizes of $\setV^{-}$ and $\setV^{+}$. However, we shall try to
take into account the fact that the current version space $\setV_i$
may be quite elongated and for some $\vc{x}$ in the pool {\em both}
$m^{-}$ and $m^{+}$ may be small simply because of the shape of
version space. Thus we will instead look at the {\em relative} sizes
of $m^{-}$ and $m^{+}$ and choose to query the $\vc{x}$ for which
$\min (\frac{m^{-}}{m^{+}}, \frac{m^{+}}{m^{-}})$ is largest (see
Figure~\ref{fig:maxRatio}b).
\end{itemize}

The above three methods are approximations to the querying component
that always halves version space. After performing some number of
queries we then return a classifier by learning a SVM with the 
labeled instances.

The margin can be used as an indication of the version space size
irrespective of whether the feature vectors have constant
modulus. Thus the explanation for the $\MaxMin$ and $\MaxRatio$
methods still holds even without the constraint on the modulus of the
training feature vectors. The $\Simple$ method can still be used when
the training feature vectors do not have constant modulus, but the
motivating explanation no longer holds since the maximal margin
hyperplane can no longer be viewed as the center of the largest
allowable sphere. However, for the $\Simple$ method, alternative
motivations have recently been proposed by \citet{Campbell+al:2000}
that do not require the constraint on the modulus.

For inductive learning, after
performing some number of queries we then return a classifier by
learning a SVM with the labeled instances. For transductive learning,
after querying some number of instances we then return a classifier by
learning a transductive SVM with the labeled {\em and} unlabeled instances.

\section{Experiments} \label{Experiments}
For our empirical evaluation of the above methods we used two
real-world text classification domains: the
$\Reuters$-21578 data set and the $\Newsgroups$ data set.

\subsection{Reuters Data Collection Experiments}
The $\Reuters$-21578 data set%
\footnote{Obtained from www.research.att.com/\symbol{126}lewis.} is a
commonly used collection of newswire stories categorized into
hand-labeled topics.  Each news story has been hand-labeled with some
number of topic labels such as ``corn'', ``wheat'' and ``corporate
acquisitions''. Note that some of the topics overlap and so some
articles belong to more than one category.  We used the 12902 articles
from the ``ModApte'' split of the data\footnote{The $\Reuters$-21578
collection comes with a set of predefined training and test set
splits. The commonly used``ModApte'' split filters out duplicate
articles and those without a labeled topic, and then uses earlier
articles as the training set and later articles as the test set.} and,
to stay comparable with previous studies, we considered the top ten
most frequently occurring topics.  We learned ten different binary
classifiers, one to distinguish each topic.  Each document was
represented as a stemmed, TFIDF-weighted word frequency vector.%
\footnote{We used Rainbow~\citep{Libbow} for text processing.} Each
vector had unit modulus. A stop list of common words was used and
words occurring in fewer than three documents were also ignored. Using
this representation, the document vectors had about 10000 dimensions.

\begin{figure*}[t]
\centerline{
\begin{tabular}{cc}
\hspace{-0.25in}
\psfig{figure=Graphs/reuter.1000.acc.zoomed.eps,height=2.8in,angle=270} &
\psfig{figure=Graphs/reuter.1000.be.zoomed.eps,height=2.8in,angle=270} \\
(a) & (b)
\end{tabular}
}
\caption{
(a) Average test set accuracy over the ten most frequently
occurring topics when using a pool size of 1000.
(b) Average test set precision/recall breakeven point over the
ten most frequently occurring topics when using a pool size of 1000.
}
\label{fig:reuter1000acc}
\label{fig:reuter1000be}
\end{figure*}

\begin{table}[t]
\centerline{
%\footnotesize{
\begin{tabular}{l|c|c|c|c}
Topic & $\Simple$ & $\MaxMin$ & $\MaxRatio$ & Equivalent \\ 
&	&	&	& $\Random$ size \\ \hline
Earn & $86.39 \pm 1.65$ & $87.75 \pm 1.40$ & $90.24 \pm 2.31$ & $34$ \\
Acq & $77.04 \pm 1.17$ & $77.08 \pm 2.00$ & $80.42 \pm 1.50$ & $>100$ \\
Money-fx & $93.82 \pm 0.35$ & $\mathbf{94.80 \pm 0.14}$ & $\mathbf{94.83 \pm 0.13}$ & $50$ \\
Grain & $95.53 \pm 0.09$ & $95.29 \pm 0.38$ & $95.55 \pm 1.22$ & $13$ \\
Crude & $95.26 \pm 0.38$ & $95.26 \pm 0.15$ & $95.35 \pm 0.21$ & $>100$ \\
Trade & $96.31 \pm 0.28$ & $96.64 \pm 0.10$ & $96.60 \pm 0.15$ & $>100$ \\
Interest & $96.15 \pm 0.21$ & $96.55 \pm 0.09$ & $96.43 \pm 0.09$ & $>100$ \\
Ship & $97.75 \pm 0.11$ & $97.81 \pm 0.09$ & $97.66 \pm 0.12$ & $>100$ \\
Wheat & $98.10 \pm 0.24$ & $98.48 \pm 0.09$ & $98.13 \pm 0.20$ & $>100$ \\
Corn & $98.31 \pm 0.19$ & $98.56 \pm 0.05$ & $98.30 \pm 0.19$ & $15$
\end{tabular}
%}
}
\caption{Average test set accuracy over the top ten most frequently
occurring topics (most frequent topic first) when trained with ten
labeled documents. Boldface indicates statistical significance.}
\label{fig:reuter10equiv.acc}
\end{table}

\begin{table}[t]
\centerline{
%\footnotesize{
\begin{tabular}{l|c|c|c|c}
Topic & $\Simple$ & $\MaxMin$ & $\MaxRatio$ & Equivalent \\
&	&	&	& $\Random$ size \\ \hline
Earn & $86.05 \pm 0.61$ & $\mathbf{89.03 \pm 0.53}$ & $\mathbf{88.95 \pm 0.74}$ & $12$ \\
Acq & $54.14 \pm 1.31$ & $56.43 \pm 1.40$ & $57.25 \pm 1.61$ & $12$ \\
Money-fx & $35.62 \pm 2.34$ & $38.83 \pm 2.78$ &$38.27 \pm 2.44$ & $52$ \\
Grain & $50.25 \pm 2.72$ & $\mathbf{58.19 \pm 2.04}$ & $\mathbf{60.34 \pm 1.61}$ & $51$ \\
Crude & $58.22 \pm 3.15$ & $55.52 \pm 2.42$ & $58.41 \pm 2.39$ & $55$ \\
Trade & $50.71 \pm 2.61$ & $48.78 \pm 2.61$ & $50.57 \pm 1.95$ & $85$ \\
Interest & $40.61 \pm 2.42$ & $45.95 \pm 2.61$ & $43.71 \pm 2.07$ & $60$ \\
Ship & $53.93 \pm 2.63$ & $52.73 \pm 2.95$ & $53.75 \pm 2.85$ & $>100$ \\
Wheat & $64.13 \pm 2.10$ & $66.71 \pm 1.65$ & $66.57 \pm 1.37$ & $>100$ \\
Corn & $49.52 \pm 2.12$ & $48.04 \pm 2.01$ & $46.25 \pm 2.18$ & $>100$
\end{tabular}
%}
}
\caption{Average test set precision/recall breakeven point over the
top ten most frequently occurring topics (most frequent topic first)
when trained with ten labeled documents. Boldface indicates statistical significance.}
\label{fig:reuter10equiv.be}
\end{table}

We first compared the three querying methods in the inductive learning
setting. Our test set consisted of the 3299 documents present in the
``ModApte'' test set. 

For each of the ten topics we performed the following steps. We created a
pool of unlabeled data by sampling 1000 documents from the remaining
data and removing their labels. We then randomly selected two
documents in the pool to give as the initial labeled training set. One
document was about the desired topic, and the other document was not
about the topic. Thus we gave each learner 998 unlabeled documents and
2 labeled documents. After a fixed number of queries we asked each
learner to return a classifier (an SVM with a polynomial kernel of
degree one%
\footnote{For SVM and transductive SVM
learning we used SVMlight~\cite{JoachimsSVMlight:1999}.}
learned on the labeled training documents). We then tested the
classifier on the independent test set.

The above procedure was repeated thirty times for each topic and the
results were averaged. We considered the $\Simple$ Margin, $\MaxMin$
Margin and $\MaxRatio$ Margin querying methods as well as a $\Random$
Sample method. The $\Random$ Sample method simply randomly chooses the
next query point from the unlabeled pool. This last method reflects
what happens in the regular passive learning setting---the
training set is a random sampling of the data.

To measure performance we used two metrics: test set classification
error and, to stay compatible with previous Reuters corpus results,
the {\em precision/recall breakeven point} \citep{Joachims:ECML97}.
{\em Precision} is the percentage of documents a classifier labels as
``relevant'' that are really relevant. {\em Recall} is the percentage
of relevant documents that are labeled as ``relevant'' by the
classifier. By altering the decision threshold on the SVM we can trade
precision for recall and can obtain a precision/recall curve for the
test set. The precision/recall breakeven point is a one number summary
of this graph: it is the point at which precision equals recall.

Figures~\ref{fig:reuter1000acc}a and~\ref{fig:reuter1000be}b
present the average test set accuracy and precision/recall breakeven
points over the ten topics as we vary the number of queries
permitted. The horizontal line is the performance level achieved when
the SVM is trained on all 1000 labeled documents comprising the
pool. Over the $\Reuters$ corpus, the three active learning methods
perform almost identically with little notable difference to
distinguish between them. Each method also appreciably outperforms
random sampling. Tables~\ref{fig:reuter10equiv.acc} and
\ref{fig:reuter10equiv.be} show the test set accuracy and breakeven
performance of the active methods after they have asked for just eight
labeled instances (so, together with the initial two random instances,
they have seen ten labeled instances). They demonstrate that the three
active methods perform similarly on this $\Reuters$ data set after
eight queries, with the $\MaxMin$ and $\MaxRatio$ showing a very
slight edge in performance. The last columns in each table are of more
interest. They show approximately how many instances would be needed
if we were to use $\Random$ to achieve the same level of performance
as the $\MaxRatio$ active learning method. In this instance, passive
learning on average requires over six times as much data to achieve
comparable levels of performance as the active learning methods. The
tables indicate that active learning provides more benefit with the
infrequent classes, particularly when measuring performance by the
precision/recall breakeven point. This last observation has also been
noted before in previous empirical tests~\citep{McCallum+Nigam:ICML98}.


\begin{figure*}[t]
\centerline{
\begin{tabular}{cc}
\hspace{-0.25in}
\psfig{figure=Graphs/reuter.balanced.acc.eps,height=2.8in,angle=270} &
\psfig{figure=Graphs/reuter.balanced.be.zoomed.eps,height=2.8in,angle=270} \\
(a) & (b)
\end{tabular}
}
\caption{
(a) Average test set accuracy over the ten most frequently
occurring topics when using a pool size of 1000.
(b) Average test set precision/recall breakeven point over the
ten most frequently occurring topics when using a pool size of 1000.
}
\label{fig:reuter_balanced_acc}
\label{fig:reuter_balanced_be}
\end{figure*}

\begin{figure*}
\centerline{
\begin{tabular}{cc}
\psfig{figure=Graphs/reuter.500.acc.eps,height=2.8in,angle=270} &
\psfig{figure=Graphs/reuter.500.be.eps,height=2.8in,angle=270} \\
(a) & (b)
\end{tabular}
}
\caption{
(a) Average test set accuracy over the ten most
frequently occurring topics when using a pool sizes of 500 and 1000.
(b) Average breakeven point over the ten most
frequently occurring topics when using a pool sizes of 500 and 1000.
}
\label{fig:reuter500_1000acc}
\label{fig:reuter500_1000be}
\end{figure*}

We noticed that approximately half of the queries that the active
learning methods asked tended to turn out to be positively labeled,
regardless of the true overall proportion of positive instances in the
domain. We investigated whether the gains that the active learning
methods had over regular $\Random$ sampling were due to this biased
sampling. We created a new querying method called $\BalancedRandom$
which would randomly sample an equal number of positive and negative
instances from the pool. Obviously in practice the ability to randomly
sample an equal number of positive and negative instances without
having to label an entire pool of instances first may or may not be
reasonable depending upon the domain in
question. Figures~\ref{fig:reuter_balanced_acc}a
and~\ref{fig:reuter_balanced_be}b show the average accuracy and
breakeven point of the $\BalancedRandom$ method compared with the
$\MaxRatio$ active method and regular $\Random$ method on the
$\Reuters$ dataset with a pool of 1000 unlabled instances. The
$\MaxRatio$ and $\Random$ curves are the same as those shown in
Figures~\ref{fig:reuter1000acc}a and~\ref{fig:reuter1000be}b. The
$\MaxMin$ and $\Simple$ curves are omitted to ease legibility. The
$\BalancedRandom$ method has a much better precision/recall breakeven
performance than the regular $\Random$ method, although it is still
matched and then outperformed by the active method. For classification
accuracy, the $\BalancedRandom$ method initially has extremely poor
performance (less than 50\% which is even worse than pure random
guessing) and is always consistently and significantly outperformed by
the active method. This indicates that the performance gains of the
active methods are not merely due to their ability to bias the class
of the instances they queries. The active methods are choosing special
targeted instances and approximately half of these instances happen
to have positive labels.

Figures~\ref{fig:reuter500_1000acc}a
and~\ref{fig:reuter500_1000be}b show the average accuracy and
breakeven point of the $\MaxRatio$ method with two different pool
sizes. Clearly the $\Random$ sampling method's performance will not be
affected by the pool size. However, the graphs indicate that
increasing the pool of unlabeled data will improve both the accuracy
and breakeven performance of active learning. This is quite intuitive
since a good active method should be able to take advantage of a
larger pool of potential queries and ask more targeted questions.

\begin{figure*}[t]
\centerline{
%\begin{tabular}{cc}
%\hspace{-0.25in}
\psfig{figure=Graphs/reuter.trans.be.zoomed.eps,height=3.0in,angle=270} %&
%\hspace{-0.43in}
%\psfig{figure=Graphs/reuter.trans.acc.eps,height=1.88in,angle=270} \\
%(a) & (b)
%\end{tabular}
}
\caption{
%(a)
Average pool set precision/recall breakeven point over the
ten most frequently occurring topics when using a pool size of 1000.
%(b) Average pool set accuracy over the
%ten most frequently occurring topics when using a pool size of 1000.
}
\label{fig:reutertrans}
%\label{fig:reutertrans_acc}
\end{figure*}

\ignore{
\begin{figure*}
\centerline{
\psfig{figure=Graphs/crude.pr.curve.eps,height=2.8in,angle=270}
}
\caption{
A typical precision/recall curve using transductive and inductive SVMs after 20 queries.
}
\label{fig:pr_curve}
\end{figure*}
}

We also investigated active learning in a transductive setting. Here
we queried the points as usual except now each method ($\Simple$ and
$\Random$) returned a transductive SVM trained on both the labeled and
remaining unlabeled data in the pool. As described
by~\citet{Joachims:ECML97} the breakeven point for a TSVM was computed
by gradually altering the number of unlabeled instances that we wished
the TSVM to label as positive. This invovles re-learning the TSVM
multiple times and was computationally intensive. Since our setting
was transduction, the performance of each classifier was measured on
the pool of data rather than a separate test set.  This reflects the
relevance feedback transductive inference example presented in the
introduction.

Figure~\ref{fig:reutertrans} shows that using a TSVM provides a
slight advantage over a regular SVM in both querying methods
($\Random$ and $\Simple$) when comparing breakeven points. However,
the graph also shows that active learning provides notably more
benefit than transduction---indeed using a TSVM with a $\Random$
querying method needs over 100 queries to achieve the same breakeven
performance as a regular SVM with a $\Simple$ method that has only
seen 20 labeled instances.

\subsection{Newsgroups Experiments}
Our second data collection was K. Lang's $\Newsgroups$
collection~\cite{Lang95}. We used the five $\mathtt{comp.*}$ groups,
discarding the Usenet headers and subject lines. We processed the text
documents exactly as before, resulting in vectors of about 10000
dimensions.

\begin{figure*}[t]
\centerline{
\begin{tabular}{cc}
\hspace{-0.25in}
\psfig{figure=Graphs/ng.all.acc.zoomed.eps,height=2.8in,angle=270} &
\psfig{figure=Graphs/ng.ibm.acc.zoomed.eps,height=2.8in,angle=270} \\
(a) & (b)
\end{tabular}
}
\caption{ (a) Average test set accuracy over the five
$\mathtt{comp.*}$ topics when using a pool size of 500.  (b) Average
test set accuracy for $\mathtt{comp.sys.ibm.pc.hardware}$ with a 500
pool size.}
\label{fig:ng.all.acc}
\label{fig:ng.ibm.acc}
\end{figure*}

We placed half of the 5000 documents aside to use as an independent
test set, and repeatedly, randomly chose a pool of 500 documents from
the remaining instances. We performed twenty runs for each of the five
topics and averaged the results. We used test set accuracy to measure
performance. Figure~\ref{fig:ng.all.acc}a contains the learning
curve (averaged over all of the results for the five $\mathtt{comp.*}$
topics) for the three active learning methods and $\Random$
sampling. Again, the horizontal line indicates the performance
of an SVM that has been trained on the entire pool. There is no
appreciable difference between the $\MaxMin$ and $\MaxRatio$ methods
but, in two of the five newsgroups
($\mathtt{comp.sys.ibm.pc.hardware}$ and
$\mathtt{comp.os.ms\mbox{-}windows.misc}$) the $\Simple$ active learning
method performs notably worse than the $\MaxMin$ and $\MaxRatio$
methods. Figure~\ref{fig:ng.ibm.acc}b shows the average learning
curve for the $\mathtt{comp.sys.ibm.pc.hardware}$ topic. In around ten
to fifteen per cent of the runs for both of the two newsgroups
the $\Simple$ method was misled and performed extremely poorly (for
instance, achieving only 25\% accuracy even with fifty training
instances, which is worse than just randomly guessing a label!). This
indicates that the $\Simple$ querying method may be more unstable than
the other two methods.

One reason for this could be that the $\Simple$ method tends not to
explore the feature space as aggressively as the other active methods,
and can end up ignoring entire clusters of unlabeled instances. In
Figure~\ref{fig:cluster}a, the $\Simple$ method takes several queries
before it even considers an instance in the unlabeled cluster while
both the $\MaxMin$ and $\MaxRatio$ query a point in the unlabeled
cluster immediately.

\begin{figure*}[t]
\centerline{
\begin{tabular}{cc}
\psfig{figure=Figures/cluster.eps,height=2.6in,angle=270} &
\psfig{figure=Graphs/ng.ibmmisc.acc.zoomed.eps,height=2.8in,angle=270} \\
(a) & (b)
\end{tabular}
}
\caption{
(a) A simple example of querying unlabeled clusters.
(b) Macro-average test set accuracy for
$\mathtt{comp.os.ms\mbox{-}windows.misc}$ and
$\mathtt{comp.sys.ibm.pc.hardware}$ where $\Hybrid$ uses the $\MaxRatio$
method for the first ten queries and $\Simple$ for the rest. 
}
\label{fig:hybrid}
\label{fig:cluster}
\end{figure*}

While $\MaxMin$ and $\MaxRatio$ appear more stable they are much more
computationally intensive.  With a large pool of $s$ instances, they
require about $2s$ SVMs to be learned for each query. Most of the
computational cost is incurred when the number of queries that have
already been asked is large. The reason is that the cost of training
an SVM grows polynomially with the size of the labeled training set
and so now training each SVM is costly (taking over 20 seconds to
generate the 50th query on a Sun Ultra 60 450Mhz workstation with a
pool of 500 documents).  However, when the quantity of labeled data is
small, even with a large pool size, $\MaxMin$ and $\MaxRatio$ are
fairly fast (taking a few seconds per query) since now training each
SVM is fairly cheap. Interestingly, it is in the first ten queries
that the $\Simple$ seems to suffer the most through its lack of
aggressive exploration. This motivates a $\Hybrid$ method. We can use
$\MaxMin$ or $\MaxRatio$ for the first few queries and then use the
$\Simple$ method for the rest. Experiments with the $\Hybrid$ method
show that it maintains the stability of the $\MaxMin$ and $\MaxRatio$
methods while allowing the scalability of the $\Simple$
method. Figure~\ref{fig:hybrid}b compares the $\Hybrid$ method with
the $\MaxRatio$ and $\Simple$ methods on the two newsgroups for which
the $\Simple$ method performed poorly. The test set accuracy of the
$\Hybrid$ method is virtually identical to that of the $\MaxRatio$
method while the $\Hybrid$ method's run time was about the same as the
$\Simple$ method, as indicated by Table~\ref{table:runtimes}.

\begin{table}[t]
\centerline{
\begin{tabular}{l|c|c|c|c}
Query 	& $\Simple$ 	& $\MaxMin$ 	& $\MaxRatio$ 	& $\Hybrid$ \\ \hline
1	& 0.008 	& 3.7		& 3.7		& 3.7 \\
5	& 0.018		& 4.1		& 5.2		& 5.2 \\
10	& 0.025		& 12.5		& 8.5		& 8.5 \\
20	& 0.045		& 13.6		& 19.9		& 0.045 \\
30	& 0.068		& 22.5		& 23.9		& 0.073 \\
50	& 0.110		& 23.2		& 23.3  	& 0.115 \\
100	& 0.188		& 42.8		& 43.2		& 0.2
\end{tabular}
}
\caption{Typical run times in seconds for the Active methods on the $\Newsgroups$ dataset}
\label{table:runtimes}
\end{table}


\section{Related Work} \label{Related}

\begin{figure*}[t]
\centerline{
\begin{tabular}{cc}
\psfig{figure=Graphs/vs.mccallum.eps,height=2.8in,angle=270} &
\psfig{figure=Graphs/vs.liere.eps,height=2.8in,angle=270} \\
(a) & (b)
\end{tabular}
}
\caption{ 
(a) Average breakeven point performance over the Corn, Trade and Acq
$\Reuters$-21578 categories.
(b) Average test set accuracy over the top ten
$\Reuters$-21578 categories.
}
\label{fig:vs.mccallum}
\label{fig:vs.liere}
\end{figure*}

There have been several studies of active learning for classification.
The Query by Committee algorithm~\citep{Seung+al:COLT92,Freund+al:ML97}
uses a prior distribution over hypotheses. This general algorithm has
been applied in domains and with classifiers for which specifying and
sampling from a prior distribution is natural. They have been used
with probabilistic models~\citep{Dagan+Engelson:ICML95} and
specifically with the Naive Bayes model for text classification in a
Bayesian learning setting~\citep{McCallum+Nigam:ICML98}.  The Naive
Bayes classifier provides an interpretable model and principled ways
to incorporate prior knowledge and data with missing values. However,
it typically does not perform as well as discriminative methods such
as SVMs, particularly in the text classification
domain~\citep{Joachims:ECML97,Dumais+al:CIKM98}.

We re-created
McCallum and Nigam's~(1998\nocite{McCallum+Nigam:ICML98}) experimental
setup on the $\Reuters$-21578 corpus and compared the reported results
from their algorithm (which we shall call the {\em MN-}algorithm
hereafter) with ours.
In line with their experimental setup, queries were asked five at a
time, and this was achieved by picking the five instances closest to
the current hyperplane.
Figure~\ref{fig:vs.mccallum}a compares McCallum and Nigam's reported
results with ours.
The graph indicates that the Active SVM performance is significantly
better than that of the {\em MN-}algorithm.

An alternative committee approach to query by committee was explored
by Liere and Tadepalli~(1997, 2000). Although their algorithm ({\em
LT-}algorithm hereafter) lacks the theoretical justifications of the
Query by Committee algorithm, they successfully used their committee
based active learning method with Winnow classifiers in the text
categorization domain.  Figure~\ref{fig:vs.liere}b was produced by
emulating their experimental setup on the $\Reuters$-21578 data set
and it compares their reported results with ours.  Their algorithm
does not require a positive and negative instance to seed their
classifier.  Rather than seeding our Active SVM with a positive and
negative instance (which would give the Active SVM an unfair
advantage) the Active SVM randomly sampled 150 documents for its first
150 queries. This process virtually guaranteed that the training set
contained at least one positive instance. The Active SVM then
proceeded to query instances actively using the $\Simple$ method.
Despite the very naive initialization policy for the Active SVM, the
graph shows that the Active SVM accuracy is significantly better than
that of the {\em LT-}algorithm.

\citet{Lewis:SIGIR94} introduced
uncertainty sampling and applied it to a text domain using logistic
regression and, in a companion paper, using decision
trees~\citep{Lewis:ICML94}. The $\Simple$ querying method for SVM
active learning is essentially the same as their uncertainty sampling
method (choose the instance that our current classifier is most
uncertain about), however they provided substantially less
justification as to why the algorithm should be effective. They also
noted that the performance of the uncertainty sampling method can be
variable, performing quite poorly on occasions.

Two other studies~\citep{Campbell+al:2000,Schohn+Cohn:2000}
independently developed our $\Simple$ method for active learning with
support vector machines and provided different formal
analyses. Campbell, Cristianini and Smola extend their analysis for
the $\Simple$ method to cover the use of soft margin
SVMs~\citep{Cortes+Vapnik:ML95} with linearly non-separable
data. Schohn and Cohn note interesting behaviors of the active
learning curves in the presence of outliers.

\section{Conclusions and Future Work} \label{Conclusions}
We have introduced a new algorithm for performing active learning with
SVMs. 
By taking advantage of the duality between parameter space and feature
space, we arrived at three algorithms that attempt to reduce version
space as much as possible at each query. We have shown empirically
that these techniques can provide considerable gains in both the
inductive and transductive settings---in some cases shrinking the
need for labeled instances by over an order of magnitude, and
in almost all cases reaching the performance achievable on the entire
pool having seen only a fraction of the data.  Furthermore, larger
pools of unlabeled data improve the quality of the resulting
classifier.  

Of the three main methods presented, the $\Simple$ method is
computationally the fastest. However, the $\Simple$ method seems
to be a rougher and more unstable approximation, as we witnessed when
it performed poorly on two of the five Newsgroup topics. If asking
each query is expensive relative to computing time then using either
the $\MaxMin$ or $\MaxRatio$ may be preferable. However, if the cost
of asking each query is relatively cheap and more emphasis is placed
upon fast feedback then the $\Simple$ method may be more suitable. In
either case, we have shown that the use of these methods for learning
can substantially outperform standard passive learning. Furthermore,
experiments with the $\Hybrid$ method indicate that it is possible to
combine the benefits of the $\MaxRatio$ and $\Simple$ methods.

The work presented here leads us to many directions of interest.
Several studies have noted that gains in computational speed can be
obtained at the expense of generalization performance by querying
multiple instances at a
time~\citep{Lewis:SIGIR94,McCallum+Nigam:ICML98}. Viewing SVMs in terms
of the version space gives an insight as to where the approximations
are being made, and this may provide a guide as to which multiple
instances are better to query. For instance, it is suboptimal to query
two instances whose version space hyperplanes are fairly parallel to
each other. So, with the $\Simple$ method, instead of blindly choosing
to query the two instances that are the closest to the current SVM, it
may be better to query two instances that are close to the current SVM
and whose hyperplanes in the version space are fairly
perpendicular. Similar tradeoffs can be made for the $\MaxRatio$ and
$\MaxMin$ methods.
 
{\em Bayes Point Machines}~\citep{Herbrich+al} approximately find the
center of mass of the version space. Using the $\Simple$ method with
this point rather than the SVM point in the version space may produce an
improvement in performance and stability. The use of Monte Carlo
methods to estimate version space areas may also give improvements.

One way of viewing the strategy of always choosing to halve the
version space is that we have essentially placed a uniform
distribution over the current space of consistent hypotheses and we
wish to reduce the expected size of the version space as fast as
possible. Rather than maintaining a uniform distribution over
consistent hypotheses, it is plausible that the addition of prior
knowledge over our hypotheses space may allow us to modify our query
algorithm and provided us with an even better strategy. Furthermore,
the PAC-Bayesian framework
introduced by \citet{McAllester:COLT99}
considers the effect of prior knowledge on generalization bounds and
this approach may lead to theoretical guarantees for the modified
querying algorithms.

Finally, the $\MaxRatio$ and $\MaxMin$ methods are computationally
expensive since they have to step through each of the unlabeled data
instances and learn an SVM for each possible labeling. However, the
temporarily modified data sets will only differ by one instance from
the original labeled data set and so one can envisage learning an SVM
on the original data set and then computing the ``incremental''
updates to obtain the new SVMs~\citep{Gert:NIPS00} for each of the
possible labelings of each of the unlabeled instances. Thus, one would
hopefully obtain a much more efficient implementation of the
$\MaxRatio$ and $\MaxMin$ methods and hence allow these active
learning algorithms to scale up to larger problems.

\acks{
This work was supported by DARPA's {\em Information Assurance\/} program
under subcontract to SRI International, and by ARO grant DAAH04-96-1-0341
under the MURI program ``Integrated Approach to Intelligent Systems''.}

\bibliography{simon}

\end{document}