\section{Introduction}

In the core of the SVM training problem lies a convex optimization problem
which scales with the training set size rather than the feature space 
dimension \citep{BGV92, Vap95}.
While this is usually considered to be a desired quality, since it
circumvents the well known ``curse of dimensionality'', in large scale
problems (which are so common in real world application such as speech,
document classification, OCR, etc.) it may actually raise a new concern:
Although the training problem is, in principle, solvable, in practice it is
intractable by the conventional optimization techniques; e.g., 
each iteration of a general interior point method (IPM) scales
cubically with the size of the training set. Several approaches
to handle this problem have been suggested in the past few years which
basically build a solution by solving a sequence of small scale subproblems.
To mention a few: stochastic gradient ascent
algorithms such as the Kernel-Adatron \citep{FCC98} 
and the SMO \citep{Platt99}, 
which sequentially update one or two
(resp.) Lagrange multipliers at every training step, and active set
methods, such as Chunking \citep{BGV92}, Decomposition \citep{OFG97} 
and Shrinking \citep{Joachims99},
which gradually build the
(hopefully) small set of active constraints by feeding a generic optimizer
(usually an interior point algorithm) with small scale subproblems.

Our current effort is concentrated on the numerical rank of the kernel matrix as a
source for further enhancement of the performance of the IPM. 
To this end we make the following three key observations:

\begin{enumerate}
\item If the kernel matrix has {\em low rank}, then it is possible to design 
an efficient interior point  method that scales {\em linearly} with the
size of data set.

\item If the kernel matrix has {\em full rank} (or almost so), 
it may  be possible  to approximate it by a  low rank positive semidefinite
 matrix (which in turn can be used to feed the optimizer).

\item By using an approximation matrix instead of the original kernel matrix we obtain
 a perturbed optimization problem, which requires less computational effort to solve.
In general, the optimal value of the objective function of the perturbed problem
is different than the optimal value of the original problem. However,
the size of the difference can be explicitely controlled by 
the quality of the approximation.
\end{enumerate}

We assume that the positive semidefinite kernel matrix, $Q_{n\times n}$, is
totally dense and too large to handle, however its rank, $k$, is significantly
smaller than $n$ (``significantly'' may be defined according to the
context). This means that $Q$ can be represented as $Q=VV^{T}$, where $V$ is
a matrix with $k$ columns and $n$ rows. The most expensive step at every
iteration of an IPM  (applied to an SVM problem) is inverting or factorizing a matrix of the form $D+Q$,
where $D$ is a diagonal (positive) matrix and $Q$ is as described above. In
general, this operation  requires  $O(n^{2})$ storage space and takes
 $O(n^{3})$ arithmetic operations, but by handling the linear
algebra in a special way we can reduce the complexity to $O(nk^{2})$ and the
storage to $O(nk)$.

There are two possible approaches that achieve the same complexity. The
first approach is based on the Sherman-Morrison-Woodbury formula for a low
rank update of an inverse of a matrix. 
The Sherman-Morrison-Woodbury formula has been used widely in the context 
of interior point methods for linear
programming \citep{ChoiMonmaShanno,Marxen}. In that context,
however, the method often runs into numerical difficulties. 
The second approach is based on product form Cholesky factorization and is
described in Section \ref{sec:Low-rank_updates}. 
This method is somewhat more expensive than the first: its
workload is about twice and the storage requirement is three times the storage
needed for the first method. 
On the other hand, the second method was shown to be
numerically stable (experimentally and theoretically)  in the
context of interior point methods for LP and quadratically constrained 
quadratic problems \citep{GS1}. Some examples that we present 
in Section \ref{sec:Low-rank_updates}
suggest that the first method may also have numerical 
difficulties in the context of our QP problems. The second method works
well numerically for all problems that we tried.

In many  applications that we considered, a low rank representation   $Q=VV^{T}$ 
may not be available a priori (even if it exists). It may also happen that even 
though the rank of $Q$ is close to (or equal to) $n$, $Q$ can be 
approximated by a low rank symmetric positive semidefinite matrix $\tQ$. 
In this case one can apply incomplete Cholesky factorization method with 
symmetric pivoting to obtain the desired representation of $Q$ or its approximation.
This procedure requires $O(nk^2)$ operations, 
where $k$ is the number of nonzero pivots in the procedure, 
which is the same as the rank of matrix $Q$ or of its low rank approximation. 
The storage requirements reduces to $O(nk)$.
It is  possible to trace  the quality of the approximation
and apply an appropriate stopping criteria without an extra effort.
This and the overall complexity makes this method compare favorably with 
other recently suggested approximation techniques \citep{SS00, WS01}.

Finally, the last observation leads us to derive a bound on the quality of 
the solution obtained by solving the perturbed problem
using a low rank approximation to $Q$. 
We show that if the approximation error $Q-\tQ$ is bounded (in some sense)
by $\e$, then the difference between the optimal objective function values 
 of the original problem and the perturbed problems
is bounded by $lc^2\e$, where $c$ is the penalty (cost) parameter
of the training error and $l$ is the number of active points in the training 
set of the perturbed problem.

The rest of this paper is organized as follows: 
In the next section we present the convex QP that we are interested in solving,
and the interior point method that we use. 
In Section \ref{sec:Low-rank_updates}
we describe the Sherman-Morrison-Woodbury method and 
the product form Cholesky factorization method, as means for low rank updates. 
Section \ref{sec:Approximating_the_Kernel_Matrix} is devoted to 
issues related to approximating the kernel matrix by a 
low-rank positive semidefinite matrix. We present the algorithm to find such
an approximation and a theorem  in which we derive the above mentioned bound.
Experiments, which demonstrate the utility of the suggested method, 
are presented in Section \ref{sec:Experiments}. We wrap things up 
in Section \ref{sec:Conclusions} with some 
concluding remarks and leading directions for further study.

\subsection{Notation}
In general we denote scalars and vectors in lower case letters
(we make an explicit distinction when it is not clear from the context), and
we use upper case letters to denote matrices. We use subscript indices 
to indicate  elements of a vector or a matrix and we use superscript
to label a whole matrix of a vector with an index.

The description in this paper follows the traditional notation used
in the optimization literature, which is slightly
different from the one used in most SVM papers. The main difference
is in the identification of the primal and dual problems (which switch roles)
and the convention used to denote the unknowns with $x,y$, while in the SVM
literature $x$ and $y$  are used to denote the input vectors and their labels, resp. 
Hence, the SVM training problem with $1$-norm soft margin is
\begin{eqnarray}
{\rm min_{\xi,w,b}}&& \frac{1}{2} \left<w,w \right> + c\sum_{i=1}^n \xi_i \nonumber\\
\hspace{-1.3in}(SVM)\hspace{0.8in}\qquad {\rm s.t.}&& 
a_i \left( \left< w,v_i \right> + b \right) \geq 1-\xi_i, \quad i=1,\ldots,n \nonumber\\
\hspace{0.8in}&& \xi_i \geq 0, \qquad i=1,\ldots,n \nonumber
\end{eqnarray}
where labeled examples are pairs of the form $(v,a)$ such that $v$ is a
finite or infinite dimension feature vector, 
$a \in \left\{ -1,+1 \right\}$ is a label, and $\left< \cdot,\cdot \right>$ 
is an inner product.
We often use $Q$ as a shorthand for the labeled kernel matrix, i.e.
\[
Q_{i,j} = a_i a_j K\left( v_i, v_j \right)
\]
where $K\left( \cdot,\cdot \right)$ is a 
(Mercer) kernel function representing dot-products in the feature (RKHS) space.

We will use an equivalent form of the above problem 
(with somewhat different notation) which we will refer to as the
{\sl dual} problem (see problem ($D$) in the next section). The dual
of the ($SVM$) problem is thus called the {\sl primal} problem
(see problem ($P$) in the next section).