
\section{Experiments}
\label{sec:Experiments}
In this section we compare performances of the suggested method to 
state-of-the-art SVM training procedures, 
construct a simple toy example to demonstrate the numerical instability 
of the Sherman-Morrison-Woodbury update, 
and examined the impact of approximating the SVM solution from both
the optimization problem and the classification problem perspectives.

\subsection{Cholesky Product Form QP vs. SMO}
\begin{figure}[tbh]
\begin{minipage}{16cm}
\epsfxsize = 12cm 
\centerline{\epsfbox{CPFQPvsDpSMO.eps}}
\end{minipage}
\hfill 
\caption{ Cholesky Product Form QP vs. SMO}
\label{CPFQPvisSMO}

\end{figure}

We've modeled 150 different Speaker Id binary problems\footnote{
This is part of a much larger multi-class Speaker ID system \citep*{FNG01a}.},
in the following way: 
For each speaker we trained a mixture of 4
Gaussians, 25 dimensions\footnote{%
13 dimension cepstrum, augmented with first derivatives (`delta')
and finally discarding the energy coefficient.}, 
diagonal covariance. We then applied Fisher kernel
methodology \citep{JH99}, which resulted in transforming the original 
data vectors to a 204 dimensional space. 
We further applied linear normalization transformation to
each dimension to balance the transformed vectors. This caused the
transformed vectors to be fully dense. The size of the training set in a
typical problem ranges from 6500 to 7000 vectors, roughly equally divided
to positive and negative example. We chose a dot-product kernel 
and compared our technique with the SMO algorithm to find a maximal margin 
linear separator. Our choice
was motivated by former
evidences which support the claim that most of
the work in separating the positive and negative points is carried out
through the transformation represented by the Fisher kernel.
In the current setting the choice of a dot-product kernel served 
another purpose - 
to obtain results of the SMO at the peek of its performance. 
To this end we used a special designed variant of SMO which takes advantage 
of the fact
that the kernel operations are just dot-products
\citep[for further discussion see][]{Platt99}. 
The comparison in CPU time (depicted at Fig. 
\ref{CPFQPvisSMO}) clearly favors our method. Note also that we have gained
roughly a factor of 1100 in computational complexity and a factor of 
33 in the storage requirements over traditional IPMs. 



\subsection{Cholesky Product Form QP vs. SVM$^{light}$}
\label{Sec_CPFQPvsSVMlight}
%\begin{figure}[tbh]
\begin{figure}[hbt!]
\begin{minipage}{16cm}
\epsfxsize = 12cm 
\centerline{\epsfbox{CPFQPvsSVMlight.abalone.cpu.eps}}
\end{minipage}
\hfill 
\caption{ Cholesky Product Form QP vs. SVM$^{light}$}
\label{CPFQPvsSVMlight}

\end{figure}

We next turn to compare performances with Joachims's SVM$^{light}$ package,
which is an active set method applying the ``shrinking'' approach to select 
the subproblem to be solved at each iteration \citep{Joachims99}. As the QP
subproblem solver we used {\em loqo}, which is Smola's implementation of an 
IPM solver provided with the SVM$^{light}$ package. We examined performances 
on a moderate size problem, the Abalone dataset from the UCI Repository 
\citep{BM98}.
Since, at this point, we were not interested
in evaluating generalization performances, we used the whole set (of size 4177)
for training. The gender encoding  (male/female/infant) was mapped into
\{(1,0,0),(0,1,0),(0,0,1)\}. Then data was rescaled to lie in the [-1,1]
interval. 
Similarly to the SMO experiment, we restricted ourselves to dot-product
kernel, while controlling the difficulty of the problem with the choice
of the penalty (cost) term  $c$. We examined performances for 3 levels
of difficulty, while gradually increasing the allowed size of the 
subproblems the SVM$^{light}$ algorithm. Note that there's no similar
tuning to be done for our method since the whole QP problem fits into memory.
The results are depicted at Figure \ref{CPFQPvsSVMlight}. 
This figure highlights the fact that SVM$^{light}$ is quite sensitive 
to the choice of the subproblem (chunk) size, as well as
the difficulty of the overall problem, 
while our method (denoted ``CholQP'' it the figure) is  stable 
for  all levels of difficulty. 



\subsection{Example of failure of the Sherman-Morrison-Woodbury update}
%\begin{figure}[tbh]
\begin{figure}[hbt]
\centerline{
\begin{minipage}{10cm}
\epsfxsize = 10cm 
\centerline{\epsfbox{smw_ex.eps}}
\end{minipage}
}
\hfill 
\caption{ Example for the failure of SMW update}
\label{smw_ex}

\end{figure}

We constructed a small example in the spirit of the one described in 
Section \ref{sec:Low-rank_updates} to demonstrate possible numerical instability
of the Sherman-Morrison-Woodbury update. Figure \ref{smw_ex} illustrates
the data in the example. There are two crucial features: the set of
active support vectors is redundant (so the problem is degenerate) and
the support vectors are scaled by a large number ($10^4$).
On this problem an interior point method using SMW update failed to
converge after achieving only 2 digits of accuracy. The same IPM with
the product form Cholesky update converged to 8 digits of accuracy.

To demonstrate that such failures can happen in practice we applied 
the Sherman-Morrison-Woodbury version of the code to an
approximate problem arising from Abalone data set using polynomial kernel
(as described in the next subsection). The algorithm stalled after 
achieving only 5 digits of accuracy, whereas the product form Cholesky
version converged to 12 digits of accuracy.


\subsection{Incomplete Cholesky Factorization (ICF)}

\begin{figure}[htb]
\centerline{
\begin{minipage}{15cm}
\epsfxsize = 15cm 
\centerline{\epsfbox{BM_W.eps}}
\end{minipage}
}
\hfill 
\caption{ Incomplete Cholesky Factorization for Poly Kernel 
(input dim = 10, poly degree = 5) on the Abalone data set.
The optimization problem perspective:
optimal value of the objective function
and norm of the separating hyperplane.
X axis is the rank (note that the feature space dim $\approx 3000$).}
%\caption{ Incomplete Cholesky Factorization for Poly Kernel 
%(input dim = 10, poly degree = 5) on the Abalone data set -
%{\bf the optimization problem perspective:}
%Optimal value of the objective function
%and norm of the separating hyperplane.
%X axis is the rank (note that the feature space dim $\approx 3000$).}
\label{BM_W_Obj}
\end{figure}


\begin{figure}[htb]
\centerline{
\begin{minipage}{15cm}
\epsfxsize = 15cm 
\centerline{\epsfbox{BM_Err.eps}}
\end{minipage}
}
\hfill 
\caption{ Incomplete Cholesky Factorization for Poly Kernel 
(input dim = 10, poly degree = 5) on the Abalone data set.
The classification problem perspective: training and testing errors. 
X axis is the rank (note that the feature space dim $\approx 3000$).}
%\caption{ Incomplete Cholesky Factorization for Poly Kernel 
%(input dim = 10, poly degree = 5) on the Abalone data set - 
%{\bf the classification problem perspective:} Training and testing errors. 
%X axis is the rank (note that the feature space dim $\approx 3000$).}
\label{BM_Err}
\end{figure}

To demonstrate the utility of the proposed ICF method and the impact of its
approximation on the SVM solution, 
we conducted an experiment similar to the one described at 
\citep{SS00} on the Abalone data set\footnote{
The data set was preprocessed as described in Section 
\ref{Sec_CPFQPvsSVMlight},
and we used the first 3000 points for training 
and the remaining 1177 points for testing.} using
polynomial kernels,
%\footnote{
%The feature space dimension is 
%$\left( \begin{array}{c}m+d\\d \end{array} \right)$, 
%where $m$ is the input space dimension.}
i.e. $k(x,y)=(\left<x,y \right>+const)^d$. 
Thus we actually used the ``kernelized'' version
of the factorization algorithm 
(cf. Section \ref{sec:Approximating_the_Kernel_Matrix}). 
We examined the
resulted factorization from two stand points: Figure \ref{BM_W_Obj}
demonstrates the impact of the low-rank approximation 
on the solution of the optimization problem 
as a function of the rank, i.e. the optimal value of the objective 
function\footnote{This is actually a ``negative'' plot of the optimal values
(due to our definition of the objective function) 
to ease the comparison with similar plots published in the SVM literature.}
and the norm of the separating hyperplane.
Figure \ref{BM_Err} shows the impact of the low-rank approximation 
from the classification performances perspective, i.e. 
training and testing errors. 
Both figures are scaled with a graph which measures the quality of the 
approximation in Frobenius norm
(the square root of the 
sum of squares of the matrix elements)
with respect to the original kernel matrix.










