
\section{Approximating the Kernel Matrix}
\label{sec:Approximating_the_Kernel_Matrix}

It is often the case that the exact low-rank representation  of the 
kernel matrix $Q=VV\t$ is not given, or even does not exist,
i.e. the rank of $Q$ is large. 
In these cases we may hope that $Q$ can at least be approximated\footnote{
Indeed in the context of SVM it was already been noted 
that typically $Q$ has rapidly decaying eigenvalues \citep{WSS98, SS00}
although there exist kernels for which their eigenspectrum is flat 
\citep{OSS00}.}
by a low rank matrix $\tilde Q$.
However, finding a good approximation
while keeping its  rank low is not trivial. 
In this section we address the following problem: given a symmetric positive semidefinite matrix, $Q$,  construct a matrix $\tilde Q$ such that
its rank $k$ and the bound on the error $\dQ=Q-\tilde Q$ are 
acceptably small. 

Recently, \citet{SS00} suggested a method for 
approximating the data set in the feature space 
by a set in a low-dimensional subspace.
The authors suggested to select (via random sampling) 
a small subset of data points to form the basis of 
the approximating subspace\footnote{
Also see \citet{WS01} for another 
random sampling technique for low rank approximation.}
. 
All other data points are then approximated by linear combinations
of the elements of the basis. The basis is build iteratively, each new
candidate  element  is chosen by a greedy method to reduce the
bound on the approximation error $\dQ=Q-\tQ$ as much as possible.
The authors use the value ${\rm tr}(\dQ)$ as the bound on the norm
of $\dQ$, and hence, as a stopping criteria: the algorithm stops when  
${\rm tr}(\dQ)$ is below some tolerance, say $\e _{tol}$. 

Let us assume that $k$ points are already in the basis.
To compute the reduction of the approximation bound for each
of the remaining $n-k$ points extra $O(nk(n-k))$ operations are required.
This is clearly too expensive. The authors  suggest at each iteration
to randomly select $N$ candidates for the basis and apply their greedy 
approach only to those candidates. Thus, the overall complexity is
$O(Nnk^2)$. 

Here we present a way of constructing  $\tQ$
by directly approximating the Cholesky factorization of $Q$.
The proposed algorithm has complexity $O(nk^2)$ and takes the
full advantage of the greedy approach for the best
reduction in the approximation bound ${\rm tr}(\dQ)$.

Any positive definite matrix $Q$ can be represented by its Cholesky factorization  $Q=GG\t$, where
$G$ is a lower triangular matrix\footnote{
This is the traditional definition of Cholesky factorization.
It relates to the definition presented at Section \ref{sec:Low-rank_updates}, 
$Q=L\Lambda L\t$, by setting $G=L\Lambda^{1/2}$.
}
\citep[see][ Ch.~4]{GolubVanLoan}. 
If $Q$ is positive semidefinite and singular, 
then it is still possible to compute 
an ``incomplete'' Cholesky factorization $GG\t$, where some columns of $G$ 
are zero. Such procedure requires $O(k^2n)$ operations, where $k$ is the number
of nonzero pivots in the procedure, which is the same as the rank
of matrix $Q$. This procedure can also be applied to {\sl almost} singular 
matrices by skipping  pivots that are below a certain threshold 
\citep[see][]{GolubVanLoan}. 
If the eigenvalues of $Q$ are distinctly
separated into a group of very small eigenvalues
and relatively large eigenvalues then by choosing an appropriate value
of the threshold such procedure will produce both: very small numerical 
error and a good approximation \cite[see][]{Wright2}. 
If, however, the eigenvalues of $Q$
have a more complicated structure (varying from very small to relatively
large), then a symmetric permutation of rows and columns of $Q$ may be 
necessary during the 
Cholesky factorization procedure to stabilize the computations and guarantee
a good approximation. Symmetric permutations (sometimes referred to as 
``symmetric pivoting'', cf. 4.2.9  of \citealt{GolubVanLoan}) are also 
crucial in case when one is trying to find the best possible approximation 
while keeping the rank of the approximating matrix below a prescribed bound.
Figure \ref{fig:CFSP} present an algorithm which computes an approximation 
$\tQ=GG\t$ of $Q$ using incomplete Cholesky factorization with 
symmetric permutations. Note that this procedure generates the 
{\em Cholesky triangle} $G$ column by column, while keeping in memory
only the diagonal elements  of $Q$. All other entries of $Q$ are needed
only once and thus can be computed on demand. Hence it is easy to derive a 
``kernelized'' version of the suggested procedure, assuming an access to
the input vectors and the kernel function.
\begin{figure}[htb]
\centerline{
\fbox{\begin{minipage}{29pc}
\begin{eqnarray*}
&&{\bf for\ } i=1:n \\
&& \quad {\bf for\ } j=i:n \\
&& \quad \quad G_{jj}:=Q_{jj}\\
&& \quad \quad {\bf for\ } k=1:i-1 \\ 
&& \quad \quad \quad G_{j j}:=G_{jj}-G_{jk}G_{jk}\\
&& \quad \quad {\bf end\ } \\
&& \quad {\bf end\ } \\
&& \quad {\bf if } \sum_{j=i}^n G_{jj} > \e_{tol}\\
&& \quad  \quad{\bf find} j^*:\ G_{j^* j^*}=\max_{j=i:n}G_{jj}\\
&& \quad  \quad G_{i:n,i}:=Q_{j^*:n,j^*} \\
&& \quad  \quad G_{i, 1:i} \leftrightarrow G_{j^*, 1:i} \\
&& \quad  \quad {\bf for\ } j=1:i-1 \\
&& \quad \quad  \quad G_{i+1:n, i}:=G_{i+1:n,i}-G_{i+1:n,j}G_{i,j}\\
&& \quad \quad {\bf end\ } \\
&& \quad \quad G_{ii}:=\sqrt{G_{ii}}\\
&& \quad \quad G_{i+1:n, i}:=G_{i+1:n, i}/G_{ii} \\
&& \quad {\bf else\ } \\
&& \quad \quad k:=i-1 \\
&& \quad \quad {\bf Stop} \\
&& \quad {\bf endif}\\
&& {\bf end}
\end{eqnarray*}
\end{minipage}}
}
\caption{Column-wise Cholesky Factorization with Symmetric Pivoting}
\label{fig:CFSP}
\end{figure}
%
%\vskip0.5cm

The computational complexity of this procedure is $O(nk^2)$, since
the only extra work is in computing  diagonal elements $G_{jj}$, which
requires $O(nk)$ operations  at each iteration. The storage requirement
is $O(nk)$. Symmetric permutations provide a greedy way of building
the approximation of $Q$. After $i$ iterations of the Cholesky factorization
procedure 
the matrix $G^i=G_{1:n, 1:i}$ is such that $G^i(G^i)\t=\tQ^i$, where
$\tQ^i$ is an approximation of $Q$ subject to a symmetric permutation
of rows and columns. For simplicity, let us assume that the rows and
 columns of $Q$ are ordered  ``correctly'' and no permutation is 
necessary
(this assumption does not affect our analysis).
Let $\dQ^i=Q-\tQ^i$. Let $G^i_{1:i}$ be the matrix composed of the first 
$i$ rows of $G^i$ and $G^i_{i+1:n}$ in the matrix composed of the last 
$n-i$ columns. Then
$$
\dQ^i=\left [ \begin{array}{cc} 0 & 0 \\ 0 & Q_{i+1:n,i+1:n}-
G^i_{i+1:n}(G^i_{1:i})^{-1}Q_{1:i,i+1:n}\end{array} \right ].
$$
From the properties of Cholesky
factorization, $\dQ^i$ is positive semidefinite and 
it is easy to see that after updating $G_{jj}$ for all $j=i+1,\ldots, n$
at the $i$-th  iteration,  
$\sum_{j=i}^n G_{jj}={\rm tr}\,\dQ^i$. Thus, when the above
algorithm stops (after  $k$ iteration), we have 
${\rm tr}\,\dQ \leq \e_{tol}$.

\subsection{Bound on the error in the optimal objective value}
\label{subsec:bound}



Consider the perturbed optimization problem $(\tilde P)$, which is constructed
by replacing the kernel matrix $Q$ by a low-rank approximation $\tQ$, i.e.
\begin{eqnarray}
{\rm min_{x}}&& \frac{1}{2}x\t \tQ x - e\t x \nonumber\\
\hspace{-1.3in}(\tilde P)\hspace{0.8in}\qquad {\rm s.t.}&&a\t x=0, \nonumber\\
&&0\leq x \leq c.\nonumber
\end{eqnarray}
A natural question to ask is: 
How close is the solution of the perturbed problem
to the solution of the original problem. 

Let $x^*$ denote an optimal solution of the original problem ($P$) and let
 $\tilde x^*$ denote an optimal solution of the perturbed problem ($\tilde P$).
Also let $f$
denote  the objective function of ($P$), and $\tilde f$ be the objective 
function  of $\tilde P$. We would like to estimate 
$|f(x^*)-\tilde f(\tilde x^*)|$. The feasible sets of $P$ and $\tilde P$ are 
the same, hence a feasible solution of one problem is feasible for the other.

From optimality of $\tilde x^*$ $\tilde f(\tilde x^*)\leq \tilde f(x^*)=
f(x^*)+\frac{1}{2}(x^*)\t\dQ x^*$. We know that $\dQ=Q-\tQ$ is positive 
semidefinite, thus $\tilde f(\tilde x^*)\leq \tilde f(x^*)\leq f(x^*)$;
 i.e. optimal objective value of the
perturbed problem is always smaller than the optimal objective value 
of the original problem\footnote{Recall that our definition of the objective function is the negative
of the typical objective function definition used in SVM literature. Thus, for
the more traditional objective function its optimal  value for the 
original problem is always {\em smaller} than the optimal value for the
perturbed problem.}.

On the other hand,  optimality of $x^*$ $f(x^*)\leq f(\tilde x^*)$, 
hence $f(x^*)-\tilde 
f(\tilde x^*)\leq f(\tilde x^*)-\tilde f(\tilde x^*)=\frac{1}{2} (\tilde x^*)\t\dQ 
\tilde x^*\leq \frac{1}{2}\lambda_1(\dQ)||\tilde x^*||^2\leq \frac{1}{2}
{\rm tr}\,(Q-\tQ)||\tilde x^*||^2$. 
 Let $l$ be the number of positive components of $\tilde x^*$. From the
feasibility constraints $0\leq x\leq c$ these components are bounded from 
above by $c$. Let $\dQ_l$ be the principal sub-matrix of $\dQ$ that 
corresponds to
positive components of $\tilde x^*$.  The above yields the bound 
$$
0\leq f(x^*)-\tilde f(\tilde x^*)\leq \frac{1}{2}x(0)\t \dQ x(0)\leq
 \frac{c^2l\e}{2}.
$$
Hence we have the following theorem:



\begin{theorem}\label{bound}
If $(Q-\tQ)$ is positive semidefinite and  ${\rm tr}\,(Q-\tQ)\leq \e$ 
then the optimal objective value of  the original problem is larger than
the optimal objective value of  the perturbed problem and their
difference is bounded by $c^2l\e/2$, 
where $l$ is the number of active constraints (support vectors)
in the perturbed problem.
\end{theorem} 


The above error bound does not require approximation matrix 
$\tQ$ to be obtained by Cholesky factorization with permutations. Theorem
\ref{bound} holds for any $\tQ$ such that $Q-\tQ$ is positive
semidefinite and it's trace is bounded by $\e$. If $Q-\tQ$ is not 
positive semidefinite, then it is still possible to provide a bound
on the change in the optimal value function through a
bound on the norm of $\dQ$ (see \citealt{FS2} for more details).

