
\section{Low-rank updates}
\label{sec:Low-rank_updates}
There are two methods which efficiently solve the system $(D+VV\t)u=w$
by taking into consideration the special form of the matrix
of the system (i.e., the fact that it is a diagonal matrix plus
a low-rank symmetric positive semidefinite matrix).
Both of these
methods require $O(nk)$ storage and $O(nk^2)$ arithmetic operations, 
where $k$ is the
number of columns in $V$. Thus, if $k<<n$ then the two methods provide 
significant savings in terms of workload and storage.

The first method is somewhat cheaper than the second. Its workload
is about half of that of the second method and the storage requirement is
one third of the storage of the second method. 
On the other hand, the first method may experience numerical instability.

\subsection{Sherman-Morrison-Woodbury update}

We  consider a  classical  method for a low-rank update of 
an inverse of a matrix. This update is usually called the 
Sherman-Morrison-Woodbury (SMW) update or formula:
$$
(D+VV\t)^{-1}=D^{-1}-D^{-1}V(I+V\t D^{-1}V)^{-1}V\t D^{-1}.
$$
Since we would like to avoid storing the matrix $D+VV\t$, we also would like
to avoid storing its inverse. We can apply the above formula to
solve the system $(D+VV\t)u=w$ as follows: 
$$
u=D^{-1}w-D^{-1}V(I+V\t D^{-1}V)^{-1}V\t D^{-1}w. 
$$
Hence, compute 
\begin{eqnarray}
&& z:=D^{-1}w \nonumber\\
&& t:\ (I+V\t D^{-1}V)t=V\t z\nonumber \\
&& u:=z-D^{-1}Vt. \nonumber 
\end{eqnarray}

The system in the second step is $k \times k$ symmetric positive
definite and can be solved by computing  Cholesky factorization
of $(I+V\t D^{-1}V)$. The work required to compute such factorization
is $O(k^3)$, which is small if $k$ is small with respect to $n$.
Computing the matrix of that system takes $nk^2/2+O(nk)$ multiplications
(and the same number of additions). The overall work, in terms
of the number of multiplications, required to
solve $(D+VV\t)u=w$ is $nk^2/2$ plus smaller order terms.

The Sherman-Morrison-Woodbury has been used widely in
the context of interior point methods for linear programming,
\citep{ChoiMonmaShanno, Marxen}.
In that context, however, the method often runs into numerical 
difficulties. There are few modifications of the method
 that have been proposed to deal with the numerical instability
\citep{Andersen,SW}. However, these modifications do not
always help, and they increase the computational cost.

It is not easy to point out  rigorously  the situations
for which the SMW
formula is numerically unstable.
The intuition suggests that it happens when matrix $D+VV\t$ or
just $D$ is not well conditioned. 
Consider the following small example:
\begin{example}\label{exmp:SMW}
Let
$$
D=\left [\begin{array}{cc} \epsilon^2  &  \\  & 1 \end{array} \right ]
\quad {\rm and} \quad V=\left [\begin{array}{r}1\\-1\end{array}\right ].
$$
Then assuming that $\epsilon ^2$ is smaller than the relative machine 
precision - i.e., $1
\pm\epsilon^2$ is computed as $1$ - we have that 

\begin{equation}
D +VV\t\approx \left [\begin{array}{rr}1 &  -1\\  -1& 2\end{array}\right ]. \nonumber
\end{equation}
Now consider solving $(D+VV\t)u=w$, where $w=\left ( \begin{array}{c} w_1 \\ w_2 \end{array} \right )$ and $w_1$ and $w_2$ are $\Theta(1)$. Clearly
\begin{equation}
u\approx \left( \begin{array}{c}{2w_1+w_2}
\\w_1+w_2
\end{array}\right ). \label{solut}
\end{equation}

Let us apply the Sherman-Morrison-Woodbury  formula to this example.  
$$
z=\left (\begin{array}{c}{w_1}\over{\epsilon^2} \\ w_2
\end{array}\right )
$$
$$
(V\t D^{-1} V+I)t=
(2+{ {1}\over{\epsilon^2}})t \quad {\rm and} \quad V\t z={{w_1}\over{\epsilon^2}}-w_2.
$$
Because of round-off errors $V\t D^{-1} V+I$ is computed as 
$1/\epsilon^2$ and $V\t z$ as  $w_1/\epsilon^2$.
Hence $t=w_1$,
$$
u=z-D^{-1}Vt=\left (\begin{array}{c}
{{w_1}\over{\epsilon^2}}-{{w_1}\over{\epsilon^2}} \\ w_2+w_1 
\end{array}\right )=\left (\begin{array}{c}
0\\ w_2 +
w_1\end{array}\right ).
$$
Clearly, the computed solution is not even close to the approximately 
correct solution (\ref{solut}).
%\vskip0.5cm
%
\end{example}
A tuation similar to the above example happens often in linear programming.
For our QP problems such a situation is less natural, but still possible.
The diagonal elements of $D$ are $s_i/x_i+\xi _i/(c-x_i)$. By complementarity,
as $\mu$ approaches zero, either $s_i\to 0$ and $x_i \to x_i^*>0$ or
$x_i\to 0$ and $s_i \to s_i^*>0$ (analogously, either $\xi_i\to 0$ and $x_i \to x_i^* < c$ or
$x_i\to c$ and $\xi_i \to \xi_i^*>0$). Moreover, whenever a variable
converges to zero, it converges with the same rate as the parameter $\mu$.
Thus, whenever $s_i$ and $\xi_i$ both converge to zero (which happens
for the points that are in the active optimal set), then $D_i$ converges
to zero with the same rate as $\mu$. Otherwise, $D_i$ converges to
infinity with the same rate as $1/\mu$. If $\mu=10^{-8}$, then
some elements of $D$ are of the order $10^{-8}$ and the others are of the
order $10^8$. If we scale $D$ and $Q$ by $10^{-8}$ then $D$ looks 
similar to the diagonal matrix in the example; i.e., some of the elements are
of the order of $1$ and some are of the order of $10^{-16}$
(which often is the order of  the machine precisions).
To have matrix $Q$ and $V$ consistent with the example, we need the 
elements of scaled matrix $10^{-8}Q$ to be of the order of $1$.
This will happen if the elements of $V$, e.g.  the data points, 
 are of the order $10^4$. 
In section \ref{sec:Experiments}
we briefly discuss some examples of practical SVM problems 
for which our implementation
of the interior point method with the SMW update failed to converge.



In the next subsection we present an alternative method, which avoids many
of the numerical difficulties of the SMW update.



\subsection{Product-Form Cholesky Factorization}\label{pfcf}
An alternative efficient method for solving the
system $(D+VV\t)u=w$ is based on a low rank
update of the Cholesky factorization of the matrix,
rather than using low rank updates to the inverse of the matrix. 
This is the reason for good numerical properties of this approach.
We will use a, so called, product form Cholesky factorization
of $D+VV\t$.

Here a Cholesky factorization of a symmetric positive semidefinite matrix 
$M\in \R^{n\times n}$ is defined as $M=L\Lambda L\t$, where $L$ is a 
lower triangular matrix with
ones on the diagonal, and $\Lambda$ is a nonnegative diagonal matrix.
A product form Cholesky factorization, in general, is of the form 
$\lt ^1 \lt ^2 \ldots \lt ^k \dt ^k (\lt^k)\t \ldots (\lt^2)\t (\lt^1)\t$, where
each $\tilde L^i$ is a lower triangular matrix and $\dt^k$ is a nonnegative
diagonal matrix. 

To solve a system of linear equations $Mu=w$, one has to solve the following
sequence of linear systems: $\tilde L^1 u^1=w$, 
$\tilde L^2 u^2=u^1$, $\ldots$ , $\tilde L^k u^k=u^{k-1}$, $(\tilde L^k)\t 
u^{k+1}=(\Lambda^k)^{-1}u^{k}$, $\ldots$ , $(\tilde L^1)\t u=u^{2k-1}$. 
Each such system has a lower triangular or an upper triangular matrix,
and, in a  general dense case, solving each such system takes $O(n^2)$ 
operations. However, if each $\tilde L^i$ has a special structure
(such as we will see below) then solving each system takes only $O(n)$ 
operations.

Let us assume that we have a Cholesky factorization of a matrix $M$:
$M=L\Lambda L\t$. We would like to obtain a product form factorization
of $M+vv\t$, where $v\in \R ^n$ is a vector (hence, $vv\t$ is a rank-one
$n\times n$ matrix):
$$
L\Lambda L\t +vv\t= L(\Lambda +pp\t)L\t=
L\lt \tilde \Lambda \tlt L \t,
$$
where $p$ is the solution of the equations $Lp=v$
and $\lt \tilde \Lambda \tlt$ is the Cholesky factorization of 
$\Lambda +pp\t$. 

$\tilde L$ has a special form
\begin{equation}
\lt=\left [ \begin{array}{cccccc} 1 & & & & & \\
            p_2\beta _1 & 1 & & & & \\
            p_3\beta _1 &  p_3 \beta_2 & 1 & & &\\
            \vdots & \vdots & \vdots & \ddots & & \\
            p_{n-1}\beta _1 & p_{n-1}\beta_2 & p_{n-2}\beta _3& \cdots & 1 &\\
            p_n\beta _1 & p_n \beta _2 & p_n \beta _3 & \cdots & p_n\beta_{n-1}
            & 1 \end{array} \right ], \label{ltild}
\end{equation}
where $\beta\in \R ^n$ and $\tilde \Lambda={\rm diag}\{\tilde\lambda_1, \ldots,
\tilde\lambda _n\}$ can be computed from the following recurrence relations:
\begin{eqnarray}
&& t_0:=1, \nonumber\\
&& {\rm for \ } j=1,2,\ldots,n \nonumber \\
&& \hskip0.5cm t_j:=t_{j-1} +p_j^2 /\lambda_j , \nonumber \\
&& \hskip0.5cm \tilde \lambda_j :=\lambda_j t_j/t_{j-1}, \nonumber\\
&& \hskip0.5cm \beta _j :=p_j/(\lambda_jt_j). \nonumber 
\end{eqnarray}
Various versions of this method 
for updating Cholesky factorization 
by a rank-one term were  proposed by 
\citet{Bennet}, \citet{FP} and  \citet{GMS}.

If we allow some  $\lambda _j$ to be zero,  then it is easy
to verify that the above recurrence relations 
become:
\begin{figure}[htb]
\centerline{
\fbox{\begin{minipage}{29pc}
\begin{eqnarray}
&& t_0:=1, \nonumber\\
&& {\bf for \ } j=1,2,\ldots,n \nonumber \\
&& \hskip0.5cm {\bf if \ } t_{j-1}\neq \infty \nonumber \\
&& \hskip1cm {\bf if \ } \lambda_j\neq 0 \nonumber \\
&& \hskip1.5cm t_j:=t_{j-1} +p_j^2 /\lambda_j , \nonumber \\
&& \hskip1.5cm \tilde \lambda_j :=\lambda_j t_j/t_{j-1}, \nonumber\\
&& \hskip1.5cm \beta _j :=p_j/(\lambda_jt_j). \nonumber \\
&& \hskip1cm{\bf else \ }  \nonumber \\
&& \hskip1.5cm {\bf if\ } p_j\neq 0\nonumber \\
&& \hskip2cm t_j:=\infty \nonumber \\
&& \hskip2cm \tilde \lambda_j :=p_j^2/t_{j-1}, \nonumber \\
&& \hskip2cm \beta _j :=1/p_j. \nonumber \\
&& \hskip1.5cm{\bf else } \nonumber \\
&& \hskip2cm t _j :=t_{j-1} \nonumber \\
&& \hskip2cm \tilde \lambda _j :=0 \nonumber \\
&& \hskip2cm \beta _j :=0 \nonumber \\
&& \hskip0.5cm{ \bf else \ } \nonumber \\
&& \hskip1cm t_j:=\infty \nonumber \\
&& \hskip1cm \tilde \lambda_j :=\lambda _j, \nonumber\\
&& \hskip1cm \beta _j :=0. \nonumber
\end{eqnarray}
\end{minipage}}
}
\caption{Rank-One update for Product-Form Cholesky Factorization}
\label{fig:RankOnePFCF}
\end{figure}

It is often important for the purpose of numerical stability
to consider $\lambda_j$ as equal to zero, even if it is not
precisely zero, but is very small (relatively to the other 
elements of $\Lambda$). The above recursions can be used
(with slight modifications) in the approximate sense as well.

Now assume that we  have a factorization $L\Lambda L\t$ and we 
want to obtain a product form factorization
of $L\Lambda L\t+v^1(v^1)\t+v^2(v^2)\t+\ldots+v^k(v^k)\t$.
By repeating the above procedure $k$ times, once for each rank-one term 
$v^i(v^i)\t$, we can obtain the product from Cholesky factorization 
$L \lt ^1 \lt ^2 \ldots, \lt ^k \dt ^k (\lt^k)\t \ldots  (\lt^2)\t (\lt^1)\t  
\tl$
from the following:

\vskip0.5cm

\noindent {\bf Procedure 1.}
\begin{description}

\item[1.] Set $\dt ^0:=\Lambda$,

\item[2.] For $i=1,\ldots, k$

\begin{description}
\item[(i)] Solve $Lq^{0,i}=v^i$ for $q^{0,i}$;
\item[(ii)] For $j=1,\ldots, i-1$, solve $\lt ^j q^{j,i}=q^{j-1,i}$ for $q^{j,i}$;
\item[(iii)] Set $p:=q^{i-1,i}$; compute $\beta$ and $\dt $ by recurrence
relations in Fig. \ref{fig:RankOnePFCF};
\item[(iv)] Store $p^i:=p$ and $\beta ^i:=\beta$; replace $\dt ^{i-1}$ by $\dt ^i$.
\end{description}
\end{description}

\vskip0.5cm

In the case of our QP problems, we need to compute product form
Cholesky factorization of a matrix $M=D + VV\t$, where $D$ is a diagonal matrix
in $\R^{n\times n}$, $V$ is a rectangular matrix in $\R^{n\times k}$ 
(assuming $k<<n$), 
and $VV\t=\sum_{i=1}^k v^i(v^i)\t$, where $v^i$ is the $i$-th column
of $V$. Thus, we can apply the above Procedure 1 by setting
initial $\Lambda$ to equal $D$ and  $L$ to $I_n$, an $n\times n$ identity 
matrix.

To store information about the  product form Cholesky factorization of 
$(D+VV\t)$ we need to store the matrix $V$, the diagonal elements of 
$\Lambda^k$, $k$ vectors
$p^i$ and $k$ vectors $\beta^i$, $i=1,\ldots k$.
Overall, our memory requirement for the factorization is $3kn$ plus smaller
order terms.

To compute the vectors $p^i$ and $\beta^i$, $i=1, \ldots, k$, we need to solve
$(k-1)(k-2)/2$ systems of the form $\tilde Lq=r$ (by $\tilde L$, 
without an index, we mean any matrix of the form of Eq. (\ref{ltild})).
Due to the special structure of $\tilde L$ such solution can be
obtained by the following procedure:

\vskip0.5cm
\noindent {\bf Procedure 2.}

Given $p$, $\beta $ and $r$;

\begin{description}
\item[1.] Set $q:=r$ and  $\sigma:=q_1\beta _1$;
\item[2.] For $j=2,\ldots, n$
\begin{description} 
 \item Set $q_j:=q_j-p_j\sigma$,
 \item  ${\sigma}:={\sigma}+q_j\beta _j$
\end{description}
\end{description}

\vskip0.5cm

This procedure requires $2n$ multiplications 
(and the same number of additions). Recursions
in Fig. \ref{fig:RankOnePFCF}  require $O(n)$ operations for each $i=1,\ldots, k$.
Hence, overall, the computation of the product from Cholesky factorization
of $D+VV\t$ requires $k^2n$ ( plus smaller order term) multiplications.

Clearly, solving the system of equations of the form $\tilde L\t q=r$,
and performing matrix multiplication $q=\tilde L r$ or $q=\tilde L\t r$ 
each can be done by a procedure which takes into account the special structure
of $\tilde L$, similarly to Procedure 2, and  
requires $O(n)$ operations. Thus, once the factorization of $D+VV\t$ is 
computed,
solving $(D+VV\t)u=w$  requires only $O(kn)$ operations. The overall 
complexity in terms of the number of multiplications is then 
$k^2n$ plus smaller order term. This is about twice as much as the complexity
of solving this system using the SMW update. However, if $k<<n$ this
complexity is still a very significant improvement over $O(n^3)$ 
operations required to  solve $(D+VV\t)u=w$ directly.

The product form Cholesky factorization method is numerically more stable
than the SMW method both in theory and in practice. First, it is easy to
check that it produces a correct result when applied to the
small example in the previous subsection.
Second, in Section \ref{sec:Experiments}
we show some examples of QP for which the implementation of 
an interior point method using the product form Cholesky factorization 
works well, while the implementation of the same interior point method
with SMW update fails numerically. 
Finally, in \citet{GS1} it is
shown that the product form Cholesky factorization is numerically 
stable when used in the context of interior point method for linear programming
(under the assumption that the initial data is well conditioned).
This result is supported by extensive computational 
evidence. The case of QP for SVM is somewhat different from
the case of linear programming. The proof of numerical stability 
does not apply directly, and extending it is tedious and beyond 
interest of this paper;
however, similar intuition applies. Computational experiments show that the product form 
Cholesky factorization method avoids numerical difficulties even when the initial data
is not well conditioned.









