\section{An Interior Point Method}
\label{sec:IPM}

We consider the following convex quadratic programming problem.

\begin{eqnarray}
{\rm min_{x}}&& \frac{1}{2}x\t Qx - e\t x \nonumber\\
\hspace{-1.3in}(P)\hspace{0.8in}\qquad {\rm s.t.}&&a\t x=0, \nonumber\\
&&0\leq x \leq c,\nonumber
\end{eqnarray}
where $x\in \R ^n$ is the vector of primal variables,  $e$ is the vector of all $1$'s 
of length $n$ and $a$ is a vector of labels, $1$'s
and $-1$'s.  $c$ is the penalty parameter associated with the training error.
In general it may be desirable to use different penalty parameters for 
different data points. 
In this case $c$ should be treated as an $n$-dimensional vector.
However, 
for the sake of simplicity we assume that $c$ is a 
positive scalar. 
All our analysis and results extend easily to the more general case.
Similarly, one might want to solve a problem with a slightly more general  objective function
$\frac{1}{2}x\t Qx +q\t x$, where $q$ is some $n$-dimensional vector
(e.g., a small scaled optimization problem arising within 
a chunking-like meta algorithms). 
Our analysis and results extend easily to this case as well. 

The dual to the above problem is 
\begin{eqnarray}
{\rm max_{y,s}}&& -\frac{1}{2}x\t Qx - c\sum_{i=1}^n \xi_i \nonumber\\
\hspace{-1.3in}(D)\hspace{0.8in}\qquad {\rm s.t.}&& -Qx+ a y +s-\xi =-e,\nonumber\\
&&s\geq 0,\  \xi \geq 0,  \nonumber
\end{eqnarray}
Here $y$ is the scalar dual variable (it is equal to the negative bias) 
and $s$ and $\xi$
are the $n$-dimensional vectors of dual variables. 


Both ($P$) and ($D$) have finite optimal solutions and 
any optimal primal-dual pair of solutions  satisfies
the Karush-Kuhn-Tucker (KKT) necessary and sufficient optimality conditions:

\begin{eqnarray}
&&Xs=0,\nonumber\\
&&(C-X)\xi=0\nonumber\\
&&a\t x=0, \nonumber\\
&&-Qx+ a y +s-\xi =-e,\nonumber\\
&&0\leq x\leq c,\ s\geq 0,\ \xi \geq 0. \nonumber
\end{eqnarray}
By $X$ ($S$, $C-X$) we denote the diagonal matrix whose diagonal entries are the
elements of the vector $x$ ($s$, $c-x$).
For optimality conditions of QP  and details of applying
IPMs to QP \citep[see][]{Wright} and references therein.

The perturbed KKT optimality conditions  
\begin{eqnarray}
&&Xs=\sigma \mu e,\nonumber\\
&&(C-X)\xi=\sigma \mu e\nonumber\\
\hspace{-.6in}(CP _\mu)\hspace{.7in}&&a\t x=0, \nonumber\\
&& -Qx+ a y +s-\xi=-e,\nonumber\\
&&0\leq x\leq c,\ s\geq 0,\ \xi \geq 0 \nonumber
\end{eqnarray}
have a unique solution for any positive  values of the parameter $\mu$ and $\sigma$. 
The trajectory of solutions to the above system of equations for all
values of $\mu\in (0, \infty)$ with some fixed positive  value of $\sigma$ 
is called the {\sl central path}. The central path converges to an optimal solution. 
A path-following primal-dual interior point method 
tries to follow the central path towards the optimal solution
by iteratively approximating solutions on the path by applying the 
Newton method to the above system of nonlinear equations, while reducing
the value of parameter $\mu$ on each iteration.

Any typical interior point method solves a linearization of the above
system of nonlinear equations. 
We use the so-called Mehrotra predictor-corrector algorithm \citep{Mehrotra}, 
which is considered to be one of the most efficient in practice, however
the ideas presented in Section \ref{sec:Low-rank_updates} on reducing the 
per-iteration complexity apply to any other interior point method. 
At every iteration there are two basic steps - the predictor step and 
the corrector step. 
The idea behind this algorithm is to start with ``predicting'' 
the best reduction in the duality gap,
by evaluating a step directly towards the optimality. Such step, however,
is not taken, since it may ruin the ``centrality'' of the iterates. 
A corrector step is computed instead.
The corrector step enforces the centrality and also takes into
account the approximate curvature of the central path predicted by
the predictor step. 

Predictor step is a Newton step towards the solution of the  system 
with $\sigma =0$ (which is the same as the system of the KKT conditions). 
Then, using this step, a
value for $\sigma$ is chosen and a better step 
(corrector step) towards solution
of the above system with this value of $\sigma$ is taken.

For the predictor step we  compute $(\Delta x, \Delta y, \Delta s, \Delta \xi)$,
 satisfying
$$
\left [ \begin{array}{cccc} -Q &a & I & -I \\
a\t & 0 & 0 & 0\\ S & 0 & X  & 0 \\
-\Xi & 0 & 0 & (C-X)\end{array} \right ] 
\left [ \begin{array}{c}  \Delta x \\
\Delta y \\ \Delta s \\ \Delta \xi\end{array} \right ]=
\left [ \begin{array}{l} r_c \\
r_b\\ -Xs \\ -(C-X)\xi \end{array} \right ]
$$
where $r_b=-a\t x$, $r_c=-e+Qx-a y -s +\xi$. If the current solution
is primal and dual feasible, then $r_b$ and $r_c$ 
are  zero (within the numerical accuracy). 

By eliminating $\Delta s$, $\Delta \xi$ and consequently  $\Delta x$ from the 
system we obtain an expression for $\Delta y$: $a\t(Q+D)^{-1}a\Delta y = r$, 
where $D=SX^{-1}+\Xi (C-X)^{-1}$ and $r$ is a right hand side that depends on $(Q+D)^{-1}$ and
other parameters of the system. Solving for $\Delta y$ we substitute it back 
to find $\Delta x$, $\Delta s$ and $\Delta \xi$. 
The maximum possible step length is determined by computing $\alpha\leq 1$ 
such that
$0 \leq x+\alpha \Delta x\leq c$, $0\leq s+\alpha\Delta s$ and $0\leq \xi +\alpha\Delta \xi$.

To compute the corrector step, 
we set $dx=\Delta x$, $ds=\Delta s$ and $d\xi=\Delta \xi$ and $\hat \mu=(x+dx)\t(s+ds)+(c-x-dx)\t
(\xi+d\xi)$ (from the predictor step) and compute 
new $(\Delta x, \Delta y, \Delta s, \Delta \xi)$, satisfying
$$
\left [ \begin{array}{cccc} -Q &a & I & -I \\
a\t & 0 & 0 & 0\\ S & 0 & X  & 0 \\
-\Xi & 0 & 0 & (C-X)\end{array} \right ] 
\left [ \begin{array}{c}  \Delta x \\
\Delta y \\ \Delta s \\ \Delta \xi\end{array} \right ]=
\left [ \begin{array}{l} r_c \\
r_b\\\sigma \mu -dXds -Xs \\ \sigma \mu + dXd\xi-(C-X)\xi \end{array} \right ]
$$
where $\sigma$ is defined by
$$
\sigma=\left ( \frac{\hat \mu}{\mu}\right ) ^3.
$$
We solve this system similarly to the system for the predictor step, 
compute the largest possible step (but actually
take 99$\%$ of that step to avoid hitting the boundaries of the 
feasible region or else the new solution would not be in the interior), 
and update the current solution.

The stopping criteria is formed by checking the duality gap 
and the primal and dual feasibility against a predetermined tolerance.
If these conditions are still not met, we compute a new value for
the parameter $\mu$ and repeat the iteration.
This algorithm converges to an optimal solutions from any strictly interior
starting point (a point that lies inside all inequality constraints).
In theory, it takes $O(n\ln (1/\e))$ 
iterations to converge, where $\e$ is the
relative accuracy. However, in practice, an interior point method rarely takes
more than $50$ iterations, 
regardless of the problem's size.

The most time consuming  operation on every step is obtaining a vector
of the form $u=(Q+D)^{-1}w$ (e.g., during the computation of the predictor step
such operation is done twice - once during computation of the right 
hand side vector $r$ and once during computation of $a\t (Q+D)^{-1}a$). 
Since $Q+D$ is a dense $n\times n$, symmetric 
positive definite 
matrix, then in general obtaining the vector $u$ requires $O(n^3)$ operations.
Typically this can be done by either computing the inverse or 
factorizing\footnote{
Note that  since $D$ and $Q$ are the same for
both the predictor and the corrector steps, we need to compute
the inverse or factorize $D+Q$ only once per every iteration.}
$Q+D$.

The main point of the next section is to show that 
if $Q$ can be represented as $Q=VV\t$, where $V$ is 
$n\times k$ matrix (where $k<<n$), then vector $u$ can be obtained in
$O(k^2n)$ operations.