\documentclass[twoside,11pt]{article}

\usepackage{jmlr2e}


\jmlrheading{2}{2001}{97-123}{3/01}{12/01}{Roman Rosipal and Leonard J Trejo}
\ShortHeadings{Kernel Partial Least Squares Regression in RKHS}{Rosipal and Trejo}
\firstpageno{97}


\begin{document}


\title{Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space}
\author{\name Roman Rosipal\footnotemark 
\email rrosipal@mail.arc.nasa.gov \\
\addr Applied Computational Intelligence Research Unit \\
%School of ICT \\
%Information \\
%and Communication Technologies \\
University of Paisley \\
Paisley PA1 2BE, Scotland \\
{\rm and} \\
Laboratory of  Neural Networks \\
Institute of Measurement Science, SAS \\
%D{\'u}bravsk{\'a} cesta 9 \\
Bratislava 842 19, Slovak Republic \\
\AND
\name  Leonard J. Trejo \email  ltrejo@mail.arc.nasa.gov \\
\addr Computational Sciences Division \\
NASA Ames Research Center \\
Moffett Field, CA 94035-1000
}

\def\thefootnote{\fnsymbol{footnote}}
\footnotetext[1]{Current address: NASA Ames Research Center, Mail Stop 269-3, Moffett Field, CA 94035-1000}

\editor{Nello Cristianini, John Shawe-Taylor and Bob Williamson}

\maketitle

\def\thefootnote{\arabic{footnote}}


\begin{abstract}
A family of regularized least squares regression models  in a 
Reproducing Kernel Hilbert
Space is extended by the kernel partial least squares (PLS) regression model.
Similar to principal components regression (PCR),  PLS is a method 
based on the projection of input (explanatory) variables to the latent variables 
(components). However, in contrast to PCR,
PLS creates the components by modeling the relationship between input  and
output variables while maintaining  most of the information in the 
input variables. PLS 
is useful in situations where the number of explanatory variables
exceeds the number of observations and/or a high level of multicollinearity
among those variables is assumed. Motivated by this fact we will 
provide a kernel PLS algorithm for construction of nonlinear regression 
models in possibly high-dimensional feature spaces.

We give the theoretical description of the kernel PLS algorithm
and we experimentally compare the algorithm with the existing kernel 
PCR and kernel ridge regression techniques.  We will demonstrate that on the data 
sets employed kernel PLS achieves the same results  as kernel
PCR but uses significantly fewer, qualitatively different components.

\end{abstract}


\section{Introduction}
In this paper we will focus our attention on least squares
regression models in a Reproducing Kernel Hilbert
Space (RKHS). The models are derived based on a straightforward connection
between a RKHS and the corresponding feature space representation where
the input data are mapped. In our previous work 
\citep{RosGirTre00,RosGirTreCic01,RosTreCic00}
we proposed the kernel principal components regression (PCR) 
technique and we also made theoretical
and experimental comparison to kernel ridge regression (RR) 
\citep{SauGamVov98,CriTay00}.
In this work we extend the family with a new nonlinear kernel partial 
least squares
(PLS) regression method.

Classical PCR, PLS and RR  techniques  are well known shrinkage estimators
designed to deal with multicollinearity 
\citep[see, e.g.,][]{FraFri93,MonPec92,Jol86}.
The multicollinearity or near-linear dependence of regressors is a 
serious problem
which can dramatically influence the effectiveness of a regression model.
Multicollinearity results in large variances and covariances for the 
least squares
estimators of the regression coefficients. Multicollinearity  can also
produce estimates of the regression coefficients that are too large 
in absolute value.
Thus the values and signs of estimated regression coefficients may 
change considerably
given different data samples.
This effect can lead to a  regression model which fits
the training data reasonably well, but generalizes poorly to new data 
\citep{MonPec92}. This fact is in a very close relation to the 
argument stressed in
  \citep{SmoSchMul98}, where the authors have shown that choosing the 
{\it flattest}
linear regression function\footnote{The {\it flatness} is defined in 
the sense of penalizing
high values of the regression coefficients estimate.}  in a feature space
can,  based on the smoothing properties of the selected kernel 
function,  lead to a
smooth nonlinear function in the input space.

The PLS method \citep{Wol75,WolRuhWolDun84}  has been a  popular 
regression techniques
in its domain of origin---Chemometrics. The method is similar to 
PCR where principal components
determined solely from explanatory variables creates orthogonal, i.e. 
uncorrelated,
input variables in a regression model. In contrast, PLS creates 
orthogonal components by using
the existing correlations between explanatory variables and 
corresponding outputs while also
keeping most of the variance of explanatory variables. PLS has proven 
to be useful in situations when the  number of  observed variables ($N$) is 
significantly greater than the
number of observations ($n$) and high multicollinearity among the 
variables exists.
This situation when $N \gg n$ is common in chemometrics and gave rise 
to the modification of
classical principal component analysis (PCA) and linear PLS methods 
to their kernel variants
\citep{WuMasJon97,RanLinGelWol94,Lew95}. %LinGelWol93,LinGelWol94,Lew95}.
However, rather than assuming a nonlinear transformation into a
feature space of arbitrary dimensionality the authors attempted to
reduce computational complexity in the input space. Motivated by these 
works we propose a more general
nonlinear kernel PLS algorithm.\footnote{In the following, where it is 
clear, we will not
stress this nonlinear essence of the
proposed kernel PLS regression model and will use the kernel PLS notation.}

There exist several nonlinear versions of the PLS model. In 
\citep{Fra90,Fra94} approaches
based on fitting the nonlinear input-output dependency by providing 
the extracted components as inputs 
%to neural networks %additive nonlinear regression models based on 
% artificial neural networks (ANN) or   
to smoothers and spline-based additive nonlinear regression models  
were  proposed.  Another nonlinear PLS model
\citep{Mal95,MalTamMah97} is based on  relatively complicated 
artificial neural network  modeling of nonlinear PCA and consequent  nonlinear PLS.
 From that point our approach differs in the sense that the original
input data are nonlinearly mapped to
a feature space ${\cal F}$ where a linear PLS model is created. Good
generalization properties of the corresponding nonlinear PLS model 
are then achieved by
appropriate estimation of regression coefficients in  ${\cal F}$ and 
by the selection of an
appropriate kernel function. Moreover, utilizing the kernel function 
corresponding to
the canonical dot product in ${\cal F}$ allows us to avoid the nonlinear 
optimization involved in
the above approaches. In fact only linear algebra as simple as in a 
linear PLS regression is
required.

In Section 2 a basic definition of a RKHS and formulation of the 
Representer theorem is given.
Section 3 describes the ``classical''  PLS algorithm.
In Section 4 the kernel PLS method is given. Some of the
properties of kernel PLS are shown using a simple example.
Kernel PCR  and kernel RR
are briefly described in Section 5.
Section 6 describes the used model selection techniques.
The results are given in Section 7.
Section 8 provides a short discussion and concludes the paper.


\section{RKHS and Representer Theorem}
\label{rt}
The common aim of support vector machines, regularization networks,
Gaussian processes and spline methods \citep{Vap98,Gir98,Wil98,Wah90,CriTay00}
is to address the poor generalization properties of existing 
nonlinear regression techniques.
To overcome this problem a regularized formulation of regression is
considered as a variational problem in a RKHS $\cal{H}$
\begin{equation}
\label{rf}
  \min_{f \in {\cal H}}
  R_{reg}(f)=\frac {1}{n} \sum_{i=1}^{n} V(y_i,f({\bf x}_i))  + \xi
  \|f\|_{\cal H}^2 \; .
\end{equation}
We assume a training set of
regressors $\{{\bf x}_i\}_{i=1}^n $ to be a subset of a compact set
  ${\cal X} \subset R^N $ and  $\{ y_i \}_{i=1}^n \in R$ to be a set 
of corresponding outputs.
The solution to the problem (\ref{rf}) was given by
\citet{KimWah71,Wah99} and is known as the
\begin{description}
\item [Representer theorem {\rm (simple case)}:] Let the loss function $V(y_i, f)$ be 
a functional of $f$ which depends on $f$ only pointwise, that is, 
through $\{f({\bf x}_i)\}_{i=1}^{n}$---the values of $f$ at the data points.       
%convex in $\tau$ for each $y_i$. 
Then any solution to the problem: find $f \in {\cal H}$ to 
minimize  (\ref{rf})
has a  representation of the form
\begin{equation}
\label{gf}
    f({\bf x})=\sum_{i=1}^n c_i K({\bf x}_i, {\bf x}) \; ,
\end{equation}
where $\{c_i\}_{i=1}^n \in  R$.
\end{description}
In this formulation $\xi$ is a positive number (regularization coefficient)
  to control the tradeoff between approximating properties and the 
smoothness of $f$.
%  We assume a training set of
%regressors $\{{\bf x}_i\}_{i=1}^n $ to be a subset of a compact set
%  ${\cal X} \subseteq R^N $ and  $\{ y_i \}_{i=1}^n \in R$ to be a set 
%of corresponding outputs.
  $\|f\|_{\cal H}^2$ is a norm (sometimes called ``stabilizer'' in the 
  regularization networks domain) in a RKHS  $\cal H$ %\cite{Aro50}
  defined by the
positive definite kernel $K({\bf x},{\bf y})$; i.e. a symmetric function of two
variables satisfying the
Mercer theorem conditions \citep{Mer09,CriTay00}.
The fact that for any such positive definite kernel
there exists a unique RKHS is well established by the {\it 
Moore-Aronszjan theorem} \citep{Aro50}.
The form
   $K({\bf x},{\bf y})$ has the following {\it reproducing property}
  $$
f({\bf y})= \langle f({\bf x}), K({\bf x}, {\bf y}) \rangle_{\cal H} 
\;\;\;\;  \forall f \in {\cal H} \; ,
  $$
where $\langle . , . \rangle_{\cal H}$ is the scalar product in ${\cal H}$. The
function $K$ is called {\it a reproducing kernel} for~${\cal H}$.

It follows from Mercer's theorem  that each positive definite kernel
$K({\bf x}, {\bf y})$ defined on a compact domain ${\cal X} \times {\cal X}$ can by written in the form
\begin{equation}
\label{kd}
K({\bf x}, {\bf y})=\sum_{i=1}^M \lambda_i \phi_i({\bf x})\phi_i({\bf 
y}) \;\;\;\; M \le \infty \; ,
\end{equation}
where $\{\phi_i(.)\}_{i=1}^M$ are the eigenfunctions of the integral operator
${\bf \Gamma}_K : L_2({\cal X}) \rightarrow L_2({\cal X})$
$$
({\bf \Gamma}_K f)({\bf x})=\int_X K({\bf x},{\bf y})f({\bf y})d{\bf 
y} \;\;\;\; \forall f \in L_2({\cal X})
$$
and $\{\lambda_i >  0 \}_{i=1}^M$ are the corresponding positive eigenvalues.
The sequence  $\{\phi_i(.)\}_{i=1}^M$ creates an orthonormal basis of 
${\cal H}$ and we can express any
function $f \in {\cal H}$ as $f({\bf x})=\sum_{i=1}^{M} a_i 
\phi_i({\bf x})$ for some  $a_i \in R$.
This allows us to define a scalar product in ${\cal H}$:
$$
\langle f({\bf x}), h({\bf x}) \rangle_{\cal H} =
\langle \sum_{i=1}^{M}{a_i}\phi_i({\bf x}), 
\sum_{i=1}^{M}{b_i}\phi_i({\bf x}) \rangle_{\cal H} \equiv
\sum_{i=1}^{M}\frac{a_i b_i}{\lambda_i}
$$
and the norm $\|f\|_{\cal H}^2=\sum_{i=1}^{M}\frac{a_i^2}{\lambda_i}.$

Rewriting  (\ref{kd}) in the form
%\begin{*equation}
%\label{kf}
$$
K({\bf x}, {\bf y})=\sum_{i=1}^M \sqrt{\lambda_i} \phi_i({\bf x}) 
\sqrt {\lambda_i} \phi_i({\bf y}) =
(\Phi({\bf x}).\Phi({\bf y})) =\Phi({\bf x})^T \Phi({\bf y}) \; ,
$$
%\end{equation}
it becomes clear that any kernel $K({\bf x},{\bf y})$ also
corresponds to a canonical (Euclidean) dot product in a possibly high-dimensional 
space ${\cal F}$
where the input data are mapped by
$$
%\begin{equation}
%\label{f_map}
\begin{array}{ll}
\Phi:  &  {\cal X } \rightarrow {\cal F} \\
      & {\bf x} \rightarrow
          (\sqrt{\lambda_1}\phi_1({\bf x}),\sqrt{\lambda_2}\phi_2({\bf 
x}), \dots, \sqrt{\lambda_M}\phi_M({\bf x})) \;.
\end{array}
%\end{equation}
$$
  The space ${\cal F}$
  is usually denoted as a {\it feature space} and
  $\{\{\sqrt{\lambda_i}\phi_i({\bf x})\}_{i=1}^M, {\bf x } \in {\cal 
X}\}$ as {\it feature mappings}.
The number of  basis functions $\phi_i(.)$
also defines  the dimensionality of ${\cal F}$.
  It is worth noting, that
we can also construct a RKHS and a corresponding feature space by 
choosing a sequence of linearly
independent functions (not necessary orthogonal) $\{ \psi_i({\bf 
x})\}_{i=1}^M$ and
positive numbers $\alpha_i$ to define a series (in the case of 
$M=\infty$ absolutely and
uniformly convergent)
$ K({\bf x}, {\bf y})=\sum_{i=1}^M \alpha_i \psi_i({\bf x})\psi_i({\bf y}).$
This also gives the connection between the RKHS and  Gaussian 
processes  where the $K$ is
assumed to represent the correlation function of a zero-mean Gaussian process
evaluated at points ${\bf x}$ and ${\bf y}$ \citep{Wah90}.

Until now, we assumed that $K$ is a positive definite kernel. However, the
above results can be extended even for the case when $K$ is a positive
semidefinite. Is such a case a RKHS ${\cal H}$ contains a subspace of functions
$f$ with a zero norm $\|f\|_{\cal H}^2$ (the null
space). \citet{KimWah71}
showed that in such a case the solution of (\ref{rf}) leads to a more general
form of the Representer theorem:
$$
%\begin{equation}
%\label{gf1}
   f({\bf x})=\sum_{i=1}^n c_i K({\bf
   x}_i, {\bf x}) + \sum_{j=1}^{l} b_j \upsilon_j({\bf x}) \; ,
%\end{equation}
$$
where the functions $\{\upsilon_j(.)\}_{j=1}^l$ span the null
space of ${\cal H}$ and the coefficients $\{c_i\}_{i=1}^n$, 
$\{b_j\}_{j=1}^{l}$ are again given by
the data. In  this paper we will consider only  the case when
$l=1$ and $\upsilon_1({\bf x})=const \;\; \forall \, {\bf x}$.

\section{Partial Least Squares Regression}

PLS regression is a technique for  modeling  a linear relationship between a
set of output variables (responses) $\{{\bf y}_i\}_{i=1}^{n} \in R^L$
and a set of input variables (regressors) $\{{\bf x}_i\}_{i=1}^{n} \in R^N$.
In the first step, PLS creates uncorrelated latent variables which 
are linear combinations
of the original regressors.
The basic point of the procedure is that the weights
used to determine these linear combinations of the original
regressors are proportional to the covariance among input and output
variables \citep{Hel88}.
A least squares regression is then performed on the subset of
extracted latent variables. This leads to a biased but lower variance 
estimate of the
regression coefficients comparing to the Ordinary Least Squares (OLS) 
regression.


In the following {\bf X} will represent the $(n \times N)$ matrix of $n$ 
inputs  and ${\bf Y}$
will stand for the $(n \times L)$ matrix of the corresponding  
$L$-dimensional responses.
Further we assume centered input and output variables;
i.e. the columns of {\bf X} and {\bf Y} are zero mean.

There exist several different modifications 
\citep[see][]{MarNae89,Man87,Hel88,Jon93}
of the basic algorithm for PLS regression originally developed by 
\citet{Wol75}. In its
basic form a special case of the nonlinear iterative partial least 
squares (NIPALS)
algorithm \citep{Wol66} is used. NIPALS is a robust procedure for solving 
singular value decomposition  problems and is closely related to the power
method \citep{GolLoa96}. After an initial random estimate of the 
latent vector ${\bf t}$ the
following two steps are repeated until convergence of ${\bf t}$ and
the loadings vector ${\bf p}$:
\begin{enumerate}
\item ${\bf p}={\bf X}^T{\bf t}$
\item ${\bf t}={\bf X}{\bf p}, \,   {\bf t} \leftarrow {\bf t}/\|{\bf t}\| \; .$
\end{enumerate}
  After the extraction of ${\bf t}$ and ${\bf p}$
vectors the matrix ${\bf X}$ is deflated by ${\bf t}$
$${\bf X} \leftarrow {\bf X}-{\bf t}{\bf t}^T{\bf X}$$
and by repeating the whole procedure we may extract a new pair of
vectors ${\bf t}$ and ${\bf p}$ which are by construction orthogonal
to the previous one. It is worth noting that in the case that $N < n$ 
the normalization of the
$N$-dimensional vector ${\bf p}$ after the first step is computationally
advantageous in comparison to the normalization of the $n$-dimensional
vector ${\bf t}$. However, the normalization of  ${\bf t}$ allows
us to adapt the NIPALS algorithm to extract  the latent
vectors from the kernel matrices ${\bf XX}^T$ \citep{Lew95}:
\begin{enumerate}
\item ${\bf p}={\bf XX}^T{\bf t}$
\item ${\bf t}={\bf p}, \,   {\bf t} \leftarrow {\bf t}/\|{\bf t}\| \; .$
\end{enumerate}
The deflation of the ${\bf XX}^T$  matrix is given by
$${\bf XX}^T \leftarrow ({\bf X}-{\bf t}{\bf t}^T{\bf X})({\bf 
X}-{\bf t}{\bf t}^T{\bf X})^T \; .$$

\citet{WolRuhWolDun84} applied the NIPALS algorithm to the PLS
regression  with the aim to sequentially extract the latent vectors
${\bf t}, {\bf u}$ and weight vectors ${\bf w},{\bf c}$ from ${\bf X }$
and ${\bf Y}$  matrices in decreasing order of their corresponding
singular values.
What follows is a modification of the ``classical''  NIPALS-PLS algorithm in
the sense that normalization of 
the latent  vectors ${\bf t},{\bf u}$ rather than normalization of the vectors of weights
${\bf w},{\bf c}$ is used \citep{Lew95}:
\begin{enumerate}
\item randomly initialize ${\bf u} $
\item ${\bf w}={\bf X}^T{\bf u}$
\item ${\bf t}={\bf X}{\bf w}, \,  {\bf t} \leftarrow {\bf t}/\|{\bf t}\|$
\item ${\bf c}={\bf Y}^T{\bf t}$
\item ${\bf u}={\bf Y}{\bf c}, \,  {\bf u} \leftarrow {\bf u}/\|{\bf u}\|$
\item {\rm repeat steps 2. -- 5. until convergence}
\item {\rm deflate {\bf X},{\bf Y} matrices}: ${\bf X} \leftarrow 
{\bf X}-{\bf t}{\bf t}^T{\bf X}, \;
{\bf Y} \leftarrow {\bf Y} - {\bf t}{\bf t}^T{\bf Y} \; .$
\end{enumerate}
The PLS regression is an iterative process; i.e. after extraction of one
component the algorithm starts again using the deflated matrices
${\bf X}$ and {\bf Y} computed in step 7. Thus we can achieve the 
sequence of the models
up to the point when the rank of {\bf X} is reached.
However, in practice the cross-validation technique is usually used to avoid
underfitting or overfitting caused by the use of too small or too 
large dimensional  models.
After the extraction of the $p$ components we can create the $(n \times 
p)$ matrices {\bf T},
{\bf U}, the $(N \times p)$ matrix ${\bf W }$ and the $(L \times p)$ matrix 
${\bf C}$  consisting of the
columns created by the vectors $\{{\bf t}_i\}_{i=1}^p$,
$\{{\bf u}_i\}_{i=1}^p$, $\{{\bf w}_i\}_{i=1}^p$ and $\{{\bf 
c}_i\}_{i=1}^p$, respectively,
extracted during the individual iterations.

The PLS regression model
can be written in matrix form as \citep{Man87,RanLinGelWol94}
$${\bf Y}={\bf XB} + {\bf F} \; , $$
where ${\bf B}$ is an $(N \times L)$ matrix of the of the regression 
coefficients
and ${\bf F}$ is an $(n \times L)$ matrix of residuals.
This equation is identical to that used in other regression models;
multiple linear regression, ridge regression and   principal components
regression, however, in contrast to these models the  matrix ${\bf B}$
has the form \citep{Man87,RanLinGelWol94}
\begin{equation}
\label{bb}
{\bf B}={\bf W}({\bf P}^T{\bf W})^{-1}{\bf C}^T \; ,
\end{equation}
where ${\bf P}$ is the $(N \times p)$ matrix consisting of loadings vectors
$\{{\bf p}_i={\bf X}^T{\bf t}_i/({\bf t}_i^T{\bf t}_i)\}_{i=1}^p$.
Due to the fact that ${\bf p}_i^T{\bf w}_j = 0 \; {\rm for } \; i > 
j$ and in general
${\bf p}_i^T{\bf w}_j \neq  0 \; {\rm for } \; i < j$ \citep{Hos88} the matrix 
${\bf P}^T{\bf W}$ is upper
triangular and thus invertible.
Moreover, using the fact that ${\bf t}_i^T{\bf t}_j=0 \; {\rm for } 
\; i \neq j $  and
   ${\bf t}_i^T{\bf u}_j=0 \; {\rm for } \; j > i $
   \citet{RanLinGelWol94} derived the
  following equalities$\,$\footnote{In our case ${\bf T}^T{\bf T}$ is the
 $p$-dimensional identity matrix. This is simply a consequence of normalization of the 
  individual latent vectors $\{{\bf t}_i\}_{i=1}^p$.}
  \begin{equation}
\label{ww}
{\bf W }={\bf X}^T{\bf U}
\end{equation}
\begin{equation}
{\bf P}={\bf X}^T{\bf T}({\bf T}^T{\bf T})^{-1}
\end{equation}
\begin{equation}
\label{cc}
{\bf C}={\bf Y}^T{\bf T}({\bf T}^T{\bf T})^{-1} \; .
\end{equation}
Substituting (\ref{ww}--\ref{cc}) into (\ref{bb}) and using the
orthogonality of the ${\bf T}$ matrix columns
we can write the matrix {\bf B} in the following form
\begin{equation}
\label{BB}
{\bf B}={\bf X}^T{\bf U}({\bf T}^T{\bf XX}^T{\bf U})^{-1}{\bf T}^T 
{\bf Y}\; .
\end{equation}
It is worth noting that different scalings of
the individual latent 
%and loadings 
vectors  $\{{\bf t}_i\}_{i=1}^p$ and
$\{{\bf u}_i\}_{i=1}^p$
%, respectively, 
do not influence this estimate
of the matrix ${\bf B}$.

\section{Kernel Partial Least Squares Regression in RKHS }
\label{skpls}
Assume a nonlinear transformation of the input variables $\{{\bf 
x}_i\}_{i=1}^n$
into a  feature space ${\cal F}$; i.e. mapping
$\Phi: {\bf x}_i \in R^N \rightarrow \Phi({\bf x}_i) \in {\cal F}$. 
Our goal is to
construct a linear PLS regression model in ${\cal F}$. Effectively it 
means that
we can obtain a nonlinear regression model in the space of the
original input variables.
Denote by ${\bf \Phi}$  an $(n \times M)$ matrix of regressors whose
$i$-th row is the vector $\Phi({\bf x}_i)$. Depending on the nonlinear
transformation $\Phi(.)$ the feature space can by high-dimensional, even
infinite dimensional when the Gaussian kernel function is used. However, in
practice we are working only with $n$ observations and we have to restrict
ourself to finding the solution of the linear regression problem in the span of
the points $\{\Phi({\bf x}_i)\}_{i=1}^{n}$. This situation is 
analogous to the case
when the input data matrix ${\bf X }$ has more columns than rows; 
i.e. we are dealing with
more variables than measured objects. This motivated 
\citet{RanLinGelWol94} to introduced
the (input space) kernel PLS algorithm to speed up the computation of
the components for a linear PLS model.
The idea is to compute the  components from the ($n \times n$)
${\bf XX}^T$ matrix rather than  the ($N \times N$) ${\bf X}^T{\bf X}$ 
matrix when $n \ll N$. The same
approach can be also used for the computation of the principal components
\citep[see][]{WuMasJon97} or the nonlinear version \citep{SchSmoMul98}.

Now, motivated by the theory of RKHS described in Section \ref{rt} we 
derive the algorithm for
the (nonlinear) kernel PLS model. From the previous section we can see that
by the connection of 2. and 3. step  and by using the ${\bf \Phi}$ 
matrix of mapped
input data we can modify the NIPALS-PLS algorithm into the form
\begin{enumerate}
\item randomly initialize ${\bf u} $
\item ${\bf t}={\bf \Phi\Phi}^T{\bf u}, \,  {\bf t} \leftarrow {\bf 
t}/\|{\bf t}\|$
\item ${\bf c}={\bf Y}^T{\bf t}$
\item ${\bf u}={\bf Y}{\bf c}, \,  {\bf u} \leftarrow {\bf u}/\|{\bf u}\|$
\item {\rm repeat steps 2. -- 5. until convergence}
\item {\rm deflate ${\bf \Phi\Phi}^T$,{\bf Y} matrices}:
${\bf \Phi\Phi}^T \leftarrow ({\bf \Phi}-{\bf t}{\bf t}^T{\bf 
\Phi})({\bf \Phi}-{\bf t}{\bf t}^T{\bf \Phi})^T,$ $\; {\bf Y} \leftarrow {\bf Y} - {\bf t}{\bf t}^T{\bf Y} \; .$ 
%\\
%  \hspace*{4.3cm} ${\bf Y} \leftarrow {\bf Y} - {\bf t}{\bf t}^T{\bf Y}$
\end{enumerate}
Applying the so-called ``kernel trick''; i.e. the fact that
$\Phi({\bf x}_i)^T\Phi({\bf x}_j)=K({\bf x}_i,{\bf x}_j)$,  we can 
see that  ${\bf \Phi\Phi}^T$
represents the $(n \times n)$ {\it kernel Gram matrix} {\bf K} of the 
cross dot products between all
mapped input data points  $\{\Phi({\bf x}_i)\}_{i=1}^{n}$. Thus, 
instead of an explicit nonlinear mapping,
the kernel function can be used.
The deflation of the ${\bf \Phi\Phi}^T={\bf K}$ matrix after 
extraction of the ${\bf t}$ component
is now given by
\begin{equation}
\label{def}
{\bf K} \leftarrow ({\bf I}-{\bf t}{\bf t}^T){\bf K}({\bf I}-{\bf 
t}{\bf t}^T)={\bf K}-{\bf t}{\bf t}^T{\bf K}-
{\bf K}{\bf t}{\bf t}^T+{\bf t}{\bf t}^T{\bf K}{\bf t}{\bf t}^T \; ,
\end{equation}
where ${\bf I}$ is an $n$-dimensional identity matrix.
We would like to point out that a similar kernel PLS algorithm can be
also derived by the nonlinear modification of the (linear) kernel PLS
algorithm described in \citep{RanLinGelWol94}. This modification leads
to the extraction of the ${\bf t},{\bf u}$ components from the ${\bf
  KYY}^T$ and ${\bf YY}^T$ matrices, however this approach can be more 
fruitful when
the multivariate kernel PLS model is assumed ($L > 1 $) as compared to the 
NIPALS-PLS algorithm described above.

Similarly we can see that the matrix of the regression coefficients 
${\bf B}$ (\ref{BB}) will have the form
\begin{equation}
\label{BK}
{\bf B}={\bf \Phi}^T{\bf U}({\bf T}^T{\bf K}{\bf U})^{-1}{\bf T}^T {\bf Y}
\end{equation}
and to make prediction on training data we can write
\begin{equation}
\label{YTR}
{\bf \hat Y}={\bf \Phi}{\bf B}={\bf K}{\bf U}({\bf T}^T{\bf K}{\bf 
U})^{-1}{\bf T}^T {\bf Y} = {\bf T}{\bf T}^T {\bf Y}\; ,
\end{equation}
where the last equality follows from the fact that the matrix of the 
components ${\bf T}$  may be
expressed as  ${\bf T}={\bf \Phi R}$ where
${\bf R}= {\bf \Phi}^T{\bf U}({\bf T}^T{\bf K}{\bf U})^{-1}$ 
\citep{Jon93,Hel88}.
It is important to stress that during the
iterative process of the estimation of the components $\{{\bf 
t}_i\}_{i=1}^p$ we made the
deflation of the ${\bf K}$ matrix after each step. Effectively it 
means that ${\bf T } \neq  {\bf KU}$.
Thus, for predictions made on testing points $\{{\bf 
x}_i\}_{i=n+1}^{n+n_{t}}$ the matrix of regression
coefficients (\ref{BK}) have to be used; i.e.
\begin{equation}
\label{YTS}
{\bf \hat Y}_t={\bf \Phi}_t{\bf B}={\bf K}_t{\bf U}({\bf T}^T{\bf 
K}{\bf U})^{-1}{\bf T}^T {\bf Y}\; ,
\end{equation}
where ${\bf \Phi}_t$ is the matrix of the mapped testing points
and consequently  ${\bf K}_t$ is the $(n_t \times n)$ ``test'' matrix
whose elements are ${\rm K}_{ij}=K({\bf x}_i,{\bf x}_j)$ where 
$\{{\bf x}_i\}_{i=n+1}^{n+n_{t}}$ and
$\{{\bf x}_j\}_{j=1}^n$ are the testing and training points, respectively.

At the beginning of the previous section  we assumed a centralized 
PLS regression problem.
To centralize the mapped data in a feature space ${\cal F}$ we can 
simply applied the following
procedures \citep{SchSmoMul98,WuMasJon97a}
\begin{equation}
\label{cK}
{\bf K}=({\bf I} - \frac{1}{n}{\bf 1}_n{\bf 1}_n^T){\bf K}({\bf I} - 
\frac{1}{n}{\bf 1}_n{\bf 1}_n^T)
\end{equation}
\begin{equation}
\label{cKt}
{\bf K}_t=({\bf K}_t - \frac{1}{n}{\bf 1}_{n_t}{\bf 1}_n^T{\bf 
K})({\bf I} - \frac{1}{n}{\bf 1}_n{\bf 1}_n^T) \; ,
\end{equation}
where {\bf I } is again an $n$-dimensional identity matrix and ${\bf 
1}_n$, ${\bf 1}_{n_t}$
represent the vectors whose elements are ones, with length $n$ and 
$n_t$, respectively.

In conclusion we would like to make several remarks about the interpretation
of the kernel PLS model. For simplicity we will consider the univariate
kernel PLS regression case ($L=1$) and we denote the $(n \times 1)$ vector
${\bf d}={\bf U}({\bf T}^T{\bf K}{\bf U})^{-1}{\bf T}^T {\bf Y}$.
Now we can represent the solution of the kernel PLS regression as
$$
f({\bf x},{\bf d})=\sum_{i=1}^n d_i K({\bf x},{\bf x}_i) \; ,
$$
which agrees with the solution of the regularized formulation of 
regression (\ref{gf}) given by
the Representer theorem in Section \ref{rt}. Using equation 
(\ref{YTR}) we may also
interpret the kernel PLS model as a linear regression model of the form
\citep[for more detailed interpretation of linear PLS models we refer 
the reader to][]{Gar94,Hos88}
$$
f({\bf x},{\bf c})= c_1 t_1({\bf x}) + c_2 t_2({\bf x}) + \dots + c_p 
t_p({\bf x})=
  {\bf c}^T{\bf t}({\bf x}) \; ,
$$
where the $\{t_i({\bf x})\}_{i=1}^{p}$ are the projections of the data point
${\bf x}$ onto the extracted $p$ components and  ${\bf c}$ is the 
vector of weights given by (\ref{cc}).


\subsection{Example}
\label{sex}
In this section we would like to demonstrate some of the 
properties of the  kernel PLS
when applied to approximating the 
%regression on approximation  
$sinc(x)$ function
defined as
$$
f({x})=sinc(x)=\frac{sin|x|}{|x|} \; .
$$
We generated 100 uniformly spaced samples in the range $[-10,10]$ and 
computed the corresponding
values of the $sinc(.)$ function which were subsequently centralized. 
We used the Gaussian kernel function
with width equal to 1.
In Figure \ref{sinc_proj} the first 3 principal components (left) 
computed from the centralized Gram matrix {\bf K}
are compared with the components extracted by the proposed kernel PLS 
algorithm (right). We can see the qualitative
difference between the principal components and components extracted 
by kernel PLS where also
the correlations  between the input and output data are used.
\begin{figure}[]
   \centerline{\psfig{figure=sinc_proj.ps,width=155mm,height=60mm,angle=0}}
   \caption{\label{sinc_proj}\small  First three principal components (PC) 
extracted by kernel PCA (left) and
   components (C) extracted by kernel PLS (right). The curves represents 
the (principal)  component values.
   {\it Sinc} function is shown as a solid line.}
\end{figure}

In the next step we added the white Gaussian noise with  standard deviation 0.2
to the outputs. This corresponds to the ratio between the standard 
deviation of noise
and signal equal to 56\%.
We generated an additional 80 uniformly spaced testing samples from 
the same range $[-10,10]$,
however not identical to the previous ones. Using these data points 
we computed noise-free outputs.
The results obtained on training and testing parts of the data are 
depicted in Figure
\ref{sinc_n2_tr_ts}. We can see that although the kernel PLS method
fits more precisely noisy training data by the appropriate selection 
of the components we can
avoid the overfitting effect and achieve the same performance on 
the testing set as with the
kernel PCR method. The number of components in the kernel PLS case is 
significantly smaller.
Using the number of components on which minimum test error occurred, 
in Figure~\ref{sinc_noise_tr},
we plotted  approximating functions computed on noisy training
data. We achieved qualitatively similar results in the case when the
Gaussian noise with standard deviation equal to 0.05 and 0.1 was added to
the outputs.
\begin{figure}[]
   \centerline{\psfig{figure=sinc_n2_tr_ts.ps,width=155mm,height=60mm,angle=0}}
   \caption{\label{sinc_n2_tr_ts}\small $Sinc$ function approximation. 
Dependence of the training and
   testing error of kernel PLS (solid line) and kernel PCR (dashed line) 
on the number of extracted components.
   Gaussian noise with standard deviation equal to  0.2 was added to the 
training data set outputs. Error is
   evaluated in terms of mean squared error (MSE).}
\end{figure}
\begin{figure}[]
   \centerline{\psfig{figure=sinc_n2_tr.ps,width=148mm,height=80mm,angle=0}}
   \caption{\label{sinc_noise_tr}\small Comparison of kernel PLS (solid 
line) and kernel PCR (dash-dotted line) on noisy
   $sinc$ function approximation. The number of used (principal) 
components was selected based on
   plots  in Figure  \ref{sinc_n2_tr_ts} (right) and was 1  and 10 in 
kernel PLS and kernel PCR, respectively.
   Gaussian noise with standard deviation equal to 0.2 was added to the
   data set outputs (dots). The true function is shown dashed.}
   \end{figure}


\section{Kernel PCR and  Kernel RR in RKHS}
In this section, we briefly describe and compare another two 
kernel-based regularized
least squares models; kernel PCR and kernel RR. However, we start with
the kernel PCA method for the extraction of nonlinear principal components
used in the kernel PCR model.

\subsection{Kernel PCA}
The PCA problem in a high-dimensional feature space $\cal{F}$ can be
formulated as
the diagonalization of an $n$-sample estimate of the covariance matrix
$$
%\begin{equation}
%\label{ckpca}
   {\bf\hat C}= \frac{1}{n}\sum_{i=1}^n\Phi({\bf x}_i)\Phi({\bf 
x}_i)^T = \frac{1}{n}{\bf \Phi}^T {\bf \Phi} \;,
%\end{equation}
$$
\noindent where ${\bf \Phi}$ now denotes the  $(n \times M)$ matrix
of the  centered nonlinear mappings of the
input variables $\{{\bf x}_i\}_{i=1}^n \in R^N$.
The diagonalization represents a transformation of the original data 
to new coordinates
defined by orthogonal eigenvectors ${\bf u}$.
We have to find eigenvalues $\lambda \ge 0$ and
  non-zero eigenvectors ${\bf u} \in \cal{F}$ satisfying the eigenvalue
equation
$$
  \lambda {\bf u}= {\bf\hat  C} {\bf u} \;.
$$
\noindent
When $n \ll M$,  similar to linear kernel PCA \citep[see, e.g.,][]{WuMasJon97}, 
we may transform this eigenvalue problem
to the problem of diagonalization of the $(n \times n)$ matrix   ${\bf 
\Phi} {\bf \Phi}^T={\bf K}$; i.e. solving the 
eigenvalue problem as described in \citep{SchSmoMul98,Sir87}
\begin{equation}
\label{evp}
{\bf K}{\bf \tilde u} =  n \lambda {\bf \tilde u}=  {\tilde  \lambda} 
{\bf \tilde u} \;.
\end{equation}
\noindent
The eigenvectors $\{{\bf u}^k\}_{k=1}^n$ are then given by
$$
{\bf u}^k=(n \lambda_k)^{-1/2}{\bf \Phi}^T {\bf \tilde u}^k = {\tilde 
\lambda_k}^{-1/2}{\bf  \Phi}^T {\bf \tilde u}^k \; ,
$$
\noindent where ${\bf \tilde u}^k$ is the $k$-th principal component
extracted by solving (\ref{evp}) and $n \lambda_k = {\tilde 
\lambda_k}$ the corresponding eigenvalue.
Finally, we can compute the projection of $\Phi({\bf x})$ onto the $k$-th
nonlinear principal component by
\begin{equation}
\label{com}
\beta_k({\bf x}) = \Phi({\bf x})^T {\bf u}^k ={\tilde 
\lambda_k}^{-1/2}\sum_{i=1}^n
{\tilde u}_i^k K({\bf x}_i,{\bf x}) \; .
\end{equation}
Re-writing this projection into the matrix form
we can write for the projection of training data points  $\{{\bf 
x}_i\}_{i=1}^{n}$
\begin{equation}
\label{tP}
{\bf P}={\bf \Phi}{\bf \Phi}^T{\bf \tilde U}{\bf \Lambda}^{-1/2}={\bf
   K}{\bf \tilde U}{\bf \tilde \Lambda}^{-1/2}=
{\bf \tilde U}{\bf \tilde \Lambda}^{1/2} \; ,
\end{equation}
where the columns of ${\bf \tilde U}$ are created by the eigenvectors
$\{{\bf \tilde u}^i\}_{i=1}^{n}$
and ${\bf \tilde \Lambda}$ is a diagonal matrix $diag({\tilde
   \lambda_1}, {\tilde \lambda_2}, \dots , {\tilde \lambda_n})$.
Similarly  for the projection of testing points $\{{\bf 
x}_i\}_{i=n+1}^{n+n_t}$ we have
$$
{\bf P}_t={\bf \Phi}_t{\bf \Phi}^T{\bf \tilde U}{\bf \tilde \Lambda}^{-1/2}=
{\bf K}_t{\bf \tilde U}{\bf \tilde \Lambda}^{-1/2} \;.
$$
Note that the assumption of the centralized nonlinear mappings is again 
transformed to the
``centralization'' of the ${\bf K}$ and ${\bf K}_t$  matrices given by 
(\ref{cK}) and (\ref{cKt}),
respectively.


\subsection{Kernel PCR}
Consider the standard regression model in feature space  ${\cal F}$
\begin{equation}
\label{gm}
{\bf y}= {\bf \Phi} {\bf w} + \mbox{\boldmath$\epsilon$} \; ,
\end{equation}
where ${\bf y}$ is a vector of $n$ observations of the dependent variable,
${\bf \Phi}$ now represents an $(n \times M)$ matrix of zero-mean 
regressors $\{\Phi({\bf x}_i)\}_{i=1}^n$,
${\bf w}$ is a vector  of regression coefficients and
$\mbox{\boldmath$\epsilon$}$ is the vector of error
terms whose elements have equal variance $\sigma^2$, and are 
independent of each other.
  ${\bf \Phi}^T{\bf \Phi}$ is proportional to  the sample
covariance matrix and kernel PCA can be performed to extract
its $M$ eigenvalues $\{\tilde \lambda_i\}_{i=1}^M$  and
corresponding eigenvectors $\{{\bf u}^i\}_{i=1}^M$\footnote{For the 
moment, we are theoretically assuming
that $n > M$.  Otherwise we have to deal with a singular case ($ n 
\le  M $) allowing us to extract only up to
$n-1$ eigenvectors corresponding to non-zero eigenvalues.} 
(\ref{evp}). Having the eigensystem
$\{{\tilde  \lambda}_i,{\bf u}^i\}_{i=1}^{M}$ the
spectral decomposition \citep{Jol86} of ${\bf \Phi}^T{\bf \Phi}$ has the form
$$
{\bf \Phi}^T{\bf \Phi}=\sum_{i=1}^{M}{\tilde  \lambda}_i {\bf u}^i 
({\bf u}^i)^T \;.
$$
The projection of $\Phi({\bf x})$ onto the $k$-th
nonlinear principal component is given by (\ref{com}).
By projection of all original regressors onto the principal components
we can rewrite (\ref{gm}) as
\begin{equation}
\label{tgm}
{\bf y}={\bf B}{\bf v}  + \mbox{\boldmath$\epsilon$} \;,
\end{equation}
where ${\bf B}={\bf \Phi U}$ is now an  $( n \times M)$ matrix of 
transformed regressors and
${\bf U}$ is an $( M \times M)$ matrix whose $k$-th column is the eigenvector
${\bf u}^k$. The columns of the matrix ${\bf B}$ are now orthogonal and  the
least squares estimate of the coefficients ${\bf v}$ becomes
$$
%\begin{equation}
%\label{ce}
\hat{\bf v}=({\bf B}^T{\bf B})^{-1}{\bf B}^T{\bf y}=
{\mbox{\boldmath$\tilde \Lambda$}}^{-1}
{\bf B}^T{\bf y} \;,
%\end{equation}
$$
where
  $\mbox{\boldmath$\tilde  \Lambda$}={\rm diag}({\tilde
  \lambda}_1,{\tilde  \lambda}_2, \dots , {\tilde  \lambda}_M)$.
It is worth noting that PCR, as well as other biased regression 
techniques, is not
invariant to the relative scaling of the original regressors 
\citep{FraFri93}. However, similar to
OLS regression,  the solution of (\ref{tgm}) 
does not depend on a possibly
different scaling in individual eigendirections used in the kernel 
PCA transformation.
Further, the results obtained using  all principal components for the 
projection of
the original regressor variables for (\ref{tgm}) is equivalent to that
obtained by least squares using the original regressors.
  In fact we can express the estimate ${\bf \hat  w}$ of the original model
(\ref{gm}) as
$$
{\bf \hat  w}={\bf U}{\bf\hat v}=
{\bf U}({\bf B}^T{\bf B})^{-1}{\bf B}^T{\bf y}=
({\bf \Phi}^T{\bf \Phi})^{-1}{\bf \Phi}^T{\bf y }
=\sum_{i=1}^M{\tilde \lambda}_i^{-1}{\bf u}^i({\bf u}^i)^T{\bf \Phi}^T{\bf y}
$$
and
its corresponding variance-covariance matrix \citep{Jol86} as
\begin{equation}
\label{c1}
cov({\bf \hat w})=\sigma^2{\bf U}({\bf B}^T{\bf B})^{-1}{\bf U}^T
=\sigma^2{\bf U}
\mbox{\boldmath$\tilde \Lambda$}^{-1}{\bf U}^T=
\sigma^2\sum_{i=1}^M{\tilde \lambda}_i^{-1}{\bf u}^i({\bf u}^i)^T \; ,
\end{equation}
where we used the fact that ${\bf y} \sim {\cal N}({\bf \Phi 
w},\sigma^2{\bf I})$.
Similarly to PLS, to avoid the problem of multicollinearity,  PCR uses only
some of the principal components. It is clear from (\ref{c1}) that 
the influence  of
small eigenvalues can significantly increase the overall variance
of the estimate.  PCR  simply deletes the principal components 
corresponding to small values of
the eigenvalues ${\tilde \lambda}_i$.
The penalty we have to pay for the decrease in variance of the regression
coefficient estimate  is bias in  the final estimate. However, if 
multicollinearity
is a serious  problem the introduced bias can have a less 
significant effect than a high variance estimate. If the elements of
${\bf v}$ corresponding to deleted regressors are
zero, an unbiased estimate is achieved \citep{Jol86}.

Using  the first $p$ nonlinear principal components (\ref{com})
to create a linear model based on orthogonal regressors in feature 
space ${\cal F}$
we can  formulate the kernel PCR  model as \citep{RosGirTre00,RosGirTreCic01}
$$
%\begin{equation}
%\label{gs}
f({\bf x},{\bf a})=\sum_{k=1}^{p} v_k\beta_k({\bf x}) + b =
\sum_{k=1}^{p}v_k\sum_{i=1}^{n}{\tilde \lambda}_k^{-1/2}{\tilde 
u}_i^k K({\bf x}_i,{\bf x})=
\sum_{i=1}^n a_i K({\bf x}_i,{\bf x}) + b \; ,
%\end{equation}
$$
where $\{a_i=\sum_{k=1}^p v_k {\tilde \lambda}_k^{-1/2}{\tilde 
u}_i^k\}_{i=1}^n$ and $b$ is a bias term.
Similar to kernel PLS and kernel RR (see below)
we can assume a centralized regression model leading to a zero bias term $b$.

We have shown that by removing the principal components whose
variances are very small we can eliminate large variances of the 
estimate due to
multicollinearities. However, if the orthogonal regressors 
corresponding to those principal components
have a large correlation with the dependent variable $y$ such 
deletion is undesirable
\citep[experimentally  demonstrated  in][]{RosTreCic00}.
There are several different strategies for selecting the appropriate 
orthogonal regressors for
the final model \citep[see][and references therein]{Jol86,Jol82}.
In Section \ref{smod_sel} we discuss approaches used in our experiments.



\subsection{Kernel Ridge Regression}
Kernel RR  is another technique to deal with multicollinearity by
assuming  the linear regression model (\ref{gm}) whose solution is 
now achieved by
minimizing
\begin{equation}
\label{RR}
   R_{rr}({\bf w})=
\sum_{i=1}^{n}(y_i-f({\bf x}_i,{\bf w}))^2 +
\xi \|{\bf w}\|^2 \;,
\end{equation}
\noindent where  $f({\bf x},{\bf w})= {\bf w}^T\Phi({\bf x})$ and  $\xi$
is a regularization coefficient.
  The least squares estimate of
${\bf w}$ is
$$
{\bf \hat w}=({\bf \Phi}^T{\bf \Phi} + \xi{\bf I})^{-1}{\bf 
\Phi}^T{\bf y} \; ,
$$
which is  biased but has lower variance compared to an OLS estimate.
To make the connection to the kernel PCR case we express the estimate 
${\bf \hat w}$ in
the eigensystem $\{{\tilde \lambda}_i,{\bf u}^i\}_{i=1}^{M}$
$$
{\bf \hat w}=\sum_{i=1}^M({\tilde \lambda}_i+\xi)^{-1}{\bf u}^i({\bf 
u}^i)^T{\bf \Phi}^T{\bf y}
$$
and corresponding variance-covariance matrix as \citep{Jol86}
$$
cov({\bf \hat w})=\sigma^2\sum_{i=1}^M{\tilde \lambda}_i({\tilde 
\lambda}_i+\xi)^{-2}{\bf u}^i({\bf u}^i)^T \;.
$$
We can see, that in contrast to kernel PCR (\ref{c1}), the  variance
reduction in kernel RR  is achieved by giving less weight to small 
eigenvalue principal components
via the factor $\xi$.

In practice we usually do not know  the explicit mapping $\Phi(.)$
or its computation in the high-dimensional feature space $\cal F$ may 
be numerically intractable.
Using the dual representation of the linear RR model,
  \citet{SauGamVov98} derived a formula for estimation of the weights ${\bf w}$
for the linear RR model in a feature space ${\cal F}$; i.e. 
(nonlinear) kernel RR.
Again, using the fact that $K({\bf x},{\bf y})= \Phi({\bf x})^T\Phi({\bf y}) $
we can express the final kernel RR estimate of (\ref{RR}) in the dot 
product form
\citep{SauGamVov98,CriTay00}
$$
%\begin{equation}
%\label{krr}
f({\bf x})={\bf c}^T{\bf k}  = {\bf y}^T({\bf K}+\xi I)^{-1}{\bf k}\;  ,
%\end{equation}
$$
where ${\bf K}$ is again an $(n \times n)$ Gram matrix
and ${\bf k}$  is the vector of
dot products of a new mapped input  example $\Phi({\bf x})$ and the 
vectors of the training
set; ${\rm k}_i=(\Phi({\bf x}_i).\Phi({\bf x}))$.
It is worth noting that the same solution to the RR problem in the 
feature space
$\cal F$ can also be derived based on the dual representation of the 
regularization networks
minimizing the regularized risk functional (\ref{rf}) using the quadratic loss function
$V(y_i,f({\bf x}_i))=(y_i- f({\bf x}_i))^2$
\citep{GirJonPog93,GirJonPog95,Hay99} or  through the techniques 
derived from Gaussian processes \citep{Wil98,CriTay00}.


In this paper we assume centralized kernel RR \citep{RosTreCic00}; 
i.e. we assume the sample mean of the mapped
data $\Phi({\bf x}_i)$ and outputs  $y_i$ to be zero. The 
centralization of the individual
mapped data points is again accomplished by the ``centralization'' of 
${\bf K}$ and ${\bf k}$
given by the equations (\ref{cK}) and (\ref{cKt}), respectively.

\section{Model Selection}
\label{smod_sel}

To determine  unknown parameters in all  regression models,  cross 
validation (CV)
techniques were used \citep{Sto74}. While in kernel RR,  a 
regularization coefficient and
parameters of the kernel function have to be estimated, in kernel PLS 
and kernel PCR
it is mainly the problem of appropriate selection of (principal) components.

For a comparison of models using particular  values of estimated 
parameters, the prediction error sum of
squares (PRESS) statistic was used.
$$
{\rm PRESS} = \sum_{i=1}^{n}(y_i-f({\bf x}_i))^2 \; ,
$$
where $f({\bf x}_i)$ represents the prediction of the measured response $y_i$.
PRESS was summed over all CV subsets.


\subsection{Kernel PLS}
In kernel PLS the number of
components gradually increases until the model reaches some optimal dimension.
For example we can use CV to determine the adequacy of the individual components to enter
the final model \citep{Wol78} or use CV for the comparison of 
whole models of
certain dimensionality $1,2, \dots  , p$.  In our study we used the 
second approach
and the validity of individual models was compared in terms of  PRESS.

\subsection{Kernel PCR}
The situation is  more difficult in the case of kernel PCR than in kernel PLS because 
principal components are extracted solely based on the description of the
input space without using any existing correlations with the outputs.
The influence of individual principal components regressors can be consequently
measured by the $t$-test for regression coefficients \citep{MonPec92}.
By assuming a centralized regression model (\ref{tgm}) for which the 
design matrix ${\bf B}$
satisfies ${\bf B}^T{\bf B}={\bf I}$,  we can write for the $t^2$ statistic
of $k$-th regressor $t^2_k \equiv (\mbox{\boldmath$\beta$}_k^T {\bf Y})^2$,
where  $\mbox{\boldmath$\beta$}_k$ represents the $(n \times 1)$ 
vector of the projections
of input data onto the $k$-th principal component. The condition 
${\bf B}^T{\bf B}={\bf I}$  simply
means sphering  of the projected data which can be achieved on the training data
by taking ${\bf P}={\bf \tilde U}$ in equation (\ref{tP}).

There are several different situations that can occur in PCR. First, the
principal directions with large eigenvalues and significant values of $t^2$
should always be used in the final model. The
principal directions with high eigenvalues and insignificant values of $t^2$
should also be included in the final model due to the fact that a significant
amount of variability of the input data can be lost. The  principal 
directions with low eigenvalues
and insignificant values of $t^2$ should always be deleted. The most 
difficult problems
arise when some of the directions with small eigenvalues have a 
significant contribution
to prediction. This situation on two data sets used was already 
demonstrated in \citep{RosTreCic00,RosGirTreCic01}.
Moreover, in Figure \ref{ev_t2_r4_3} we also give one of the examples 
we observed on the used data sets.
Comparing the left and right graphs we can see that some of the small 
eigenvalues principal components
may have relatively high prediction properties. In contrast we can 
see that $t^2$ values of
some high-eigenvalue principal components indicate their low 
contribution to the overall
prediction abilities of the regression model. For further discussion 
on the topic of principal components
selection  we refer the  reader to \citep{Jol86,StoBro90}.
\begin{figure}[]
   \centerline{\psfig{figure=ev_t2_r4_3.ps,width=150mm,height=60mm,angle=0}}
   \caption{\label{ev_t2_r4_3}\small Example of the estimated ordered 
eigenvalues (left) and
   $t^2$ statistic  (right) of the regressors
created by the  projection onto corresponding  principal components. 
Vertical dashed lines indicate a
number of principal components describing  95\% of the overall 
variance of the data. One of the
training data partitions of subject D from the regression problem 
described in Section \ref{serp} was used.}
\end{figure}

First, we would like to stress that as a consequence of the
orthogonality of regressors the individual single variable models 
have an independent
contribution to the overall regression model. This significantly simplifies
the selection of individual regressors and in our study we decided on 
the following model
selection strategy. We were iteratively increasing
the number of large eigenvalue principal components entering the 
model without considering their values of the $t^2$ statistic. The criterion 
employed was the amount of
described variance. The rest of the principal components were 
ordered based on the $t^2$ statistic. As with  kernel PLS, CV was used to compare
the whole models of particular dimension. However, in contrast to 
kernel PLS the PRESS
statistics were used to select the final model over all possible 
arrangements of the
final models; i.e. for a different, fixed number of principal 
components with large
eigenvalues entering the final model.


\section{Results}
The present work was carried out with two types of kernels. Gaussian kernels
$K({\bf x},{\bf y})=e^{-(\frac{\|{\bf x} -{\bf y}\|^2}{d})}$, where
$d$ determines the width of the Gaussian function. The Gaussian kernel
possesses good smoothness properties (suppression of the higher 
frequency components)
and in the case we do not have a priori knowledge about the 
regression problem we would
prefer a smooth estimate \citep{GirJonPog93,SmoSchMul98}. The 
polynomial kernels
$K({\bf x},{\bf y})=(({\bf x}.{\bf y}) +1)^a $
of different orders $a$  were also employed. In the case of $a=1$ we used
the  $K({\bf x},{\bf y})=({\bf x}.{\bf y} )={\bf x}^T{\bf y}$ kernel leading to the
construction of linear regression models.
The results were evaluated in terms of
$R^2$ (coefficient of determination) \citep{MonPec92} or in terms of
normalized root mean squared error (NRMSE).

\subsection{Corn  Data}
This data set consists of 80 samples of corn measured on 3 different 
near-infra-red spectrometers
(in our study spectra from instrument m5 were used) and is electronically available 
at {\tt http://www.eigenvector.com/Data/Data\_sets.html}. 
The wavelength range is 1100-2498nm at 2 nm intervals ($N=700$ channels).
The moisture, oil, protein and starch values represent four output
variables ($L=4$). As the
first principal component described 99\% of the
overall variance  this  indicated high multicollinearity among the 
input variables.
We have used the spectra to form the matrix of regressors {\bf X}, 
however, instead of modeling
the real responses we generated four different outputs as follows
\begin{eqnarray}
\label{gen_res}
\begin{array}{ll}
y_1=\exp(\frac{{\bf x}^T{\bf x}}{2m}) & y_2=\exp(\frac{{\bf x}^T{\bf 
A}^{-1}{\bf x}}{2m_1}) \\
y_3=(\frac{{\bf x}^T{\bf x}}{m})^3\exp(\frac{{\bf x}^T{\bf x}}{2m})& 
y_4=0.3y_1+0.25y_2-0.7y_3 \; . \\
\end{array}
\end{eqnarray}
{\bf A} is a symmetric matrix with off-diagonal elements set to 0.8 
and diagonal elements set to 1.0.
$m$ and $ m_1$ are averages of $\{{\bf x}_i^T{\bf x}_i\}_{i=1}^{700}$  and
  $\{{\bf x}_i^T{\bf A}^{-1}{\bf x}_i\}_{i=1}^{700}$,  respectively.
  The first 60 samples were used to create a training data set and the 
remaining 20 samples created a testing
  set.  To the outputs computed on training data  we added white noise 
with normal distribution
  and with different levels corresponding to ratios of the standard 
deviation of the noise and the clean
  output variables. We denote this ratios $n/s$. For each noise level 
  25 different training sets were generated.  

  Leave-one-out cross validation (LOO) applied on the training 
partition was used to select
desired parameters for individual kernel regression models. In the 
case of kernel PCR
the first principal component was always included into the regression 
models and only the first
30 principal components covering almost the whole data variance were 
investigated.

Both multivariate and univariate regression models were studied.
Multivariate kernel PLS tries to find components that are good 
predictors for all response variables. In
the final model the same components are used for the prediction of 
individual responses.
These components are determined based on the PRESS statistic computed 
as the summation of squared errors
over all response variables.
Multivariate kernel PLS might be advantageous especially when
 predicted response variables are  highly multicollinear.
We observed that both approaches lead to the same
results on data sets with a  smaller noise level, however, for noise 
levels equal to $n/s=60\%$ and $n/s=90\%$
multivariate kernel PLS provides superior  predictions on the test set.
Multivariate RR approach was  discussed by Frank and Friedman 
\citep{FraFri93}. It was shown that if the original response
variables are correlated it might be profitable to assume
separate RR or PCR regression models on decorrelated outputs rather 
than on the original responses.
However, by applying this method we did not observe an improvement on 
test set predictions.

The goodness of prediction of the univariate kernel RR, kernel PCR 
and kernel PLS regression models
on the test set are  summarized in Table \ref{corn}. We may observe comparable
performance of all three kernel regression techniques employed. 
However, the results
achieved with multivariate kernel PLS regression were  superior to the
kernel PCA and kernel RR models in the case of a higher noise level
  $n/s=90\%$. In this case multivariate kernel PLS resulted in  $R^2$ values
equal to $0.95$ and $0.93$ for the prediction of $y_1$ and $y_2$ 
response variables on the test set.
Increasing the noise level leads to the selection of a smaller number
of (principal) components. Although we are losing the possibility of 
finer approximation
of the function those components, especially in the case of kernel 
PLS, may still give relatively
good performance even in the situation where there is a  higher noise 
level. However, we have to note that
this is due to the  relatively simple nonlinear dependencies that we 
constructed by (\ref{gen_res}).
In the nonlinear case, on average,  univariate kernel PLS uses less 
than 80\% of the components used by kernel PCR.
However, in the case of linear regression (i.e. using the polynomial kernel
  $K({\bf x},{\bf y})=( {\bf x}. {\bf y} )$) the number of used 
components is approximately
  the same.

We would like to note that using the original response variables
(moisture, oil, protein, starch) we did not  achieve significantly
better results using polynomial kernels of higher orders compared to 
linear regression.

{\small
\begin{table}[h]
\begin{center}
\begin{tabular}{l|llll|llll|llll}
Noise   &  \multicolumn{4}{c|}{\bf kernel PLS } &
  \multicolumn{4}{c}{\bf kernel PCR }
& \multicolumn{4}{c}{\bf kernel RR} \\
% &   &   &   &   &   &  & &  & & \\
\cline{2-5} \cline{6-9} \cline{10-13}
level  & $y_1$  & $y_2$  & $y_3$  & $y_4$  & $y_1$  & $y_2$  & $y_3$ 
& $y_4$  & $y_1$  & $y_2$  & $y_3$  & $y_4$  \\
\hline 
\multicolumn{13}{c}{}   \\
\multicolumn{13}{c}{$K({\bf x},{\bf y})=( {\bf x}. {\bf y} )$}   \\
\hline
15\%    & .96       & .95       & .76     & .75
         & .95       & .94       & .76       & .72
         & .96       & .95       & .76     & .74 \\
30\%    & .94 & .93       & .74       & .70
        & .92       & .93       & .75       & .73
            & .94  & .94       &  .74 &  .71 \\
60\%   & .86       & .87       & .75 &  .73
       & .81       & .89       & .67       & .71
           & .88       & .89       & .74       & .73 \\
90\%  & .77       &  .75      &  .70       & .73
       & .69       & .77       & .62       & .60
           & .78      & .81       & .69       & .73 \\
\hline
\multicolumn{13}{c}{}  \\
\multicolumn{13}{c}{$K({\bf x},{\bf y})=(( {\bf x}. {\bf y} ) +1)^2$}   \\
  \hline 
15\%    & .98       & .99       & .96     & .97
         & .98       & .97       & .93       & .94
             & .98       & .98       & .96      & .97 \\
30\%  & .96 & .96       & .91       & .94
       & .96       & .94       & .87       & .88
           & .97  & .97       &  .91 &  .94 \\
60\%  & .87       & .89       & .77 &  .85
       & .85       & .88       & .71       & .76
           & .90       & .92       & .84       & .85 \\
90\%  & .73       &  .77      &  .70       & .79
       & .70       & .78       & .55       & .70
           & .78      & .84       & .80       & .82 \\

\hline 
\multicolumn{13}{c}{}  \\
\multicolumn{13}{c}{$K({\bf x},{\bf y})=(( {\bf x}. {\bf y} ) +1)^3$}   \\
  \hline
15\%    & .99       & .99       & .98     & .98
         & .99       & .99       & .96       & .97
             & .99       & .99       & .97       & .98 \\
30\%    & .98 & .97       & .94       & .95
         & .97      & .97       & .89       & .91
             & .98  & .97       &  .95 &  .96 \\
60\%     & .93       & .89       & .88 &  .88
       & .88       & .88       & .77       & .82
           & .93       & .90       & .90       & .91 \\
90\%  & .85       &  .77      &  .85       & .85
       & .72       & .73       & .61       & .79
           & .83      & .77       & .86       & .86 \\
\hline
\end{tabular}
\caption{\label{corn}\small  Goodness of prediction  in $R^2$ terms.
The values correspond to an average of 25 different simulations.
Noise level is  represented as the ratio between the standard 
deviation of the added
Gaussian noise and the standard deviation of the generated response
variables $y_1,y_2,y_3,y_4$ (\ref{gen_res}).}
\end{center}
\end{table}
}

\subsection{Polymer  Data}
This data set  is taken from a polymer test plant \citep{polymer}.
There are 10 input variables ($N=10$), measurements of controlled variables in
a polymer processing plant (temperatures, feed rates, etc.),  and 4
output variables ($L=4$) are measures of the output of that plant. It 
is claimed
  that this  data set is particularly good for testing
the robustness of nonlinear modeling methods to irregularly spaced data.

We first took 41 samples as training and the remaining 20 samples as 
testing data.
LOO applied on the training partition was used to select the
desired parameters for individual kernel regression models. Polynomial kernels
of different orders were used. The indication of high 
multicollinearity (the ratio between the first
and fourth eigenvalues of the covariance matrix of
the response variables equals 628) among the response variables 
suggests that assuming a
multivariate regression approach may by profitable in this case.

Table \ref{polymer} compares the goodness of prediction of
the multivariate kernel PLS on the test set with the kernel RR and kernel PCR
regression models used on decorrelated outputs. The results applying 
univariate regression models on the same
data set are summarized in Table \ref{polymer1}.

We can see that all three regression models achieved comparable 
results when the best predictions
on individual response variables are compared.
We may see that univariate kernel PLS provides better prediction on the
responses $y_1$ and $y_2$ while its multivariate modification seems 
to be better for the
prediction of $y_3$ and $y_4$.
Univariate kernel RR and kernel PCR are better on the prediction of 
$y_2$, $y_3$ and $y_4$, while decorrelation
of the outputs may improve the prediction of $y_1$. However, we have 
to say that these results may also
depend on the model selection criterion used. Further experiments 
using different data sets and model selection
criteria will have to be employed to provide a more valuable comparison.
Results also suggest that on this data set
kernel PLS provides the most consistent results over the range of 
different orders of the polynomial kernel.
Simulations with Gaussian kernels of different widths did not lead to 
better performance.

{\small
\begin{table}[h]
\begin{center}
\begin{tabular}{l|llll|llll|llll}
Order  &  \multicolumn{4}{c|}{\bf kernel PLS } &
  \multicolumn{4}{c|}{\bf kernel PCR }
& \multicolumn{4}{c}{\bf kernel RR} \\
% &   &   &   &   &   &  & &  & & \\
\cline{2-5} \cline{6-9} \cline{10-13}
$a$ & $y_1$  & $y_2$  & $y_3$  & $y_4$  & $y_1$  & $y_2$  & $y_3$  & 
$y_4$  & $y_1$  & $y_2$  & $y_3$  & $y_4$  \\
\hline
1     & .43       & .06       & .73       & .77
       & .23       & {\bf .68} & {\bf .83}  & .79
           & .36       & .21   & .79       & .79 \\
2     & {\bf .90} & .30      & .89       & .88
       &  0$^*$    & .0$^*$  & .63       & .62
           & .0$^*$     & .0$^*$  & .68       & {\bf .86}  \\
3     & .84       & .57      & {\bf .90} & {\bf .90}
       & {\bf .89} & .33       & .60       & .68
           & {\bf .86}   & .0$^*$    & .58       & .54 \\
4     & .79       & {\bf .69} & .87       & .87
       & .84       & 0$^*$   & .74          & .74
           & .79       & .30       & {\bf .85}       & .83 \\
5     & .72       & .65       & .65       & .57
       & .80       & .62       & .76       & {\bf .81}
           & .70       & {\bf .37}   & .73       & .71 \\

\hline

\end{tabular}
\caption{\label{polymer}\small  Goodness of prediction of the 
$y_1,y_2,y_3,y_4$
in  terms of $R^2$. Bold numbers indicate the best achieved 
prediction on individual response variables. Polynomial
kernels of different orders $a$ were used. 0$^*$ represents a case 
when averaged performance of the model
is worse than assuming a linear model equal to the mean of a response 
variable.}
\end{center}
\end{table}
}

{\small
\begin{table}[h]
\begin{center}
\begin{tabular}{l|llll|llll|llll}
Order  &  \multicolumn{4}{c|}{\bf kernel PLS } &
  \multicolumn{4}{c|}{\bf kernel PCR }
& \multicolumn{4}{c}{\bf kernel RR} \\
% &   &   &   &   &   &  & &  & & \\
\cline{2-5} \cline{6-9} \cline{10-13}
$a$ & $y_1$  & $y_2$  & $y_3$  & $y_4$  & $y_1$  & $y_2$  & $y_3$  & 
$y_4$  & $y_1$  & $y_2$  & $y_3$  & $y_4$  \\
\hline
1     & .21       & .0$^*$    & .71       & .78
       & .39       & .0$^*$    & .77       & .78
           & .28       & .0$^*$    & .76       & .76 \\
2     & .90       & .04       & {\bf .85} & {\bf .87}
       & .82       & .41       & .84       & .84
           &{\bf .79}  & .45       & {\bf .90} & {\bf .90} \\
3     & .91       & .59       & .85       & .86
       & {\bf .87} & .63      & .55       & .81
           & .52       & .0$^*$    & .67       & .22 \\
4     & {\bf .93} & .75       & .68       & .66
       & .81       & .58       & {\bf .88} & {\bf .88}
           & .68       & .71       & .78       & .75 \\
5     & .85       & {\bf.79}  & .58       & .81
       & .81       & {\bf .79} &  .83     & .84
           & .60       & {\bf .74} & .66       & .66 \\

\hline

\end{tabular}
\caption{\label{polymer1}\small  Goodness of prediction of the 
$y_1,y_2,y_3,y_4$
in  terms of $R^2$. Bold numbers indicate the best achieved 
prediction on individual response variables. Polynomial
kernels of different orders $a$ were used. 0$^*$ represents a case 
when averaged performance of the model
is worse than assuming a linear model equal to the mean of a response 
variable.}
\end{center}
\end{table}
}

\subsection{Chaotic Mackey-Glass Time-Series}
The chaotic Mackey-Glass time-series is defined by the differential equation
$$  \frac{ds(t)}{dt}=-b_1s(t)+b_2\frac{s(t-\tau)}{1+s(t-\tau)^{10}} $$
with $b_1=0.1$, $b_2=0.2$.
The data were generated with
$\tau = 17$  and using a  second-order Runge-Kutta method with a step size 0.1.
Training data is from t=200 to t=3200 while test data is in the range 
t= 5000 to 5500.
To this generated time-series we added  white noise with normal 
distribution and with different
levels corresponding to ratios of the standard deviation of the noise 
and the clean
Mackey-Glass time-series.

The training data partitions were constructed by moving a  ``sliding window''
over the 3000 training samples in steps of 250 samples.
This window had a  size of 500 samples.
The validation set was then created by using the following 250
data points. This created ten partitions of size 500/250 
(training/validation) samples.

All regression models were trained to predict
the value sampled $85$ steps ahead ($L=1$) from inputs at time
$t,t-6,t-12,t-18$ ($N=4$).

The Gaussian kernel was used. We estimated the variance of the overall
clean  training set and based on this estimate ${\hat \sigma}^2\doteq 0.05$
the CV technique was used to find the optimal width $d$  
from the range $\langle 0.01,0.2 \rangle$ using the step size $0.01$.
A fixed test set of size 500 data points was used
in all experiments. The performance of the regression models to make 
predictions on ``clean''
test set of 500 data points  was evaluated in terms of NRMSE.

The results achieved using the individual regression models are 
summarized in Table \ref{mg_comp}.
We can see that there are no significant differences among the 
methods employed. However, comparing
kernel PLS and kernel PCR we can observe a significant reduction in 
the number of components used 
in the case of kernel PLS regression. In some cases kernel PLS uses 
less than 10\% of the number of 
components used by kernel PCR.

Increasing the value of $d$ leads to a faster decay of the eigenvalues
\citep[see, e.g.,][]{WilSmoSch98}
and to the potential loss of the fine structure due to a 
smaller number of nonlinear principal components describing the  same percentage of
all the data variance. Increasing levels of the noise has  the  tendency to
increase the optimal value for the $d$ parameter which coincides with 
the intuitive
assumption about smearing out the local structure \citep[for the 
discussion on this topic see][]{RosGirTreCic01}.
In contrast small values of $d$ will lead to ``memorizing'' of the 
training data structure.
Thus, in Figure \ref{mg_n05_st250_500}
we also compared the results on the noisy time
series ($n/s=22\%$) and their dependence on the width $d$ of the 
Gaussian kernel. Similar behavior was observed
for data with $n/s=11\%$.
We may observe a smaller range of the $d$ values on which kernel PLS 
and kernel PCR
achieves the optimal results on the testing set compared  to kernel 
RR. However, the results also
suggest a smaller variance in the case of latent variable projection 
methods; i.e. kernel PLS and
kernel PCR.

{\small
\begin{table}[h]
\begin{center}
\begin{tabular}{l|cc|cc|c}
  Noise   &  \multicolumn{2}{c|}{\bf   kernel PLS } &
  \multicolumn{2}{c|}{\bf kernel PCR }
& \multicolumn{1}{c}{\bf kernel RR} \\
level &  \multicolumn{1}{c}{\em NRMSE} &  \multicolumn{1}{c|}{\# of C}
&  \multicolumn{1}{c}{\em NRMSE} &  \multicolumn{1}{c|}{\# of PC}
&  \multicolumn{1}{c}{\em NRMSE}     \\
\hline
{\em n/s=0.0\%}  &  0.048  & 155   &  0.046     &  383    &  0.044     \\
                &  (0.031)  & (38)   &  (0.030)  &  (78)    &  (0.027)  \\



{\em n/s=11\%}    &  0.322   &  7       &  0.327      &  79    & 
0.321       \\
                   &  (0.030) &  (2)     &   (0.030)   & (35)   & 
(0.041)      \\


{\em n/s=22\%}  & 0.455    & 6   & 0.462   & 48  &  0.451     \\
                & (0.021)  & (2) & (0.031) & (24) &  (0.029)   \\
\hline
\end{tabular}
\caption{\label{mg_comp}\small  The comparison of the approximation 
errors (NRMSE) of prediction,
  the number of used components (C) and principal components (PC). 
The values represent an average of 10
simulations. The corresponding standard deviations are presented in parentheses.}
\end{center}
\end{table}
}

%\vspace*{-0.4cm}
\begin{figure}[h]
\centerline{\psfig{figure=mg_n05_st250_500.ps,width=150mm,height=130mm,angle=0}}
%\vspace*{-0.25cm}
   \caption{\label{mg_n05_st250_500}\small Comparison of the results 
achieved on the noisy Mackey-Glass ({\em n/s=22\%})
   time series with the kernel PLS (top), kernel PCR (middle) 
and kernel RR (bottom) methods.
   Ten different training sets of size 500 data points were used. The 
performance
   for different widths ($d$) of the Gaussian kernel
is compared in normalized root mean squared error (NRMSE) terms.
The error bars represent the standard deviation on results computed from ten different runs.}
\end{figure}


\subsection{Human Signal Detection Performance Monitoring}
\label{serp}
In this study eight male Navy technicians experienced in the operation
of display systems performed a signal detection task.
Event related potentials (ERP) and performance data from an earlier study
\citep{TreShe99, TreKraArn95, KosRosKonTre97} were used. The performance of
the subjects was measured in terms of PF1 measure ($L=1$) based on 
their  accuracy, confidence
and reaction time to detect relevant stimuli. For details on the 
experimental setting see \citep{RosGirTreCic01}.

The results achieved on individual subjects in our former studies 
informed our choice of
the Gaussian kernel. For each individual subject we split the data 
into 10 different 55\% and 45\%
training and testing partitions.
Eleven-fold CV to estimate desired parameters was applied on each 
training partition. After  CV  a final
model was tested on an independent testing partition.
This was repeated 10 times for each training and testing data pair.
The validity of the models was measured in terms of $R^2$.

Table \ref{tabAllS}
summarizes the results achieved on eight subjects (A to H),  using
kernel PLS,  kernel PCR and  kernel RR  methods. As with results reported on
the Mackey-Glass time series prediction we can not observe any 
significant differences among
the kernel regression models.  The number of components used in the case of
kernel PLS is on average 10 times lower when compared to kernel PCR.

{
\begin{table}[h]
\begin{center}
\begin{tabular}{l||ll|ll|l}
  Subject & \multicolumn{2}{c|}{\bf kernel PLS} &
  \multicolumn{2}{c|}{\bf kernel PCR} & \multicolumn{1}{c}{\bf kernel RR}  \\
  &  $R^2$  & \# of C & $R^2$ &  \# of PC & $R^2$  \\

  %&  A  & B  & C  & D  & E & F& G & H    \\
\hline
A         & 0.841      & 27.9   & 0.841   & 373.1 & 0.841    \\
(891 ERP)      & (0.025)    & (16.3) & (0.023)  &  (30.9)  & (0.025)  \\
\hline
B           & 0.883   &    33.9   &  0.882  &   224.4  &  0.883    \\
(592 ERP)   & (0.027) &    (12.3) &  (0.026)  & (32.2)  & (0.026)   \\
\hline
C            &  0.741  &    15.5   & 0.740    & 134.8  & 0.747    \\
(417 ERP)   & (0.060) &    (4.1)  & (0.044)  & (17.9) & (0.044)   \\
\hline
D            &  0.870  &    24.8  & 0.870    & 241.2 &   0.874  \\
(702 ERP)   & (0.011) &   (6.0)  & (0.010) &  (50.3) &  (0.010)  \\
\hline
E            &  0.942 &   42.4   & 0.941 &  274.4  & 0.943 \\
(734 ERP)   & (0.006) &   (20.1) & (0.006)&  (42.2) & (0.007)  \\
\hline
F             & 0.884   &  24.6    & 0.875      &  186.6 & 0.886  \\
(614 ERP)   & (0.023) &  (9.9)  & (0.025)&  (61.4) & (0.024)  \\
\hline
G            &  0.895   &  23.6   & 0.893  &   323.3 & 0.895   \\
(868 ERP)  &  (0.018) &  (13.5) & (0.018)  & (53.2) & (0.017)  \\
\hline
H           & 0.827   &    19.7  & 0.825 &  280.4 & 0.827   \\
(776 ERP) & (0.022)    & (7.0) & (0.022) & (49.0)& (0.022)  \\
%\hline
%\hline
\end{tabular}
%\vspace{-0.2cm}
\caption{\label{tabAllS}\small  The comparison of $R^2$ and the number of used
components (C) and principal components (PC), respectively,  for
subjects A to H.  The values represent an average of 10
different simulations and the corresponding standard deviations are 
presented in parentheses.}
\end{center}
\end{table}
}



\section{Discussion and Conclusions}

On several different regression tasks we compared the proposed kernel 
PLS method with
  kernel PCR and  kernel RR
techniques. We show  that kernel PLS provides the same results 
as  kernel PCR
and  kernel RR. However, in comparison to kernel PCR, the kernel PLS 
method uses a much  smaller number of qualitatively different components.
As demonstrated in Section \ref{sex}, using the existing correlations 
between outputs and mapped inputs, kernel PLS may provide components 
which follow more closely the investigated nonlinear function.  
However, as with  kernel PCA, the interpretation of these components in the input 
space may be difficult.     
    

There exists a large body of literature comparing standard  OLS
regression with PLS, PCR and RR \citep[see, e.g.,][]{FraFri93,StoBro90}. Assuming a
construction of regularized linear regression models in a RKHS we can make some
conclusions by using the analogy with the reported observations.
First, in the situation where high multicollinearity among regressors 
exists OLS leads
to the unbiased but high variance  estimate of regression 
coefficients. PLS, PCR and RR
are designed to shrink the solution to the regression from the areas of the low
data spread resulting in biased but lower variance estimates.
Second, there exist real world 
regression problems where the number of observed variables $N$ significantly exceeds 
 the number of samples (observations) $n$---a situation quite   
common in chemometrics. 
%Second,
%there exist real world
%regression problems where the number of observations
%$n$  significantly exceeds  the number of observed variables $N$---a situation quite
%common in chemometrics. 
Moreover, we may usually also observe that the real rank of
the matrix of regressors is significantly lower than $n$ and  $N$. 
The projection of the original
regressors to the ``real'' latent variables is the main advantage of 
methods such as PLS or PCR.
This is also similar to the situation where the input variables are 
corrupted by a certain amount
of noise (the situation with noisy Mackey-Glass time series and ERP 
data sets). By the projection
of original data to the components
with higher eigenvalues we may usually discard the noise component 
contained in the original data
\citep[assuming kernel PCR as discussed in][]{RosGirTreCic01}.
We hypothesize that both situations are also quite common when a 
kernel-type of regression
is used. Usually we nonlinearly transform the original data to the 
high-dimensional space
whose dimension $M$ is in many cases much  higher than the 
number of observations $M \gg n $.
Although, assuming high-dimensional feature spaces, we cannot 
diagnose multicollinearity
by the inspection of the sample covariance matrix, the indication of 
strong  multicollinearity
may by detected from a large ratio between  maximal and minimal 
eigenvalues of the covariance
matrix. The eigenvalues can by estimated by the kernel PCA method.
Values greater than 100 usually indicate strong multicollinearity  among
regressors \citep{MonPec92}. In many cases, in our data sets we 
observed that the ratio between the
larger and smaller eigenvalues exceeded  1000,  indicating severe
multicollinearity.\footnote{We investigated the situation by using the
Gaussian kernel with different width parameters and polynomial 
kernels of different orders.}

The proposed kernel PLS uses the NIPALS procedure to iteratively
estimate the desired components. We have already pointed out that the
NIPALS algorithm is very similar to the power method and as with 
this method  was found to be very robust for solving
eigenvalue-eigenvector related problems where dominant eigenvectors 
are calculated
one at a time.  The rate of the convergence of both algorithms is
given by the ratio of two largest eigenvalues \citep{Mal95}.
%In Section \ref{skpls} described  NIPALS-PLS  procedure scales as 
%${\cal O}(n^2)$,  however, after
%the extraction of the individual components
%the deflation procedure (\ref{def}) has complexity  ${\cal O}(n^3)$. 
Both the NIPALS-PLS  procedure described in Section \ref{skpls} 
and the deflation procedure (\ref{def}) used after
the extraction of the individual components
scale as   ${\cal O}(n^2)$. 
The need to repeat these procedures increases the computational costs 
in direct proportion to the number of desired components.
However, on the employed  data sets we
have demonstrated that the best results are achieved with the number
of components $p \ll n $. Moreover, investigating 
the curves of the PRESS statistics computed on validation sets we
observed that we usually do not need to
investigate a wide spectrum of components as a rather strong
increase of PRESS occurs after extracting
the optimal number of components. Assuming $n > L$, the
procedure (\ref{YTR}) for the estimation of the desired  training data 
set  output  values scales as
${\cal O}(pnL)$.  Using the procedure (\ref{YTS}) to make the prediction computed 
on a single test data point, the complexity scales as ${\cal  O}(pn^2 + p^3)$, where the second term 
associated with the inversion of the $(p \times p)$ matrix may be neglected in the case that 
$p \ll n$.   
%In the case $n_t \le n$ this will be the same for the
%predictions computed on testing data set otherwise the procedure
%(\ref{YTS}) scale as ${\cal  O}(pnn_t)$.
Moreover, in this procedure we do not even need to invert
the whole $(n \times n )$ Gram matrix. In fact using the kernel PLS 
method on large scale regression
tasks we may avoid storing the whole Gram matrix ${\bf K}$. 
Re-computation of its
elements may,  however,  significantly slow down the whole  algorithm.


\acks{We would like to thank Bogdan Gabrys  for helpful discussions 
on model selection techniques.
  ERP data were obtained under a grant from the US Navy Office of 
Naval Research
(PE60115N), monitored by Joel Davis and Harold Hawkins.
R. Rosipal is funded by a research
grant for the project ``Objective Measures of Depth of Anaesthesia'';
University of Paisley and
Glasgow Western Infirmary NHS trust, and is  partially supported by
Slovak Grant Agency for Science (grant No. 2/1136/21).
L.J. Trejo is supported by the U.S. National Aeronautics and Space
Administration, Aerospace Operations Systems Program, and by 
the Intelligent Systems Program.}


\addcontentsline{toc}{section}{References}

\bibliographystyle{plain}
\bibliography{jmlr}

\end{document}

