\documentclass[twoside,11pt]{article}

% Any additional packages needed should be included after jmlr2e.
% Note that jmlr2e.sty includes epsfig, amssymb, natbib and graphicx,
% and defines many common macros, such as 'proof' and 'example'.
%
% It also sets the bibliographystyle to plainnat; for more information on
% natbib citation styles, see the natbib documentation, a copy of which
% is archived at http://www.jmlr.org/format/natbib.pdf

\usepackage{jmlr2e}
\usepackage{amsmath}  

% Definitions of handy macros can go here

\newcommand{\dataset}{{\cal D}}
\newcommand{\fracpartial}[2]{\frac{\partial #1}{\partial  #2}}

% Heading arguments are {volume}{year}{pages}{submitted}{published}{authors}

\jmlrheading{2}{2001}{299-312}{3/01}{12/01}{Marc G. Genton}
\ShortHeadings{Classes of Kernels for Machine Learning}{Genton}
\firstpageno{299}

\begin{document}

\title{Classes of Kernels for Machine Learning: \\ A Statistics Perspective}

\author{\name Marc G. Genton \email genton@stat.ncsu.edu \\
       \addr Department of Statistics\\
       North Carolina State University\\
       Raleigh, NC 27695-8203, USA
       }

\editor{Nello Cristianini, John Shawe-Taylor, Robert Williamson}

\maketitle

\begin{abstract}%   <- trailing '%' for backward compatibility of .sty file
In this paper, we present classes of kernels for machine learning from a statistics perspective. Indeed, kernels are positive definite functions and thus also covariances. After discussing key properties of kernels, as well as a new formula to construct kernels, we present several important classes of kernels: anisotropic stationary kernels, isotropic stationary kernels, compactly supported kernels, locally stationary kernels, nonstationary kernels, and separable nonstationary kernels. Compactly supported kernels and separable nonstationary kernels are of prime interest because they provide a computational reduction for kernel-based methods. We describe the spectral representation of the various classes of kernels and conclude with a discussion on the characterization of nonlinear maps that reduce nonstationary kernels to either stationarity or local stationarity.
\end{abstract}

\begin{keywords}
  Anisotropic, Compactly Supported, Covariance, Isotropic, Locally Stationary, Nonstationary, Reducible, Separable, Stationary
\end{keywords}

\section{Introduction}

Recently, the use of kernels in learning systems has received considerable attention. The main reason is that kernels allow to map the data into a high dimensional feature space in order to increase the computational power of linear machines \citep[see for example][]{vapnik:95,vapnik:98,crisha:00}. Thus, it is a way of extending linear hypotheses to nonlinear ones, and this step can be performed implicitly. Support vector machines, kernel principal component analysis, kernel Gram-Schmidt, Bayes point machines, Gaussian processes, are just some of the algorithms that make crucial use of kernels for problems of classification, regression, density estimation, and clustering. In this paper, we present classes of kernels for machine learning from a statistics perspective. We discuss simple methods to design kernels in each of those classes and describe the algebra associated with kernels.
         
The kinds of kernel $K$ we will be interested in are such that for all examples $\bf x$ and $\bf z$ in an input space $X\subset {\Bbb{R}^d}$:
\begin{equation*}
K({\bf x},{\bf z})=\langle \phi({\bf x}), \phi({\bf z}) \rangle,
\end{equation*}
where $\phi$ is a nonlinear (or sometimes linear) map from the input space $X$ to the feature space $\cal F$, and $\langle \cdot,\cdot \rangle$ is an inner product. Note that kernels can be defined on more general input spaces $X$, see for instance \citet{aron:50}. In practice,
the kernel $K$ is usually defined directly, thus implicitly defining the map $\phi$ and the feature space $\cal F$. It is therefore important to be able to design new kernels. Clearly, from the symmetry of the inner product, a kernel must be symmetric:
\begin{equation*}
K({\bf x},{\bf z})=K({\bf z},{\bf x}),
\end{equation*}
and also satisfy the Cauchy-Schwartz inequality:
\begin{equation*}
K^2({\bf x},{\bf z})\leq K({\bf x},{\bf x}) K({\bf z},{\bf z}).
\end{equation*}
However, this is not sufficient to guarantee the existence of a feature space. \citet{mercer:09} showed that a necessary and sufficient condition for a symmetric function $K({\bf x},{\bf z})$ to be a kernel is that it be positive definite. This means that for any set of examples ${\bf x}_1,\ldots,{\bf x}_l$ and any set of real numbers $\lambda_1,\ldots,\lambda_l$, the function $K$ must satisfy:
\begin{equation} \label{posdef}
\sum_{i=1}^l \sum_{j=1}^l \lambda_i \lambda_j K({\bf x}_i,{\bf x}_j)\geq 0.
\end{equation}
Symmetric positive definite functions are called covariances in the statistics literature. Hence kernels are essentially covariances, and we propose a statistics perspective on the design of kernels. It is simple to create new kernels from existing kernels because positive definite functions have a pleasant algebra, and we list some of their main properties below. First, if $K_1$, $K_2$ are two kernels, and $a_1$, $a_2$ are two positive real numbers, then:
\begin{equation}\label{p1}
K({\bf x},{\bf z})=a_1 K_1({\bf x},{\bf z})+a_2 K_2({\bf x},{\bf z}),
\end{equation}
is a kernel. This result implies that the family of kernels is a convex cone. The multiplication of two kernels $K_1$ and $K_2$ yields a kernel:
\begin{equation}\label{p2}
K({\bf x},{\bf z})=K_1({\bf x},{\bf z}) K_2({\bf x},{\bf z}).
\end{equation}
Properties (\ref{p1}) and (\ref{p2}) imply that any polynomial with positive coefficients, $pol^+(x)=\{\sum_{i=1}^n \alpha_i x^i | n\in {\Bbb{N}}, \,\, \alpha_1,\ldots,\alpha_n \in {\Bbb{R}^+}\}$, evaluated at a kernel $K_1$, yields a kernel:
\begin{equation}\label{p3}
K({\bf x},{\bf z})=pol^+(K_1({\bf x},{\bf z})).
\end{equation}
In particular, we have that:
\begin{equation}\label{p4}
K({\bf x},{\bf z})=\exp(K_1({\bf x},{\bf z})),
\end{equation}
is a kernel by taking the limit of the series expansion of the exponential function. Next, if $g$ is a real-valued function on $X$, then
\begin{equation}\label{p5}
K({\bf x},{\bf z})=g({\bf x}) g({\bf z}),
\end{equation}
is a kernel. If $\psi$ is an $\Bbb{R}^p$-valued function on $X$ and $K_3$ is a kernel on $\Bbb{R}^p \times \Bbb{R}^p$, then:
\begin{equation}\label{p6}
K({\bf x},{\bf z})=K_3(\psi({\bf x}),\psi({\bf z})),
\end{equation}
is also a kernel. Finally, if $A$ is a positive definite matrix of size $d \times d$, then:
\begin{equation}\label{p7}
K({\bf x},{\bf z})={\bf x}^TA{\bf z},
\end{equation}
is a kernel. The results (\ref{p1})-(\ref{p7}) can easily be derived from (\ref{posdef}), see also \citet{crisha:00}. The following property can be used to construct kernels and seems not to be known in the machine learning literature. Let $h$ be a real-valued function on $X$, positive, with minimum at 0 (that is, $h$ is a variance function). Then:
\begin{equation}\label{p8}
K({\bf x},{\bf z})=\frac{1}{4} \Big[ h({\bf x}+{\bf z})-h({\bf x}-{\bf z})\Big],
\end{equation}
is a kernel. The justification of (\ref{p8}) comes from the following identity for two random variables $Y_1$ and $Y_2$: Covariance($Y_1$,$Y_2$)=[Variance($Y_1+Y_2)-$Variance($Y_1-Y_2$)]/4. For instance, consider the function $h({\bf x})={\bf x}^T{\bf x}$. From (\ref{p8}), we obtain the kernel:
\begin{equation*}
K({\bf x},{\bf z})=\frac{1}{4} \Big[ ({\bf x}+{\bf z})^T({\bf x}+{\bf z})-({\bf x}-{\bf z})^T({\bf x}-{\bf z})\Big]={\bf x}^T {\bf z}.
\end{equation*}

The remainder of the paper is set up as follows. In Section 2, 3, and 4, we discuss respectively the class of stationary, locally stationary, and nonstationary kernels. Of particular interest are the classes of compactly supported kernels and separable nonstationary kernels because they reduce the computational burden of kernel-based methods. For each class of kernels, we present their spectral representation and show how it can be used to design many new kernels. Section 5 addresses the reducibility of nonstationary kernels to stationarity or local stationarity, and we conclude the paper in Section 6.
  
\bigskip  

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
\section{Stationary Kernels}   
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%   

A stationary kernel is one which is translation invariant:
\begin{equation*}
K({\bf x},{\bf z})=K_S({\bf x}-{\bf z}),
\end{equation*}
that is, it depends only on the lag vector separating the two examples ${\bf x}$ and ${\bf z}$, but not on the examples themselves. Such a kernel is sometimes referred to as anisotropic stationary kernel, in order to emphasize the dependence on both the direction and the length of the lag vector. The assumption of stationarity has been extensively used in time series \citep[see for example][]{broda:91} and spatial statistics \citep[see for example][]{cressie:93} because it allows for inference on $K$ based on all pairs of examples separated by the same lag vector. Many stationary kernels can be constructed from their spectral representation derived by \citet{boch:55}. He proved that a stationary kernel $K_S(\bf {x-z})$ is positive  definite in ${\Bbb{R}^d}$ if and only if it has the form: 
\begin{equation}
K_S({\bf x-z})=\int_{{\Bbb{R}}^d}  \cos\big({\boldsymbol \omega}^{T}({\bf x-z})\big)F(d{\boldsymbol \omega}),   
\label{bochn}   
\end{equation}   
where $F$ is a positive finite measure. The quantity $F/K_S({\bf 0})$ is called the spectral distribution function. Note that (\ref{bochn}) is simply the Fourier transform of $F$. \citet{creshuang:99} and \citet{gneit:01} use (\ref{bochn}) to derive nonseparable space-time stationary kernels, see also \citet{christa:00} for illustrative examples.

When a stationary kernel depends only on the norm of the lag vector between two examples, and not on the direction, then the kernel is said to be isotropic (or homogeneous), and is thus only a function of distance:  
\begin{equation*}
K({\bf x},{\bf z})=K_I(\|{\bf x}-{\bf z}\|).
\end{equation*}
The spectral representation of isotropic stationary kernels has been derived from Bochner's theorem \citep{boch:55} by \citet{yaglom:57}:  
\begin{equation}\label{yaglom}   
K_I(\|{\bf x}-{\bf z}\|)=\int_{0}^{\infty} \Omega_{d} \big(\omega\|{\bf x}-{\bf z}\|\big)F(d\omega),      
\end{equation}     
where 
\begin{equation*} 
\Omega_{d} (x) = \left(\frac{2}{x}\right)^{(d-2)/2}\Gamma \left(\frac{d}{2}\right)J_{(d-2)/2} (x),
\end{equation*} 
form a basis for functions in ${\Bbb R}^{d}$. 
Here $F$ is any nondecreasing bounded function, $\Gamma (d/2)$ is   
the gamma function, and $J_{v}$ is the Bessel function of the first kind of order $v$.  Some familiar examples of $\Omega_{d}$ are    
$\Omega_{1} (x)= \cos (x)$, $\Omega_{2} (x)= J_{0} (x)$,   
and $\Omega_{3} (x)= \sin (x)/x$. Here again, by choosing a nondecreasing bounded function $F$ (or its derivative $f$), we can derive the corresponding kernel from (\ref{yaglom}). For instance in $\Bbb{R}^1$, with the spectral density $f(\omega)=(1-\cos (\omega))/(\pi \omega^2)$, we derive the triangular kernel:
\begin{eqnarray*}
K_I(x-z)&=&\int_{0}^{\infty} \cos(\omega|x-z|) \frac{1-\cos(\omega)}{\pi \omega^2} d\omega \\
&=&\frac{1}{2}(1-|x-z|)^+, 
\end{eqnarray*}
where $(x)^+=\max(x,0)$ (see Figure 1). Note that an isotropic stationary kernel obtained with $\Omega_d$ is positive definite in ${\Bbb{R}^d}$ and in lower dimensions, but not necessarily in higher dimensions. For example, the kernel $K_I(x-z)=(1-|x-z|)^+/2$ is positive definite in ${\Bbb{R}^1}$ but not in ${\Bbb{R}^2}$, see Cressie (1993, p.84) for a counterexample. 
\begin{figure}[!h] \label{pollution}
\centerline{\psfig{figure=fig2.ps,height=5cm}\quad \psfig{figure=fig1.ps,height=5cm}} 
\caption{The spectral density $f(\omega)=(1-\cos (\omega))/(\pi \omega^2)$ (left) and its corresponding isotropic stationary kernel $K_I(x-z)=(1-|x-z|)^+/2$ (right).} 
\end{figure} 
It is interesting to remark from (\ref{yaglom}) that an isotropic stationary kernel has a lower bound \citep{stein:99}:
\begin{equation*}
K_I(\|{\bf x}-{\bf z}\|)/K_I(0)\geq \inf_{x\geq 0}\Omega_{d}(x),
\end{equation*}
thus yielding:
\begin{eqnarray*}
K_I(\|{\bf x}-{\bf z}\|)/K_I(0) &\ge &-1  \hspace{1.45cm} \mbox{in } \Bbb{R}^1 \\
K_I(\|{\bf x}-{\bf z}\|)/K_I(0) &\ge &-0.403  \qquad \mbox{in } \Bbb{R}^2\\
K_I(\|{\bf x}-{\bf z}\|)/K_I(0) &\ge &-0.218  \qquad \mbox{in } \Bbb{R}^3\\
K_I(\|{\bf x}-{\bf z}\|)/K_I(0) &\ge &0 \hspace{1.75cm} \mbox{in } \Bbb{R}^\infty.
\end{eqnarray*}
The isotropic stationary kernels must fall off more quickly as the dimension $d$ increases, as might be expected by examining the basis functions $\Omega_d$. Those in ${\Bbb R}^\infty$ have 
the greatest restrictions placed on
them.  Isotropic stationary kernels that are positive definite in ${\Bbb R}^d$ form a nested family of subspaces. When $d \rightarrow \infty$ the basis $\Omega_d(x)$ goes to $\exp(-x^2)$. \citet{schoenberg:38} proved that if $\beta_{d}$ is the class of positive
definite functions of the form given by \citet{boch:55}, then the
classes for all $d$ have the property:
\begin{equation*}
\beta_{1} \supset \beta_{2} \supset \cdots \supset \beta_{d} \supset
\cdots \supset \beta_{\infty},
\end{equation*}
\noindent
so that as $d$ is increased, the space of available functions is reduced.
Only functions with the basis $\exp(-x^2)$ are 
contained in all the classes. The positive
definite requirement imposes a smoothness condition 
on the basis as the dimension $d$ is increased. Several criteria to check the positive definiteness of stationary kernels can be found in \citet{christa:84}. Further isotropic stationary kernels defined with non-Euclidean norms have recently been discussed by \citet{chripapa:00}.
\begin{figure}[!h] \label{pollution}
\centerline{\psfig{figure=fig3.ps,height=5cm}\quad \psfig{figure=fig4.ps,height=5cm}} 
\vspace{-.3cm}
\centerline{(a)\hspace{7.4cm} (b)}
\vspace{.5cm}
\centerline{\psfig{figure=fig5.ps,height=5cm}\quad \psfig{figure=fig6.ps,height=5cm}}
 \vspace{-.3cm}
\centerline{(c)\hspace{7.4cm} (d)}
\vspace{.5cm}
\centerline{\psfig{figure=fig7.ps,height=5cm}\quad \psfig{figure=fig8.ps,height=5cm}} 
\vspace{-.3cm}
\centerline{(e)\hspace{7.4cm} (f)}
\vspace{.5cm}
\caption{Some isotropic stationary kernels: (a) circular; (b) spherical; (c) rational quadratic; (d) exponential; (e) Gaussian; (f) wave.} 
\end{figure} 


From the spectral representation (\ref{yaglom}), we can construct many isotropic stationary kernels.
Some of the most commonly used are depicted in Figure 2. They are defined by the equations listed in Table 1, where $\theta>0$ is a parameter.
\begin{table}[h!]
\label{kernels}
\begin{center}
\vspace{.3cm}
\begin{tabular}{|l|c|}
\hline
Name of kernel & $K_I(\|{\bf x}-{\bf z}\|)/K_I(0)$ \\ 
\hline
\hline
(a) {\bf Circular} & \\
positive definite in $\Bbb{R}^2$ & $\frac{2}{\pi} \arccos\Big( \frac{\|{\bf x}-{\bf z}\|}{\theta}\Big)-\frac{2}{\pi} \frac{\|{\bf x}-{\bf z}\|}{\theta} \sqrt{1-\big( \frac{\|{\bf x}-{\bf z}\|}{\theta} \big)^2}$ if $\|{\bf x}-{\bf z}\|<\theta$ \\
 & zero otherwise \\
 \hline
(b) {\bf Spherical} & \\
positive definite in $\Bbb{R}^3$ & $1-\frac{3}{2}  \frac{\|{\bf x}-{\bf z}\|}{\theta}+\frac{1}{2} \Big(\frac{\|{\bf x}-{\bf z}\|}{\theta}\Big)^3$ if $\|{\bf x}-{\bf z}\|<\theta$ \\
 & zero otherwise \\
 \hline
(c) {\bf Rational quadratic} & \\
positive definite in $\Bbb{R}^d$ & $1-\frac{\|{\bf x}-{\bf z}\|^2}{\|{\bf x}-{\bf z}\|^2+\theta}$ \\
 & \\
 \hline
(d) {\bf Exponential} & \\
positive definite in $\Bbb{R}^d$ & $\exp \Big(-\frac{\|{\bf x}-{\bf z}\|}{\theta}\Big)$\\
 & \\
 \hline
(e) {\bf Gaussian} & \\
positive definite in $\Bbb{R}^d$ & $\exp \Big(-\frac{\|{\bf x}-{\bf z}\|^2}{\theta}\Big)$ \\
 & \\
 \hline
(f) {\bf Wave} & \\
positive definite in $\Bbb{R}^3$ & $\frac{\theta}{\|{\bf x}-{\bf z}\|} \sin \Big(\frac{\|{\bf x}-{\bf z}\|}{\theta}\Big)$ \\
 & \\
 \hline 
\end{tabular}
\caption{Some commonly used isotropic stationary kernels.}
\end{center}
\end{table}
As an illustration, the exponential kernel (d) is obtained from the spectral representation (\ref{yaglom}) with the spectral density:
\begin{equation*}
f(\omega)=\frac{1}{\frac{\pi}{\theta}+\pi \theta \omega^2},
\end{equation*}
whereas the Gaussian kernel (e) is obtained with the spectral density:
\begin{equation*}
f(\omega)=\frac{\sqrt{\theta}}{2\sqrt{\pi}} \exp \left( -\frac{\theta \omega^2}{4}\right).
\end{equation*}
Note also that the circular and spherical kernels have compact support. They have a linear behavior
at the origin, which is also true for the exponential kernel. The rational quadratic, Gaussian, and wave kernels have a parabolic behavior at the origin. This indicates a different degree of smoothness. Finally, the Mat\'ern kernel \citep{matern:60} has recently received considerable attention, because it allows to control the smoothness with a parameter $\nu$. The Mat\'ern kernel is defined by:
\begin{equation}\label{matern}
K_I(\|{\bf x}-{\bf z}\|)/K_I(0)=\frac{1}{2^{\nu-1}\Gamma(\nu)}\left( \frac{2\sqrt{\nu}\|{\bf x}-{\bf z}\|}{\theta}\right)^{\nu} H_{\nu}\left( \frac{2\sqrt{\nu}\|{\bf x}-{\bf z}\|}{\theta}\right),
\end{equation}
where $\Gamma$ is the Gamma function and $H_{\nu}$ is the modified Bessel function of the second kind of order $\nu$. Note that the Mat\'ern kernel reduces to the exponential kernel for $\nu=0.5$ and to the Gaussian kernel for $\nu \rightarrow \infty$. Therefore, the Mat\'ern kernel includes a large class of kernels and will prove very useful for applications because of this flexibility.



Compactly supported kernels are kernels that vanish whenever the distance between two examples $\bf x$ and $\bf z$ is larger than a certain cut-off distance, often called the range. For instance, the spherical kernel (b) is a compactly supported kernel since $K_I(\|{\bf x}-{\bf z}\|)=0$ when $\|{\bf x}-{\bf z}\|\geq \theta$. This might prove a crucial advantage for certain applications dealing with massive data sets, because the corresponding Gram matrix $G$, whose $ij$-th element is $G_{ij}=K({\bf x}_i,{\bf x}_j)$, will be sparse. Then, linear systems involving the matrix $G$ can be solved very efficiently using sparse linear algebra techniques, see for example \citet{gilb:92}. As an illustrative example in ${\Bbb{R}}^2$, consider 1,000 examples, uniformly distributed in the unit square. Suppose that a spherical kernel (b) is used with a range of $\theta=0.2$. The corresponding Gram matrix contains 1,000,000 entries, of which only 109,740 are not equal to zero, and is represented in the left panel of Figure 3 (black dots represent nonzero entries). 
\begin{figure}[!h] \label{gramsparse}
\centerline{\psfig{figure=spyg.ps,height=7cm}\quad \psfig{figure=spygcut.ps,height=7cm}} 
\caption{The Gram matrix for 1,000 examples uniformly distributed in the unit square, based on a spherical kernel with range $\theta=0.2$: initial (left panel); after reordering (right panel).} 
\end{figure} 
The entries of the Gram matrix can be reordered, for instance with a sparse reverse Cuthill-McKee algorithm (see Gilbert et al., 1992), in order to have the nonzero elements closer to the diagonal. The result is displayed in the right panel of Figure 3. The reordered Gram matrix has now a bandwidth of only 252 instead of 1,000 for the initial matrix, and important computational savings can be obtained. Of course, if the spherical and the circular kernels would be the only compactly supported kernels available, this technique would be limited. Fortunately, large classes of compactly supported kernels can be constructed, see for example \citet{gneit:02} and references therein. A compactly supported kernel of Mat\'ern type can be obtained by multiplying the kernel (\ref{matern}) by the kernel:
\begin{equation*}
\max\left\{ \left(1-\frac{\|{\bf x}-{\bf z}\|}{\tilde \theta}\right)^{\tilde \nu},0\right\},
\end{equation*}
where $\tilde \theta>0$ and ${\tilde \nu}\geq (d+1)/2$, in order to insure positive definiteness. This product is a kernel by the property (\ref{p2}). Beware that it is not possible to simply ``cut-off'' a kernel in order to obtain a compactly supported one, because the result will not be positive definite in general.



\bigskip  

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%   
\section{Locally Stationary Kernels}   
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%   
 
A simple departure from the stationary kernels discussed in the previous section
is provided by locally stationary kernels \citep{silverman:57,silverman:59}:
\begin{equation}\label{locsta}
K({\bf x},{\bf z})=K_1\Big(\frac{{\bf x}+{\bf z}}{2}\Big) K_2({\bf x}-{\bf z}),
\end{equation}
where $K_1$ is a nonnegative function and $K_2$ is a stationary kernel. Note that if $K_1$ is a positive constant, then (\ref{locsta}) reduces to a stationary kernel. Thus, the class of locally stationary kernels has the desirable property of including stationary kernels as a special case.
Because the product of $K_1$ and $K_2$ is defined only up to a multiplicative positive constant, we further impose that $K_2({\bf 0})=1$. The variable $({\bf x}+{\bf z})/2$ has been chosen because of its suggestive meaning of the average or centroid of the examples $\bf x$ and $\bf z$. The variance is determined by:
\begin{equation} \label{vari}   
K({\bf x},{\bf x})=K_1({\bf x}) K_2({\bf 0})=K_1({\bf x}),
\end{equation} 
thus justifying the name of power schedule for $K_1({\bf x})$, which describes the global structure. On the other hand, $K_2({\bf x}-{\bf z})$ is invariant under shifts and thus describes the local structure. It can be obtained by considering:
\begin{equation} \label{antidiag}   
K({\bf x}/2,-{\bf x}/2)=K_1({\bf 0}) K_2({\bf x}).
\end{equation}
Equations (\ref{vari}) and (\ref{antidiag}) imply that the kernel $K({\bf x},{\bf z})$ defined by (\ref{locsta}) is completely determined by its values on the diagonal ${\bf x}={\bf z}$ and antidiagonal ${\bf x}=-{\bf z}$, for:
\begin{equation} \label{repres}   
K({\bf x},{\bf z})=\frac{K(({\bf x}+{\bf z})/2,({\bf x}+{\bf z})/2) K(({\bf x}-{\bf z})/2,-({\bf x}-{\bf z})/2)}{K({\bf 0},{\bf 0})}.
\end{equation}
Thus, we see that $K_1$ is invariant with respect to shifts parallel to the antidiagonal, whereas $K_2$ is invariant with respect to shifts parallel to the diagonal. These properties allow to find moment estimators of both $K_1$ and $K_2$ from a single realization of data, although the kernel is not stationary. 

We already mentioned that stationary kernels are locally stationary. Another special class of locally stationary kernels is defined by kernels of the form:
\begin{equation} \label{expconv}   
K({\bf x},{\bf z})=K_1({\bf x}+{\bf z}),
\end{equation}
the so-called exponentially convex kernels \citep{loeve:46,loeve:48}. From (\ref{repres}), we see immediately that $K_1({\bf x}+{\bf z})\geq 0$. Actually, as noted by Lo\`eve, any two-sided Laplace transform of a nonnegative function is an exponentially convex kernel. A large class of locally stationary kernels can therefore be constructed by multiplying an exponentially convex kernel by a stationary kernel, since the product of two kernels is a kernel by the property (\ref{p2}). However, the following example is a locally stationary kernel in $\Bbb{R}^1$ which is not the product of two kernels:
\begin{equation} \label{locex} 
\exp\Big[-a(x^2+z^2)\Big]=\exp\Big[ -2a((x+z)/2)^2\Big] \, \exp\Big[ -a (x-z)^2/2\Big],\,\, a>0,
\end{equation}
since the first factor in the right side is a positive function without being a kernel, and the second factor is a kernel. Finally, with the positive definite Delta kernel $\delta({\bf x}-{\bf z})$, which is equal to 1 if ${\bf x}={\bf z}$ and 0 otherwise, the product:
\begin{equation*} 
K({\bf x},{\bf z})=K_1\Big(\frac{{\bf x}+{\bf z}}{2}\Big) \delta({\bf x}-{\bf z}),
\end{equation*}
is a locally stationary kernel, often called a locally stationary white noise.

The spectral representation of locally stationary kernels has remarkable properties. Indeed, it can be written as (Silverman, 1957):
\begin{equation*} 
K({\bf x},{\bf z})=\int_{{\Bbb{R}}^d}\int_{{\Bbb{R}}^d}   \cos\big({\boldsymbol \omega}_1^{T}{\bf x}-{\boldsymbol \omega}_2^{T}{\bf z}\big)f_1\Big(\frac{{\boldsymbol \omega}_1+{\boldsymbol \omega}_2}{2}\Big) f_2({\boldsymbol \omega}_1-{\boldsymbol \omega}_2)d{\boldsymbol \omega}_1d{\boldsymbol \omega}_2,
\end{equation*}
i.e. the spectral density $f_1\Big(\frac{{\boldsymbol \omega}_1+{\boldsymbol \omega}_2}{2}\Big) f_2({\boldsymbol \omega}_1-{\boldsymbol \omega}_2)$ is also a locally stationary kernel, and:
\begin{eqnarray*}
K_1({\bf u})&=&\int_{{\Bbb{R}}^d} \cos({\boldsymbol \omega}^{T}{\bf u}) f_2({\boldsymbol \omega})d{\boldsymbol \omega}, \\
K_2({\bf v})&=&\int_{{\Bbb{R}}^d} \cos({\boldsymbol \omega}^{T}{\bf v}) f_1({\boldsymbol \omega})d{\boldsymbol \omega},
\end{eqnarray*}
i.e. $K_1, f_2$ and $K_2, f_1$ are Fourier transform pairs.
For instance, to the locally stationary kernel (\ref{locex}) corresponds the spectral density:
\begin{equation*} 
f_1\Big(\frac{{ \omega}_1+{ \omega}_2}{2}\Big) f_2({ \omega}_1-{ \omega}_2)=\frac{1}{4\pi a} \exp\Big[ -\frac{1}{2a}((\omega_1+\omega_2)/2)^2\Big] \, \exp\Big[ -\frac{1}{8a} (\omega_1-\omega_2)^2/2\Big],
\end{equation*}
which is immediately seen to be locally stationary since, except for a positive factor, it is of the form (\ref{locex}), with $a$ replaced by $1/(4a)$. Thus, we can design many locally stationary kernels with the help of their spectral representation. In particular, we can obtain a very rich family of locally stationary kernels by multiplying a Mat\'ern kernel (\ref{matern}) by an exponentially convex kernel (\ref{expconv}). The resulting product is still a kernel by the property (\ref{p2}).
\bigskip  

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%   
\section{Nonstationary Kernels}   
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%   
 
The most general class of kernels is the one of nonstationary kernels, which depend explicitly on the two examples $\bf x$ and $\bf z$:
 \begin{equation*}
K({\bf x},{\bf z}).
\end{equation*}
For example, the polynomial kernel of degree $p$:
 \begin{equation*}
K({\bf x},{\bf z})=({\bf x}^T {\bf z})^p,
\end{equation*}
is a nonstationary kernel.
The spectral representation of nonstationary kernels is very general.
A nonstationary kernel $K({\bf x},{\bf z})$ is positive definite in ${\Bbb{R}^d}$ if and only if it has the form \citep{yaglom:87}: 
\begin{equation}\label{nonspec}
K({\bf x},{\bf z})=\int_{{\Bbb{R}}^d}\int_{{\Bbb{R}}^d}   \cos\big({\boldsymbol \omega}_1^{T}{\bf x}-{\boldsymbol \omega}_2^{T}{\bf z}\big)F(d{\boldsymbol \omega}_1,d{\boldsymbol \omega}_2),     
\end{equation}   
where $F$ is a positive bounded symmetric measure. When the function $F({\boldsymbol \omega}_1,{\boldsymbol \omega}_2)$ is concentrated on the diagonal ${\boldsymbol \omega}_1={\boldsymbol \omega}_2$, then (\ref{nonspec}) reduces to the spectral representation (\ref{bochn}) of stationary kernels. Here again, many nonstationary kernels can be constructed with (\ref{nonspec}). Of interest are nonstationary kernels obtained from (\ref{nonspec}) with ${\boldsymbol \omega}_1={\boldsymbol \omega}_2$ but with a spectral density that is not integrable in a neighborhood around the origin. Such kernels are referred to as generalized kernels \citep{matheron:73}. For instance, the Brownian motion generalized kernel corresponds to a spectral density $f({\boldsymbol \omega})=1/\| {\boldsymbol \omega} \|^2$ \citep{mandel:68}.

A particular family of nonstationary kernels is the one of separable nonstationary kernels: 
\begin{equation*}
K({\bf x},{\bf z})=K_1({\bf x})K_2({\bf z}),
\end{equation*}
where $K_1$ and $K_2$ are stationary kernels evaluated at the examples $\bf x$ and $\bf z$ respectively. The resulting product is a kernel by the property (\ref{p2}) in Section 1. Separable nonstationary kernels possess the property that their Gram matrix $G$, whose $ij$-th element is $G_{ij}=K({\bf x}_i,{\bf x}_j)$, can be written as a tensor product \citep[also called Kronecker product, see][]{graham:81} of two vectors defined by $K_1$ and $K_2$ respectively. This is especially useful to reduce computational burden when dealing with massive data sets. For instance, consider a set of $l$ examples ${\bf x}_1,\ldots,{\bf x}_l$. The memory requirements fot the computation of the Gram matrix is then reduced from $l^2$ to $2l$ since it suffices to evaluate the vectors ${\bf a}=(K_1({\bf x}_1),\ldots,K_1({\bf x}_l))^T$ and ${\bf b}=(K_2({\bf x}_1),\ldots,K_2({\bf x}_l))^T$. We then have $G={\bf a} {\bf b}^T$. Such a computational reduction can be of crucial importance for certain applications involving very large training sets.

\bigskip  

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%   
\section{Reducible Kernels}   
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%   
 
In this section, we discuss the characterization of nonlinear maps that reduce nonstationary kernels to either stationarity or local stationarity. The main idea is to find a new feature space where stationarity \citep[see][]{sampson:92} or local stationarity \citep[see][]{genper:01} can be achieved. We say that a nonstationary kernel $K({\bf x},{\bf z})$ is stationary reducible if there exist a bijective deformation $\boldsymbol \Phi$ such that:
\begin{equation}\label{stared}
K({\bf x},{\bf z})=K_S^*({\boldsymbol \Phi}({\bf x})-{\boldsymbol \Phi}({\bf z})),
\end{equation}
where $K_S^*$ is a stationary kernel. For example in ${\Bbb{R}}^2$, the nonstationary kernel defined by:
\begin{equation}\label{brownsheet}
K({\bf x},{\bf z})=\frac{\|{\bf x}\|+\|{\bf z}\|-\|{\bf z}-{\bf x}\|}{2\sqrt{\|{\bf x}\| \|{\bf z}\|}},
\end{equation}
is stationary reducible with the deformation:
\begin{equation*}
{\boldsymbol \Phi}(x_1,x_2)=\Big(\ln\big(\sqrt{x_1^2+x_2^2}\big), \arctan(x_2/x_1)\Big)^T,
\end{equation*}
yielding the stationary kernel:
\begin{equation} \label{kuu}
K_S^*(u_1,u_2)=\cosh(u_1/2)-\sqrt{(\cosh(u_1/2)-\cos(u_2))/2}.
\end{equation}
Effectively, it is straightforward to check with some algebra that (\ref{kuu}) evaluated at:
\begin{equation*}
{\boldsymbol \Phi}({\bf x})-{\boldsymbol \Phi}({\bf z})=\left(  \ln\left( \frac{\|{\bf x}\|}{\|{\bf z}\|}\right),\arctan(x_2/x_1)-\arctan(z_2/z_1)\right)^T,
\end{equation*}
yields the kernel (\ref{brownsheet}).
\citet{perrin:99,perrin:00} characterize such deformations ${\boldsymbol \Phi}$. Specifically, if ${\boldsymbol \Phi}$ and its inverse are differentiable in ${\Bbb{R}^d}$, and $K({\bf x},{\bf z})$ is continuously differentiable for ${\bf x}\neq {\bf y}$, then $K$ satisfies (\ref{stared}) if and only if:
\begin{equation} \label{charact}
D_{\bf x} K({\bf x},{\bf z}) Q_{{\boldsymbol \Phi}}^{-1}({\bf x})+D_{\bf z} K({\bf x},{\bf z}) Q_{{\boldsymbol \Phi}}^{-1}({\bf z})={\bf 0},\,\, {\bf x}\neq{\bf y},
\end{equation}
where $Q_{{\boldsymbol \Phi}}$ is the Jacobian of ${\boldsymbol \Phi}$ and $D_{\bf x}$ denotes the partial derivatives operator with respect to $\bf x$. It can easily be checked that the kernel (\ref{brownsheet}) satisfies the above equation (\ref{charact}). Unfortunately, not all nonstationary kernels can be reduced to stationarity through a deformation ${\boldsymbol \Phi}$. Consider for instance the kernel in ${\Bbb{R}^1}$:
\begin{equation}\label{example1}
K(x,z)=\exp(2-x^6-z^6),
\end{equation}
which is positive definite as can be seen from (\ref{p5}). It is obvious that $K(x,z)$ does not satisfy Equation (\ref{charact}) and thus is not stationary reducible. This is the motivation of Genton and Perrin (2001) to extend the model (\ref{stared}) to locally stationary kernels. We say that a nonstationary kernel $K$ is locally stationary reducible   
if there exists a bijective deformation ${\boldsymbol \Phi}$ such that:   
\begin{equation} \label{locstared}   
K({\bf x},{\bf z})=K_1\Big(\frac{{\boldsymbol \Phi}({\bf x})+{\boldsymbol \Phi}({\bf z})}{2}\Big) K_2\big({\boldsymbol \Phi}({\bf x})-{\boldsymbol \Phi}({\bf z})\big),   
\end{equation}   
where $K_1$ is a nonnegative function and $K_2$ is a stationary kernel. Note that if $K_1$ is a positive constant, then Equation (\ref{locstared}) reduces to the model (\ref{stared}). Genton and Perrin (2001) characterize such transformations ${\boldsymbol \Phi}$. For instance, the nonstationary kernel (\ref{example1}) can be reduced to a locally stationary kernel with the transformation:
\begin{equation}   \label{transf}
{\boldsymbol \Phi}(x)=\frac{x^3}{3}-\frac{1}{3},  
\end{equation} 
yielding:
\begin{eqnarray}\label{k1k2}
K_1(u)&=&\exp \left( -18u^2-12u \right)\\
K_2(v)&=&\exp\left( -\frac{9}{2}v^2 \right).
\end{eqnarray}
Here again, it can easily be checked from (\ref{k1k2}), (28), and (\ref{transf}) that:
\begin{equation*}   
K_1\Big(\frac{{\boldsymbol \Phi}({x})+{\boldsymbol \Phi}({z})}{2}\Big) K_2\big({\boldsymbol \Phi}({x})-{\boldsymbol \Phi}({z})\big)=\exp(2-x^6-z^6).  
\end{equation*}  
Of course, it is possible to construct nonstationary kernels that are neither stationary reducible nor locally stationary reducible. Actually, the familiar class of polynomial kernels of degree $p$, $K({\bf x},{\bf z})=({\bf x}^T {\bf z})^p$, cannot be reduced to stationarity or local stationarity with a bijective transformation ${\boldsymbol \Phi}$. Further research is needed to characterize such kernels.

\bigskip  
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%   
\section{Conclusion}   
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%   
   
In this paper, we have described several classes of kernels that can be used for machine learning: stationary (anisotropic/isotropic/compactly supported), locally stationary, nonstationary and separable nonstationary kernels.
Each class has its own particular properties and spectral representation. The latter allows for the design of many new kernels in each class. We have not addressed the question of which class is best suited for a given problem, but we hope that further research will emerge from this paper. It is indeed important to find adequate classes of kernels for classification, regression, density estimation, and clustering. Note that kernels from the classes presented in this paper can be combined indefinitely by using the properties (\ref{p1})-(\ref{p8}). This should prove useful to researchers designing new kernels and algorithms for machine learning. In particular, the reducibility of nonstationary kernels to simpler kernels which are stationary or locally stationary suggests interesting applications. For instance, locally stationary kernels are in fact separable kernels in a new coordinate system defined by $({\bf x}+{\bf z})/2$ and ${\bf x}-{\bf z}$, and as already mentioned, provide computational advantages when dealing with massive data sets.
  




% Acknowledgements should go at the end, before appendices and references

\acks{I would like to acknowledge support for this project
from U.S. Army TACOM Research, Development and Engineering Center under the auspices of the U.S. Army Research Office Scientific Services Program administered by Battelle (Delivery Order 634, Contract No. DAAH04-96-C-0086, TCN 00-131). I would like to thank David Gorsich from U.S. Army TACOM, Olivier Perrin, as well as the Editors and two anonymous reviewers, for their comments that improved the manuscript.}


\vskip 0.2in
\bibliography{genton.kernel}





   



\end{document}
