\documentclass{article}

\usepackage{jmlr2e}



%% Probabilities and expectations

\newcommand{\p}[1]{\mathbb{P}\left[#1 \right]}

\newcommand{\e}[1]{\mathbb{E}\left[#1 \right]}

\newcommand{\pp}[2]{\mathbb{P}_{#1}\left[#2 \right]}

\newcommand{\ee}[2]{\mathbb{E}_{#1}\left[#2 \right]}

\newcommand{\Pz}[1]{\mathbb{P}_z\left[#1 \right]}

\newcommand{\PS}[1]{\mathbb{P}_S\left[#1 \right]}

\newcommand{\PT}[1]{\mathbb{P}_T\left[#1 \right]}

\newcommand{\Ez}[1]{\mathbb{E}_z\left[#1 \right]}

\newcommand{\ESz}[2]{\mathbb{E}_{S,#1}\left[#2 \right]}

\newcommand{\ES}[1]{\mathbb{E}_S\left[#1 \right]}

\newcommand{\ip}[1]{\left\langle #1 \right\rangle}

\newcommand{\note}[1]{{\bf{Note:#1}}\newline}



\newcommand{\X}{\mathcal{X}}

\newcommand{\Y}{\mathcal{Y}}

\newcommand{\Z}{\mathcal{Z}}

\newcommand{\F}{\mathcal{F}}

\newcommand{\h}{\mathcal{H}}

\newcommand{\D}{\mathcal{D}}

\newcommand{\vz}{\vec{0}}

% Variations

\newcommand{\ci}{^i} % change i-th component

\newcommand{\cj}{^j} % change j-th component

\newcommand{\cij}{^i,j} % change i-th and j-th components

\newcommand{\ri}{^{\backslash i}} % remove i-th component

\newcommand{\rj}{^{\backslash j}} % remove j-th component

\newcommand{\rij}{^{\backslash i,j}} % remove i-th and j-th components

\newcommand{\rr}[1]{^{\backslash #1}}

\newcommand{\cc}[1]{^{#1}}

% Samples

\newcommand{\Ss}{S}

\newcommand{\Sci}{{S\ci}}

\newcommand{\Scj}{{S\cj}}

\newcommand{\Scij}{{S\cij}}

\newcommand{\Sri}{{S\ri}}

\newcommand{\Srj}{{S\rj}}

\newcommand{\Srij}{{S\rij}}

\newcommand{\So}{{S^{\backslash 1}}}

\newcommand{\St}{{S^{\backslash 2}}}

% Application of the algorithm to the samples

\newcommand{\AS}{A_\Ss}

\newcommand{\ASci}{A_{\Sci}}

\newcommand{\AScj}{A_{\Scj}}

\newcommand{\AScij}{A_{\Scij}}

\newcommand{\ASri}{A_{\Sri}}

\newcommand{\ASrj}{A_{\Srj}}

\newcommand{\ASrij}{A_{\Srij}}

\newcommand{\ASo}{A_{\So}}

\newcommand{\ASt}{A_{\St}}

% Functions

\newcommand{\f}{f}

\newcommand{\fci}{f\ci}

\newcommand{\fri}{f\ri}

% Errors

% true error

\newcommand{\R}{R}

\newcommand{\Rci}{R\ci}

\newcommand{\Rri}{R\ri}

\newcommand{\Rg}{R^\gamma}

% empirical error

%\newcommand{\Res}{\hat{R}}
\newcommand{\Res}{{R}_{emp}}

%\newcommand{\Reci}{\hat{R}\ci}

\newcommand{\Reci}{{R}_{emp}\ci}

%\newcommand{\Reri}{\hat{R}\ri}

\newcommand{\Reri}{{R}_{emp}\ri}

%\newcommand{\Reg}{\hat{R}^\gamma}

\newcommand{\Reg}{{R}_{emp}^\gamma}

% leave-one-out error

%\newcommand{\Rl}{\bar{R}}

\newcommand{\Rl}{{R}_{loo}}

%\newcommand{\Rlci}{\bar{R}\ci}

\newcommand{\Rlci}{{R}_{loo}\ci}

%\newcommand{\Rlri}{\bar{R}\ri}

\newcommand{\Rlri}{{R}_{loo}\ri}

%\newcommand{\Rlg}{\bar{R}^\gamma}

\newcommand{\Rlg}{{R}_{loo}^\gamma}

% regularized empirical error

\newcommand{\Rr}{R_r}

\newcommand{\Rrci}{R_r\ci}

\newcommand{\Rrri}{R_r\ri}



\jmlrheading{2}{2002}{499-526}{7/01}{3/02}{Olivier Bousquet and
Andr\'e Elisseeff}

\ShortHeadings{Stability and Generalization}{Bousquet and
Elisseeff}


\firstpageno{499}



\title{Stability and Generalization}



\editor{Dana Ron}



%\editor{Leslie Pack Kaelbling}



\begin{document}



\author{\name Olivier~Bousquet \email bousquet@cmapx.polytechnique.fr\\
\addr CMAP, Ecole Polytechnique\\
F-91128 Palaiseau, FRANCE\\
\AND \name Andr\'e~Elisseeff
 \email andre.elisseeff@biowulf.com\\
\addr BIOwulf Technologies\\
305 Broadway,\\
New-York, NY 10007}


\maketitle


\begin{abstract}%

We define notions of stability for learning algorithms
 and show
how to use these notions to derive generalization error bounds
based on the empirical error and the leave-one-out error. The
methods we use can be applied in the regression framework as well
as in the classification one when the classifier is obtained by
thresholding a real-valued function. We study the stability
properties of large classes of learning algorithms such as
regularization based algorithms. In particular we focus on Hilbert
space regularization and Kullback-Leibler regularization. We
demonstrate how to apply the results to SVM for regression and
classification.

\end{abstract}

%\begin{keywords}
%Algorithmic Stability, Generalization bounds, Concentration
%inequalities.
%\end{keywords}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% SECTION

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Introduction}

A key issue in the design of efficient Machine Learning systems is
the estimation of the accuracy of learning algorithms. Among the
several approaches that have been proposed to this problem, one of
the most prominent is based on the theory of uniform convergence
of empirical quantities to their mean (see e.g. Vapnik, 1982).
This theory provides ways to estimate the {\em risk} (or
generalization error) of a learning system based on an empirical
measurement of its accuracy and a measure of its complexity, such
as the Vapnik-Chervonenkis (VC) dimension or the fat-shattering
dimension (see e.g. Alon et al., 1997).

\iffalse
 Another inconvenient
that is often invoked to explain the looseness of certain bounds
on generalization error is that the theory does not take into
account explicitly the nature of the learning algorithm. Whatever
the system is learnt, only the structure of the model, the set of
function from which the algorithm pick its outcome is important.
One could think then that adding some knowledge about the learning
method into the theory would give a better understanding of what
generalization is. A solution has been proposed by \cite{Sha_al96}
and concerns linear models. They prove a bound on the
generalization error of a linear model in terms of the $\ell_2$
norm of its parameters. Then, they define a framework where a
"good" algorithm would be one that minimize this norm. This result
is interesting but its application has been done only for linear
systems. \fi

 We explore here a different approach which is based
on {\em sensitivity analysis}. Sensitivity analysis aims at
determining how much the variation of the input can influence the
output of a system.\footnote{For a qualitative discussion about
sensitivity analysis with links to other resources see e.g. {\tt
http://sensitivity-analysis.jrc.cec.eu.int/}} It has been applied
to many areas such as statistics and mathematical programming. In
the latter domain, it is often referred to as perturbation
analysis (see Bonnans and Shapiro, 1996, for a survey). The
motivation for such an analysis is to design robust systems that
will not be affected by noise corrupting the inputs.

 In this paper, the
objects of interest are learning algorithms. They take as input a
learning set made of instance-label pairs and output a function
that maps instances to the corresponding labels. The sensitivity
in that case is thus related to changes of the outcome of the
algorithm when the learning set is changed. There are two sources
of randomness such algorithms have to cope with: the first one
comes from the sampling mechanism used to generate the learning
set and the second one is due to noise in the measurements (on the
instance and/or label). In contrast to standard approaches to
sensitivity analysis, we mainly focus on the sampling randomness
and we thus are interested in how changes in the composition of
the learning set influence the function produced by the algorithm.
The outcome of such an approach is a principled way of getting
bounds on the difference between empirical and generalization
error. These bounds are obtained using powerful statistical tools
known as concentration inequalities. The latter are the
mathematical device corresponding to the following statement, from
\cite{Talagrand96}:

\begin{quote}
A random variable that depends (in a ``smooth way'') on the
influence of many independent variables (but not too much on any
of them) is essentially constant.
\end{quote}

The expression ``essentially constant'' actually means that the
random variable will have, with high probability, a value close to
its expected value. We will apply these inequalities to the random
variable we are interested in, that is, the difference between an
empirical measure of error and the true generalization error. We
will see that this variable has either a zero expectation or it
has a nice property: the condition under which it concentrates
around its expectation implies that its expectation is close to
zero. That means that if we impose conditions on the learning
system such that the difference between the empirical error and
the true generalization error is roughly constant, then this
constant is zero. This observation and the existence of
concentration inequalities will allow us to state exponential
bounds on the generalization error of a stable learning system.


\iffalse

Learning algorithms are randomized algorithms where randomness
comes from the sampling of the learning examples. By defining a
stable learning system, the goal is actually to see how to put
more and more determinism into the algorithm such that its
behavior is more controlled. The stability approach we propose has
thus another goal than classical studies: rather than comparing
one learning system to the best one, it just focus on how to
control the true risk from an empirical measurement and this, only
for the outcome of a specific learning algorithm.



Robustness can also be found in some other area, where its
interpretation is then slightly different. Consider concentration
inequalities. The latter will be discussed more precisely in the
paper, but let us informally discuss one of the most famous. Mc
Diarmid's inequality has been introduced at the end of the 80's.
It shows that a random variable $X$ computed as a function of
i.i.d. random variables $Y_i$, $i=1,..,m$ concentrates around its
expectation. By concentration, it is meant that the difference
between the mean and the random value is small with high
probability. As a consequence, two different sets of $Y_i$ will
lead roughly to the same value of $X$. Then, as an extreme
interpretation, $X$ can be thought of as a constant value. Mc
Diarmid's theorem requires that the value of $X$ changes only a
little when one of the $Y_i$ is removed or changed. It typically
corresponds to a robustness criterion. The less change in $X$, the
more concentrated it will be around its mean. Under the same
hypothesis as Mc Diarmid's theorem, one can also show that the
variance of $X$ is small and depends on the way $X$ changes with
respect to changes in the $Y_i$'s. A random variable with a small

variance compared to its mean value can then be almost interpreted

as a constant value equals to its mean. Once again, robustness

removes the effect of randomness in the sense that different

sampling lead to roughly the same value.\fi



The outline of the paper is as follows: after reviewing previous
work in the area of stability analysis of learning algorithms, we
introduce three notions of stability (Section \ref{sec:stab}) and
derive bounds on the generalization error of stable learning
systems (Section \ref{sec:bounds}). In Section \ref{sec:reg}, we
show that many existing algorithms such as SVM for classification
and regression, ridge regression or variants of maximum relative
entropy discrimination do satisfy the stability requirements. For
each of these algorithms, it is then possible to derive original
bounds which have many attractive properties.



\subsection*{Previous work}

It has long been known that when trying to estimate an unknown
function from data, one needs to find a tradeoff between bias and
variance.\footnote{We deliberately do not provide a precise
definition of bias and variance and resort to common intuition
about these notions. In broad terms, the bias is the best error
that can be achieved and the variance is the difference between
the typical error and the best error.} Indeed, on one hand, it is
natural to use the largest model in order to be able to
approximate any function, while on the other hand, if the model is
too large, then the estimation of the best function in the model
will be harder given a restricted amount of data. Several ideas
have been proposed to fight against this phenomenon. One of them
is to perform estimation in several models of increasing size and
then to choose the best estimator based on a complexity penalty
(e.g. Structural Risk Minimization). This allows to control the
complexity while allowing to use a large model. This technique is
somewhat related to regularization procedures that we will study
in greater detail in subsequent sections. Another idea is to use
statistical procedures to reduce the variance without altering the
bias. One such technique is the bagging approach of
\cite{Breiman96a} which consists in averaging several estimators
built from random subsamples of the data.



Although it is generally accepted that having a low variance (or a
high stability in our terminology) is a desirable property for a
learning algorithm, there are few quantitative results relating
the generalization error to the stability of the algorithm with
respect to changes in the training set. The first such results
were obtained by Devroye, Rogers and Wagner in the seventies (see
Rogers and Wagner, 1978, Devroye and Wagner, 1979a,b).
\cite{RogWag78} first showed that the variance of the
leave-one-out error can be upper bounded by what \cite{KeaRon99}
later called {\em hypothesis stability}. This quantity measures
how much the function learned by the algorithm will change when
one point in the training set is removed. The main distinctive
feature of their approach is that, unlike VC-theory based
approaches where the only property of the algorithm that matters
is the size of the space to be searched, it focuses on how the
algorithm searches the space. This explains why it has been
successfully applied to the $k$-Nearest Neighbors algorithm
($k$-NN) whose search space is known to have an infinite
VC-dimension. Indeed, results from VC-theory would not be of any
help in that case since they are meaningful when the learning
algorithm performs minimization of the empirical error in the full
function space. However, the $k$-NN algorithm is very stable
because of its 'locality'. This allowed Rogers and Wagner to get
an upper bound on the difference between the leave-one-out error
and the generalization error of such a classifier. These results
were later extended to obtain bounds on the generalization error
of k-local rules in \cite{DevWag79a}, and of potential rules in
\cite{DevWag79b}.



In the early nineties, concentration inequalities became popular
in the probabilistic analysis of algorithms, due to the work of
\cite{McD89} and started to be used as tools to derive
generalization bounds for learning algorithms by \cite{Devroye91}.
Building on this technique, \cite{LugPaw94} obtained new bounds
for the $k$-NN, kernel rules and histogram rules. These bounds
used ``smoothed estimates'' of the error which estimate the
posterior probability of error instead of simply counting the
errors. This smoothing is very much related to the use of
real-valued classifiers and we will see that it is at the heart of
the applicability of stability analysis to classification
algorithms. A comprehensive account of the application of
McDiarmid's inequality to obtain bounds for the leave-one-out
error or the smoothed error of local classifiers can be found in
\cite{Dev_al96}.



Independently from this theoretical analysis, practical methods
have been developed to deal with instability of learning
algorithms. In particular, \cite{Breiman96a,Breiman96b} introduced
the Bagging technique which is presented as a method to combine
single classifiers in such a way that the variance of the overall
combination is decreased. However, there is no theoretical
guarantee that this variance reduction will bring an improvement
on the generalization error.


Finally, a more recent work has shown an interesting connection
between stability and VC-theory. \cite{KeaRon99} derived what they
called sanity-check bounds.  In particular, they proved that an
algorithm having a search space of finite VC-dimension, is stable
in the sense that its stability (in a sense to be defined later)
is bounded by its VC-dimension. Thus using the stability as a
complexity measure does not give worse bounds than using the
VC-dimension.



The work presented here follows and extends the stability approach
of \cite{LugPaw94} in that we derive exponential upper bounds on
the generalization error based on notions of stability. It is
based on earlier results presented in \cite{Bou_Eli01}. We
consider both the leave-one-out error and the empirical error as
possible estimates of the generalization error. We prove stability
bounds for a large class of algorithms which includes the Support
Vector Machines, both in the regression and in the classification
cases. Also we generalize some earlier results from Devroye and
Wagner.



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% SECTION

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Preliminaries}

We first introduce some notation and then the main tools we will
use to derive inequalities.

\subsection{Notations}

$\X$ and $\Y\subset \mathbb{R}$ being respectively an input and an output
space, we consider a training set
\[
S=\{z_1=(x_1,y_1),..,z_m=(x_m,y_m)\}\,,
\]

of size $m$ in $\Z=\X\times\Y$ drawn i.i.d. from an unknown
distribution $D$. A learning algorithm is a function $A$ from
$\Z^m$ into $\F \subset \Y^\X$ which maps a learning set $S$ onto
a function $\AS$ from $\X$ to $\Y$. To avoid complex notation, we
consider only deterministic algorithms. It is also assumed that
the algorithm $A$ is symmetric with respect to $S$, {\em i.e.} it
does not depend on the order of the elements in the training set.
Furthermore, we assume that all functions are measurable and all
sets are countable which does not limit the interest of the
results presented here.



Given a training set $S$ of size $m$, we will build, for all
$i=1.\ldots,m$, modified training sets as follows:
\begin{itemize}
\item By {\em removing} the $i$-th element

\[
\Sri = \{z_1,\ldots,z_{i-1},z_{i+1},\ldots,z_m\}\,.
\]

\item By {\em replacing} the $i$-th element

\[
\Sci = \{z_1,\ldots,z_{i-1},z_i',z_{i+1},\ldots,z_m\}\,.
\]
where the replacement example $z_i'$ is assumed to be drawn from
$D$ and is independent from $S$.

\end{itemize}



Unless they are clear from context, the random variables over
which we take probabilities and expectation will be specified in
subscript. We thus introduce the notation $\PS{.}$ and $\ES{.}$ to
denote respectively the probability and the expectation with
respect to the random draw of the sample $S$ of size $m$ (drawn
according to $D^m$). Similarly, $\Pz{.}$ and $\Ez{.}$ will denote
the probability and expectation when $z$ is sampled according to
$D$.


In order to measure the accuracy of the predictions of the
algorithm, we will use a {\em cost function} $c:\Y\times\Y
\rightarrow \mathbb{R}^+$. The {\em loss} of an hypothesis $f$
with respect to an example $z=(x,y)$ is then defined as
\[
\ell(f,z) = c(f(x),y)\,.
\]
We will consider several measures of the performance of an
algorithm. The main quantity we are interested in is the {\em
risk} or {\em generalization error}. This is a random variable
depending on the training set $S$ and it is defined as
\[
\R(A,S)= \Ez{\ell(\AS,z)}\,.
\]
Unfortunately, $R$ cannot be computed since $D$ is unknown. We
thus have to estimate it from the available data $S$. We will
consider several estimators for this quantity.


The simplest estimator is the so-called {\em empirical error}
(also known as {\em resubstitution estimate}) defined as

\[
\Res(A,S) = \frac{1}{m}\sum_{i=1}^m \ell(\AS, z_i)\,.
\]

Another classical estimator is the {\em leave-one-out error} (also
known as {\em deleted estimate}) defined as
\[
\Rl(A,S) = \frac{1}{m}\sum_{i=1}^m \ell(\ASri,z_i)\,.
\]

When the algorithm is clear from context, we will simply write
$R(S)$, $\Res(S)$ and $\Rl(S)$. We will often simplify further the
notations when the training sample is clear from context. In
particular, we will use the following shorthand notations $\R
\equiv \R(A,\Ss)$, $\Res \equiv \Res(A,\Ss)$, and $\Rl \equiv
\Rl(A,\Ss)$.

\subsection{Main Tools}

The study we describe here intends to bound the difference between
empirical and generalization error for specific algorithms. For
any $\epsilon > 0$ our goal is to bound the term
\begin{equation}\label{sta:10}
\PS{\left|\Res(A,\Ss) - \R(A,\Ss)\right| > \epsilon}\,,
\end{equation}
which differs from what is usually studied in learning theory
\begin{equation}\label{eq:vc}
\PS{\sup_{f\in\F}\left| \Res(f) - \R(f)\right| > \epsilon}\,.
\end{equation}

Indeed, we do not want to have a bound that holds uniformly over
the whole space of possible functions since we are interested in
algorithms that may not explore it. Moreover we may not even have
a way to describe this space and assess its size. This explains
why we want to focus on (\ref{sta:10}).



Our approach is based on inequalities that relate moments of
multi-dimensional random functions to their first order finite
differences. The first one is due to \cite{Steele86} and provides
bounds for the variance. The second one is a version of Azuma's
inequality due to \cite{McD89} and provides exponential bounds but
its assumptions are more restrictive.


\begin{theorem}[Steele, 1986]\label{th:es}
Let $S$ and $\Sci$ defined as above, let $F:\Z^m\rightarrow\R$ be
any measurable function, then
%denoting by $\ES{F}$ the expectation
%of $F$ over $\Z^m$, we have\footnote{Note that we write
%$\ES{F(\Sci)}$ for $\ES{F(p_i(S))}$ where $p_i$ is the projection
%over all components but the $i^{th}$ one.}
\[
\ES{\left(F(S)- \ES{F(S)}\right)^2} \leq \frac{1}{2}\sum_{i=1}^m
\ESz{z_i'}{\left(F(S)-F(\Sci)\right)^2}
\]
\end{theorem}

\begin{theorem}[McDiarmid, 1989]\label{th:mcd}
Let $S$ and $\Sci$ defined as above, let $F:\Z^m\rightarrow\R$ be
any measurable function for which there exists constants $c_i$
($i=1,\ldots,m$) such that
\[
\sup_{S\in \Z^m, z_i'\in \Z} \left|F(S)-F(\Sci)\right| \leq c_i\,,
\]
then
\[
\PS{F(S)-\ES{F(S)}\geq \epsilon}\leq e^{-2\epsilon^2/\sum_{i=1}^n
c_i^2}\,.
\]
\end{theorem}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% SECTION

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Defining the Stability of a Learning Algorithm}\label{sec:stab}
There are many ways to define and quantify the stability of a
learning algorithm. The natural way of making such a definition is
to start from the goal: we want to get bounds on the
generalization error of specific learning algorithm and we want
these bounds to be tight when the algorithm satisfies the
stability criterion.



As one may expect, the more restrictive a stability criterion is,
the tighter the corresponding bound will be.



In the learning model we consider, the randomness comes from the
sampling of the training set. We will thus consider stability with
respect to changes in the training set. Moreover, we need an easy
to check criterion so that we will consider only restricted
changes such as the removal or the replacement of one single
example in the training set.



Although not explicitly mentioned in their work, the first such
notion was used by \cite{DevWag79a} in order to get bounds on the
variance of the error of {\em local} learning algorithms. Later,
\cite{KeaRon99} stated it as a definition and gave it a name. We
give here a slightly modified version of Kearns and Ron's
definition that suits our needs.

\begin{definition}[Hypothesis Stability]
An algorithm $A$ has {\em hypothesis stability} $\beta$ with
respect to the loss function $\ell$ if the following holds
\begin{equation}\label{eq:hstab}
\forall i\in \{1,\ldots,m\},\; \ESz{z}{|\ell(\AS,z) -
\ell(\ASri,z)|} \leq \beta\,.
\end{equation}
\end{definition}
Note that this is the $L_1$ norm with respect to $D$, so that we
can rewrite the above as
\[
\ES{\|\ell(\AS,.)-\ell(\ASri,.)\|_1}\leq \beta
\]

We will also use a variant of the above definition in which
instead of measuring the average change, we measure the change at
one of the training points.

\begin{definition}[Pointwise Hypothesis Stability]
An algorithm $A$ has {\em pointwise hypothesis stability} $\beta$
with respect to the loss function $\ell$ if the following holds
\begin{equation}\label{eq:hstab2}
\forall i\in \{1,\ldots,m\},\; \ES{|\ell(\AS,z_i) -
\ell(\ASri,z_i)|} \leq \beta\,.
\end{equation}
\end{definition}
Another, weaker notion of stability was introduced by Kearns and
Ron. It consists of measuring the change in the expected error of
the algorithm instead of the average pointwise change.

\begin{definition}[Error Stability]
An algorithm $A$ has {\em error stability} $\beta$  with respect
to the loss function $\ell$ if the following holds
\begin{equation}\label{eq:estab}
\forall S \in \Z^{m},\;\forall i\in \{1,\ldots,m\},\;
|\Ez{\ell(\AS,z)} - \Ez{\ell(\ASri,z)}| \leq \beta\,,
\end{equation}
which can also be written
\begin{equation}
\forall S \in \Z^{m},\;\forall i\in \{1,\ldots,m\},\; |\R(S) -
\Rri(S)| \leq \beta\,.
\end{equation}

\end{definition}

Finally, we introduce a stronger notion of stability which will
allow to get tight bounds. Moreover we will show that it can be
applied to large classes of algorithms.

\begin{definition}[Uniform Stability]
An algorithm $A$ has {\em uniform stability} $\beta$ with respect
to the loss function $\ell$ if the following holds
\begin{equation}
\label{eq:ustab} \forall S \in \Z^{m},\;\forall i\in
\{1,\ldots,m\},\; \|\ell(\AS,.) - \ell(\ASri,.)\|_\infty \leq
\beta\,.
\end{equation}
\end{definition}
Notice that (\ref{eq:hstab}) implies (\ref{eq:estab}) and
(\ref{eq:ustab}) implies (\ref{eq:hstab}) so that uniform
stability is the strongest notion.



Considered as a function of $m$, the term $\beta$ will sometimes
be denoted by $\beta_m$.  We will say that an algorithm is {\em
stable} when the value of $\beta_m$ decreases as $\frac{1}{m}$. An
algorithm with uniform stability $\beta$ has also the following
property:

\[
\forall S,\;\forall z_i',\;|\ell(\AS,z)-\ell(\ASci,z)|\leq
|\ell(\AS,z)-\ell(\ASri,z))|+|\ell(\ASci,z)-\ell(\ASri,z)| \leq
2\beta\,.
\]

In other words, stability with respect to the exclusion of one
point implies stability with respect to changes of one point.

We will assume further that as a function of the sample size, the
stability is non-increasing. This will be the case in all our
examples. This assumption is not restrictive since its only
purpose is to simplify the statement of the theorems (we will
always upper bound $\beta_{m-1}$ by $\beta_m$).



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% SECTION

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Generalization Bounds for Stable Learning Algorithms}\label{sec:bounds}

We start this section by introducing a useful lemma about the bias
of the estimators we study.

\begin{lemma}\label{le:bias}
For any symmetric learning algorithm $A$, we have $\forall i \in
\{1,..,m\}$:
\[
\ES{\R(A,\Ss)-\Res(A,\Ss)} =
\ESz{z_i'}{\ell(\AS,z_i')-\ell(\ASci,z_i')}\,,
\]
and
\[
\ES{\R(A,\Sri)-\Rl(A,\Ss)} = 0\,,
\]
and
\[
\ES{\R(A,\Ss)-\Rl(A,\Ss)} = \ESz{z}{\ell(\AS,z)-\ell(\ASri,z)}\,,
\]
\end{lemma}

\begin{proof}
For the first equality, we just need to compute the expectation of
$\Res(A,\Ss)$. We have
\[
\ES{\Res(\Ss)} = \frac{1}{m}\sum_{j=1}^m \ES{\ell(\AS,z_j)} =
\frac{1}{m}\sum_{j=1}^m \ESz{z_i'}{\ell(\AS,z_j)}\,,
\]
and renaming $z_j$ as $z_i'$ we get, $\forall i \in \{1,..,m\}$
\[
\ES{\Res(\Ss)} = \ESz{z_i'}{\ell(\ASci,z_i')}\,,
\]
by the i.i.d. and the symmetry assumptions. This proves the first
equality. Similarly we have
\[
\ES{\Rl(\Ss)} = \frac{1}{m}\sum_{i=1}^m \ES{\ell(\ASri,z_i)} =
\frac{1}{m}\sum_{i=1}^m \ESz{z}{\ell(\ASri,z)}\,,
\]
from which we deduce the second and third equalities.
\end{proof}

\begin{remark}
We notice from the above lemma, comparing the first and last
equalities, that the empirical error and the leave-one-out error
differ from the true error in a similar way. It is usually
accepted that the empirical error is very much optimistically
biased while the leave-one-out error is almost unbiased (due to
the second equation of the lemma). However, we will see that for
the particular algorithms we have in mind (which display high
stability), the two estimators are very close to each other. The
similarity of the bounds we will derive for both estimators will
be striking. This can be explained intuitively by the fact that we
are considering algorithms that do not directly minimize the
empirical error but rather a regularized version of it, so that
the bias in the empirical error will be reduced.
\end{remark}



\subsection{Polynomial Bounds with Hypothesis Stability}
In this section we generalize a lemma from \cite{DevWag79b}. Their
approach consists in bounding the second order moment of the
estimators with the hypothesis stability of the algorithm. For
this purpose, one could simply use Theorem \ref{th:es}. However
this theorem gives a bound on the variance and we need here the
second order moment of the difference between the error
(leave-one-out or empirical) and the generalization error. It
turns out that a direct study of this quantity leads to better
constants than the use of Theorem \ref{th:es}.

\begin{lemma}\label{le:dev}
For any learning algorithm $A$ and loss function $\ell$ such that
$0\leq c(y,y')\leq M$ we have for any $i,j\in\{1,\ldots,m\}$,
$i\neq j$ for the empirical error,
\begin{equation}\label{eq:dev1}
\ES{(\R-\Res)^2} \leq \frac{M^2}{2m} +
3M\ESz{z_i'}{|\ell(\AS,z_i)-\ell(\ASci,z_i)|}\,,
\end{equation}
and
\begin{eqnarray*}
\ES{(\R-\Res)^2} &\leq& \frac{M^2}{2m} +
M\ESz{z_i',z}{|\ell(\AS,z)-\ell(\ASci,z)|}\\&& +
M\ESz{z_i'}{|\ell(\AS,z_j)-\ell(\ASci,z_j)|}\\&& +
M\ESz{z_i'}{|\ell(\AS,z_i)-\ell(\ASci,z_i)|}\,,
\end{eqnarray*}
and for the leave-one-out error,
\begin{equation}\label{eq:dev3}
\ES{(\R-\Rl)^2} \leq \frac{M^2}{2m} +
3M\ESz{z}{|\ell(\AS,z)-\ell(\ASri,z)|}\,,
\end{equation}
and
\begin{equation}\label{eq:dev4}
\ES{(\R-\Rl)^2} \leq \frac{M^2}{2m} + 2
M\ESz{z_i',z}{|\ell(\AS,z)-\ell(\ASci,z)|+
|\ell(\AS,z)-\ell(\ASri,z)|}\,.
\end{equation}
\end{lemma}

The proof of this lemma is given in the appendix.

\begin{remark}
Notice that Devroye and Wagner's work focused on the leave-one-out
estimator and on classification. We extend it to regression and to
the empirical estimator, which they treated with the following
easy-to-prove inequality
\[
\ES{(\R-\Res)^2}\leq 2\ES{(\R-\Rl)^2} +
2M\ES{|\ell(\AS,z_i)-\ell(\ASri,z_i)|}\,,
\]
which gives a similar result but with worse constants.
\end{remark}
Let's try to explain the various quantities that appear in the
upper bounds of the above lemma. We notice that the term
$\frac{M^2}{2m}$ is always present and it cannot be avoided even
for a very stable algorithm and somehow corresponds to the bias of
the estimator. In Inequality (\ref{eq:dev1}), the expectation in
the right-hand side corresponds to the following situation:
starting from training set $S$ we measure the error at point
$z_i\in S$, then we replace $z_i\in S$ by $z_i'$ and we again
measure the error at $z_i$ which is no longer in the training set.
Then, in the second inequality of Lemma \ref{le:dev} several different
quantities appear. They all correspond to comparing the algorithm trained on
$S$ and on $\Sci$ (where $z_i$ is replaced by $z_i'$) but the
comparison point differs: it is either $z$, a point which is not
part of the training set, or $z_j$, a point of the training set
different from $z_i$ or finally $z_i$.


For the leave-one-out error, in (\ref{eq:dev3}) we consider the
average difference in error when trained on $S$ and on $\Sri$
(where $z_i$ has been removed) and in (\ref{eq:dev4}), the first
expectation in the right hand side corresponds to the average
difference in error when one point is changed while the second one
is the average difference in error when one point is removed.

All these quantities capture a certain aspect of the stability of
the algorithm. In order to use the lemma, we need to bound them
for specific algorithms. Instead of using all these different
quantities, we will rather focus on the few notions of stability
we introduced and see how they are related. We will see later how
they can be computed (or upper bounded) in particular cases.


Now that we have a bound on the expected squared deviation of the
estimator to the true error, the next step is to use Chebyshev's
inequality in order to get a bound which holds with high
probability on the deviation.

\begin{theorem}
For any learning algorithm $A$ with hypothesis stability $\beta_1$
and pointwise hypothesis stability $\beta_2$ with respect to a
loss function $\ell$ such that $0\leq c(y,y')\leq M$, we have with
probability $1-\delta$,
\[
\R(A,\Ss) \leq \Res(A,\Ss) + \sqrt{\frac{M^2 +
12Mm\beta_2}{2m\delta}}\,,
\]
and
\[
\R(A,\Ss) \leq \Rl(A,\Ss) + \sqrt{\frac{M^2 + 6Mm\beta_1}{2m\delta}}\,.
\]
\end{theorem}

\begin{proof}
First, notice that for all $S$ and all $z$,
\[
|\ell(\AS,z) - \ell(\ASci,z)|\leq |\ell(\AS,z) -
\ell(\ASri,z)|+|\ell(\ASri,z) - \ell(\ASci,z)|\,,
\]
so that we get
\[
\ESz{z_i'}{|\ell(\AS,z_i)-\ell(\ASci,z_i)|}\leq
\ES{|\ell(\AS,z_i)-\ell(\ASri,z_i)|}+
\ESz{z_i'}{|\ell(\ASri,z_i)-\ell(\ASci,z_i)|}\leq 2\beta_2\,.
\]
We thus get by (\ref{eq:dev1})
\[
\ES{(\R-\Res)^2} \leq \frac{M^2}{2m} + 6M\beta_2\,.
\]
Also, we have by (\ref{eq:dev4})
\[
\ES{(\R-\Rl)^2} \leq \frac{M^2}{2m} + 3M\beta_1\,.
\]
Now, recall that Chebyshev's inequality gives for a random
variable $X$
\[
\p{X\geq\epsilon}\leq \frac{\e{X^2}}{\epsilon^2}\,,
\]
which in turn gives that for all $\delta>0$, with probability at
least $1-\delta$,
\[
X\leq \sqrt{\frac{\e{X^2}}{\delta}}\,.
\]
Applying this to $\R-\Res$ and $\R-\Rl$ respectively give the result.
\end{proof}

As pointed out earlier, there is a striking similarity between the
above bounds which seems to support the fact that for a stable
algorithm, the two estimators that we are considering have a
closely related behavior.

In the next section we will see how to use the exponential
inequality of Theorem \ref{th:mcd} to get better bounds.

\subsection{Exponential Bounds with Uniform Stability}
\cite{DevWag79a} first proved exponential bounds for $k$-local
algorithms. However, the question of whether their technique can
be extended to more general classes of algorithms is a topic for
further research.

In \cite{Dev_al96} another, more general technique is introduced
which relies on concentration inequalities. Inspired by this
approach, we will derive exponential bounds for algorithms based
on their uniform stability.

We will study separately the regression and the classification
cases for reasons that will be made clear.



\subsubsection{Regression Case}
A stable algorithm has the property that removing one element in
its learning set does not change much of its outcome. As a
consequence, the difference between empirical and generalization
error, if thought as a random variable, should have a small
variance. If its expectation is small, stable algorithms should
then be good candidates for their empirical error to be close to
their generalization error.
 This assertion is formulated in the
following theorem:

\begin{theorem}\label{th:fund}
Let $A$ be an algorithm with uniform stability $\beta$ with
respect to a loss function $\ell$ such that $0\leq \ell(\AS,z)
\leq M$, for all $z\in \Z$ and all sets $S$. Then, for any $m\geq
1$, and any $\delta\in(0,1)$, the following bounds hold
(separately) with probability at least $1-\delta$ over the random
draw of the sample $S$,
\begin{equation}\label{eq:remp}
\R\leq \Res + 2\beta + \left(4m\beta + M\right)\sqrt{\frac{\ln
1/\delta}{2m}}\,,
\end{equation}
and
\begin{equation}\label{eq:loo}
\R \leq \Rl + \beta + \left(4m\beta + M\right)\sqrt{\frac{\ln
1/\delta}{2m}}\,.
\end{equation}
\end{theorem}

\begin{remark}
This theorem gives tight bounds when the stability $\beta$ scales
as $1/m$. We will prove that this is the case for several known
algorithms in later sections.
\end{remark}

\begin{proof}
Let's prove that the conditions of Theorem \ref{th:mcd} are
verified by the random variables of interest. First we study how
these variables change when one training example is removed. We
have
\begin{equation}\label{eq:rri}
|\R-\Rri| \leq \Ez{\left|\ell(\AS,z)-\ell(\ASri,z)\right|} \leq
\beta\,,
\end{equation}
and
\begin{eqnarray*}
|\Res-\Reri| &\leq& \frac{1}{m}\sum_{j\neq i}\left|\ell(\AS,z_j) -
\ell(\ASri,z_j)\right|+\frac{1}{m}\left|\ell(\AS,z_i)\right|\\
&\leq& \beta+\frac{M}{m}\,.
\end{eqnarray*}
Then we upper bound the variation when one training example is
changed:
\[
|\R-\Rci|\leq |\R-\Rri|+|\Rri-\Rci|\leq 2\beta\,.
\]
Similarly we can write
\[
|\Res-\Reci|\leq |\Res-\Reri|+|\Reri-\Reci| \leq 2\beta
+2\frac{M}{m}\,.
\]
however, a closer look reveals that the second factor of $2$ is
not needed. Indeed, we have
\begin{eqnarray*}
|\Res-\Reci| &\leq& \frac{1}{m}\sum_{j\neq i}\left|\ell(\AS,z_j) -
\ell(\ASci,z_j)\right|+\frac{1}{m}\left|\ell(\AS,z_i) -
\ell(\ASci,z_i')\right|\\
&\leq& \frac{1}{m}\sum_{j\neq i}\left|\ell(\AS,z_j) -
\ell(\ASri,z_j)\right|+\frac{1}{m}\sum_{j\neq
i}\left|\ell(\ASri,z_j) -
\ell(\ASci,z_j)\right|\\
&&+\frac{1}{m}\left|\ell(\AS,z_i) -
\ell(\ASci,z_i')\right|\\
&\leq& 2\beta+\frac{M}{m}\,.
\end{eqnarray*}

Thus the random variable  $\R-\Res$ satisfies the conditions of
Theorem \ref{th:mcd} with $c_i=4\beta+\frac{M}{m}$.

It thus remains to bound the expectation of this random variable
which can be done using Lemma \ref{le:bias} and the
$\beta$-stability property:
\begin{eqnarray*}
\ES{\R-\Res} &\leq& \ESz{z_i'}{|\ell(\AS,z_i')-\ell(\ASci,z_i')|}\\
&\leq& \ESz{z_i'}{|\ell(\ASci,z_i')-\ell(\ASri,z_i')|} +
\ESz{z_i'}{|\ell(\ASri,z_i')-\ell(\AS,z_i')|}\\
&\leq& 2\beta\,.
\end{eqnarray*}
Which yields
\[
\PS{\R-\Res > \epsilon + 2\beta_m} \leq
\exp\left(-\frac{2m\epsilon^2}{(4m\beta_m + M)^2}\right)\,.
\]
Thus, setting the right hand side to $\delta$, we obtain that with
probability at least $1-\delta$,
\[
\R\leq \Res + 2\beta_m + (4m\beta_m +
M)\sqrt{\frac{\ln1/\delta}{2m}}\,,
\]
and thus
\[
\R\leq \Res +
2\beta_m\left(1+\sqrt{2m\ln1/\delta}\right)+M\sqrt{\frac{\ln1/\delta}{2m}}\,,
\]
which gives,
%, using the fact that $\delta\leq e^{-1}$, gives
(\ref{eq:remp})

For the leave-one-out error, we proceed similarly. We have
\begin{eqnarray*}
|\Rl-\Rlri| &\leq& \frac{1}{m}\sum_{j\neq i}\left|\ell(\ASrj,z_j)
-
\ell(\ASrij,z_j)\right|+\frac{1}{m}\left|\ell(\ASri,z_i)\right|\\
&\leq& \beta_{m-1}+\frac{M}{m}\,,
\end{eqnarray*}
and also
\[
|\Rl-\Rlci| \leq 2\beta_{m-1}+\frac{M}{m} \leq
2\beta_m+\frac{M}{m}\,.
\]
So that Theorem \ref{th:mcd} can be applied to $\R-\Rl$ with
$c_i=4\beta_m+\frac{M}{m}$. Then we use Lemma \ref{le:bias} along
with (\ref{eq:rri}) to deduce
\[
\PS{\R - \Rl > \epsilon + \beta_m} \leq
\exp\left(-\frac{2m\epsilon^2}{\left(m(4\beta_m) +
M\right)^2}\right)\,,
\]
which gives (\ref{eq:loo}) by setting the right hand side to
$\delta$ and using $\delta\leq e^{-1}$.
\end{proof}

Once again, we notice that the bounds for the empirical error and
for the leave-one-out error are very similar. As we will see in
later sections, this clearly indicates that our method is not at
all suited to the analysis of algorithms which simply perform the
minimization of the empirical error (which are not stable in the
sense defined above).

\subsubsection{Classification Case}\label{sec:classif_naive}
In this section we consider the case where $\Y=\{-1,1\}$ and the
algorithm $A$ returns a function $\AS$ that maps instances in $\X$
to labels in $\{-1,1\}$. The cost function is then simply
\[
c(\AS(x),y) = {\bf 1}_{\{y\AS(x)\leq 0\}}\,.
\]
Thus we see that because of the discrete nature of the cost
function, the Uniform Stability of an algorithm with respect to
such a cost function can only be $\beta=0$ or $\beta=1$. In the
first case, it means that the algorithm is always returning the
same function. In the second case there is no hope of obtaining
interesting bounds since we saw that we need
$\beta=O(\frac{1}{m})$ for our bounds to give interesting results.

%%%%%%%%%%%%%%%%%%%%%5 ICICICICICICI


We thus have to proceed in a different way. One possible approach
is to modify our error estimates so that they become ``smoother''
and have higher stability. The idea to smooth error estimators to
decrease their variance is not new and it has even been used in
conjunction with McDiarmid's inequality by \cite{LugPaw94} in
order to derive error bounds for certain algorithms. Lugosi and
Pawlak studied algorithms which produce estimates for the
distributions $P(X|Y=-1)$ and $P(X|Y=+1)$ and defined analogues of
the resubstitution and leave-one-out estimates of the error suited
to these algorithms.



Here we will take a related, though slightly different route.
Indeed, we will consider algorithm having a real-valued output.
However, we do not require this output to correspond to a
posterior probability but it should simply have the correct sign.
That is, the label predicted by such an algorithm is the sign of
its real-valued output. Of course, a good algorithm will produce
outputs whose absolute value somehow represents the confidence it
has in the prediction.



In order to apply the results obtained so far to this setting, we
need to introduce some definitions.

\begin{definition}
A {\em real-valued classification algorithm} $A$ is a learning
algorithm that maps training sets $S$ to functions $\AS:
\X\rightarrow \mathbb{R}$ such that the label predicted on an
instance $x$ is the sign of $\AS(x)$.
\end{definition}

This class of algorithm includes for instance the classifiers
produced by SVM or by ensemble methods such as boosting.



Notice that the cost function defined above extends to the case
where the first argument is a real number and have the desired
properties: it is zero when the algorithm does predict the right
label and 1 otherwise.



\begin{definition}[Classification Stability]
A real-valued classification algorithm $A$ has {\em classification
stability} $\beta$ if the following holds
\begin{equation}
\label{eq:cstab} \forall S \in \Z^{m},\;\forall i\in
\{1,\ldots,m\},\;\|\AS(.)-\ASri(.)\|_{\infty}  \leq \beta\,.
\end{equation}
\end{definition}



We introduce a modified cost function:
\[
c_\gamma(y,y') = \left\{\begin{array}{ll}
1 & \mbox{ for } yy'\leq 0\\
1-yy'/\gamma & \mbox{ for } 0\leq yy'\leq \gamma\\
0 & \mbox{ for } yy'\geq \gamma
\end{array}\right.
\]
and we denote
\[
\ell_\gamma(f,z) = c_\gamma(f(x),y)\,.
\]
Accordingly, we define the following error estimates
\[
\Reg(A,\Ss)=\frac{1}{m}\sum_{i=1}^m \ell_\gamma(\AS,z_i)\,,
\]
 and similarly,
\[
\Rlg(A,\Ss) = \frac{1}{m}\sum_{i=1}^m \ell_\gamma(\ASri,z_i)\,.
\]
The loss $\ell_\gamma$ will count an error each time the function
$f$ gives an output close to zero, the closeness being controlled
by $\gamma$.

\begin{lemma}
A real-valued classification algorithm $A$ with classification
stability $\beta$ has uniform stability $\beta/\gamma$ with
respect to the loss function $\ell_\gamma$.
\end{lemma}

\begin{proof}
It is easy to see that $c_\gamma$ is $1/\gamma$-Lipschitz with
respect to its first argument and so does $\ell_\gamma$ by
definition. Thus we have for all $i$, all training set $S$, and
all $z$,
\[
|l_\gamma(\AS,z)-l_\gamma(\ASri,z)|=|c_\gamma(\AS(x),y)-c_\gamma(\ASri(x),y)|
\leq \frac{1}{\gamma}|\AS(x)-\ASri(x)| \leq \beta/\gamma\,.
\]
\end{proof}
We can thus apply Theorem \ref{th:fund} with the loss function
$\ell_\gamma$ and get the following theorem.
\begin{theorem}\label{th:class}
Let $A$ be a real-valued classification algorithm with stability
$\beta$. Then, for all $\gamma>0$, any $m\geq 1$, and any
$\delta\in (0,1)$, with probability at least $1-\delta$ over the
random draw of the sample $S$,
\begin{equation}\label{eq:cremp}
\R\leq \Reg + 2\frac{\beta}{\gamma} + \left(4m\frac{\beta}{\gamma}
+ 1\right)\sqrt{\frac{\ln 1/\delta}{2m}}\,,
\end{equation}
and with probability at least $1-\delta$ over the random draw of
the sample $S$,
\begin{equation}
\R \leq \Rlg + \frac{\beta}{\gamma} + \left(4m\frac{\beta}{\gamma}
+ 1\right)\sqrt{\frac{\ln 1/\delta}{2m}}\,.
\end{equation}
\end{theorem}
\begin{proof}
We apply Theorem \ref{th:fund} to $A$ with the loss function
$\ell_\gamma$ which is bounded by $M=1$ and for which the
algorithm is $\beta/\gamma$-stable. Moreover, we use the fact that
$R(\AS) \leq R^\gamma =\Ez{l_\gamma(\AS,z)}$.
\end{proof}


In order to make this result more practically useful, we need a
statement that would hold uniformly for all values $\gamma$. The
same techniques as in \cite{Bar96} lead to the following result:
\begin{theorem}\label{th:class2}
Let $A$ be a real-valued classification algorithm with stability
$\beta$ and $B$ be some real number. Then, for any $m\geq 1$, and
any $\delta\in (0,1)$, with probability at least $1-\delta$ over
the random draw of the sample $S$,
\begin{equation}
\forall \gamma\in (0,B],\; \R\leq \Reg + 2\frac{e\beta}{\gamma} +
\left(\frac{4me\beta}{\gamma} + 1\right) \sqrt{\frac{1}{2m}}
\left(\sqrt{\ln1/\delta} + \sqrt{2\ln\ln\frac{eB}{\gamma}}\right)
\,,
\end{equation}
and
\begin{equation}
\forall \gamma\in (0,B],\; \R\leq \Rlg + \frac{e\beta}{\gamma}+
\left(\frac{4me\beta}{\gamma} + 1\right) \sqrt{\frac{1}{2m}}
\left(\sqrt{\ln1/\delta} + \sqrt{2\ln\ln\frac{eB}{\gamma}}\right)
\,,
\end{equation}
\end{theorem}
We defer the proof of this theorem to the appendix.


We can thus apply Theorem \ref{th:class2} with a value of $\gamma$
which is optimized after having seen the data.



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% SECTION

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Stable Learning Algorithms}\label{sec:reg}
As seen in previous sections, our approach allowed to derive
bounds on the generalization error from the empirical and
leave-one-out errors which depend on the stability of the
algorithm. However, we noticed that the bounds we obtain for the
two estimators are very similar. This readily implies that the
method is suited to the study of algorithms for which the
empirical error is close to the leave-one-out error. There is thus
no hope to get good bounds for algorithms which simply minimize
the empirical error since their empirical error will be very much
optimistically biased compared to their leave-one-out error.



This means that, in order to be stable in the sense defined above,
a learning algorithm has to significantly depart from an empirical
risk minimizer. It thus has to accept a significant number of
training errors (which should however not be larger that the noise
level). In order to generalize, these extra training errors will
thus be compensated by a decrease of the complexity of the learned
function.


In some sense, this is exactly what regularization-based algorithm
do: they minimize an objective function which is the sum of an
empirical error term and a regularizing term which penalizes the
complexity of the solution. This explains why our approach is
particularly well suited for the analysis of such algorithms.



\subsection{Previous Results for $k$-Local Rules}
As an illustration of the various notions of stability, we will
first study the case of $k$-Local Rules for which a large number
of results were obtained.

A $k$-Local Rule is a classification algorithm that determines the
label of an instance $x$ based on the $k$ closest instances in the
training set. The simplest example of such a rule is the
$k$-Nearest Neighbors ($k$-NN) algorithm which computes the label
by a majority vote among the labels of the $k$ nearest instances
in the training set. Such an algorithm can be studied as a
$\{0,1\}$-valued classifier or as a $[0,1]$-valued classifier if
we take into account the result of the vote.

We will consider the real-valued version of the $k$-NN classifier
and give a result about its stability with respect to different
loss functions.
\begin{enumerate}
\item With respect to the $\{0,1\}$-loss function, the $k$-NN classifier has
hypothesis stability
\[
\beta \leq \frac{4}{m}\sqrt{\frac{k}{2\pi}}\,.
\]
This was proven in \cite{DevWag79a}. We will not reproduce the
proof which is quite technical but notice that a symmetry argument
readily gives
\[
\p{\AS(z)\neq \ASri(z)}\leq \frac{k}{m}\,.
\]
\item With respect to the absolute loss function ($c(y,y')=|y-y'|$), the
$k$-NN classifier has only a trivial uniform stability which is
the bound on the values of $y$.
\end{enumerate}
The polynomial bound that can be obtained from hypothesis stability
suggests that $k$ should be small if one wants a good bound. This is somehow
counter-intuitive since the decision seems more robust to noise when many
points are involved in the vote. There exist exponential bounds on the
leave-one-out estimate of $k$-NN for the $\{0,1\}$-loss obtained by
\cite{DevWag79a} and for the smoothed error estimate (i.e. with
respect to the absolute loss) obtained by \cite{LugPaw94}, and
these bounds do not depend on the parameter $k$ (due to a more
careful application of McDiarmid's inequality suited to the
algorithm). We may then wonder in that case whether the polynomial
bounds are interesting compared to exponential ones since the
latter are sharper and are closer to intuitive interpretation.
Despite this example, we believe that in general polynomial bounds
could give relevant hints about which feature of the learning
algorithm leads to good generalization.

In the remainder, we will consider several algorithms that have
not been studied from a stability perspective and we will focus on
their uniform stability only, which turns out to be quite good.
Obtaining results directly for their hypothesis stability remains
an open problem.

\subsection{Stability of Regularization Algorithms}
Uniform stability may appear as a strict condition. Actually, we
will see in this section that many existing learning methods
exhibit a uniform stability which is controlled by the
regularization parameter and can thus be very small.

\subsubsection{Stability for General Regularizers}
Recall that $\ell(f,z) = c(f(x),y)$. We assume in this section
that $\F$ is a convex subset of a linear space.
\begin{definition}
A loss function $\ell$ defined on $\F\times \Y$ is
$\sigma$-admissible with respect to $\F$ if the associated cost
function $c$ is convex with respect to its first argument and the
following condition holds
\[
\forall y_1,y_2\in \D,\;\forall y'\in\Y, |c(y_1,y')-c(y_2,y')|\leq
\sigma|y_1-y_2|\,,
\]
where $\D = \{y: \exists f\in\F, \exists x\in\X, f(x)=y\}$ is the
domain of the first argument of $c$.
\end{definition}
Thus in the case of the quadratic loss for example, this condition
is verified if $\Y$ is bounded and $\F$ is totally bounded, that is there
exists $M<\infty$ such that
\[
\forall f\in\F,\, \|f\|_{\infty}\leq M\,\mbox{ and }\,\forall
y\in\Y,\,|y|\leq M\,.
\]
We introduce the objective function that the algorithm will
minimize: let $N:\F \rightarrow \R_+$ be a function on $\F$,
\begin{equation}\label{eq:Rra}
\Rr(g) := \frac{1}{m}\sum_{j=1}^m \ell(g,z_j) + \lambda N(g)\,,
\end{equation}
and a modified version (based on a truncated training set),
\begin{equation}\label{eq:Rrria}
\Rrri(g) := \frac{1}{m}\sum_{j\neq i} \ell(g,z_j) + \lambda
N(g)\,.
\end{equation}
Depending on the algorithm $N$ will take different forms. To
derive stability bounds, we need some general results about the
minimizers of (\ref{eq:Rra}) and (\ref{eq:Rrria}).

\begin{lemma}\label{le:main}
Let $\ell$ be $\sigma$-admissible with respect to $\F$, and $N$ a
functional defined on $\F$ such that for all training sets $S$,
$\Rr$ and $\Rrri$ have a minimum (not necessarily unique) in $\F$.
Let $\f$ denote a minimizer in $\F$ of $\Rr$, and for
$i=1,\ldots,m$, let $\fri$ denote a minimizer in $\F$ of $\Rrri$.
We have for any $t\in [0,1]$,
\begin{equation}
\label{eq:main} N(\f) - N(\f+t\Delta f) + N(\fri) - N(\fri-t\Delta
f) \leq \frac{t\sigma}{\lambda m}|\Delta f(x_i)|\,,
\end{equation}
where $\Delta f = \fri - \f$.
\end{lemma}
\begin{proof}

Let us introduce the notation
\[
\Reri(f) := \frac{1}{m}\sum_{j\neq i} \ell(f,z_j)\,.
\]
Recall that a convex function $g$ verifies:
\[
\forall x,y,\,\forall t\in [0,1]\ \ \  g(x+t(y-x))-g(x)\leq
t(g(y)-g(x))\,.
\]
Since $c$ is convex, $\Reri$ is convex too and thus, $\forall t
\in [0,1]$
\[
\Reri(\f + t\Delta f) - \Reri(\f) \leq t(\Reri(\fri)-\Reri(\f))\,.
\]
We can also get (switching the role of $\f$ and $\fri$):
\[
\Reri(\fri - t\Delta f) - \Reri(\fri) \leq
t(\Reri(\f)-\Reri(\fri))\,.
\]
Summing the two preceding inequalities yields
\begin{equation}
\label{eq:Remp} \Reri(\f + t\Delta f) - \Reri(\f) + \Reri(\fri -
t\Delta f) - \Reri(\fri) \leq 0\,.
\end{equation}
Now, by assumption we have
\begin{eqnarray}
\Rr(\f)-\Rr(\f+t\Delta f) & \leq  & 0 \label{eq:opt1}\\
\Rrri(\fri)-\Rrri(\fri-t\Delta f) & \leq & 0 \label{eq:opt2}\,,
\end{eqnarray}
so that, summing the two previous inequalities and using
(\ref{eq:Remp}), we get
\[
c(\f(x_i),y_i)-c((\f+t\Delta f)(x_i),y_i)+ m\lambda \left(N(\f) -
N(\f+t\Delta f) +N(\fri) - N(\fri-t\Delta f)\right) \leq 0\,,
\]
and thus, by the $\sigma$-admissibility condition, we get
\[
N(\f) - N(\f+t\Delta f) +N(\fri) - N(\fri-t\Delta f) \leq
\frac{t\sigma}{\lambda m}|\Delta f (x_i)|\,.
\]
\end{proof}
In the above lemma, there is no assumption about the space $\F$
(apart from being a convex linear space) and the regularizer $N$ apart from
the existence of minima for $\Rr$ and $\Rrri$. However, most of the
practical regularization-based algorithms work with a space $\F$ that is a
vector space and with a convex regularizer. We will thus refine
our previous result in this particular setting. In order to do
this, we need some standard definitions about convex functions
which we deferred to Appendix C where most of the material can be
found in \cite{Rockafellar70} and in \cite{Gordon99}.
\begin{lemma}\label{le:N}
Under the conditions of Lemma \ref{le:main}, when $\F$ is a vector
space and $N$ is a proper closed convex function from $\F$ to
$\R\cup\{-\infty,+\infty\}$, we have
\[
d_N(\f,\fri)+d_N(\fri,\f) \leq \frac{1}{\lambda m}
\left(\ell(\fri,z_i) - \ell(\f,z_i) -
d_{\ell(.,z_i)}(\fri,\f)\right) \leq \frac{\sigma}{\lambda m}
|\Delta f (x_i)|\,,
\]
when $N$ and $\ell$ are differentiable\iffalse and
\[
\bar{d}_N(\f,\vz)+\bar{d}_N(\fri,\vz) \leq \frac{1}{\lambda m}
\left(\ell(\fri,z_i) - \ell(\f,z_i) -
\bar{d}_{\ell(.,z_i)}(\fri,\vz)\right) \leq \frac{\sigma}{\lambda
m} |\Delta f (x_i)|\,,
\]
otherwise\fi.
\end{lemma}
\begin{proof}
We start with the differentiable case and work with regular
divergences. By definition of $\f$ and $\fri$, we have, using
(\ref{eq:div}),
\[
d_{\Rr}(\fri,\f) + d_{\Rrri}(\f,\fri) = \Rr(\fri)-\Rr(\f) +
\Rrri(\f) - \Rrri(\fri) = \frac{1}{m}\ell(\fri,z_i) -
\frac{1}{m}\ell(\f,z_i)\,.
\]
Moreover, by the nonnegativity of divergences, we have
\[
d_{\Reri}(\f,\fri) + d_{\Reri}(\fri,\f)\geq 0\,,
\]
which, with the previous equality and the fact that
$d_{A+B}=d_A+d_B$, gives
\[
\lambda d_{N}(\f,\fri) + \lambda d_{N}(\fri,\f) \leq
\frac{1}{m}\left(\ell(\fri,z_i) - \ell(\f,z_i) -
d_{\ell(.,z_i)}(\fri,\f)\right)\,,
\]
and we obtain the first part of the result. For the second part,
we notice that
\[
\ell(\fri,z_i) - \ell(\f,z_i) - d_{\ell(.,z_i)}(\fri,\f) \leq
\ell(\fri,z_i)-\ell(\f,z_i)\,,
\]
by the nonnegativity of the divergence and thus
\[
\ell(\fri,z_i) - \ell(\f,z_i) - d_{\ell(.,z_i)}(\fri,\f)\leq
\sigma|\fri(x_i)-\f(x_i)|\,,
\]
by the $\sigma$-admissibility condition.
\iffalse
The proof for the
generalized divergences follows the same lines, we write
respectively
\[
\bar{d}_{\Rr}(\fri,\vz) + \bar{d}_{\Rrri}(\f,\vz) =
\Rr(\fri)-\Rr(\f) + \Rrri(\f) - \Rrri(\fri) =
\frac{1}{m}\ell(\fri,z_i) - \frac{1}{m}\ell(\f,z_i)\,,
\]
and
\[
\bar{d}_{\Reri}(\f,\vz) + \bar{d}_{\Reri}(\fri,\vz)\geq 0\,,
\]
yielding
\[
\lambda \bar{d}_{N}(\f,\vz) + \lambda \bar{d}_{N}(\fri,\vz) \leq
\frac{1}{m}\left(\ell(\fri,z_i) - \ell(\f,z_i) -
\bar{d}_{\ell(.,z_i)}(\fri,\vz)\right)\,,
\]
and we obtain the first part of the result. For the second part,
we notice that
\[
\ell(\fri,z_i) - \ell(\f,z_i) - \bar{d}_{\ell(.,z_i)}(\fri,\vz)
\leq \ell(\fri,z_i)-\ell(\f,z_i)\,,
\]
by the nonnegativity of the divergence and thus
\[
\ell(\fri,z_i) - \ell(\f,z_i) -
\bar{d}_{\ell(.,z_i)}(\fri,\vz)\leq \sigma|\fri(x_i)-\f(x_i)|\,.
\]
\fi
\end{proof}


The results in this section can be used to derive bounds on the
stability of many learning algorithms. Each procedure that can be
interpreted as the minimization of a regularized functional can be
analyzed with these lemmas. The only thing that will change from
one procedure to another is the regularizer $N$ and the cost
function $c$. In the following, we show how to apply these
theorems to different learning algorithms.

\iffalse
\subsubsection{General Method for Obtaining Stability Bounds}
We consider here cases where Lemma \ref{le:N} applies since all
the algorithms that we will deal with fit in this setting.

Recall that our goal is to bound
\[
\|\ell(\f,.)-\ell(\fri,.)\|_\infty\,.
\]
When $\ell$ is $\sigma$-admissible, it is thus enough to bound
\[
\|\f-\fri\|_\infty\,.
\]
In order to do so, we will use Lemma \ref{le:N} and find upper and
lower bounds in terms of an appropriate norm $\|.\|$ which upper
bounds the $L_\infty$ norm. Typically, we will get results of this
kind
\[
\|\f-\fri\|^2 \leq d_N(\f,\fri) + d_N(\fri,\f) \leq
\frac{\sigma}{\lambda m} |(\f-\fri)(x_i)| \leq \frac{c
\sigma}{\lambda m}\|\f-\fri\|\,,
\]
for some constant $c$, from which we will deduce
\[
\|\f-\fri\|\leq \frac{c \sigma}{\lambda m}\,,
\]
and then
\[
\|\ell(\f,.)-\ell(\fri,.)\|_\infty \leq \sigma\|\f-\fri\|_\infty
\leq \sigma \|\f-\fri\| \leq \frac{c \sigma^2}{\lambda m}\,.
\]
\fi
\subsubsection{Application to Regularization in Hilbert Spaces}
Many algorithms such as Support Vector Machines (SVM) or classical
regularization networks introduced by \cite{Pog_Gir90} perform the
minimization of a regularized objective function where the
regularizer is a norm in a reproducing kernel Hilbert space
(RKHS):
\[
N(f) = \|f\|^2_k\,,
\]
where $k$ refers to the kernel (see e.g. Wahba, 2000, or Evgeniou
{et al.}, 1999, for definitions). The fundamental property of a
RKHS $\F$ is the so-called reproducing property which writes

\[
\forall f\in \F,\ \forall x\in\X,\ f(x)=\ip{f,k(x,.)}\,.
\]
In particular this gives by Cauchy-Schwarz inequality
\begin{equation}\label{eq:rep}
\forall f\in \F,\ \forall x\in\X,\ |f(x)|\leq \|f\|_k \sqrt{k(x,x)}\,.
\end{equation}
We now state a result about the uniform stability of RKHS
learning.
\begin{theorem}\label{th:rkhs}
Let $\F$ be a reproducing kernel Hilbert space with kernel $k$
such that $\forall x\in\X,\; k(x,x)\leq \kappa^2<\infty$. Let $\ell$
be $\sigma$-admissible with respect to $\F$. The learning
algorithm $A$ defined by
\begin{equation}\label{eq:reg}
\AS = \arg\min_{g\in\F} \frac{1}{m}\sum_{i=1}^m \ell(g,z_i) +
\lambda \|g\|_k^2\,,
\end{equation}
has uniform stability $\beta$ with respect to $\ell$ with
\[
\beta \leq \frac{\sigma^2 \kappa^2}{2\lambda m}\,.
\]
\end{theorem}
\begin{proof}
We use the proof technique described in previous section. It can
be easily checked that when $N(.)=\|.\|_k^2$ we have
\[
d_N(g,g') = \|g-g'\|_k^2\,.
\]
Thus, Lemma \ref{le:main} gives
\[
2\|\Delta f\|_k^2 \leq \frac{\sigma}{\lambda m} |\Delta f(x_i)|\,.
\]
Using (\ref{eq:rep}), we get
\[
|\Delta f(x_i)| \leq \|\Delta f\|_k \sqrt{k(x_i,x_i)} \leq \kappa\|\Delta
f\|_k\,,
\]
so that
\[
\|\Delta f\|_k \leq \frac{\kappa \sigma}{2\lambda m}\,.
\]
Now we have, by the $\sigma$-admissibility of $\ell$
\[
|\ell(\f,z)-\ell(\fri,z)|\leq \sigma|\f(x)-\fri(x)| =
\sigma|\Delta f(x)|\,,
\]
which, using (\ref{eq:rep}) again, gives the result.
\end{proof}
We are now one step away from being able to apply Theorem
\ref{th:fund}. The only thing that we need is to bound the loss
function. Indeed, the $\sigma$-admissibility condition does not
ensure the boundedness. However, since we are in a RKHS, we can
use the following simple lemma which ensures that if we have an a
priori bound on the target values $y$, then the boundedness
condition is satisfied.
\begin{lemma}\label{le:bound}
Let $A$ be the algorithm of Theorem \ref{th:rkhs} where $\ell$ is
a loss function associated to a convex cost function $c(.,.)$. We
denote by $B(.)$ a positive non-decreasing real-valued function
such that for all $y\in\D$.
\[
\forall y'\in\Y,\, c(y,y')\leq B(y)
\]
For any training set $S$, we have
\[
\|f\|_k^2 \leq \frac{B(0)}{\lambda}\,,
\]
and also
\[
\forall z\in\Z,\, 0\leq\ell(\AS, z)\leq
B\left(\kappa\sqrt{\frac{B(0)}{\lambda}}\right)\,.
\]
Moreover, $\ell$ is $\sigma$-admissible where $\sigma$ can be
taken as
\[
\sigma = \sup_{y'\in \Y} \sup_{|y|\leq
B\left(\kappa\sqrt{\frac{B(0)}{\lambda}}\right)}
\left|\frac{\partial c}{\partial y}(y,y')\right|\,.
\]
\end{lemma}
\begin{proof}
We have for $f=\AS$,
\[
\Rr(f)\leq \Rr(\vz) = \frac{1}{m}\sum_{i=1}^m \ell(\vz,z_i) \leq
B(0)\,,
\]
and also $\Rr(f)\geq \lambda\|f\|_k^2$ which gives the first
inequality. The second inequality follows from (\ref{eq:rep}). The
last one is a consequence of the definition of
$\sigma$-admissibility.
\end{proof}



\begin{example}[Stability of bounded SVM regression]
Assume $k$ is a bounded kernel, that is $k(x,x)\leq \kappa^2$ and
$\Y=[0,B]$. Consider the loss function
\[
\ell(f,z) = |f(x)-y|_{\epsilon} = \left\{\begin{array}{ll}
0 &\mbox{ if } |f(x)-y|\leq \epsilon\\
|f(x)-y|-\epsilon & \mbox{ otherwise}
\end{array}\right.
\]
This function is $1$-admissible and we can state $B(y)= B$. The
SVM algorithm for regression with a kernel $k$ can be defined as
\[
\AS= \arg\min_{g\in\F} \frac{1}{m}\sum_{i=1}^m \ell(g,z_i) +
\lambda \|g\|_k^2\,,
\]
and we thus get the following stability bound
\[
\beta \leq \frac{\kappa^2}{2\lambda m}\,.
\]
Moreover, by Lemma \ref{le:bound} we have
\[
\forall z\in \Z,\, 0\leq \ell(\AS,z)\leq
\kappa\sqrt{\frac{B}{\lambda}}\,
\]
Plugging the above into Theorem \ref{th:fund} gives the following
bound
\[
\R\leq \Res + \frac{\kappa^2}{\lambda m} +
\left(\frac{2\kappa^2}{\lambda} +
\kappa\sqrt{\frac{B}{\lambda}}\right)\sqrt{\frac{\ln1/\delta}{2m}}\,.
\]
Note that we consider here SVM without the bias $b$, which is
strictly speaking different from the true definition of SVM. The
question whether $b$ can be included in such a setting remains
open.
\end{example}


\begin{example}[Stability of soft margin SVM classification]
We have $\Y=\{-1,1\}$. We consider the following loss function
\[
\ell(f,z) = (1-y f(x))_+ = \left\{\begin{array}{ll} 1-y f(x) &
\mbox{ if } 1-y f(x) \geq 0 \\ 0 & \mbox{ otherwise}
\end{array}\right.
\]
which is $1$-admissible. From Lemma \ref{le:main}, we deduce that
the real-valued classification obtained by the SVM optimization
procedure has classification stability $\beta$ with
\[
\beta \leq \frac{\kappa^2}{2\lambda m}\,.
\]
We use Theorem \ref{th:class} with $\gamma = 1$ and thus get
\[
\R\leq \Res^1 + \frac{\kappa^2}{\lambda m} +
\left(1+\frac{2\kappa^2}{\lambda}\right)\sqrt{\frac{\ln
1/\delta}{2m}}\,,
\]
where $\Res^1$ is the clipped error. It can be seen that
$\Res^1\leq \frac{1}{m}\sum_{i=1}^m \ell(f,z_i) =
\frac{1}{m}\sum_{i=1}^m \xi_i$, where the $\xi$ are the Lagrange
multipliers that appear in the dual formulation of the soft-margin
SVM.

Note that the same remark as in the previous example holds here:
there is no bias $b$ in the definition of the SVM.
\end{example}


\bigskip



\begin{example}[Stability of Regularized Least Squares Regression]\label{ex:rr}
Again we will consider the bounded case $\Y=[0,B]$. The
regularized least squares regression algorithm is defined by
\[
\AS= \arg\min_{g\in\F} \frac{1}{m}\sum_{i=1}^m \ell(g,z_i) +
\lambda \|g\|_k^2\,,
\]
where $\ell(f,z) = (f(x)-y)^2$. We can state $B(y)= B^2$ so that
$\ell$ is $2B$-admissible by Lemma \ref{le:bound}. Also we have
\[
\forall z\in \Z,\, 0\leq \ell(\AS,z)\leq
\kappa\sqrt{\frac{B}{\lambda}}\,.
\]
The stability bound for this algorithm is thus
\[
\beta\leq\frac{2\kappa^2B^2}{\lambda m}
\]
so that we have the generalization error bound
\[
\R\leq \Res + \frac{4\kappa^2 B^2}{\lambda m} +
\left(\frac{8\kappa^2 B^2}{\lambda} +
2B\right)\sqrt{\frac{\ln1/\delta}{2m}}\,.
\]
\end{example}


\subsubsection{Regularization by the Relative Entropy}
In this section we consider algorithms that build a mixture or a
weighted combination of base hypotheses.



Let's consider a set ${\cal H}$ of functions $h:\X\rightarrow \Y$
parameterized by some parameter $\theta$:
\[
{\cal H}=\{h_\theta: \theta\in \Theta\}\,.
\]
This set is the base class from which the learning algorithm will
form mixtures by averaging the predictions of base hypotheses.
More precisely, we assume that $\Theta$ is a measurable space
where a reference measure is defined. The output of our algorithm
is a mixture of element from $\Theta$, in other words, it is a
probability distribution over $\Theta$. We will thus choose $\F$
as the set of all such probability distributions (dominated by the
reference measure), defined by their density with respect to the
reference measure.


Once an element $f\in\F$ is chosen by the algorithm, the
predictions are computed as follows
\[
\hat{y}(x) = \int_\Theta h_\theta(x) f(\theta) d\theta\,,
\]
which means that the prediction produced by the algorithm is
indeed a weighted combination of the predictions of the base
hypotheses, weighted by the density $f$. In Bayesian terms, $\AS$
would be a posterior on $\Theta$ computed from the observation of
$S$ and $\hat{y}(x)$ is the corresponding Bayes prediction.


By some abuse of notation, we will denote by $\AS$ both the
element $f\in\F$ that is used by the algorithm to weigh the base
hypotheses (which can be considered as a function
$\Theta\rightarrow \R$) and the prediction function $x\in\X\mapsto
\hat{y}(x)$.


Now we need to define a loss function on $\F\times \Z$. This can
be done by extending a loss function $r$ defined on $\h\times \Z$
with associated cost function $s$ ($r(h,z)=s(h(x),y)$). There are
two ways of deriving a loss function on $\F$. We can simply use
$s$ to compute the discrepancy between the predicted and true
labels
\begin{equation}\label{eq:l1}
\ell(g,z) = s(\hat{y}(x),y)\,,
\end{equation}
or we can average the loss over $\Theta$,
\begin{equation}\label{eq:l2}
\ell(g,z) = \int_\Theta r(h_\theta,z) g(\theta) d\theta\,.
\end{equation}
The first loss is the one used when one is doing Bayesian
averaging of hypotheses. The second loss corresponds to the
expected loss of a randomized algorithm that would sample
$h\in{\cal H}$ according to the posterior $\AS$ to perform the
predictions.


In the remainder, we will focus on the second type of loss since
it is easier to analyze. Note however, that this loss will be used
only to define a regularization algorithm and that the loss that
is used to measure its error may be different.



Our goal is to choose the posterior $f$ via the minimization of a
regularized objective function. We choose some fixed density $f_0$
and define the regularizer as
\[
N(g) = K(g,f_0) = \int_\Theta
g(\theta)\ln\frac{g(\theta)}{f_0(\theta)}d\theta\,,
\]
$K$ being the Kullback-Leibler divergence or the relative entropy.
In Bayesian terms, $f_0$ would be our {\em prior}. Now, the goal
is to minimize the following objective function
\[
\Rr(g) = \frac{1}{m}\sum_{i=1}^m \ell(g,z) + \lambda K(g,f_0)\,,
\]
where $\ell$ is given by (\ref{eq:l2}). We can interpret the
minimization of this objective function as the computation of the
Maximum A Posteriori (MAP) estimate.

Let's analyze this algorithm. We will assume that we know a bound
$M$ on the loss $r(h_\theta,z)$. First, notice that $\ell$ is
linear in $g$ and is thus convex and $M$-Lipschitz with respect to
the $L_1$ norm
\[
\left|\ell(g,z) - \ell(g',z) \right| \leq M \int_\Theta |g(\theta)
- g'(\theta)|d\theta\,.
\]
Thus $\ell$ is $M$-admissible with respect to $\F$.

We can now state the following result on the uniform stability of
the algorithm defined above.
\begin{theorem}\label{th:med}
Let $\F$ defined as above and let $r$ be any loss function defined
on $\h\times \Z$, bounded by $M$. Let $f_0$ be a fixed member of
$\F$. When $\ell$ is defined by (\ref{eq:l2}), the learning
algorithm $A$ defined by

\begin{equation}
\AS = \arg\min_{g\in\F} \frac{1}{m}\sum_{i=1}^m \ell(g,z_i) +
\lambda K(g,f_0)\,,
\end{equation}
has uniform stability $\beta$ with respect to $\ell$ with
\[
\beta \leq \frac{M^2}{\lambda m}\,.
\]
\end{theorem}
\begin{proof}
Recall the following property of the relative entropy (see e.g.
Cover and Thomas 1991), for any $g,g'$,
\[
\frac{1}{2} \left(\int_\Theta |g(\theta)-g'(\theta)|d\theta
\right)^2 \leq K(g,g')\,.
\]
Moreover, the Bregman divergence associated to the relative
entropy to $f_0$ is
\[
d_{K(.,f_0)}(g,g')=K(g,g')\,.
\]
We saw that $\ell$ is $M$-admissible thus, by Lemma \ref{le:N} we
get
\[
\left(\int_\Theta |\f(\theta)-\fri(\theta)|d\theta \right)^2 \leq
\frac{M}{\lambda m}\int_\Theta |\f(\theta)-\fri(\theta)|d\theta\,,
\]
hence
\[
\int_\Theta |\f(\theta)-\fri(\theta)|d\theta \leq \frac{M}{\lambda
m}\,,
\]
and thus, using again the $M$-admissibility of $\ell$, we get for
all $z\in\Z$,
\[
|\ell(\f,z) - \ell(\fri,z)| \leq \frac{M^2}{\lambda m}\,,
\]
which concludes the proof.
\end{proof}
Now, let's consider the case of classification where
$\Y=\{-1,1\}$. If we use base hypotheses $h_\theta$ that return
values in $\{-1,1\}$, it is easy to see from the proof of the
above theorem that algorithm $A$ has classification stability
$\beta\leq \frac{M}{\lambda m}$. Indeed, we have
\[
|\AS(x)-\ASri(x)|=\left|\int_\Theta
h_\theta(x)(\AS(\theta)-\ASri(\theta))d\theta \right|\leq
\int_\Theta |\AS(\theta)-\ASri(\theta)|d\theta\leq
\frac{M}{\lambda m}\,,
\]
where the last inequality is derived in the proof of Theorem
\ref{th:med}.

\begin{example}[Maximum Entropy Discrimination]
\cite{Jaa_al99} introduce the Minimum Relative Entropy (MRE)
algorithm which is a real-valued classifier obtained by minimizing
\[
\Rr(g) = \frac{1}{m}\sum_{i=1}^m \ell(g,z) + \lambda K(g,f_0)\,,
\]
where the base class has two parameters ${\cal H} =
\{h_{\theta,\gamma}:\theta\in\Theta, \gamma\in\R\}$ (with
$h_{\theta,\gamma}=h_\theta$) and the loss is defined by
\[
\ell(g,z) = \left(\int_{\Theta,\R} (\gamma - y
h_{\theta}(x))g(\theta)d\theta d\gamma\right)_+\,. \]
 If we have a
bound $B$ on the quantity $\gamma - y h_{\theta}(x)$, we see that
this loss function is $B$-admissible and thus by Theorem
\ref{th:med} (and the remark about the classification stability)
we deduce that the MRE algorithm has classification stability
$\beta$ bounded by
\[
\beta \leq \frac{B}{\lambda m}
\]
\end{example}



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% SECTION

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Discussion}

For regularization algorithms, we obtained bounds on the uniform
stability of the order of $\beta=O(\frac{1}{\lambda m})$. Plugging
this result into our main theorem, we obtained bounds on the
generalization error of the following type
\[
\R\leq \Res + O\left(\frac{1}{\lambda \sqrt{m}}\right)\,,
\]
so that we obtain non trivial results only if we can guarantee
that $\lambda>>\frac{1}{\sqrt{m}}$. This is likely to depend on
the noise in the data and no theoretical results exist that
guarantee that $\lambda$ does not decrease too fast when $m$ is
increased.


However, it should be possible to refine our results which used
sometimes quite crude bounds. It seems reasonable that a bound
like
\[
\R\leq \Res + O\left(\frac{1}{\sqrt{\lambda m}}\right)\,,
\]
could be possible to obtain. This remains an open problem.


In order to better understand the distinctive feature of our
bounds, we can compare them to bounds from Structural Risk
Minimization (SRM) for example on the SVM algorithm. The SVM
algorithm can be presented using the two equivalent formulations
\[
\min_{f\in\F} \frac{1}{m}\sum_{i=1}^m (1-y_i f(x_i))_+ + \lambda
\|f\|^2\,,
\]
or
\[
\min_{f\in\F} \frac{1}{m}\sum_{i=1}^m (1-y_i f(x_i))_+ \mbox{ with
} \|f\|^2\leq \nu\,,
\]
The equivalence of those two problems comes from the fact that for
any $\lambda$, there exists a $\nu$ such that the solution of the
two problems are the same.


The SRM principle consists in solving the second problem for
several values of $\nu$ and then choosing the value that minimizes
a bound that depends on the VC-dimension of the set
$\{f:\|f\|^2\leq \nu\}$. However, this quantity is usually not
easy to compute and only loose upper bounds can be found.
Moreover, since minimization under a constraint on the norm is not
easy to perform, one typically performs the first minimization for
a particular value of $\lambda$ (chosen by cross-validation) and
then uses SRM bounds with $\nu=\|f\|^2$. This requires the SRM
bounds to hold uniformly for all values of $\nu$.


This approach has led to bound which were quite predictive of the
behavior but that were quantitatively very loose.


In contrast, our approach directly focuses on the actual
minimization that is performed (the first one) and does not
require the computation of a complexity measure. Indeed, the
complexity is implicitly evaluated by the actual parameter
$\lambda$.

\section{Conclusion}
We explored the possibility of obtaining generalization bounds for
specific algorithms from stability properties. We introduced
several notions of stability and obtained corresponding
generalization bounds with either the empirical error or the
leave-one-out error. Our main result is an exponential bound for
algorithms that have good uniform stability. We then proved that
regularization algorithms have such a property and that their
stability is controlled by the regularization parameter $\lambda$.
This allowed us to obtained bounds on the generalization error of
Support Vector Machines both in the classification and in the
regression framework that do not depend on the implicit
VC-dimension but rather depend explicitly on the tradeoff
parameter $C$.

Further directions of research include the question of obtaining
better bounds via uniform stability and the use of less
restrictive notions of stability. Of great practical interest would be to
design algorithms that maximize their own stability.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% SECTION

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section*{Acknowledgements}

The authors wish to thank Ryan Rifkin and Ralf Herbrich for
fruitful comments that helped improved the readability and Alex
Smola, G\'abor Lugosi, St\'{e}phane Boucheron and Sayan Mukherjee
for stimulating discussions.



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% SECTION

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section*{Appendix A. Proof of Lemma \ref{le:dev}}
Let's start with a generalized version of a lemma from
\cite{RogWag78}.
\begin{lemma}\label{le:rogwag}
For any learning algorithm $A$, any $i,j\in \{1,\ldots,m\}$ such
that $i\neq j$, we have
\begin{eqnarray*}
\ES{(\R-\Res)^2} &\leq& \ESz{z,z'}{\ell(\AS,z)\ell(\AS,z')} -
2\ESz{z}{\ell(\AS,z)\ell(\AS,z_i)}\\
&& + \ES{\ell(\ASri,z_i)\ell(\ASrj,z_j)} +
\frac{M}{m}\ES{\ell(\AS,z_i)}\\&&-
\frac{1}{m}\ES{\ell(\AS,z_i)\ell(\AS,z_j)}\,,
\end{eqnarray*}
and
\begin{eqnarray*}
\ES{(\R-\Rl)^2} &\leq& \ESz{z,z'}{\ell(\AS,z)\ell(\AS,z')} -
2\ESz{z}{\ell(\AS,z)\ell(\ASri,z_i)}\\
&& + \ES{\ell(\ASri,z_i)\ell(\ASrj,z_j)} +
\frac{M}{m}\ES{\Rri}\\&& -
\frac{1}{m}\ES{\ell(\ASri,z_i)\ell(\ASrj,z_j)}\,,
\end{eqnarray*}
\end{lemma}
\begin{proof}
We have
\begin{eqnarray*}
\ES{\R^2} &=& \ES{\Ez{\ell(\AS,z)}^2}\\
&=& \ES{\Ez{\ell(\AS,z)} \ee{z'}{\ell(\AS,z')}}\\
&=& \ES{\ee{z,z'}{\ell(\AS,z) \ell(\AS,z')}}\,,
\end{eqnarray*}
and also
\begin{eqnarray*}
\ES{\R\Res} &=& \ES{\R \frac{1}{m}\sum_{i=1}^m \ell(\AS,z_i)}\\
&=& \frac{1}{m} \sum_{i=1}^m \ES{\R \ell(\AS,z_i)}\\
&=& \frac{1}{m} \sum_{i=1}^m \ESz{z}{\ell(\AS,z) \ell(\AS,z_i)}\\
&=& \ESz{z}{\ell(\AS,z)\ell(\AS,z_i)}\,,
\end{eqnarray*}
and also
\begin{eqnarray*}
\ES{\R\Rl} &=& \ES{\R \frac{1}{m}\sum_{i=1}^m \ell(\ASri,z_i)}\\
&=& \frac{1}{m} \sum_{i=1}^m \ES{\R \ell(\ASri,z_i)}\\
&=& \frac{1}{m} \sum_{i=1}^m \ESz{z}{\ell(\AS,z) \ell(\ASri,z_i)}\\
&=& \ESz{z}{\ell(\AS,z)\ell(\ASri,z_i)}\,,
\end{eqnarray*}
for any fixed $i$ by symmetry. Also we have
\begin{eqnarray*}
\ES{\Res^2} &=& \frac{1}{m^2} \sum_{i=1}^m \ES{\ell(\AS,z_i)^2} +
\frac{1}{m^2}\sum_{i\neq j} \ES{\ell(\AS,z_i)\ell(\AS,z_j)}\\
&\leq& \frac{M}{m}\ES{\frac{1}{m}\sum_{i=1}^m \ell(\AS,z_i)} +
\frac{m-1}{m}\ES{\ell(\AS,z_i)\ell(\AS,z_j)}\\
&=& \frac{M}{m}\ES{\ell(\AS,z_i)} +
\frac{m-1}{m}\ES{\ell(\AS,z_i)\ell(\AS,z_j)}\,,
\end{eqnarray*}
and
\begin{eqnarray*}
\ES{\Rl^2} &=& \frac{1}{m^2} \sum_{i=1}^m \ES{\ell(\ASri,z_i)^2} +
\frac{1}{m^2}\sum_{i\neq j} \ES{\ell(\ASri,z_i)\ell(\ASrj,z_j)}\\
&\leq& \frac{M}{m}\ES{\frac{1}{m}\sum_{i=1}^m \ell(\ASri,z_i)} +
\frac{m-1}{m}\ES{\ell(\ASri,z_i)\ell(\ASrj,z_j)}\\
&=& \frac{M}{m}\ES{\Rri} +
\frac{m-1}{m}\ES{\ell(\ASri,z_i)\ell(\ASrj,z_j)}\,.
\end{eqnarray*}
which concludes the proof.
\end{proof}
Now let's prove Lemma \ref{le:dev}. We will use several times the
fact that the random variables are i.i.d. and we can thus
interchange them without modifying the expectation (it is just a
matter of renaming them). We introduce the notation $T=\Srij$ and
we will denote by $A_{T,z,z'}$ the result of training on the set
$T\cup{z,z'}$.


Let's first formulate the first inequality of Lemma
(\ref{le:rogwag}) as
\begin{eqnarray*}
\ES{(\R-\Res)^2} &\leq& \frac{1}{m}\ES{\ell(\AS,z_i)\left(M -
\ell(\AS,z_j)\right)}\\
&& + \ESz{z,z'}{\ell(\AS,z)\ell(\AS,z') -
\ell(\AS,z)\ell(\AS,z_i)}\\
&& + \ESz{z,z'}{\ell(\AS,z_i)\ell(\AS,z_j) -
\ell(\AS,z)\ell(\AS,z_i)}\\
&& = I_1 + I_2 + I_3\,.
\end{eqnarray*}
Using Schwarz's inequality we have
\begin{eqnarray*}
\ES{\ell(\AS,z_i)\left(M - \ell(\AS,z_j)\right)}^2 &\leq&
\ES{\ell(\AS,z_i)^2}\ES{(M - \ell(\AS,z_j))^2}\\
&\leq& M^2 \ES{\ell(\AS,z_i)}\ES{M - \ell(\AS,z_j)}\\
&=& M^2 \ES{\ell(\AS,z_i)}(M-\ES{\ell(\AS,z_i)})\\
&\leq& \frac{M^4}{4}\,,
\end{eqnarray*}
so that we conclude
\[
I_1\leq \frac{M^2}{2m}\,.
\]
Now we rewrite $I_2$ as
\begin{eqnarray*}
\lefteqn{\ESz{z,z'}{\ell(A_{T,z_i,z_j},z)\ell(A_{T,z_i,z_j},z')
- \ell(A_{T,z_i,z_j},z)\ell(A_{T,z_i,z_j},z_i)}}\\
&& = \ESz{z,z'}{\ell(A_{T,z_i,z_j},z)\ell(A_{T,z_i,z_j},z')
- \ell(A_{T,z_j,z'},z)\ell(A_{T,z_j,z'},z')}\\
&& \mbox{(renaming $z_i$ as $z'$ in the second term)}\\
&& = \ESz{z,z'}{(\ell(A_{T,z_i,z_j},z) -
\ell(A_{T,z,z_j},z))\ell(A_{T,z_i,z_j},z')}\\&&+
\ESz{z,z'}{(\ell(A_{T,z,z_j},z) -
\ell(A_{T,z_j,z'},z))\ell(A_{T,z_i,z_j},z')}\\
&& \:\:\: + \ESz{z,z'}{(\ell(A_{T,z_i,z_j},z') -
\ell(A_{T,z_j,z'},z'))\ell(A_{T,z_j,z'},z)}\,.\\
\end{eqnarray*}
Next we rewrite $I_3$ as
\begin{eqnarray*}
\lefteqn{\ESz{z,z'}{\ell(A_{T,z_i,z_j},z_i)\ell(A_{T,z_i,z_j},z_j)
- \ell(A_{T,z_i,z_j},z)\ell(A_{T,z_i,z_j},z_i)}}\\
&&=\ESz{z,z'}{\ell(A_{T,z,z'},z)\ell(A_{T,z,z'},z')
- \ell(A_{T,z_i,z_j},z)\ell(A_{T,z_i,z_j},z_i)}\\
&& \mbox{(renaming $z_j$ as $z'$ and $z_i$ as $z$ in the first term)}\\
&&=\ESz{z,z'}{\ell(A_{T,z,z'},z)\ell(A_{T,z,z'},z')
- \ell(A_{T,z',z_i},z)\ell(A_{T,z',z_i},z')}\\
&& \mbox{(exchanging $z_i$ and $z_j$, then renaming $z_j$ as $z'$
in the
second term)}\\
&& = \ESz{z,z'}{(\ell(A_{T,z,z'},z') -
\ell(A_{T,z,z_i},z'))\ell(A_{T,z,z'},z)}\\&&
+ \ESz{z,z'}{(\ell(A_{T,z,z'},z) -
\ell(A_{T,z_i,z'},z))\ell(A_{T,z,z_i},z')}\\
&& \:\:\: + \ESz{z,z'}{(\ell(A_{T,z,z_i},z') -
\ell(A_{T,z',z_i},z'))\ell(A_{T,z_i,z'},z)}\\
&& = \ESz{z,z'}{(\ell(A_{T,z_j,z'},z') -
\ell(A_{T,z_j,z_i},z'))\ell(A_{T,z_j,z'},z_j)}\\&&+
\ESz{z,z'}{(\ell(A_{T,z,z_j},z) -
\ell(A_{T,z_i,z_j},z))\ell(A_{T,z,z_i},z_j)}\\
&& \:\:\: + \ESz{z,z'}{(\ell(A_{T,z',z_j},z) -
\ell(A_{T,z,z_j},z))\ell(A_{T,z_j,z},z')}\,,
\end{eqnarray*}
where in the last line we replaced $z$ by $z_j$ in the first term
and $z'$ by $z_j$ in the second term and we exchanged $z$ and $z'$
and also $z_i$ and $z_j$ in the last term.



Summing $I_2$ and $I_3$ we obtain
\begin{eqnarray*}
I_2 + I_3 &=& \ESz{z,z'}{(\ell(A_{T,z_i,z_j},z) -
\ell(A_{T,z,z_j},z)) (\ell(A_{T,z_i,z_j},z') -  \ell(A_{T,z,z_i},z_j))}\\
&& + \ESz{z,z'}{(\ell(A_{T,z,z_j},z) - \ell(A_{T,z_j,z'},z))
(\ell(A_{T,z_i,z_j},z') - \ell(A_{T,z_j,z},z'))}\\
&& + \ESz{z,z'}{(\ell(A_{T,z_i,z_j},z') -
\ell(A_{T,z_j,z'},z')) (\ell(A_{T,z_j,z'},z) - \ell(A_{T,z_j,z'},z_j))}\\
&\leq& 3M\ESz{z}{|\ell(A_{T,z_i,z_j},z)-\ell(A_{T,z,z_j},z)|}\\
&=& 3M\ESz{z_i'}{|\ell(\AS,z_i)-\ell(\ASci,z_i)|}\,,
\end{eqnarray*}
Which proves the first part of the bound.

For the second part, we use the same technique and slightly vary
the algebra. We rewrite $I_2$ as
\begin{eqnarray*}
\lefteqn{\ESz{z,z'}{\ell(A_{T,z_i,z_j},z)\ell(A_{T,z_i,z_j},z')
- \ell(A_{T,z_i,z_j},z)\ell(A_{T,z_i,z_j},z_i)}}\\
&& = \ESz{z,z'}{\ell(A_{T,z_i,z_j},z)\ell(A_{T,z_i,z_j},z')
- \ell(A_{T,z,z_j},z')\ell(A_{T,z,z_j},z)}\\
&& \mbox{(renaming $z_i$ as $z$ and $z$ as $z'$ in the second term)}\\
&& = \ESz{z,z'}{(\ell(A_{T,z_i,z_j},z') -
\ell(A_{T,z,z_j},z'))\ell(A_{T,z_i,z_j},z)}\\&& +
\ESz{z,z'}{(\ell(A_{T,z_i,z_j},z) -
\ell(A_{T,z,z_j},z))\ell(A_{T,z,z_j},z')}\,.
\end{eqnarray*}
Next we rewrite $I_3$ as
\begin{eqnarray*}
\lefteqn{\ESz{z,z'}{\ell(A_{T,z_i,z_j},z_i)\ell(A_{T,z_i,z_j},z_j)
- \ell(A_{T,z_i,z_j},z)\ell(A_{T,z_i,z_j},z_i)}}\\
&& = \ESz{z,z'}{\ell(A_{T,z_i,z},z_i)\ell(A_{T,z_i,z},z)
- \ell(A_{T,z_i,z_j},z)\ell(A_{T,z_i,z_j},z_i)}\\
&& \mbox{(renaming $z_j$ as $z$ in the first term)}\\
&& = \ESz{z,z'}{(\ell(A_{T,z_i,z},z_i) -
\ell(A_{T,z_i,z_j},z_i))\ell(A_{T,z_i,z},z)}\\&& +
\ESz{z,z'}{(\ell(A_{T,z_i,z},z) -
\ell(A_{T,z_i,z_j},z))\ell(A_{T,z_i,z_j},z_i)}\\
&& = \ESz{z,z'}{(\ell(A_{T,z_i,z},z_i) -
\ell(A_{T,z_i,z_j},z_i))\ell(A_{T,z_i,z},z)}\\&& +
\ESz{z,z'}{(\ell(A_{T,z_j,z},z) -
\ell(A_{T,z_i,z_j},z))\ell(A_{T,z_i,z_j},z_j)}\\
&& \mbox{(exchanging $z_i$ and $z_j$ in the second term)}\,.
\end{eqnarray*}
Summing $I_2$ and $I_3$ we obtain
\begin{eqnarray*}
I_2 + I_3 &=& \ESz{z,z'}{(\ell(A_{T,z_i,z_j},z') -
\ell(A_{T,z,z_j},z'))\ell(A_{T,z_i,z_j},z)}\\
&& + \ESz{z,z'}{(\ell(A_{T,z_i,z},z_i) -
\ell(A_{T,z_i,z_j},z_i))\ell(A_{T,z_i,z},z)}\\
&& + \ESz{z,z'}{(\ell(A_{T,z_j,z},z) -
\ell(A_{T,z_i,z_j},z)) (\ell(A_{T,z,z_j},z') - \ell(A_{T,z_i,z_j},z_j))}\\
&\leq& M\ESz{z_i',z}{|\ell(\AS,z)-\ell(\ASci,z)|} +
M\ESz{z_i'}{|\ell(\AS,z_j)-\ell(\ASci,z_j)|}\\&& +
M\ESz{z_i'}{|\ell(\AS,z_i)-\ell(\ASci,z_i)|}\,.
\end{eqnarray*}


The above concludes the proof of the bound for the empirical
error.

We now turn to the leave-one-out error. The bound can be obtain in
a similar way. Actually, we notice that if we rewrite the
derivation for the empirical error, we simply have to remove from
the training set the point at which the loss is computed. That is,
we simply have to replace all the quantities of the form
$\ell(A_{T,z,z'},z)$ by $\ell(A_{T,z'},z)$. It is easy to see that
the above results are modified in a way that gives the correct
bound for the leave-one-out error.

\section*{Appendix B. Proof of Theorem \ref{th:class2}}
First we rewrite Inequality (\ref{eq:cremp}) in Theorem
\ref{th:class} as
\[
\PS{\R-\Reg > 2\frac{\beta}{\gamma} +
\epsilon\left(\frac{4m\beta}{\gamma} + 1\right)
\sqrt{\frac{1}{2m}}} \leq e^{-\epsilon^2}\,.
\]
We introduce the following quantity
\[
u(\epsilon,\gamma) = 2\frac{\beta}{\gamma} +
\epsilon\left(\frac{4m\beta}{\gamma} + 1\right)
\sqrt{\frac{1}{2m}}\,,
\]
and rewrite the above bound as
\[
\PS{\R-\Reg > u(\epsilon,\gamma)}\leq e^{-\epsilon^2}\,.
\]
We define a sequence $(\gamma_k)_{k\geq 0}$ of real numbers such
that
\[
\gamma_k = Be^{-k}\,.
\]
We define $\epsilon_k = t + \sqrt{2\ln k}$.

Now, we use the union bound to get a statement that holds for all
values in the sequence $(\gamma_k)_{k\geq 1}$:
\begin{eqnarray*}
\PS{\exists k \geq 1,\ \R - \Res^{\gamma_k} >
u(\epsilon_k,\gamma_k)}
& \leq & \sum_{k\geq 1} \PS{\R - \Res^{\gamma_k} > u(\epsilon_k,\gamma_k)}\\
& \leq & \sum_{k\geq 1} e^{-\epsilon_k^2}\\
& \leq & \sum_{k\geq 1} \frac{1}{k^2}e^{-t^2} \leq 2e^{-t^2}\,.
\end{eqnarray*}

For a given $\gamma \in (0,B]$, consider the unique value $k\geq
1$ such that $\gamma_k \leq \gamma \leq \gamma_{k-1}$. We thus
have $\gamma_k \leq \gamma \leq e\gamma_k$.

The following inequalities follow from the definition of
$\gamma_k$
\[
\frac{1}{\gamma_k} \leq \frac{e}{\gamma}\,,
\]
\[
\Res^{\gamma_k} \leq \Reg\,,
\]
\[
\sqrt{2\ln k} = \sqrt{2\ln\ln\frac{B}{\gamma_k}}\leq
\sqrt{2\ln\ln\frac{eB}{\gamma}}\equiv \alpha\,,
\]
so that we have
\[
u(t+\sqrt{2\ln k}, \gamma_k) \leq 2\frac{e\beta}{\gamma} + \left(t
+ \alpha\right)\left(\frac{4me\beta}{\gamma} + 1\right)
\sqrt{\frac{1}{2m}} \equiv v(\gamma,t)\,.
\]

We thus get the following implication
\[
\R-\Reg > v(\gamma,t) \Rightarrow \R-\Res^{\gamma_k} >
u(t+\sqrt{2\ln k}, \gamma_k)\,.
\]
This reasoning thus proves that
\[
\PS{\exists \gamma\in (0,B],\; \R-\Reg > v(\gamma,t)} \leq
\PS{\exists k\geq 0,\; \R-\Res^{\gamma_k} > u(t+\sqrt{2\ln k},
\gamma_k)}\,,
\]
and thus
\[
\PS{\exists \gamma\in (0,B],\; \R-\Reg > v(\gamma,t)} \leq
2e^{-t^2}\,,
\]
which can be written as
\[
\PS{\exists \gamma\in (0,B],\; \R-\Reg > 2\frac{e\beta}{\gamma} +
(t+\alpha)\left(\frac{4me\beta}{\gamma} + 1\right)
\sqrt{\frac{1}{2m}}} \leq 2e^{-t^2}\,,
\]
and gives with probability $1-\delta$
\[
\forall \gamma\in (0,B],\; \R\leq \Reg + 2\frac{e\beta}{\gamma} +
\left(\sqrt{\ln1/\delta} +
\sqrt{2\ln\ln\frac{eB}{\gamma}}\right)\left(\frac{4me\beta}{\gamma}
+ 1\right) \sqrt{\frac{1}{2m}}\,,
\]
which gives the first inequality. The second inequality can be
proven in the same way.


\section*{Appendix C. Convexity}
For more details see \cite{Gordon99} or \cite{Rockafellar70}. A
{\em convex} function $F$ is any function from a vector space $\F$
to $\R\cup\{-\infty,+\infty\}$ which satisfies
\[
\lambda F(g) + (1-\lambda) F(g') \geq F(\lambda g +
(1-\lambda)g')\,,
\]
for all $g,g'\in\F$ and $\lambda\in[0,1]$. A {\em proper} convex
function is one that is always greater than $-\infty$ and not
uniformly $+\infty$. The domain of $F$ is the set of points where
$F$ is finite. A convex function is {\em closed} if its epigraph
$\{(\f,y): y\geq F(\f)\}$ is closed. The {\em subgradient} of a
convex function at a point $g$, written $\partial F(g)$ is the set
of vectors $a$ such that
\[
F(g')\geq F(g) + \left<g'-g,a\right>\,,
\]
for all $g'$.


Convex functions are continuous on the interior of their domain
and differentiable on the interior of their domain except on a set
of measure zero. For a convex function $F$ we define the {\em
dual} of $F$, noted $F^*$ by
\[
F^*(a) = \sup_{g} \left<a,g\right> - F(g)\,.
\]
Denoting by $\nabla F(g')$ a subgradient of $F$ in $g'$ (i.e. a
member of $\partial F(g')$), we can define the Bregman divergence
associated to $F$ of $g$ to $g'$ by
\[
d_F(g,g') = F(g)-F(g')-\left<g-g',\nabla F(g')\right>\,.
\]
When $F$ is everywhere differentiable, this is well defined (since
the subgradient is unique) and nonnegative (by the definition of
the subgradient). Otherwise, we can define the generalized
divergence as
\[
\bar{d}_F(g,a) = F(g)+F^*(a)-\left<g,a\right>\,,
\]
where $a\in \F^*$. Notice that this divergence is also
nonnegative. Moreover, the fact that $f$ is a minimum of $F$ in
$\F$ is equivalent to
\[
\vz\in \partial F(f)\,,
\]
which, with the following relationship
\[
a\in \partial F(g)\Rightarrow F(g) + F^*(a) = \left<g,a\right>\,,
\]
gives
\[
F(f) + F^*(\vz) = 0\,,
\]
when $f$ is a minimum of $F$ in $\F$.


When $F$ is everywhere differentiable, it is easy to get
\begin{equation}\label{eq:div}
\forall g\in\F,\, d_F(g,f) = F(g)-F(f)\,,
\end{equation}
otherwise, using generalized divergences, we have
\begin{equation}\label{eq:div2}
\forall g\in\F,\, \bar{d}_F(g,\vz) = F(g)-F(f)\,.
\end{equation}

\renewcommand{\baselinestretch}{0.1}

\begin{thebibliography}{22}

\iffalse
 \expandafter\ifx\csname
natexlab\endcsname\relax\def\natexlab#1{#1}\fi

\expandafter\ifx\csname url\endcsname\relax
  \def\url#1{{\tt #1}}\fi
\fi

%\renewcommand{\baselinestretch}{2}
\linespread{.95} \small \normalsize

\bibitem[Alon et~al.(1997) Alon, Ben-David, Cesa-Bianchi, and
  Haussler]{AlonEtAl}
N.~Alon, S.~Ben-David, N.~Cesa-Bianchi, and D.~Haussler.
\newblock Scale-sensitive dimensions, uniform convergence, and learnability.
\newblock {\em Journal of the ACM}, 44\penalty0 (4):\penalty0 615--631, 1997.
\bibitem[Bartlett(1996)]{Bar96}
P.~Bartlett
\newblock For valid generalization, the size of the weights is more important than the size of the network
\newblock {\em Advances in Neural Information Processing Systems}, 1996.
\bibitem[Bonnans and Shapiro(1996)]{BonSch96}
J.F.~Bonnans and A.~Shapiro.
\newblock Optimization problems with perturbation, a guided tour.
\newblock Technical Report 2872, INRIA, April 1996.
\bibitem[Bousquet and Elisseeff(2001)]{Bou_Eli01}
O.~Bousquet and A.~Elisseeff.
\newblock Algorithmic stability and generalization performance.
\newblock In {\em Neural Information Processing Systems 14}, 2001.
\bibitem[Breiman(1996{\natexlab{a}})]{Breiman96a}
L.~Breiman.
\newblock Bagging predictors.
\newblock {\em Machine Learning}, 24:\penalty0 123--140, 1996{\natexlab{a}}.
\bibitem[Breiman(1996{\natexlab{b}})]{Breiman96b}
L.~Breiman.
\newblock Heuristics of instability and stabilization in model selection.
\newblock {\em Annals of Statistics}, 24\penalty0 (6):\penalty0 2350--2383,
  1996{\natexlab{b}}.
\bibitem[Cover and Thomas(1991)]{CovTho91}
T.M. Cover and J.A. Thomas.
\newblock {\em Elements of Information Theory}.
\newblock John Wiley, 1991.
\bibitem[Devroye(1991)]{Devroye91}
L.~Devroye.
\newblock Exponential inequalities in nonparametric estimation.
\newblock In {\em Nonparametric Functional Estimation and Related Topics},
  pages 31--44. Kluwer Academic Publishers, 1991.
\bibitem[Devroye et~al.(1996)Devroye, Gy\"orfi, and Lugosi]{Dev_al96}
L.~Devroye, L.~Gy\"orfi, and G.~Lugosi.
\newblock {\em A Probabilistic Theory of Pattern Recognition}.
\newblock Springer Verlag, 1996.
\bibitem[Devroye and Wagner(1979{\natexlab{a}})]{DevWag79a}
L.~Devroye and T.~Wagner.
\newblock Distribution-free inequalities for the deleted and holdout error
  estimates.
\newblock {\em IEEE Transactions on Information Theory}, 25\penalty0
  (2):\penalty0 202--207, 1979{\natexlab{a}}.
\bibitem[Devroye and Wagner(1979{\natexlab{b}})]{DevWag79b}
L.~Devroye and T.~Wagner.
\newblock Distribution-free performance bounds for potential function rules.
\newblock {\em IEEE Transactions on Information Theory}, 25\penalty0
  (5):\penalty0 601--604, 1979{\natexlab{b}}.
\bibitem[Evgeniou et~al.(1999)]{Evg_al99}
\newblock {T.~Evgeniou and M.~Pontil and T.~Poggio}.
\newblock {A unified framework for Regularization Networks and Support
Vector Machines}. {A.I. Memo 1654}, Massachusetts Institute of
Technology, Artificial Intelligence Laboratory, December
1999.
\bibitem[Gordon(1999)]{Gordon99} G.~Gordon.
\newblock {\em Approximate Solutions to Markov Decision Processes}.
\newblock PhD thesis, Carnegie Mellon University, 1999.
\bibitem[Jaakola et~al.(1999)Jaakola, Meila, and Jebara]{Jaa_al99}
T.~Jaakola, M.~Meila, and T.~Jebara.
\newblock Maximum entropy discrimination.
\newblock In {\em Neural Information Processing Systems 12}, 1999.
\bibitem[Kearns and Ron(1999)]{KeaRon99}
M.~Kearns and D.~Ron.
\newblock Algorithmic stability and sanity-check bounds for leave-one-out
  cross-validation.
\newblock {\em Neural Computation}, 11\penalty0 (6):\penalty0 1427--1453, 1999.
\bibitem[Lugosi and Pawlak(1994)]{LugPaw94}
G.~Lugosi and M.~Pawlak.
\newblock On the posterior-probability estimate of the error of nonparametric
  classification rules.
\newblock {\em IEEE Transactions on Information Theory}, 40\penalty0
  (2):\penalty0 475--481, 1994.
\bibitem[McDiarmid(1989)]{McD89}
C.~McDiarmid.
\newblock On the method of bounded differences.
\newblock In {\em Surveys in Combinatorics}, pages 148--188. Cambridge
  University Press, Cambridge, 1989.
\bibitem[Poggio and Girosi(1990)]{Pog_Gir90}
T.~Poggio and F.~Girosi.
\newblock Regularization algorithms for
learning that are equivalent to multilayer networks.
\newblock In {\em Science}, 247\penalty0 (2):\penalty0 978--982, 1990.
\bibitem[Rockafellar(1970)]{Rockafellar70}
R.T. Rockafellar.
\newblock {\em Convex Analysis}.
\newblock Princeton University Press, Princeton, NJ, 1970.
\bibitem[Rogers and Wagner(1978)]{RogWag78}
W.~Rogers and T.~Wagner.
\newblock A finite sample distribution-free performance bound for local
  discrimination rules.
\newblock {\em Annals of Statistics}, 6\penalty0 (3):\penalty0 506--514, 1978.
\bibitem[Steele(1986)]{Steele86}
J.M. Steele.
\newblock An {E}fron-{S}tein inequality for nonsymmetric statistics.
\newblock {\em Annals of Statistics}, 14:\penalty0 753--758, 1986.
\bibitem[Talagrand(1996)]{Talagrand96}
M.~Talagrand.
\newblock A new look at independence.
\newblock {\em Annals of Probability}, 24:\penalty0 1--34, 1996.
\bibitem[Vapnik(1982)]{Vap_82}
V.N. Vapnik.
\newblock {\em Estimation of Dependences Based on Empirical Data}.
\newblock Springer-Verlag, 1982.
\bibitem[Wahba(2000)]{Wahba00}
G.~Wahba.
\newblock An introduction to model building with reproducing kernel hilbert
  spaces.
\newblock Technical Report Statistics Department TR 1020, University of
  Wisconsin, Madison, 2000.
\end{thebibliography}

\end{document}
