\documentclass[twoside,11pt]{article}

\usepackage{jmlr2e}
\usepackage{amsmath}
%\usepackage{latexsym}
%\usepackage{amsthm}




\newcommand{\zbar}{\overline{z}}
\def\ie{i.e.}
\def\pt#1#2{{\partial #1\over \partial #2}}
\def\ddt#1{{d {#1}\over dt}}
\def\nin{\noindent}
\def\be{\begin{equation}}
\def\ee{\end{equation}}
\newcommand{\fsubc}{f_{\rm{c}}}
\newcommand{\subc}{_{\rm{c}}}
\newcommand{\orb}{{\vec 0}_{Z\times \reals }}
\newcommand{\orz}{{\vec 0}_{Z}}
\newcommand{\pf}{\noindent{\bf Proof.}\ }
\newcommand{\del}{\partial}
\newcommand{\Qed}{\begin{flushright} $\Box$\ \ \ \ \ \
                  \end{flushright}}
\newcommand{\complex}{{\Bbb C}}
\newcommand{\reals}{{\Bbb R}}
\newcommand{\ctn}{\complex^{2n}}
\newcommand{\rtn}{\reals^{2n}}
\newcommand{\rn}{\reals^{n}}
\newcommand{\rtm}{\reals^{2m}}
\newcommand{\frakd}{{\frak d}}
\newcommand{\frakb}{{\frak b}}
\newcommand{\frakf}{{\frak f}}
\newcommand{\frakg}{{\frak g}}
\newcommand{\frakh}{{\frak h}}
\newcommand{\frakl}{{\frak l}}
\newcommand{\frakp}{{\frak p}}
\newcommand{\frakq}{{\frak q}}
\newcommand{\fraks}{{\frak s}}
\newcommand{\bstar}{{\frak b}^{*}}
\newcommand{\gstar}{{\frak g}^{*}}
\newcommand{\hstar}{{\frak h}^{*}}
\newcommand{\lstar}{{\frak l}^{*}}
\newcommand{\sstar}{{\frak s}^{*}}
\newcommand{\starh}{*_{\hbar}}
\newcommand{\torus}{{\Bbb T}}
\newcommand{\integers}{{\Bbb Z}}
\newcommand{\boldf}{\mbox{\boldmath $f$}}
\newcommand{\boldpi}{\mbox{\boldmath $\pi$}}
\newcommand{\inner}{\backl}
\newcommand{\grad}{\nabla}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\quarter}{\frac{1}{4}}
\newcommand{\cinf}{C^{\infty}}
\newcommand{\canoncoords}{q_{1},\dots,q_{n},p_{1},\dots ,p_{n}}
\newcommand{\canonbracket}{\sum _{i=1}^{n}
[\frac{\del f}{\del q_{i}}\frac{\del g}{\del p_{i}}-
\frac{\del g}{\del q_{i}}\frac{\del f}{\del p_{i}}]}
\newcommand{\symplecticform}{\sum_{i=1}^{n}dq_{i}\wedge dp_{i}}
\newcommand{\poissend}[1]{#1\times\overline{#1}}
\newcommand{\oppositepoissend}[1]{\overline{#1}\times #1}
\newcommand{\threetimes}[1]{#1 \times \overline{#1}\times\overline{#1}}
\newcommand{\fourtimes}[1]{#1\times\overline{#1}\times\overline{#1}\times
#1}
\newcommand{\inverse}{^{-1}}
\newcommand{\boldc}{{\bf c}}
\newcommand{\boldi}{{\bf i}}
\newcommand{\boldj}{{\bf j}}
\newcommand{\boldk}{{\bf k}}
\newcommand{\boldx}{{\bf x}}
\newcommand{\boldy}{{\bf y}}
\newcommand{\D}{{\bf D}}
\newcommand{\F}{{\bf F}}
\newcommand{\tr}{\trace}
\newcommand{\trace}{\mbox{\rm{Tr}}}
\newcommand{\area}{\mbox{\rm{Area}}}
\newcommand{\backl}{\mathbin{\vrule width1.5ex height.4pt\vrule height1.5ex}}
\newcommand{\comment}{\tt}
\newcommand{\cala}{{\cal A}}
\newcommand{\calb}{\frakb}
\newcommand{\calc}{{\cal C}}
\newcommand{\cald}{{\cal D}}
\newcommand{\cale}{{\cal E}}
\newcommand{\calf}{{\cal F}}
\newcommand{\calg}{{\cal G}}
\newcommand{\calh}{{\cal H}}
\newcommand{\cali}{{\cal I}}
\newcommand{\calj}{{\cal J}}
\newcommand{\calk}{{\cal K}}
\newcommand{\call}{{\cal L}}
\newcommand{\calm}{{\cal M}}
\newcommand{\caln}{{\cal N}}
\newcommand{\calo}{{\cal O}}
\newcommand{\calp}{{\cal P}}
\newcommand{\calz}{{\cal Z}}

\newcommand{\defequal}{\stackrel{\mbox {\small {def}}}{=}}
\newcommand{\smalcirc}{\mbox{\small $\circ $}}



\jmlrheading{2}{2002}{335-358}{6/01}{2/02}{Adam Cannon, J. Mark Ettinger, Don Hush and Clint Scovel}
\ShortHeadings{Machine learning with data dependent hypothesis classes}{Cannon, Ettinger, Hush and Scovel}
\firstpageno{335}

\begin{document}

\title{Machine Learning with Data Dependent Hypothesis Classes}

\author{\name Adam Cannon \email cannon@cs.columbia.edu \\
       \addr Department of Computer Science\\
       Columbia University\\
       New York, NY 10027, USA
       \AND
       \name J. Mark Ettinger  \email ettinger@lanl.gov \\
       \addr Nonproliferation and International Security Group, NIS-8\\
       Los Alamos National Laboratory\\
       Los Alamos, NM 87545, USA
       \AND
       \name Don Hush \email dhush@lanl.gov\\
       \name Clint Scovel \email jcs@lanl.gov\\
       \addr Modeling, Algorithms, and Informatics Group, CCS-3\\
       Los Alamos National Laboratory\\
       Los Alamos, NM 87545, USA}
       
\editor{Peter Bartlett}
\maketitle

\begin{abstract}%
We extend the VC theory of statistical learning to
data dependent spaces of classifiers.
This theory can be viewed as a decomposition
of classifier design into two components;
the first component is a restriction to a data dependent
hypothesis class
and the second is empirical risk minimization
within that class.
We define a measure of complexity for
data dependent hypothesis classes and
provide data dependent versions of
bounds on error deviance and estimation error.
We also provide
a structural risk minimization procedure
over data dependent hierarchies and prove consistency.
We use this theory to provide a framework for
studying the trade-offs between performance and
computational complexity in classifier design.
As a consequence we obtain
a new family of classifiers with dimension independent
performance bounds and efficient learning procedures.
\end{abstract}


\begin{keywords}
Computational Learning Theory, Empirical Process Theory, Classification, Shatter Coefficient, Structural Risk Minimization
\end{keywords}


\section{Introduction}

Vapnik motivated his development of support vector machines
as a kind of structural risk minimization.
However, the corresponding class sequence is data dependent and so Vapnik's
theory of structural risk minimization does not apply.  To resolve this
issue, Shawe-Taylor et al. (1998) have developed a theory of structural
risk minimization over data dependent hierarchies which gives performance
guarantees for support vector machines (see also Shawe-Taylor \& Cristianini, 2000).
In addition, the work on data dependent
complexity regularization of Buescher and Kumar (1996),
the posterior bounds of Freund (1998), the conditional bounds of
Devroye (1988), and the work on
data dependent penalties of Koltchinskii et al. (2000), Koltchinskii (2001), Boucheron et al. (2000), and Bartlett et al. (2000) have a similar goal.

Devroye (1988) developed performance bounds for
data dependent hypothesis classes in a similar spirit to those presented here.
However Devroye's approach provides conditional bounds whereas
the approach taken here
more closely resembles the VC framework developed by Vapnik.
Recently Gat (1999) proved a performance bound for the perceptron that extended
the VC theorem to the case of data dependent classes of
classifiers. This bound uses a counting complexity instead of a shattering
complexity for the data dependent class.
In this paper we show that Gat's result can be extended in terms of a 
shatter coefficient for data dependent classes. 
We build on this result to construct a new framework for learning
with data dependent hypothesis classes.
Although the framework of Shawe-Taylor et al.
may be more general, its exact relationship to the framework
developed here is not clear.
At a minimum the two frameworks appear 
to differ in the ease with which they can be
applied to particular learning paradigms.
For example, our framework has not yet provided
performance guarantees for support vector machines.
However, it has facilitated the discovery of
new families of classifiers that
possess dimension independent performance guarantees
for empirical risk minimization.



We contrast our framework with the VC framework.
In the VC framework
the generalization error can be decomposed into two components,
the approximation error $A$ which quantifies the lack of optimality
introduced by our choice of hypothesis class,
and the estimation error $E$ which quantifies the 
lack of optimality
of empirical error minimization due to finite sample size.
For classes with finite VC dimension the
celebrated VC theorem provides a
distribution independent bound on $E$
that goes to zero as the number of training samples $n$ goes to infinity.
Control on the approximation error is provided through a
structural risk minimization (SRM)
learning procedure that is proved to be consistent
for an infinite sequence of classes with finite VC dimension.

In our framework classifier design is 
decomposed into two components;
the first component is the restriction to the data dependent
hypothesis class
and the second is empirical risk minimization
within that class.
We define data dependent versions of approximation error $A_D$ and
estimation $E_D$ error in the obvious way.
We provide a VC-like theorem that gives bounds on $E_D$
in terms of a shatter coefficient for data dependent classes.
Based on Vapnik's construction we also provide
a structural risk minimization procedure
over data dependent hierarchies and prove consistency.
When the data dependent hypothesis class is obtained by
restricting a traditional class through a data dependency rule
the data dependent approximation error $A_D$
splits into
\[
A_D = A + E_{DD}
\]
where $A$ is the approximation error for the
traditional class and $E_{DD}$ is the {\em data dependency error}.
Consequently the analysis of approximation error
is similar to that for traditional classifiers,
except for the data dependency error term.
We have just begun the study of this random variable.
For example we show that when the data
dependency is symmetric in its dependence on
the $n$-sample the infinite sample limit
of this random variable is a constant.
At present we do not know 
general conditions on the data dependency
that allow this constant to
be computed, and in particular when it is zero.
However, for the structural risk minimization
over multi-sphere classifiers
described in Section \ref{sect:srm_multi-sphere}
this constant is zero.
In this case we have also shown that 
the infinite sample limit of the data
dependent approximation error is zero.

We also wish to address the computational
requirements of the learning procedure.
Despite the performance guarantees
provided by the VC theory,
empirical error minimization is computationally
intractable for nearly all nontrivial hypothesis classes.
One of the strengths of our framework is that it allows us to
explore trade-offs between generalization error
and computational requirements
through the choice of data dependency.
For example, it allows us
to modify a traditional classifier
to obtain a family of classifiers
through various data dependencies,
and to quantify the performance and computational requirements
over this family
with a greater degree of flexibility than we have seen before.
Among the families considered here are classifiers with
dimension independent performance bounds
that may benefit from
kernel mappings to high dimensions like those used in
support vector machines.

This paper is organized as follows.
We prove
uniform convergence of empirical processes over data dependent classes
as an extension of Gat's theorem.
Specific data dependent classes are
introduced and their computational and structural complexity analyzed.
A structural risk minimization procedure is described and a
consistency theorem proven.
The structural risk minimization procedure is demonstrated on
a specific family of spherical classifiers and conditions
for consistency with the Bayes error are provided.
Finally we describe a framework for
analyzing the trade-offs between computation and performance
as a function of the data dependency
and apply it to the family of spherical classifiers.


 
\section{Uniform convergence of empirical processes over data dependent classes}
Let $Z$ denote a metric space with its Borel $\sigma$-algebra and a Borel
probability measure
$\mu$. We let $Z_{n}$ denote the $n$-fold product of $Z$ with
itself with product $\sigma$-algebra and let ${\cal P}_{Z_{n}}=\mu^{n}$
 denote the $n$-fold
product of the measure $\mu$. 
We denote $n$ independent
samples $\{z_{n}(i),i=1,..,n\}$ as $z_{n}=(z_{n}(1),..,z_{n}(n)) \in Z_{n}$.
In this paper we consider functions $f:Z
\rightarrow \{0,1\}$.
We define
$\calf_{n}$ to be a class of such functions, where members of this class are
determined through application of a data dependency rule to $n$-samples. For
example, in Section \ref{sect:multi-sphere} we consider functions that
dichotomize $\Re^{d}$ with a sphere where the data dependency rule forces the
sphere to be centered at one of the $n$ data points. Therefore, $\calf_{n}$ is
a class of binary functions on $(Z_{n},Z)$.
We define
a data dependent class
  $\calf =\{ \calf_{n}\}$ to be a collection of
classes $\calf_{n}$.

We now assume some structure concerning the classes' data
dependency and the classes' description as a parametric family.
 We denote the class $\calf$ restricted to the $n$-sample $z_{n}$
by 
$\calf_{z_{n}}$. We assume we can describe this class in parametric form through
some
parameter space $Y_{n}$, not necessarily a product space,
 such that each function is $f_{y_{n},z_{n}}(z)$ for some
choice of $y_{n} \in Y_{n}$ and each choice of $y_{n}$ gives a function in the
class. Then the whole data dependent class on $n$ samples can be described by a single
function $\Xi_{n}(y_{n},z_{n},z)=f_{y_{n},z_{n}}(z)$.  
 Recall that  
a Polish space is a complete separable metric space
and a Suslin space is a Borel measurable image of a Polish space.
We say that the data dependent class ${\cal F}_{n}$ is image admissible Suslin 
if there exists a Suslin parameter space $Y_{n}$ and a $z_{n}$ parameterized family
of maps
$T_{z_{n}}:Y_{n}\rightarrow {\cal
F}_{z_{n}}$ such that
the evaluation function 
\[ \Xi_{n}; Y_{n} \times Z_{n} \times Z  \rightarrow \{0,1\}\]
defined by
$\Xi_{n}(y_{n},z_{n},z)= (T_{z_{n}}(y_{n}))(z)$
is jointly measurable in $(y_{n},z_{n},z)$ with the product $\sigma$-algebra of the Borel
sets of $Y_{n}$ and $Z_{n+1}$.  
The
inclusion of measurability considerations for the results in this section 
goes through in much the
same way as for the $VC$ theorem as presented by Dudley (1999).
On the other hand in the proof of Theorem \ref{const} in Section \ref{symmetric}
 measurability is
crucial and so we consider it
explicitly then.

There is a one to one mapping between functions $f:Z \rightarrow \{0,1\}$ and
their
indicator sets $I(f)=\{z: f(z)=1\}$. Throughout this paper we make this
identification
regularly without comment. In particular we identify $\mu(f)=\mu(I(f))$.
 
 
\begin{definition} For $n \leq m$ define ${\cal N}_{n}(z_{m},\calf)$ to be the
number of distinct dichotomies of the $m$ points $z_{m}$
generated as the functions vary over the union of the  data
dependent classes determined
 by all subsets of $z_{m}$ containing $n$ points. That is,
let $I_{f}= \{z: f(z)=1\}$ denote the set where the function is equal to one.
Then ${\cal N}_{n}(z_{m},\calf)$ is the number of different sets in
\[ \left\{\{z_{m}(1),...,z_{m}(m)\} \cap I_{f}: f \in \calf_{w_{n}}, w_{n} \subset
z_{m}\right\}\] 
where $w_{n} \subset z_{m}$ means that $\{w_{n}(i),i=1,..,n\} \subseteq
 \{z_{m}(j), j=1,..,m\}$.  
The shatter coefficient for a data dependent class $\calf$ is defined as
\[S_{n/m}(\calf)=\sup_{z_{m}}{{\cal N}_{n}(z_{m},\calf)}.\]
\end{definition}
We begin by citing Gat's (1999) observation that
 Vapnik's basic lemma regarding ghost samples still applies for data dependent
classes.
Let $z_{n}$ denote the $n$-sample and $z_{n_{2}}$
the ghost sample. Denote
$z_{m}=(z_{n},z_{n_{2}})$ with $m=n+n_{2}$. We write the empirical means
\[ \hat{\mu}(f)= \hat{\mu}_{1}(f)=\frac{1}{n} \sum_{i=1}^{n}{f(z_{n}(i))}\]
and
\[ \hat{\mu}_{2}(f)=\frac{1}{m-n} \sum_{i=1}^{m-n}{f(z_{n_{2}}(i))}.\]


\begin{lemma}(Gat)
\label{basiclemma}

\[{\cal{P}}_{Z_{n}}\left( \sup_{ f \in
\calf_{z_{n}}}{\bigl|\mu(f)-\hat{\mu}_{1}(f)\bigr|}>\epsilon \right)
 \leq
2{\cal{P}}_{Z_{m}} \left( \sup_{ f \in
\calf_{z_{n}}}{\bigl|\hat{\mu}_{2}(f)-\hat{\mu}_{1}(f)\bigr|}>\epsilon-\frac{1}{m-n} \right)
\]
\end{lemma}
\begin{proof}
Just follow Vapnik's proof (Vapnik, 1998, page 132) and observe that
the relevant $n$ is $n_{2}$ and the data dependency makes no difference.
\end{proof}


Now we state and prove our main theorem for data dependent classes.

\begin{theorem}
\label{main}
For any $m > n$,
\[{\cal{P}}_{Z_{n}}\left( \sup_{ f \in
\calf_{z_{n}}}\bigl|\mu(f)-\hat{\mu}_{1}(f)\bigr|>\epsilon \right) \leq
2S_{n/m}(\calf)e^{2\epsilon}e^{-(\frac{1}{n}+\frac{1}{m-n})^{-1}\epsilon^{2}
}.\]

\end{theorem}
\begin{proof}
Gat (1999) proved the version of this theorem that counted the number of functions
in $ \{  \calf_{w_{n}},
w_{n} \subseteq z_{m}\}$ when $m=2n$. We show this proof
can generate bounds in terms of the shatter coefficient for the
data dependent class.
Consider
\[{\cal{P}}_{Z_{m}}\left(\sup_{ f \in
\calf_{z_{n}}}{\bigl|\hat{\mu}_{2}(f)-\hat{\mu}_{1}(f)\bigr|}>\acute{\epsilon}\right)\]
where $\acute{\epsilon}=\epsilon-\frac{1}{m-n}$. Let $\Sigma$ denote the set of
permutations of the integers $(1,2,...,m)$ and use the same designation for the
set of permutations induced on the $m$ sample $z_{m}$.
Since ${\cal{P}}_{Z_{m}}$ is invariant under $\Sigma$, for any permutation $\sigma$
\[{\cal{P}}_{Z_{m}}\left(\sup_{ f \in
\calf_{z_{n}}}\bigl|\hat{\mu}_{2}(f)-\hat{\mu}_{1}(f)\bigr|>\acute{\epsilon}\right)
={\cal{P}}_{Z_{m}}\left(\sup_{ f \in
\calf_{\sigma(z_{m})_{1}}}\bigl|\hat{\mu}_{\sigma(z_{m})_{2}}(f)-
\hat{\mu}_{\sigma(z_{m})_{1}}(f)\bigr|
>\acute{\epsilon}\right)\]
where
$\sigma(z_{m})=(\sigma(z_{m})_{1},\sigma(z_{m})_{2})$ and $\sigma(z_{m})_{1}$ is
of length $n$.

Place the uniform probability distribution on $\Sigma$. It then follows that
\[{\cal{P}}_{Z_{m}}\left(\sup_{ f \in
\calf_{z_{n}}}\bigl|\hat{\mu}_{2}(f)-\hat{\mu}_{1}(f)\bigr|>\acute{\epsilon}\right)
={\cal{P}}_{Z_{m},\Sigma}\left(\sup_{ f \in
\calf_{\sigma(z_{m})_{1}}}\bigl|\hat{\mu}_{\sigma(z_{m})_{2}}(f)-\hat{\mu}_
{\sigma(z_{m})_{1}}(f)\bigr|>
\acute{\epsilon}\right)\]
\begin{equation}
\label{eq1}
= \int{\left({\cal{P}}_{\Sigma|Z_{m}}\Bigl(\sup_{ f \in
\calf_{\sigma(z_{m})_{1}}}\bigl|\hat{\mu}_{\sigma(z_{m})_{2}}(f)-
\hat{\mu}_{\sigma(z_{m})_{1}}(f) \bigr|>
\acute{\epsilon} \ \big| \ Z_{m}\Bigr)
\right)d{\cal{P}}_{Z_{m}}}
\end{equation}
but for fixed $Z_{m}$
\[{\cal{P}}_{\Sigma|Z_{m}}\left(\sup_{ f \in
\calf_{\sigma(z_{m})_{1}}}{\bigl|\hat{\mu}_{\sigma(z_{m})_{2}}(f)-
\hat{\mu}_{\sigma(z_{m})_{1}}(f)\bigr|}>
\acute{\epsilon} \ \big| \ Z_{m}\right)
 \leq\]
\begin{equation}
\label{eq2}
 S_{n/m}(\calf)\sup_{ f \in \bigcup_{\sigma \in \Sigma} \calf_{\sigma(z_{m})_{1}}}
{{\cal{P}}_{\Sigma|Z_{m}}\left(
\bigl|\hat{\mu}_{\sigma(z_{m})_{2}}(f)-\hat{\mu}_{\sigma(z_{m})_{1}}(f)\bigr|>
\acute{\epsilon} \ \big| \ Z_{m} \right).}
\end{equation}
To bound 
\[{\cal{P}}_{\Sigma|Z_{m}}\left(
\bigl|\hat{\mu}_{\sigma(z_{m})_{2}}(f)-\hat{\mu}_{\sigma(z_{m})_{1}}(f)\bigr|>\acute{\epsilon}
\ \big| \ Z_{m}\right)
\]
we map to the hypergeometric distribution
in the following fashion.  For $f$ fixed, let $l$ be
the number of indices such that $f(z_{m}(i))=1$. For fixed $\sigma$ let $k$
denote the number of indices $i=1,..,n$ such that
$f(\sigma(z_{m})_{1}(i))=1$.
For each combination of $k$ indices chosen from
$l$ and $n-k$ chosen from $m-l$, there are $n!(m-n)!$
permutations. The number of combinations which obtain $k$
indices from the $l$ and $n-k$ from the $m-l$ is ${l \choose
k}{m-l \choose n-k}$ so the number of corresponding
permutations is ${l \choose k}{m-l \choose
n-k}{n!(m-n)!}$. Consequently, the fraction of
permutations with $k$ ones in the first $n$ indices and the
remaining $l-k$ ones in the remaining $m-n$ indices is
\[\frac{{l \choose k}{m-l \choose
n-k}{n!(m-n)!}}{m!}= \frac{{l \choose k}{m-l \choose
n-k}}{{m \choose n}}\]
and so is distributed like the hypergeometric random variable
$K(n,m,l)$ of how many defectives are selected
when $n$
items are randomly selected without replacement from a population
of $m$ items of which $l$ are defective.
For
each of these permutations
\[\hat{\mu}_{\sigma(z_{m})_{1}}(f)-\hat{\mu}_{\sigma(z_{m})_{2}}(f)=
\frac{k}{n}-\frac{l-k}
{m-n}=
\frac{m}{n(m-n)}\Bigl(k-\frac{ln}{m}\Bigr).\]
Consequently,
\[{\cal{P}}_{\Sigma|Z_{m}} \Bigl(
|\hat{\mu}_{\sigma(z_{m})_{2}}(f)-\hat{\mu}_{\sigma(z_{m})_{1}}(f)|>\acute{\epsilon}
\ \big| \ Z_{m}\Bigr)
= {\cal{P}}_{K(n,m,l)}\left(\Bigl|k-\frac{ln}{m}\Bigr| > \frac{n(m-n)}{m}\acute{\epsilon}\right).\]
Applying Serfling's result (Serfling, 1974) on concentration of sampling
without replacement to
the hypergeometric distribution
\[  {\cal{P}}_{K(n,m,l)}\left(\Bigl|\frac{k}{n}-\frac{l}{m}\Bigr| >  t\right) \leq
e^{-\frac{2nm}{m-n+1}t^{2}}\]
with $t=\frac{m-n}{m}\acute{\epsilon}$
we obtain
\[{\cal{P}}_{\Sigma|Z_{m}}\Bigl(
|\hat{\mu}_{\sigma(z_{m})_{2}}(f)-\hat{\mu}_{\sigma(z_{m})_{1}}(f)|>\acute{\epsilon}
\ \big| \ Z_{m}\Bigr) \leq e^{-2\frac{n}{m}\frac{(m-n)^{2}}{m-n+1}\acute{\epsilon}^{2}}.\]
We use the crude lower bound $\frac{1}{m-n+1}\geq
\frac{1}{2}\frac{1}{m-n}$, losing almost a factor of $2$ in exponent, to obtain
 the simpler formula
\[{\cal{P}}_{\Sigma|Z_{m}}\Bigl(
|\hat{\mu}_{\sigma(z_{m})_{2}}(f)-\hat{\mu}_{\sigma(z_{m})_{1}}(f)|>\acute{\epsilon}
\ \big| \ Z_{m}\Bigr) \leq e^{-(\frac{1}{n}+\frac{1}{m-n})^{-1}\acute{\epsilon}^{2}}.\]


Consequently,
 from Equation \ref{eq2} we
obtain
\begin{equation}
\label{bound}
{\cal{P}}_{\Sigma|Z_{m}}\left(\sup_{ f \in
\calf_{\sigma(z_{m})_{1}}}
\bigl|\hat{\mu}_{\sigma(z_{m})_{2}}(f)-\hat{\mu}_{\sigma(z_{m})_{1}}(f)\bigr|>
\acute{\epsilon}\right) \leq S_{n/m}(\calf)
e^{-(\frac{1}{n}+\frac{1}{m-n})^{-1}\acute{\epsilon}^{2}}
\end{equation}
Substituting $\acute{\epsilon}=\epsilon-\frac{1}{m-n}$ we bound
\[-\Bigl(\frac{1}{n}+\frac{1}{m-n}\Bigr)^{-1}\acute{\epsilon}^{2} \leq 2\epsilon
-\Bigl(\frac{1}{n}+\frac{1}{m-n}\Bigr)^{-1}\epsilon^{2}\]
so that
combining with the bound from Lemma \ref{basiclemma} and Equation \ref{eq1},
we obtain
\[{\cal{P}}_{Z_{n}}\left(\sup_{ f \in
\calf_{z_{n}}}
\bigl|{\mu}(f)-\hat{\mu}(f)\bigr|>
\epsilon \right) \leq 2S_{n/m}(\calf)e^{2\epsilon}e^{-(\frac{1}{n}+\frac{1}{m-n})
^{-1}\epsilon^{2}}\]
and the proof is finished.



\end{proof}

Before we proceed, let us mention that the data dependent version of Vapnik and
Chervonenkis' lemma inducing bounds on estimation error from bounds on
error deviance
(as stated by Devroye et al., 1996) also holds. The proof
is the same as the data independent version.

\begin{lemma}
Let 
$\hat{f}(z_{n})= \arg\min_{f \in \calf_{z_{n}}}{\hat{\mu}(f)}$ denote the empirical mean
minimizer in the
data dependent class $\calf_{z_{n}}$ and let $\mu^{*}(z_{n})=\inf_{f \in
\calf_{z_{n}}}{\mu(f)}$ be the data dependent
optimum. Then
\[ \mu\bigl(\hat{f}(z_{n})\bigr)-\mu^{*}\bigl(z_{n}\bigr) 
\leq 2\sup_{f \in \calf_{z_{n}}}{\bigl|\hat{\mu}(f)-\mu(f)\bigr|}\]

\end{lemma}



\section{Classification}

We now consider classification.
Let $Z=(X,Y)$ with $Y=\{0,1\}$ where $X$ is a metric space with Borel
$\sigma$-algebra inducing a $\sigma$-algebra on $Z$ in the obvious way.
We assume additionally that 
$X$ is Polish so that $Z$ is also Polish and so
Suslin.
Again we consider independent samples so all $n$-sample probability measures will
be the product measures. 
Let the Borel probability measure be decomposed as 
\[ \mu(\acute{f})=\mu_{1}\bigl(\acute{f}(\cdot,1)\bigr)p(1)+\mu_{0}\bigl(\acute{f}(\cdot,0)\bigr)p(0)\]
for two Borel probability measures $\mu_{1}$ and $\mu_{0}$ where we use the
notation $p(1)=p(y=1)$ and $p(0)=p(y=0)$ for the
probability on $Y=\{0,1\}$.
We require that each
function correspond to a classifier hypothesis.
That is, each function $\acute{f}:Z \rightarrow \{0,1\}$ satisfies
\[ \acute{f}(x,y)=I_{\{f(x)=y\}}\]
for some classifier $f:X \rightarrow \{0,1\}$.
Equivalently $\acute{f}$ must satisfy
\[ \acute{f}(x,1)+\acute{f}(x,0)=1\] 
and so is determined by some classifier
$f:X \rightarrow \{0,1\}$ by  $f(x)=\acute{f}(x,1)$.
For any class ${\calc}$
of classifiers $f:X \rightarrow \{0,1\}$ let $\acute{\calc}$
be the class of
functions defined by $\acute{f}(x,1)=f(x)$ and $\acute{f}(x,0)=1-f(x)$.
The decomposition of the measure combined with the constraints for classifier
hypotheses determines the formula for generalization error

\begin{equation}
\label{error}
e(f)= \mu(1-\acute{f})=
\bigl(1-\mu_{1}(f) \bigr)p(1)+\mu_{0}(f)p(0).
\end{equation}

Now we define
\[e^{*}(z_{n})=  \inf_{f \in \calf_{z_{n}}} e(f)\]
to be the optimal generalization error in the data dependent class. 
Also define
\[e^{*}(z_{\infty})=\limsup_{n \rightarrow \infty}e^{*}(z_{n})\] to be its
infinite sample limit.
Let ${\cal B}$ denote the Borel measurable sets. Define
\[ e_{\cal B}= \inf_{f \in {\cal B}}{e(f)}\]
to be the Bayes error.

\section{Data Dependent Classes for Classification}

In this section we introduce and analyze
some specific data dependent hypothesis
classes for classification.
The first is a class we call {\em simple linear classifiers},
and the second is a family of classes
that we refer to collectively as {\em multi-sphere classifiers}
(see Cannon and Cowen, 2000; Marchette and
Priebe, 2001; Priebe, DeVinney, and Marchette, 2001 
for related work).
In each case we
derive bounds on their shatter coefficients and
establish hardness results for
the computational complexity of
empirical risk minimization.
Throughout this section we let 
$Z=(X,Y)$ with $X$ Polish so that $Z$ is also Polish and so
Suslin. 
Also, our data dependent classes $\acute{\calc}$ are
defined through function classes $\calc$ of functions $X \rightarrow \{ 0,1 \}$
as described in the previous section.
With this definition the shatter coefficient for $\acute{\calc}$
is the same as for $\calc$.

\subsection{Simple linear classifiers
\label{sect:slc}}

Here $X=\Re^{d}$ with metric determined
by the usual inner product.
Let the data dependent class $\calc_{x_n}$
be the subset of linear classifiers whose orientations
are determined by the pairwise differences between samples, that is
\[
\calc_{x_n} = \left\{
f,~ 1-f: f(x) =
H \left( (x_n (i) - x_n (j)) \cdot x + b \right) , i,j \le n,
b \in \Re
\right\}
\]
where $H$ is the heaviside function.
The shatter coefficient is determined as follows.
Recall that in the determination of the shatter
coefficient we need to include a ghost sample.

Consider linear classifiers whose orientations are determined by a single sample
pair $(x(i), x(j))$.
The class of sets
$\{ x: (x(i)-x(j)) \cdot x + b < 0 \}$
is ordered by subset relation
as we vary the threshold $b$
and so has VC dimension equal to one (Devroye et al., 1996),
and so the number of subsets of
the $2n$ points is at most $2(2n+1)$.
For any sample of size $2n$ we find at most
${2n \choose 2} = n(2n-1)$ unique sample pairs,
so we can bound $S_{2n/2n}(\calc)$ by $8n^3-2n$. Since each function in $\calc_{x_n}$
is in $\calc_{x_{2n}}$, ~
 $S_{n/2n}(\calc) \leq S_{2n/2n}(\calc)  \le 8n^3 -
2n$.
Note that for this simple class the shatter coefficient
is independent of dimension.
This class also admits tractable learning algorithms
in that empirical risk minimization over $\calc_{x_n}$
can be solved in polynomial time.
To see this consider the run time of the brute force algorithm
which examines all $O(n^2)$ sample pairs and
determines the optimal threshold for each.
Since it takes $O(nd + n \log n )$
to determine the threshold for each pair the
overall run time is $O(n^3(d+\log n ))$.


\subsection{Multi-sphere classifiers
\label{sect:multi-sphere}}

For the computational considerations we let $X=\Re^{d}$ with metric determined
by the usual inner product.
Let $z_{n_1}$ be the ordered subset of $z_n$ with $y = 1$ where the
ordering is induced from $z_n$,
and the random variable
$n_1$ denotes its size.
We define the {\em single sphere} data dependent class as
\[
\calc_{z_n} = \left\{
f: f(x) = H \left(r- \| x - x_{n_1} (i) \| \right) , i \le n_1 ,
r \in \Re_{+}
\right\}
\]
That is, $x$ is assigned the label 1 if it falls in
a closed ball $B(x_{n_1} (i),r) \subseteq X$ of radius $r$ centered at one of
the
data samples with label $y=1$, and 0 otherwise.
Determination of the shatter coefficient parallels the procedure above.
The class of sets $\{ B(x,r), r \geq 0\}$,  is ordered by subset
relation and so has $VC$ dimension
equal to one, and so the number of subsets
of the $2n-1$ remaining points (which include the ghost sample) is
at most $2n$. Now, for any sample of size $2n$ we have at most $2n$ unique
centers, so we can bound $S_{2n/2n}(\calc)$ by $(2n)^2$. Since each function in
$\calc_{z_n}$
is in $\calc_{z_{2n}}$, ~
 $S_{n/2n}(\calc) \leq S_{2n/2n}(\calc)  
 \le (2n)^2$.
Note that once again the shatter coefficient
is independent of dimension.
In addition, empirical risk minimization over $\calc_{z_n}$
can be solved in polynomial time since the
brute force algorithm, which examines all $2n$ centers and
determines the optimal radius for each, has an overall
run time of $O(n^2(d+\log n))$.

Now consider an extension of $\calc_{z_n}$ to multiple spheres.
We define
\[{{\cal C}^{q}_{z_{n}}}= 
\left\{f:f =\bigcup_{i=1}^{q} B\left(x_{q}(i),r_{i}\right), ~ x_{q} \subseteq x_{n_1} \right\} \]
which is the
set of functions corresponding to the union of $q$ closed
balls of $q$ possibly different radii centered at $q$ of the
data points $x_{n_1}$, where $q \le n$.
To bound the shatter coefficient for this class recall that
the number of dichotomies induced by a single ball placed at one of the
data points is at most $2n$. Then we rewrite
\[ {{\cal C}^{q}_{z_{n}}}= \bigcup_{x_{q} \subseteq x_{n_1}}
\bigsqcup_{i=1}^{q}B\left(x_{q}(i),r_{i}\right)\]
where $\bigcup$ denotes the union of classes of sets and $\bigsqcup$ denotes
the operation of forming unions of sets to form new classes. 
As shown by Devroye et al. (1996, page 219), the shatter coefficient of
$A_{1} \bigcup A_{2}$ is bounded by the sum of the shatter coefficients of 
$A_{1}$ and $A_{2}$ and the shatter coefficient of $A_{1} \bigsqcup A_{2}$
is bounded by the product of the shatter coefficients of $A_{1}$ and
$A_{2}$. Consequently,
for $q$ balls placed at $q$ points 
the shattering of $2n$ points
is bounded by $(2n)^{q}$ and for the ${2n \choose q}$
choices of subsets of size
$q$ from $2n$ points the shatter coefficient
for $q$ balls on $2n$
points is at most ${2n \choose q}(2n)^{q} \leq (2n)^{2q}$.
Therefore
$S_{n/2n}({\calc}^{q}) \leq (2n)^{2q}$.
Note that the shatter coefficient remains independent of dimension.

We now discuss a family of multi-sphere classifiers which are obtained
by varying the data dependency.
Specifically we consider variants of the $q$-sphere
class where we vary our specification of the $q$ centers
and their radii.
The classes we consider include all combinations where
\begin{enumerate}
\item
the $q$ centers may be any subset of $x_{n_1}$, or
\item
the $q$ centers are a pre-determined subset of $x_{n_1}$,
\end{enumerate}
and
\begin{enumerate}
\item
a different radius is chosen for each center, or
\item
a single radius (same for all centers) must be chosen, or
\item
the radii are pre-determined (i.e. fixed ahead of time).
\end{enumerate}


The simplest case is where both the index set of center points and
the radii are pre-determined.
This class has only one function and so its shatter
coefficient is 1 and its computational requirements are trivial.

The next level of sophistication is
when $q=1$. We have already considered the case where
both the center sample index and its radius are variable.
The tables below give bounds on the shatter coefficient and
computational requirements for empirical risk minimization
for all combinations with $q=1$.

\begin{center}
\begin{tabular}{|c|cc|}\hline
\multicolumn{3}{|c|}{{\bf Shatter Coefficient Bounds} $q=1$} \\ \hline \hline
\multicolumn{1}{|c|}{} & \multicolumn{2}{c|}{{\em Radius Dependency}} \\
{\em Center Index Dependency} & Fixed & Variable \\ \hline
Fixed & 1 & $2n$ \\
Variable & $2n$ & $(2n)^2$ \\ \hline
\end{tabular}
\end{center}

\begin{center}
\begin{tabular}{|c|cc|}\hline
\multicolumn{3}{|c|}{{\bf Computation Bounds} $q=1$} \\ \hline \hline
\multicolumn{1}{|c|}{} & \multicolumn{2}{c|}{{\em Radius Dependency}} \\
{\em Center Index Dependency} & Fixed & Variable \\ \hline
Fixed & $O(1)$ & $O(dn \log n)$ \\
Variable & $O(d n^2 )$ & $O(n^2 (d+ \log n ))$ \\ \hline
\end{tabular}
\end{center}

The next variant we consider is one where the $q>1$
center sample indices are fixed and we
allow a single variable radius
(the same for each ball).
It is easy to show that
the shatter coefficient for this variation is bounded by $2n$,
and the empirical error can be minimized with a brute force
algorithm that runs in $O(nq(d+ \log n ))$ time.

All remaining variants allow either the $q>1$
center sample indices to vary or the $q>1$ radii
to vary.
The table below provides a summary of shatter coefficient bounds
for these variants.

\begin{center}
\begin{tabular}{|c|cc|}\hline
\multicolumn{3}{|c|}{{\bf Shatter Coefficient Bounds} $q>1$} \\ \hline \hline
\multicolumn{1}{|c|}{} & \multicolumn{2}{c|}{{\em Radius Dependency}} \\
{\em Center Index Dependency} & Fixed & Variable \\ \hline
Fixed & 1 & $(2n)^q$ \\
Variable & $(2n)^q$ & $(2n)^{2q}$ \\ \hline
\end{tabular}
\end{center}

We now show that any variant that requires $q>1$ to be determined
algorithmically is NP-Hard. 
The formal definitions for these problems are stated as prize collecting
versions of the class cover problem (Cannon \& Cowen, 200).

\begin{definition}[PRIZE COLLECTING CLASS COVER: PC-CC]
Given an $n$-sample $x_n = (x_{n_0} , x_{n_1})$,
$x_n (i) \in X$,
a distance function $d: X \times X \rightarrow \Re^+$,
and a positive integer $q \le n_1$.
Determine a subset $x_q \subseteq x_{n_1}$ of size $q$
and a set of radii $r_q$, $r_q (i) \in \Re^+$, one for each sample in $x_q$
that minimizes the error
\[
e ( x_q ) = |  x_{n_1} \cap \bar{x}_c | + | x_{n_0} \cap x_c |
\]
where $x_c$ is the set of points covered by the $q$ balls,
\[
x_c =
\{ x_n (i) : \exists x_{n_1} (j) \in x_q \mbox{ such that }  d(x_n (i) , x_{n_1}
(j) ) \le r_{n_1} (j) \}
\]
and $\bar{x}_c = x_n \setminus x_c$.
\end{definition}

\begin{definition}[SINGLE RADIUS CLASS COVER: SRCC] This problem is the same as PC-CLASS COVER except that
every sample in $x_q$ is forced to use the same radius,
so only a single radius value must be determined.

\end{definition}

\begin{definition}[FIXED RADIUS CLASS COVER: FRCC]
This problem is the same as PC-CLASS COVER except that
the radii are fixed, i.e.
a fixed set of radii $r_{n_1}$,
one for each sample in $x_{n_1}$,
is provided as part of the problem instance,
and we must determine only $x_q$.
\end{definition}

\begin{definition}[FIXED CENTER CLASS COVER: FCCC]
This problem is the same as PC-CLASS COVER except that
the centers are fixed, i.e.
$x_q$ is provided as part of the problem instance, 
and we must determine only $r_q$.
\end{definition}

The last variant presented involves fixing $q>1$ centers and algorithmically
determining the corresponding radii. We do not present a hardness proof for this
version but conjecture that even this simple variant is hard for $q>1$.
The following theorem gives hardness results for the first three variants.

\begin{theorem}
PC-CC, SRCC and FRCC are NP-Hard.
\label{thm:cc_hard}
\end{theorem}

\begin{proof}
Our proofs are by reduction from $K$-CENTER
(Hochbaum, 1997,  Section 9.4.1).

\begin{definition}[K-CENTER: KC]
Given a weighted graph $G = (V,E)$ with $|V| = n$, and a positive integer
$k \le n$.
Let $\{ w_{i,j} \}$ represent the set of shortest path distances
between vertices $v_i$ and $v_j$ in $V$.
Find a subset of vertices $S$ of size $k$
so that the greatest distance among all vertices in $V$ to a nearest
vertex in $S$ is minimized. More formally,
\[
\min_{S \subset V, |S|=k} \{\max_{v_i \in V \backslash S} \min_{v_j \in S}
w_{i,j}\}.
\]
\end{definition}

We begin with the reduction to PC-CC, and then show how the same
reduction can be used for both SRCC and FRCC as well.

Notice that the value of the objective function will always be one of
$\binom{n}{2}$ possible shortest path distances.
We will generate at most $\binom{n}{2}$ instances of PC-CC.
First, order the $\binom{n}{2}$ distances $w_{i,j}$. Let $d_{(i)}$ be the
$i^{th}$
smallest distance and let $\epsilon$ be half of the smallest nontrivial pairwise
difference between the distances,
\[
\epsilon = \min_{d_{(i)} \neq d_{(j)}} \frac{1}{2}|d_{(i)} - d_{(j)}|
\]
Now we generate a copy of the weighted graph $G(V,E)$ and color
every vertex blue. 
Add $n$ red vertices, each with an edge to a corresponding
blue vertex that has weight
$d_{(\binom{n}{2})} + \epsilon$.
Note that each
blue vertex now has a red vertex whose shortest path distance is
$d_{(\binom{n}{2})} + \epsilon$ and no
red vertices closer.
The shortest path distances for all vertices in this modified graph
satisfy the properties of
a metric and can thus be embedded as points in a Euclidean space
of finite dimension in polynomial time (Young, 1938).
These points serve as inputs to the PC-CC problem where
we make a one to one correspondence between the blue (red) vertices
and the points in $x_{n_1}$ ($x_{n_0}$).
Now set $q=k$ and solve the PC-CC problem.
Observe that this version of PC-CC is trivially
solved with zero error. 

We repeat the process now using $d_{(\binom{n}{2}-1)}$. The red vertices are
now ``one step'' closer. This time we have no guarantee of existence of a
zero error solution. If the error
is zero, then we repeat the process bringing the red nodes another step closer.
We continue until we reach a PC-CC solution with nonzero error.
As soon as this occurs the blue vertices corresponding to the
points $x_q$ of the most recent zero error
solution provide to an optimal solution to the original K-CENTER instance.
Indeed if this were not the case then any superior solution to the K-CENTER
instance would lead to a zero error solution for the most current
version of PC-CC. If it turns out that all $\binom{n}{2}$
versions of PC-CC have zero error solutions then at least
$n-k$ distances must be the same and the latest (final) solution is optimal
for the $k$-center problem.

Hence if we could solve PC-CC
in polynomial time, $T_{pc-cc}(n_0 + n_1)$,
then we could solve K-CENTER on $n$ vertices
in $O(n^2 \cdot T_s (n) \cdot T_e (n) \cdot T_{pc-cc}(n))$ time,
where $T_s (n)$ is the time required to compute the
shortest path distances for the modified graph,
and $T_e (n)$ is the time required to embed the modified
graph into a Euclidean space.

The same reduction can be used to prove hardness for SRCC by noting
that at each step, as we move the red vertices closer,
the set of optimal solutions (e.g. points and their radii)
always includes an equal radius solution.
For example at the step where $d_{(i)}$ is used to
determine the new edge weights for the modified graph, if there is
a zero error solution then there is a zero error
solution where the radii for all $q$ blue points
are set to $d_{(i)}$ (at this radius each
of the $q$ blue points covers as many other blue points
as possible without introducing errors).
So the reduction works when we substitute SRCC
for PC-CC at each step.

The reduction also works when we substitute
FRCC for PC-CC. This follows directly from
the observation that if there is
a zero error solution, then there is a zero error
solution whose radii are all $d_{(i)}$.
This means that we can simply set
all the radii in $r_{n_1}$ to $d_{(i)}$ and solve
FRCC (instead of PC-CC).

\end{proof}


In summary, we have provided bounds on both
the structural and computational complexity
for a family of spherical classifiers.
However, we have not provided a procedure for
selecting the best member from this family.
This is the topic of the next section.
\section{Structural Risk Minimization}

In classification it often not clear what hypothesis space to choose. 
Vapnik (1998) addressed this problem by considering a
sequence of hypothesis spaces large
enough to contain a good classifier. The
structural risk minimization procedure (Vapnik, 1998) is a method designed to find
this good classifier even though classifier complexity of the sequence is
large.
We now present a structural risk minimization procedure and prove convergence
for a sequence of data dependent classifier spaces. 
 Although utilized in
the construction of classifiers, we present the analysis in terms of empirical
process theory. We consider a sequence
$\calf= \{ \calf^{q},q=1,....\}$ of data dependent classes $\calf^{q}$. 
The
symbol $\calf$  used here is the same as we used for the data dependent class
before. However, now the data dependent classes in the sequence are denoted
$\calf^{q}$ and $\calf$ is the sequence so there should be no confusion. We
say that $\calf$ is image admissible Suslin if each $\calf^{q}$ is. For
simplicity of presentation we set $m=2n$.  Let 
$S_{q,n}$ denote bounds on the shatter coefficients
$S_{n/2n}(\calf^{q})$.
Our method is similar to that presented by Vapnik
(1998) (also see Devroye, Gy\"{o}rfi, and
Lugosi, 1996). Given a training set ${z}_n$ we
select a function $\tilde{f}^{q}_{z_{n}}$ from every class
$\calf^{q}_{z_{n}}$ which minimizes empirical mean over the
class. From these we select the function $f^*_{z_{n}}$ that
minimizes over $q$ the complexity penalized function:
\[\tilde{\mu}(\tilde{f}^{q}_{z_{n}}) = \hat{\mu}(\tilde{f}^{q}_{z_{n}}) + r(q,n) \]
where $\hat{\mu}$ is the empirical mean and $r(q,n)$ is the penalty
term
\[r(q,n) =  \sqrt{2\frac{\log{en}}{n\log{n}} \log(S_{q,n})}\]

Let \[\mu^{*}_{q}(z_{n})=  \inf_{f \in \calf^{q}_{z_{n}}} \mu(f) \] define
the
function $\mu^{*}_{q,n}$ and
\[\mu^{*}(z_{\infty})=\inf_{q}\limsup_{n \rightarrow \infty}\mu^{*}_{q}(z_{n})\]
define the
function $\mu^{*}_{\infty}$.
If we let the intermediate
\[\mu^{*}_{q}(z_{\infty})=\limsup_{n \rightarrow \infty}\mu^{*}_{q}(z_{n})\]
define the
function $\mu^{*}_{q}$, then
\[\mu^{*}(z_{\infty})=\inf_{q}\mu^{*}_{q}(z_{\infty}).\]
$\mu^{*}_{q,n}$ can be extended to a function with the same name on $Z_{\infty}$
in the obvious way. In general we make this extension without notice.
When we apply this theory to classification we make use of Equation \ref{error} for
the generalization error
\[e(f)= \mu(1-\acute{f})=
p(1)+\mu_{0}(f)p(0)-\mu_{1}(f)p(1)\]
and the corresponding definitions for
$e^{*}_{q,n}$,$e^{*}_{\infty}$, $e^{*}_{q}$.


Let ${\cal SE}$ denote the class of subexponential functions on the positive
integers. That is $g \in {\cal SE}$ if and only if
\[ \sum_{n > 0} g(n)e^{-n\alpha} < \infty\]
for all $\alpha > 0$. This class includes the polynomials, polynomials with
logarithms and even such functions as $e^{\frac{n}{\log{n+1}}}$.
We use the bound from Theorem \ref{main} with $m=2n$
\[{\cal{P}}_{Z_{n}}\Bigl( \sup_{ f \in \calf^{q}_{z_{n}}}\bigl|\mu(f)-\hat{\mu}(f)\bigr|>\epsilon \Bigr) \leq
2S_{q,n}e^{2\epsilon}e^{-\frac{n\epsilon^{2}}{2}}.\]


The proof of the 
the following structural risk minimization theorem closely resembles a theorem presented by Devroye et al. (1996, Theorem 18.2). However, because of the data dependencies, the conditions of the
theorem are somewhat different than usual. Let ${\it wp1}$  
 denote the terminology ``with probability $1$''. 
\begin{theorem}
\label{structuralrisk}
Suppose that $\calf$ is image admissible Suslin and
 there
exists bounds $S_{q,n}$ for the shatter coefficients  such that
$\sum_{q}S_{q,n}^{-\frac{1}{
\log{n}}} \in
{\cal SE}$ and $S_{q,n} \in {\cal SE}$ for every $q$.
Let $f^*_{z_n}$ be determined by the structural risk minimization procedure. Then
$\lim \sup_{n \rightarrow \infty}\mu(f^*_{z_{n}}) \leq \mu^*_{\infty} 
~ {\it wp1}$.
\end{theorem}

\subsection{Structural assumptions on $\calf$}

This structural risk theorem was very general and assumed only that $\calf$
was image admissible Suslin, but assumed nothing about the relationship between
$\calf^{q_{1}}_{n_{1}}$ and $\calf^{q_{2}}_{n_{2}}$. 
Although stronger limit theorems can be
made under structural assumptions about $\calf$ we analyze this question
at present only in
terms of 
$\mu^{*}_{\infty}$. 

\subsubsection{$z$-increasing $\calf$}

\begin{definition}
We say that $\calf$ is $z$-increasing if
$\calf^{q}_{z_{n}} \subseteq \calf^{q}_{z_{\acute{n}}}$
 when $z_{n} \subseteq z_{\acute{n}}$
for each $q$.
\end{definition}
If $\calf$ is $z$-increasing,
 then
$\limsup_{n \rightarrow \infty}\mu^{*}_{q}(z_{n})= \inf_{n}\mu^{*}_{q}(z_{n})$
and so
\[\mu^{*}(z_{\infty})= \inf_{q} \inf_{n}\mu^{*}_{q}(z_{n})=
\inf_{q} \inf_{n}\inf_{f \in \calf^{q}_{z_{n}}} \mu(f)\]
but since generally $\inf_{w}\inf_{v}=\inf_{w,v}$ we obtain that
\[\mu^{*}(z_{\infty})=\inf_{f \in \calf}\mu(f)\]
and since $\mu(f^*_{z_{n}}) \geq \inf_{f \in
\calf}\mu(f)=\mu^{*}(z_{\infty})$ we obtain the following much stronger
theorem.

\begin{theorem}
\label{structuralrisk2}
Suppose the data dependent sequence $\calf$ is $z$-increasing and image admissible
Suslin and there
exists bounds $S_{q,n}$ for the shatter coefficients such that
$\sum_{q}S_{q,n}^{-\frac{1}{
\log{n}}} \in
{\cal SE}$ and $S_{q,n} \in {\cal SE}$ for every $q$. 
Let $f^*_{z_n}$ be determined by the structural risk minimization procedure. Then
$\mu(f^*_{z_{n}}) \underset{n \rightarrow
\infty}{\longrightarrow} \inf_{f \in \calf} \mu(f) ~ {\it wp1}$.
\end{theorem}

An interesting thing also happens to the shatter coefficients.
\begin{lemma}
\label{increasing}
Suppose the data dependent class ${\cal G}$ is $z$-increasing. Then
\[{\cal N}_{n}(z_{m},{\cal G}) \leq {\cal N}_{\acute{n}}(z_{m},{\cal G})\] 
and 
\[S_{n/m}({\cal G}) \leq S_{\acute{n}/m}({\cal G})\]
for all $n \leq \acute{n} \leq m$.
In particular, 
\[S_{n/m}({\cal G}) \leq S_{m/m}({\cal G}).\]
\end{lemma}
\begin{proof}
The $z$-increasing assumption  ${\cal G}_{z_{n}} \subseteq {\cal G}_{z_{\acute{n}}}$
when $z_{n} \subseteq z_{\acute{n}}$ implies that
\[{\cal N}_{n}(z_{m},{\cal G})=\Bigl|\{\{z_{m}(1),...,z_{m}(m)\} \cap I_{f}: f \in
{\cal G}_{w_{n}}, w_{n} \subset
z_{m}\}\Bigr|\]
\[ \leq \Big|\{\{z_{m}(1),...,z_{m}(m)\} \cap I_{f}: f \in
{\cal G}_{z_{\acute{n}}}, z_{\acute{n}} \subset z_{m} \}\Bigr|=
{\cal N}_{\acute{n}}(z_{m},{\cal G})\] 
and taking the supremum over $z_{m}$ finishes the proof.
\end{proof}

We define the data dependent $VC$ dimension of order $m$ as
\[ VC_m({\cal G})=\max \{n: \exists z_{n} \subseteq z_{m}: {\cal G}_{z_{m}}
\mbox{shatters}~ z_{n}\}.\]
In words  $VC_m({\cal G})$ is the size of the largest subset of some $m$ points
which is shattered by the data dependent class on those $m$ points.
The Sauer lemma (Devroye et al., 1996) can be directly applied. In particular
 \[S_{n/m}({\cal G}) \leq S_{m/m}({\cal G}) \leq m^{VC_m({\cal G})}+1.\]




\subsubsection{Symmetric $\calf$}
\label{symmetric}
We say that $\calf$ is $z$-symmetric if
\[\calf^{q}_{z_{n}}=\calf^{q}_{\sigma(z_{n}
)}
\]
for every permutation $\sigma$ of the $n$ points. In this section we will show that
when $\calf$ is $z$-symmetric,
${\mu}^{*}_{\infty}$  is measurable and
constant ${\it wp1}$. As we mentioned in the introduction, such measurability
considerations are appropriate through most of this work, but feel these
technicalities would obscure the presentation. However, the following result depends
so critically on measurability assumptions that we include these
technicalities here.
We now assume that $\calf$ is $z$-symmetric.
Before we state the next theorem we need to introduce some terminology.
A universally measurable set in a measurable space is a set $M$ which is measurable for
the completion of every measure.
 In particular, for each measure $\nu$ their exists measurable sets 
\[ A \subset M \subset B\]
such that $\nu(A)=\nu(B)$.  Consequently, we can and will be a little sloppy and say
$\nu(M)=\nu(A)=\nu(B)$ for the universally measurable set $M$. A universally measurable
function between measurable spaces is one such that the preimage of a measurable set
is a universally measurable set. For a specific measure, we can consider universally
 measurable sets as
measurable for the completion of the measure (Ash, 1972). Indeed not only do they
form a
$\sigma$-algebra, but it is elementary to show that if $\{M_{n}\}$ is a 
countable collection of
universally measurable sets such that
\[ A_{n} \subset M_{n} \subset B_{n}\]
with $\nu(A_{n})=\nu(B_{n})$ for each $n$, then
$\bigcup_{n}M_{n}$ and $\bigcap_{n}M_{n}$ are both universally measurable, 
\[ \bigcup_{n}A_{n} \subseteq \bigcup_{n}M_{n} \subseteq \bigcup_{n}B_{n},\]
\[ \bigcap_{n}A_{n} \subseteq \bigcap_{n}M_{n} \subseteq \bigcap_{n}B_{n},\]
$\nu( \bigcup_{n}A_{n})=\nu( \bigcup_{n}B_{n})$, and $\nu(\bigcap_{n}A_{n})=
\nu(\bigcap_{n}B_{n})$.




\begin{theorem}
\label{const}
Suppose that  $Z$ is a Suslin space with Borel measure
 $\mu$. Suppose that
the data dependent sequence $\calf$ is $z$-symmetric and image admissible
Suslin. 
Then
$\mu^{*}_{\infty}$ is universally measurable and 
$\mu^{*}_{\infty}=$ constant ${\it wp1}$.
\end{theorem}
\begin{proof}
Since  $\calf$ is image admissible Suslin, for fixed $q$,
$\Xi^{q}_{n}$ is measurable. Since $\Xi^{q}_{n}$ is positive Fubini's theorem tells us that
\[ \mu\left(f(T_{q}(y_{n})_{z_{n}})\right)=
\mu\left(\Xi^{q}(y_{n},z_{n},\cdot)\right) \]
is jointly measurable.
By the theorem of Sainte-Beuve as presented by Dudley (1999)
\[\mu^{*}_{q}(z_{n})= \inf_{f \in \calf_{z_{n}}}{\mu(f)}= \inf_{y_{n} \in Y_{n}}
{\mu\left(f(T_{q}(y_{n})_{z_{n}})\right)}\]
is universally measurable on $Z_{n}$ and extends naturally to a universally
measurable function on $Z_{\infty}$.  We work on $Z_{\infty}$ for the rest of this
proof.
In particular, for every $c$ the set $\{ \mu^{*}_{q,n}\geq c\}$
is universally measurable. Consequently, there exists measurable sets
\[ A_{q,n} \subset \{ \mu^{*}_{q,n}\geq c\} \subset B_{q,n} \]
such that ${\cal P}_{Z_{\infty}}(A_{q,n})={\cal P}_{Z_{\infty}}(B_{q,n})$.
Since $\calf$
is $z$-symmetric ${\{\mu^{*}_{q,n}\geq c\}}$ is invariant under permutation of the
first $n$ positions in $Z_{\infty}$. If
we let $S^{+}_{n}$ denote the upper symmetric envelope of a set (the union over all
permutations of the first $n$ position indices) and $S^{-}_{n}$ the
 lower symmetric envelope
it is clear from the fact that universally measurable sets behave like measurable
sets with respect to unions and intersections that 
$ {\cal A}_{q,n}=S^{+}_{n}(A_{q,n})$ and $ {\cal B}_{q,n}=S^{-}(B_{q,n})$
 are measurable, symmetric in there first $n$ positions, and
\[A_{q,n} \subseteq {\cal A}_{q,n} \subseteq \{ \mu^{*}_{q,n}\geq c\} \subseteq {\cal
B}_{q,n} \subseteq B_{q,n}.\] Consequently
${\cal P}_{Z_{\infty}}({\cal A}_{q,n})={\cal P}_{Z_{\infty}}({\cal B}_{q,n})$.
Now consider
\[ \mu^{*}(z_{\infty})=\inf_{q}\limsup_{n \rightarrow \infty}\mu^{*}_{q}(z_{n})=
\inf_{q}\inf_{N}\sup_{n \geq N}\mu^{*}_{q}(z_{n})\] so that
\[ \{ \mu^{*}_{\infty} \geq c\}= \bigcap_{q}\bigcap_{N}\bigcup_{n \geq N}\{
\mu^{*}_{q,n} \geq c\}.\] It is clearly universally measurable and
if we define ${\cal A}_{\infty}= \bigcap_{q}\bigcap_{N}\bigcup_{n \geq N}{\cal
A}_{q,n}$ and ${\cal B}_{\infty}= \bigcap_{q}\bigcap_{N}\bigcup_{n \geq N}{\cal
B}_{q,n}$,
\[ {\cal A}_{\infty}
 \subseteq \{ \mu^{*}_{\infty} \geq c\}
\subseteq  {\cal B}_{\infty}\]
and
\[{\cal P}_{Z_{\infty}}({\cal A}_{\infty})={\cal P}_{Z_{\infty}}({\cal B}_{\infty}).\]
Since ${\cal A}_{q,n}$ is symmetric in its first $n$ components,
$\bigcup_{n \geq N}{\cal A}_{q,n}$ is symmetric in the first $N$. Consequently,
$\bigcap_{N}\bigcup_{n \geq N} {\cal A}_{q,n}$ is symmetric for any finite
permutation and the same is true for ${\cal A}_{\infty}$.
This argument works just as well for 
${\cal B}_{\infty}$. Therefore both  
 ${\cal A}_{\infty}$ and ${\cal B}_{\infty}$ are
measurable and symmetric.
By the Hewitt-Savage Zero-One law (Shiryaev, 1980)
\[{\cal P}_{Z_{\infty}}({\cal A}_{\infty})={\cal P}_{Z_{\infty}}({\cal
B}_{\infty})=0 ~ \mbox{or} ~1\]
and the proof is finished.
\end{proof}
Consequently, when $\calf$ is both $z$-symmetric and $z$-increasing we obtain
that $\mu(f^{*}_{z_{n}})$ converges to $ \mu^{*}_{\infty}=\inf_{f \in \calf}
 \mu(f)$ $ ~{\it
wp1}$ and
that  $ \mu^{*}_{\infty} =\inf_{f \in \calf} \mu(f)= \mbox{constant} ~{\it wp1}$. 

Note that in the course of proving this theorem we have shown that for a fixed
symmetric image admissible Suslin data
dependent class,
the limit 
$\mu^{*}_{\infty} =\limsup_{n \rightarrow \infty}\mu^{*}(z_{n})$ 
is $\mbox{constant} ~{\it wp1}$. 


\subsection{Structural risk minimization for multi-sphere classifiers
\label{sect:srm_multi-sphere}}

When the structural risk minimization theorem is applied it would be very nice to be
able to characterize $\mu^{*}_{\infty}$. For example, when is
$\mu^{*}_{\infty}$ the minimum value over the
unconstrained sequence ${\cal UF}= \bigcup_{q,n,z_{n} \in Z_{n}}\calf^{q}
_{z_{n}} ~{\it wp1}$? We can provide no general facts in this
regard at present but in
the this section we show that this is true for multi-sphere classifiers. 
We first show that 
$e^{*}_{\infty}$ is the Bayes optimal and then show we can apply the Structural Risk
Minimization Theorem \ref{structuralrisk2}.  The analysis below works for closed or
open balls, balls all
of the same radii, and when the complements of the union of the balls
 is included in the
class of functions.

\begin{theorem}
\label{classcover}
Consider the data dependent multi-sphere classifiers on a Polish space $X$ with Borel
measure $\mu$ on $Z=(X,\{0,1\})$. Let $e_{\cal B}$ denote the Bayes error. 
Let the structural risk minimization procedure be applied with $S_{q,n}=(2n)^{2q}$
to 
produce the classifier $f^*_{z_{n}}$.
Then $e(f^*_{z_{n}}) \underset{n \rightarrow
\infty}{\longrightarrow} e_{\cal B}~ {\it wp1}$. 
  
\end{theorem}

\begin{proof}


To prove this theorem we have to do two things. The first is to show that $e^{*}_{\infty}$ is
universally measurable and
that $e^{*}_{\infty}=e_{\cal B} ~{\it wp1}$.
The second is to show that we can apply the
Structural Risk Minimization Theorem \ref{structuralrisk2}. We now proceed with the
first.


\begin{lemma}
\label{bayes}
Consider a Borel measurable random variable $Z=(X,\{0,1\})$ with $X$ Polish and Borel
measure $\mu$.
Then for the data dependent multi-sphere
classifiers,  $e^{*}_{\infty}$ is universally measurable and  
$e^{*}_{\infty}=e_{\cal B} ~{\it wp1}$.
\end{lemma}

\begin{proof}
We now show that for the multi-sphere classifiers that the 
evaluation function $\Xi^{q}$ is jointly measurable for each $q$.
The data dependent classes $\acute{{\cal C}}$ of functions on $Z$ are  
the data dependent classes $ \bigcup_{q \leq n}
\acute{{\cal C}^{q}} $
where ${\cal C}^{q}_{x_{n}}$ is the set of functions on $X$ which are one
on the union of a set of $q$ balls placed on the $x$ component of the data samples.
 It is clear that $\acute{\cal C}$ is both 
$z$-symmetric and $z$-increasing.
Consider the class ${\cal B}_{x_{q}}$ for a
fixed set of $q$ points $x_{q}$. We define the parameter space of radii
 $W_{q}=\Re_{+}^{q}$ and the
parameterization
that $T_{x_{q}}(w_{q})(z)= \max_{1 \leq i \leq q}{I_{|x-x_{q}(i)|\leq w_{q}(i)}}$
where $I$ is the indicator function.
For each $i$ the set $I_{|x-x_{q}(i)|\leq w_{q}(i)}$ is closed and so
measurable in $W_{q},Z_{q},X$.  Consequently the evaluation function
\[\Xi^{q}(w_{q},z_{q},X)=T_{x_{q}}(w_{q})(z)= \max_{1 \leq i \leq q}{I_{|x-x_{q}(i)|\leq
w_{q}(i)}}\]
is jointly measurable. 
Since the measurable sets are closed under complement,
the evaluation function for the class corresponding to the functions on $Z$ is measurable on
$W_{q},Z^{q},Z$.
Consequently, the conditions of
Theorem \ref{const} are satisfied and so $e^{*}_{\infty}$ is universally measurable and 
$e^{*}_{\infty}$ is constant ${\it wp1}$. What is left is to show that this 
constant is
$e_{\cal B}$.

To this end note that 
\[ e_{\cal B} \leq
 e^{*}_{q}(z_{n})\]
so that
\[e_{\cal B} \leq e^{*}(z_{\infty}).\]
Now we prove that 
\[e^{*}_{\infty} \leq e_{\cal B}\]
${\it wp1}$ which will then finish the proof.
The idea is as follows. We keep an eye on the generalization error (Equation
\ref{error})
$e(f)= 
p(1)+\mu_{0}(f)p(0)-\mu_{1}(f)p(1)$ and show that for any $f$ we can use a data
dependent placement of balls to form a classifier $f^{*}$ such that
$\mu_{0}(f^{*})$ is never much larger than
$\mu_{0}(f)$ while at the same time show that for large enough $n$ with high probability $\mu_{1}(f^{*})$ is
not much smaller than $\mu_{1}(f)$. For the first part we utilize the continuity of
the measure of a decreasing family of measurable sets and for the second the fact that
the measure of a measurable set on a Polish space can be approximated from below by
the measure of compact measurable sets. This compactness allows us to show that with
high probability the randomly placed set of balls will have $\mu_{1}$ measure not too
much smaller than that of $f$. Consequently from the generalization error formula we
see that with high probability we can control the increase in generalization error
 over the generalization error of any measurable set through
a random placement of balls. We now carry out the details.

Let $\epsilon_{1} > 0$. Then there exists a measurable $f_{\epsilon_{1}}$ such that
\begin{equation}
\label{b1}
 e(f_{\epsilon_{1}}) \leq e_{\cal B} +\epsilon_{1}.
\end{equation}
Since $X$ is Polish it follows (see e.g. Ash, 1972) that
$f_{\epsilon_{1}}$ can be approximated from below by compact sets. That is, given any
$\epsilon_{2}>0$ there is a compact $h_{\epsilon_{2}} \leq f_{\epsilon_{1}}$ such that
\begin{equation}
\label{b3}
\mu_{1}(f_{\epsilon_{1}})-\epsilon_{2} \leq \mu_{1}(h_{\epsilon_{2}}) \leq
 \mu_{1}(f_{\epsilon_{1}}).
\end{equation}
Since it is compact, for any covering of
$h_{\epsilon_{2}}$ of open balls
 $B^{\epsilon_{3}}_{i}$ of radius $\epsilon_{3}$ there is a finite subcovering. From
this subcovering choose the positive measure subcovering such that $\mu_{1}(B^{\epsilon_{3}}_{i}) > 0$ for
each ball in the subcover. Note that the positive measure subcover is not really a
covering. However  $\mu_{1}(\bigcup B^{\epsilon_{3}}_{i})\geq
\mu_{1}(h_{\epsilon_{2}})$ for the positive measure subcover. Consider the classifier $f^{2\epsilon_{3}}_{z_{n}}
\in {\cal F}_{z_{n}}$
which places closed balls  $\bar{B}^{2\epsilon_{3}}$ of radius
$2\epsilon_{3}$ at each of the data points which satisfy $h_{\epsilon_{2}}(x)=1$. The
triangle inequality implies 
that \[\bar{B}^{2\epsilon_{3}}(x) \supset B^{\epsilon_{3}}\]
whenever $ x \in B^{\epsilon_{3}}$ so that 
if the
sample $z_{n}$ has at least one point in every ball $B^{\epsilon_{3}}_{i}$ of the positive
measure subcover, then $f^{2\epsilon_{3}}_{z_{n}} \geq h_{\epsilon_{2}}$
and so 
\begin{equation}
\label{b4}
 \mu_{1}(f^{2\epsilon_{3}}_{z_{n}}) \geq \mu_{1}(h_{\epsilon_{2}}).
\end{equation}
From
 Equation \ref{b3} we see that for $ \epsilon_{2} <
  \mu_{1}(f_{\epsilon_{1}})$, $\mu_{1}(h_{\epsilon_{2}})> 0$ and 
since the positive measure subcover is finite we know then for large enough $n$
 with high probability there
will be at least one point in every ball. That is, given an $\epsilon_{4} >0$ 
there exists an $M(\epsilon_{4})$ such that
\[ {\cal P}_{Z_{\infty}}\Bigl( \mu_{1}(f^{2\epsilon_{3}}_{z_{n}}) \geq
\mu_{1}(h_{\epsilon_{2}}) \Bigr) \geq  1-\epsilon_{4}\] 
for all $n \geq M(\epsilon_{4})$. 
Now 
$f^{2\epsilon_{3}}_{z_{n}} \leq N_{2\epsilon_{3}}(h_{\epsilon_{2}})$ where
$N_{\epsilon}(A)$ is the set of all points at distance less that or equal to
$\epsilon$ from the set $A$.   
Consequently, $\mu_{0}(f^{2\epsilon_{3}}_{z_{n}}) \leq
\mu_{0}(N_{2\epsilon_{3}}(h_{\epsilon_{2}}))$ and 
so
$\mu_{0}(f^{2\epsilon_{3}}_{z_{n}}) \leq \mu_{0}(N_{2\epsilon_{3}}(f_{\epsilon_{1}})).$
Since in addition, $\mu_{1}(h_{\epsilon_{2}}) \geq
\mu_{1}(f_{\epsilon_{1}})-\epsilon_{2}$, if we put this together in the generalization
formula
 we obtain
 \begin{multline*}
\lefteqn{{\cal P}_{Z_{\infty}}\Bigl(e(f^{2\epsilon_{3}}_{z_{n}}) \leq
e(f_{\epsilon_{1}})+p(0)\Bigl(\mu_{0}(N_{2\epsilon_{3}}(h_{\epsilon_{2}}))-
\mu_{0}(h_{\epsilon_{2}})\Bigr)} \\
+p(0)\Bigl(\mu_{0}(h_{\epsilon_{2}})-\mu_{0}(f_{\epsilon_{1}})\Bigr) +
p(1)(\epsilon_{2})\Bigr)
 \geq 1-\epsilon_{4}.
 \end{multline*} 
Since $e(f_{\epsilon_{1}}) \leq e_{\cal B} +\epsilon_{1}$ and $h_{\epsilon_{2}} \leq
f_{\epsilon_{1}}$ we further obtain
\[ {\cal P}_{Z_{\infty}}\Bigl(e(f^{2\epsilon_{3}}_{z_{n}}) \leq
e_{\cal B}+ \epsilon_{1} + p(0)\Bigl(
\mu_{0}(N_{2\epsilon_{3}}(h_{\epsilon_{2}}))-
\mu_{0}(h_{\epsilon_{2}})\Bigr) +p(1)(\epsilon_{2})\Bigr) \geq
1-\epsilon_{4}.\]

Since $e^{*}(z_{n}) \leq e(f^{2\epsilon_{3}}_{z_{n}})$ we obtain
\[ {\cal P}_{Z_{\infty}}\Bigl(e^{*}_{n} \leq
e_{\cal B}+ \epsilon_{1} + p(0)\Bigl(
\mu_{0}(N_{2\epsilon_{3}}(h_{\epsilon_{2}}))-
\mu_{0}(h_{\epsilon_{2}})\Bigr) +p(1)(\epsilon_{2})\Bigr) \geq
1-\epsilon_{4}\] where, as before, since we know that $e_{n}$ is universally
measurable the probability statement is for the probability of the two measurable sets
that trap the event.
The proof of Theorem \ref{const} showed that the limit of a decreasing sequence of universally
measurable sets was universally measurable and that the probability of the limit is the
limit of the 
sequence of probabilities. Consequently,
\[ {\cal P}_{Z_{\infty}}\Bigl(e^{*}_{\infty} \leq
e_{\cal B}+ \epsilon_{1} + p(0)\Bigl(
\mu_{0}(N_{2\epsilon_{3}}(h_{\epsilon_{2}}))-
\mu_{0}(h_{\epsilon_{2}})\Bigr) +p(1)(\epsilon_{2})\Bigr) =
1.\]
Since a metric space is Hausdorff, the compact set $h_{\epsilon_{2}}$ is closed
and so the $\mu_{0}(N_{2\epsilon_{3}}(h_{\epsilon_{2}}))$ converges to
$\mu_{0}(h_{\epsilon_{2}})$ as $\epsilon_{3}$ goes to zero. Consequently we can choose
$\epsilon_{3}$ small enough so that
\[ {\cal P}_{Z_{\infty}}\Bigl(e^{*}_{\infty} \leq
e_{\cal B}+ \epsilon_{1} + 
\epsilon_{2}\Bigr) =
1.\]
Letting $\epsilon_{2}$ and  $\epsilon_{1}$ go to zero, we obtain
\[ {\cal P}_{Z_{\infty}}\Bigl(e^{*}_{\infty} \leq
e_{\cal B} 
\Bigr) =
1\] 
and the proof is finished.
\end{proof}

We have already shown $\acute{\cal C}$ to be image admissible Suslin  
and have observed that it is also both $z$-symmetric and $z$-increasing.  
From Theorems \ref{structuralrisk2},\ref{const}, and Lemma \ref{bayes} all that is 
left is to show that the bound on the shatter coefficient $S_{q,n}$
satisfies the conditions of Theorem
\ref{structuralrisk2}. 
In Section \ref{sect:multi-sphere}
we showed that $S_{n/2n}(\acute{{\cal C}}^{q}) \leq (2n)^{2q}$.
Thus, a direct calculation with $S_{q,n} = (2n)^{2q}$ shows
that the conditions of Theorem \ref{structuralrisk2} are satisfied.
We note that these shattering bounds are dimension independent, whereas if the
balls could be located anywhere this would not be the case.


\end{proof}
 


\section{A framework for classification}

The empirical process results in the preceding sections
form the basis of a new framework to
analyze the performance of statistical
learning paradigms.
In this framework classifier design is 
decomposed into two components;
the first component is the restriction to the data dependent
hypothesis class
and the second is empirical risk minimization
within that class.
Analysis within this framework involves the
the study of the decomposition induced
on both error and computation.
There are several goals we might hope to accomplish with this framework:
tighter performance bounds for existing learning paradigms,
simpler (or more elegant) proofs, greater flexibility in the analysis, and
the discovery of new mechanisms
for controlling error and/or computation (which may
lead to new learning paradigms).
In this section we discuss
progress along these fronts using the
VC framework as a benchmark for comparison.
Specifically, we make comparisons between error bounds for traditional
classes $C$, and their data dependent counterparts $\calc$,
where the data dependency is introduced in different ways.
A natural way of defining a data dependent class
is to impose a constraint on a traditional class $C$ 
to obtain $\calc_{z_n}$.
If this is not the case we define $C = \cup_{z_n} \calc_{z_n}$
as the traditional hypothesis class.

We now examine the structure of the induced error decomposition.
It is common to break the generalization error into two components,
{\em approximation error} and {\em estimation error},
which are defined as follows.
The approximation error for $C$ is
\[
A = \inf_{f \in C } e(f) - e_{\cal B} 
\]
and the estimation error is
\[
E = e(\hat{f}) - \inf_{f \in C } e(f)
\]
where $\hat{f} \in C$ minimizes the empirical error.

The approximation error for the data dependent class $\calc$ is
\[
A_D = \inf_{f \in \calc_{z_n} } e(f) - e_{\cal B}
\]
and the estimation error is
\[
E_D = e (\tilde{f} ) - \inf_{f \in \calc_{z_n} } e(f),
\]
where $\tilde{f} \in \calc_{z_n}$ minimizes the empirical error.
The data dependent approximation error $A_D$
splits into
\[
A_D = A + E_{DD}
\]
where the data dependency error is
\[
E_{DD} = \inf_{f \in \calc_{z_n} } e(f) - \inf_{f \in C } e(f).
\]
Consequently we see the increase in approximation error
due to data dependency.
The data dependent estimation error splits into
\[
E_D = E - E_{DD} + e(\tilde{f} ) - e(\hat{f})
\]
giving
\[
A_D + E_D = A + E + e(\tilde{f} ) - e(\hat{f}).
\]
The right hand side shows the possible benefits
of data dependent learning
which we analyze through the terms on the left hand side.

Analysis of the
data dependent approximation error
is similar to that for traditional classifiers,
except for the data dependency error term.
Our treatment of this random variable has just begun.
We have shown that its infinite sample limit is constant
when the data dependency is symmetric in its dependence on
the $n$-sample.
Although we have provided no general
conditions on the data dependency
that allow computation of this constant,
we have shown that it is zero when we employ
the structural risk minimization
over multi-sphere classifiers
as described in Section \ref{sect:srm_multi-sphere}.
In this case we have also shown that 
the infinite sample limit of the data
dependent approximation error is zero.

In both the traditional and data dependent theory
estimation error is controlled in terms of the shatter
coefficient.
While the traditional theory uses the Sauer lemma to provide a
simple bound for the shatter coefficient in terms of
the VC dimension (Devroye et al., 1996) we have discovered no
such simplification in the data dependent case.
Indeed, we have not yet found a way to bound the
shatter coefficient in terms of general characteristics
of the data dependent class.
In fact we suspect incorporation of the expectation
process in the shatter bounds will become important.
Such an approach leads to problems in the field
of probabilistic geometry (Ambartzumian, 1990).
However, there are few distribution independent results in this
field, the most notable exception being the result
of Rogers (1978) concerning the probability that the convex
hulls of two samples do not intersect.
Although relevant,
we have been unable to extend it to more than two dimensions.

We argue that the change in shatter coefficient
due to data dependency may not always be significant.
Let $C$ be a finite VC class and let the corresponding
data dependent class be the subset of $C$
that achieves the minimum empirical error on the $n$-sample,
\[
\calc_{z_n} = \{ \hat{f}: \hat{f} = \arg \min_{f \in C}  \hat{e} (f) \}
\]
Since $\calc_{z_n} \subseteq C$ it is clear that
$S_{n/2n} (\calc)$ is less than or equal to the
shatter coefficient $S(n, C)$ of $C$, but
the question is whether $S_{n/2n} (\calc )$ is
sufficiently smaller (uniformly) to
guarantee a reduction in estimation error.
The existence of
lower bounds on the generalization error
for empirical error minimization (Devroye et al., 1996; Vapnik, 1998)
that closely match the upper bounds with which we wish to compare
suggest not.

We now discuss examples where
the improvement is significant.
We begin by considering the sphere classifiers where the multi-sphere class is
restricted to
a single sphere. The traditional class $C$,
whose center and radius are unrestricted,
can be represented as a generalized linear classifier
with VC dimension $\le d+2$ (Cover, 1965) (where $d$ is the dimension of
the sample space) and so has
shatter coefficient
$S(n, C) \le n^{d+2}$.
Forcing the sphere to be centered at a data point represents a clear
restriction on the function class.
Note that traditional VC theory accounts for the reduction in complexity
that results from placing the center at any fixed location. Indeed, in this
case $S(n, C) \le n$, but the center location must be chosen before
we see the data. On the other hand, a data dependent framework like the
one in this paper
is required if we wish to account for the reduction in complexity
that results from centering the sphere at a data point.
We have shown that $S_{n/2n} (\calc) \le (2n)^2$
which demonstrates that the complexity reduction is manifested in a
shattering bound that is independent of dimension.
While this class restriction may give tighter control on
the estimation error,
it is likely to increase the approximation error.
However, the dimension independent nature of the
estimation error allows us to
employ methods for reducing approximation error
that were not previously available to us.
For example we can map to a higher dimensional space
and implement the single sphere classifier there
(in much the same way we do with support vector machines).
This can be accomplished in a computationally efficient way through
the use of a kernel to compute distances (Sch{\"o}lkopf, 2001).
Finally we note that 
forcing the sphere to be centered at a data sample
gives rise to a significant reduction in the computational
requirements for learning.
Computing the optimal position and radius of a single ball
is NP-Hard (Johnson \& Preparata, 1978),
but determining the optimal center sample and its radius
can be accomplished in (low order) polynomial time
(e.g. the brute force algorithm runs in $O(dn^2 \log n)$ time).
In summary, the data dependent framework has allowed us
to quantify the reduction in class complexity obtained by forcing
the ball to be centered at a data point, provided dimension
independent error bounds, and consequently
led us to consider a
new learning paradigm that is computationally tractable.
Similar conclusions can be drawn for the set of simple linear classifiers
introduced in Section \ref{sect:slc}.

It is possible that the significant decrease in estimation error
in these examples is countered by a large increase in
approximation error. However,
we can reduce the
approximation error substantially
by using multiple balls (or linear classifiers).
For example we consider variations of the $q$-sphere classifiers
introduced in Section \ref{sect:multi-sphere}.
Theorem \ref{classcover} tells us that under very general
conditions we can drive
the approximation (and estimation) error to zero asymptotically (${\it wp1}$).
In practice, where $q$ and $n$ are finite, since the
estimation error bounds remain independent of dimension,
we can combine the technique of mapping to a higher
dimension through a kernel with multiple
balls to help reduce approximation error.
While this is an attractive idea,
computational requirements may limit its utility.
Theorem \ref{thm:cc_hard}
states that the problem of finding the best $q$ out of $n$
balls and their radii is NP-Hard \footnote{Note that since
there are at most ${n \choose q} = O(n^q )$ choices for the
center points, and for each of these no more than $n^q$ choices for
the radii, the problem is polynomial for fixed $q$.
Even so, it is not practical for $q$ greater than about 3.}.  
Consequently it may be beneficial to consider variations of the data
dependent class $\calc_{z_n}^q$ that allow us to trade approximation
and/or estimation error for computational resources.

For example suppose we restrict $\calc_{z_n}^q$ by either fixing
the radii or fixing the subset of the samples where the balls are centered.
These restrictions lead to a natural reduction in the shatter
coefficient for the corresponding data dependent class as shown is Section
\ref{sect:multi-sphere}.
Note however, that when we
fix the subset where the balls are centered
symmetry is violated and
the infinite sample limit of the data dependent optimal generalization
error may
not be constant (see the proof of Theorem \ref{const}).
While these variations may remain NP-Hard(see Section \ref{sect:multi-sphere})
it is possible that they
admit a
polynomial time approximation that is acceptable
for use in practice. They may also be solved
by an algorithm whose expected (or typical)
run time is polynomial.

Finally, we obtain a computationally tractable variant of this
problem by fixing the $q$ samples where the balls are centered
and employ a single radius
(the same for each ball) that is chosen to minimize
the empirical error.

In summary, 
we have introduced a new framework which,
although incomplete,
has demonstrated its utility
by allowing us to more thoroughly quantify
the trade-offs between performance and
computational complexity, and
has lead to the discovery of
new families of classifiers with dimension independent
performance bounds and efficient learning procedures.


\acks{We would like to thank Leonid Gurvits for a very helpful conversation.
Clint Scovel gratefully acknowledges support from the Los Alamos DOE
Program in Applied Mathematics. Adam Cannon gratefully acknowledges support
from Los Alamos National Laboratory.}



\bibliography{cannon02a}
\nocite{ambartzumian}
\nocite{ash}
\nocite{ba-bo-lu}
\nocite{bo-lu-ma}
\nocite{bu-ku}
\nocite{ca-co}
\nocite{Cortes95}
\nocite{cover}
\nocite{Crist00}
\nocite{de-gy-lu}
\nocite{devroye}
\nocite{dudley}
\nocite{freund}
\nocite{gat}
\nocite{Hoc97a}
\nocite{hu-sc}
\nocite{Joh78a}
\nocite{koltchinskii}
\nocite{ko-ab-ar-do-pa}
\nocite{ma-pr}
\nocite{pr-de-ma}
\nocite{rogers}
\nocite{Sch01}
\nocite{serfling}
\nocite{sh-ba-wi-an}
\nocite{Sha00a}
\nocite{shiryaev}
\nocite{Vap98a}
\nocite{You38a}

\end{document}




