\documentclass[twoside,11pt]{article}

% Any additional packages needed should be included after jmlr2e.
% Note that jmlr2e.sty includes epsfig, amssymb, natbib and graphicx,
% and defines many common macros, such as 'proof' and 'example'.
%
% It also sets the bibliographystyle to plainnat; for more information on
% natbib citation styles, see the natbib documentation, a copy of which
% is archived at http://www.jmlr.org/format/natbib.pdf

\usepackage{jmlr2e}


\def\be{\begin{equation}}
\def\ee{\end{equation}}
\def\a{{\bf a}}
\def\x{{\bf x}}               

%mathematical symbols:
\def\vbar{\mathchoice{\vrule height6.3ptdepth-.5ptwidth.8pt\kern-.8pt}
   {\vrule height6.3ptdepth-.5ptwidth.8pt\kern-.8pt}
   {\vrule height4.1ptdepth-.35ptwidth.6pt\kern-.6pt}
   {\vrule height3.1ptdepth-.25ptwidth.5pt\kern-.5pt}}
\def\fudge{\mathchoice{}{}{\mkern.5mu}{\mkern.8mu}}
\def\bbc#1#2{{\rm \mkern#2mu\vbar\mkern-#2mu#1}}
\def\bbb#1{{\rm I\mkern-3.5mu #1}}
\def\bba#1#2{{\rm #1\mkern-#2mu\fudge #1}}
\def\bb#1{{\count4=`#1 \advance\count4by-64 \ifcase\count4\or\bba A{11.5}\or
   \bbb B\or\bbc C{5}\or\bbb D\or\bbb E\or\bbb F \or\bbc G{5}\or\bbb H\or
   \bbb I\or\bbc J{3}\or\bbb K\or\bbb L \or\bbb M\or\bbb N\or\bbc O{5} \or
   \bbb P\or\bbc Q{5}\or\bbb R\or\bbc S{4.2}\or\bba T{10.5}\or\bbc U{5}\or
   \bbb P\or\bbc Q{5}\or\bbb R\or\bba S{8}\or\bba T{10.5}\or\bbc U{5}\or
   \bba V{12}\or\bba W{16.5}\or\bba X{11}\or\bba Y{11.7}\or\bba Z{7.5}\fi}}
                                      
\def\Q{{\bb Q}}                         %rationals
\def\N{{\bb N}}                         %naturals
\def\R{{\bb R}}                         %real%
\def\Z{{\bb Z}}                         %integers
\def\B{{\bb B}}                         %in {0,1}*
                                                         
\jmlrheading{2}{2001}{125-137}{3/04}{12/01}{Ben-Hur, Horn, Siegelmann and Vapnik}\ShortHeadings{Support Vector Clustering }{Ben-Hur, Horn, Siegelmann and Vapnik}
\firstpageno{125}
 
\begin{document}

\title{Support Vector Clustering}
\author{\name
Asa Ben-Hur \email asa@barnhilltechnologies.com\\
%Faculty of IE and Management\\
%Technion, Haifa 32000, Israel\\ \\
\addr BIOwulf Technologies \\
2030 Addison st. suite 102, Berkeley, CA 94704, USA \\
\AND
\name David Horn \email horn@post.tau.ac.il \\ 
\addr School of Physics and Astronomy \\
Raymond and Beverly Sackler Faculty of Exact Sciences \\
Tel Aviv University, Tel Aviv 69978, Israel\\ 
\AND
\name Hava T. Siegelmann \email hava@mit.edu\\
\addr Lab for Information and Decision Systems  \\
   MIT  Cambridge, MA  02139, USA\\ 
\AND
\name Vladimir Vapnik \email vlad@research.att.com\\
\addr AT\&T Labs Research \\
100 Schultz Dr., Red Bank, NJ 07701, USA\\
}

%\begin{document}
\editor{Nello Critianini, John Shawe-Taylor and Bob Williamson}
\maketitle


\begin{abstract}%
We present a novel clustering method
using the approach of support vector machines.
Data points are mapped by means of a Gaussian kernel
to a high dimensional feature space, where we search for the minimal
enclosing sphere.
This sphere, when mapped back to data space, can separate
into several components, each enclosing a separate cluster of points.
We present a simple algorithm for identifying these clusters.
The width of the Gaussian kernel controls the scale at which the data
is probed while the soft margin constant helps coping with outliers and overlapping clusters.
The structure of a dataset is explored by varying the two
parameters, maintaining a minimal number of support vectors
to assure smooth cluster boundaries.
We demonstrate the performance of our algorithm on several datasets.
\end{abstract}

\begin{keywords}
Clustering, Support Vectors Machines, Gaussian Kernel
\end{keywords}
\section{Introduction}

Clustering algorithms group data points according to various criteria
,
as discussed by
\cite{jain,fukunaga,dhs}.
Clustering may proceed according to some parametric model, as
in the k-means algorithm of \cite{kmeans1}, or by
grouping points according to some distance or
similarity measure as in hierarchical clustering algorithms.
Other approaches include graph theoretic methods, such as \cite{shamir-gen}, 
physically motivated algorithms, as in \cite{blatt}, 
and algorithms based on density estimation as in \cite{roberts} and \cite{fukunaga}.
In this paper we propose a non-parametric clustering algorithm based
on the support vector approach of \cite{Vapnik95}.
In \cite{sch-sup,sch-sup1,tax-duin} a support vector
algorithm was used to characterize the support
of a high dimensional distribution.
As a by-product of the algorithm one can compute a set of contours 
which enclose the data points.
These contours were interpreted by us as cluster boundaries in \cite{vclust}.
Here we discuss in detail a method
which allows for a systematic search for clustering solutions
without making assumptions on their number or shape, first
introduced in \cite{nips00}.

In our Support Vector Clustering (SVC) algorithm data points
are mapped from data space to a high dimensional feature space using
a Gaussian kernel.
In feature space we look for the smallest sphere that encloses
the image of the data.
This sphere is mapped back to data space, where it forms 
a set of contours which enclose the data points.
These contours are interpreted as cluster boundaries. Points enclosed by each
separate contour are associated with the same cluster.
As the width parameter of the Gaussian kernel 
is decreased, the number of disconnected contours in data space increases,
leading to an increasing number of clusters.
Since the contours can be interpreted as delineating the support of the
underlying probability distribution, our algorithm can be viewed as
one identifying valleys in this probability distribution.

SVC can deal with outliers by employing
a soft margin constant that allows the sphere in feature space
not to enclose all points.
For large values of this parameter, we can
also deal with overlapping clusters.
In this range our algorithm is similar to
the scale space clustering method of \cite{roberts} that
is based on a Parzen window estimate of the probability density with a
Gaussian kernel function.

In the next Section we define the SVC algorithm. In Section 3 it is applied
to problems with and without outliers. We first describe a problem
without outliers to illustrate the type of clustering boundaries and
clustering solutions that are obtained by varying the scale of the Gaussian
kernel. Then we proceed to discuss problems that necessitate invoking
outliers in order to obtain smooth clustering boundaries. These problems include
two standard benchmark examples.

\section{The SVC Algorithm}

\subsection{Cluster Boundaries}

Following \cite{sch-sup} and \cite{tax-duin} we formulate a support vector description of
a data set, that is used as the basis of our clustering algorithm.
Let $\{{\bf x}_i\} \subseteq \chi$ be a data set
of $N$ points, with $\chi \subseteq \R^d$,
the data space.
Using a nonlinear transformation 
$\Phi$ from $\chi$ to some
high dimensional feature-space, 
we look for the smallest enclosing sphere of radius $R$.
This is described by
the constraints:
$$
||\Phi({\bf x}_j) - {\bf a}||^2 \leq R^2 ~~ \forall j\;,
$$
where $||\cdot||$ is the Euclidean norm and ${\bf a}$ is the
center of the sphere. 
Soft constraints are incorporated by adding slack variables $\xi_j$:
\be \label{outliers}
||\Phi(\x_j)-\a||^2 \le R^2 + \xi_j
\ee
with $\xi_j \geq 0$.
To solve this problem we introduce the Lagrangian
\be \label{lagrangian}
L=R^2 -\sum_j (R^2 +\xi_j-||\Phi(\x_j)-\a||^2)\beta_j 
 - \sum \xi_j \mu_j + C \sum \xi_j\;,
\ee
where $\beta_j \ge 0$ and $\mu_j \ge 0$ are Lagrange multipliers,
$C$ is a constant,
and $C \sum \xi_j$ is a penalty term.
Setting to zero the derivative of $L$ with respect to 
$R$, ${\bf a}$ and $\xi_j$, respectively, leads to
\begin{eqnarray} \label{sumbeta}
\sum_j \beta_j =1  \\
\label{eqn-a}
\a=\sum_j \beta_j \Phi(\x_j)  \\
\label{upperbound}
\beta_j = C - \mu_j \, .
\end{eqnarray}
The KKT complementarity conditions of \cite{fletcher} result in
\begin{eqnarray}
\label{kkt1}
\xi_j \mu_j = 0 ,\\
\label{kkt2}
(R^2 +\xi_j-||\Phi(\x_j)-\a||^2)\beta_j = 0.
\end{eqnarray}
It follows from Eq. (7) that the image of a point ${\bf x}_i$ with $\xi_i > 0$
and $\beta_i >0$ lies outside the feature-space sphere.
Eq. (\ref{kkt1}) states that such a point has
$\mu_i=0$, hence we conclude from
Eq. (\ref{upperbound}) that $\beta_i = C$. This will be called
a {\em bounded support vector} or BSV.
A point ${\bf x}_i$ with $\xi_i=0$ is mapped to the inside or to the surface
of the feature space sphere.
If its $0<\beta_i <C$ then
Eq. (\ref{kkt2}) implies that its image $\Phi({\bf x}_i)$
lies on the surface of the feature space sphere.
Such a point
will be referred to as a {\em support vector} or SV.
SVs lie on cluster boundaries, BSVs lie outside the boundaries, and all other
points lie inside them.
Note that when $C \geq 1$ no BSVs exist because of
the constraint (3).

Using these relations
we may eliminate the variables $R$, $\a$ and $\mu_j$, turning the Lagrangian
into the Wolfe dual form that is a function of the variables $\beta_j$:
\be
W=\sum_j \Phi(\x_j)^2 \beta_j -
\sum_{i,j} \beta_i \beta_j \Phi(\x_i) \cdot \Phi(\x_j).
\ee
Since the variables $\mu_j$ don't appear in the Lagrangian they may be
replaced with the constraints:
\be \label{boundbeta}
0 \le \beta_j \le C , \,\, j=1,\ldots, N.
\ee
We follow the SV method and represent the
dot products  
$\Phi(\x_i) \cdot \Phi(\x_j)$ by an appropriate Mercer kernel $K(\x_i,\x_j)$.
Throughout this paper we use the Gaussian kernel
\be
K(\x_i,\x_j)=e^{-q||\x_i-\x_j||^2} \;,
\ee
with width parameter $q$.
As noted in \cite{tax-duin}, polynomial kernels
do not yield tight contours
representations of a cluster.
The Lagrangian $W$ is now written as:
\be
W=\sum_j K(\x_j,\x_j) \beta_j -\sum_{i,j} \beta_i \beta_j K(\x_i, \x_j).
\ee

At each point ${\bf x}$ we define the distance of its image in feature space
from the center of the sphere:
\begin{equation}
R^2({\bf x}) = ||\Phi({\bf x}) - {\bf a}||^2 \;.
\end{equation}
In view of (\ref{eqn-a}) and the definition of the kernel we have:
\begin{equation} \label{radius}
R^2({\bf x}) = 
K({\bf x},{\bf x}) - 
2 \sum_{j} \beta_j K({\bf x}_j,{\bf x}) + 
\sum_{i,j} \beta_i \beta_j K({\bf x}_i,{\bf x}_j) \,.
\end{equation}
The radius of the sphere is: 
\begin{equation}\label{R}
R = \{ R({\bf x}_i) ~|~{\bf x}_i \mbox{~is a support vector~}\} \;.
\end{equation}
The contours that enclose the points in data space are defined by the set
\begin{equation}\label{contour}
\{ {\bf x} ~|~R({\bf x}) = R \} \;.
\end{equation}
They are interpreted by us as forming cluster boundaries
(see Figures 1 and 3).
In view of equation (\ref{R}),
SVs lie on cluster boundaries, BSVs are outside, and all other points
lie inside the clusters.

\subsection{Cluster Assignment}

The cluster description algorithm does 
not differentiate between points that belong to different clusters.
To do so, we use a geometric approach 
involving ${\bf R}(\x)$, based on the
following observation:
given a pair of data points that belong to different components (clusters),
any path that connects them 
must exit from the sphere in feature space.
Therefore, such a path contains a segment of points ${\bf y}$ such that
$R({\bf y}) > R$.
This leads to the definition of the
adjacency matrix
$A_{ij}$ between  pairs of points ${\bf x}_i$ and ${\bf x}_j$
whose images lie in or on the sphere in feature space:
\begin{equation}
A_{ij} = \left\{
\begin{array}{ll}
1 & \mbox{if, for all ${\bf y}$ on the line segment connecting ${\bf x}_i$}
\, \mbox{and~} {\bf x}_j, ~R({\bf y}) \leq R \\
0 & \mbox{otherwise}. \\
\end{array} \right.
\end{equation}
Clusters are now defined as the connected components of the
graph induced by $A$.
Checking the line segment is implemented
by sampling a number of points
(20 points were used in our numerical experiments).

BSVs are unclassified by this procedure since their feature space images lie
outside the enclosing sphere.
One may decide either to leave them unclassified,
or to assign them to the cluster that they are closest to, as we will do
in the examples studied below.

\section{Examples}
The shape of the enclosing contours in data space is governed
by two parameters: $q$, the scale parameter of the Gaussian kernel, and $C$,
the soft margin constant.
In the examples studied in this section we will demonstrate the
effects of these two parameters. 


\subsection{Example without BSVs}

\begin{figure}[t]
\epsfxsize=13cm
\centerline{\epsffile[65 190 560 610]{four.ps}}
\label{four}
\caption{Clustering of a data
set containing 183 points using SVC with \ensuremath{C=1}.
Support vectors are designated by small circles, and cluster assignments
are represented by different grey scales of the data points.
 (a): \ensuremath{q=1} (b): \ensuremath{q=20} (c): \ensuremath{q=24} 
(d): \ensuremath{q=48}. 
}
\end{figure}

\begin{figure}[t]
\epsfxsize=10cm
\centerline{\epsffile[65 190 560 610]{sv_q.ps}}
\label{sv-plot}
\caption{
Number of SVs as a function of \ensuremath{q} for the data of Figure 1.
Contour splitting points are denoted by vertical lines.
}
\end{figure}

We begin with a data set in which the separation into clusters can be
achieved without invoking outliers, i.e. $C=1$.
Figure 1 demonstrates that as the scale parameter
of the Gaussian kernel, 
$q$, is increased,
the shape of the boundary in data-space varies:
with increasing $q$ the boundary fits more tightly the data, and
at several $q$ values the 
enclosing contour
splits, forming an increasing number of components (clusters). 
Figure 1a has the smoothest cluster boundary, defined by six SVs.
With increasing $q$, the number of support vectors $n_{sv}$ increases.
This is demonstrated in Figure 2 where we plot $n_{sv}$ as a function of $q$
for the data considered in Figure 1.

\subsection{Example with  BSVs}

In real data, clusters are usually not as well separated as in Figure 1.
Thus, in order to observe splitting of contours, we must allow
for BSVs.
The number of outliers is controlled by the 
parameter $C$. From
the constraints (\ref{sumbeta},\ref{boundbeta}) it follows that
\begin{equation}\label{outlier_number}
n_{bsv} < 1/C \,,
\end{equation}
where $n_{bsv}$ is the number of BSVs.
Thus $1/(NC)$ is an upper bound on the fraction of BSVs, and it is more 
natural to work with the parameter
\begin{equation}
 p = \frac{1}{NC} \,.
\end{equation}
Asymptotically (for large $N$), the fraction of outliers tends to $p$
,
as noted in
\cite{sch-sup}.

When distinct clusters are present, but some outliers (e.g. due to noise)
prevent contour separation,
it is very useful to employ BSVs.
This is demonstrated in Figure 3a: without BSVs contour separation does
not occur for the two outer rings for any value of $q$.
When some BSVs are present, the clusters are separated easily
(Figure 3b).
The difference between data that are contour-separable without 
BSVs and data that require use of BSVs
is illustrated schematically in Figure 4.
A small overlap between the two probability distributions that generate
the data is enough to prevent separation if there are no BSVs.
\begin{figure}[t]
\epsfysize=7cm
\epsfxsize=15cm
\centerline{\epsffile[73 202 560 610]{ringnormsubplots.eps}}
\label{ringnorm}
\caption{Clustering with and without BSVs.
The inner cluster is composed of 50 points generated from a Gaussian distribution.
The two concentric rings contain 150/300 points, generated from
a uniform angular distribution
and radial Gaussian distribution.
(a) The rings cannot be distinguished when \ensuremath{C=1}. 
Shown here is \ensuremath{q=3.5},
the lowest \ensuremath{q} value that leads to separation of the inner cluster.
(b) Outliers allow easy clustering. The parameters are
 \ensuremath{p=0.3} and \ensuremath{q=1.0}. 
}
\end{figure}

\begin{figure}[t]
\epsfysize=6cm
\epsfxsize=13cm
\centerline{\epsffile[70 390 540 585]{overlap1.eps}}
\label{overlap1}
\caption{Clusters with overlapping density functions require
the introduction of BSVs.
}
\end{figure}

In the spirit of the examples displayed in Figures 1 and 3 we propose
to use SVC iteratively: 
Starting with a low value of $q$ where there is a single cluster,
and increasing it, to observe the formation of an increasing number
of clusters, as the Gaussian kernel describes the data with
increasing precision.
If, however, the  number of SVs is excessive,
i.e. a large fraction of the data turns into
SVs (Figure 3a), or a number of singleton clusters form, one should
increase $p$ to allow these points to turn into outliers, thus
facilitating contour separation (Figure 3b).
As $p$ is increased not only does
the number of BSVs increase, but their influence on the shape of the
cluster contour decreases, as shown in \cite{vclust}.
The number of support vectors depends on both $q$ and $p$.
For fixed $q$, as $p$ is increased, the number of SVs
decreases since some of them turn into BSVs
and the contours become smoother
(see Figure 3).
 
                         
\section{Strongly Overlapping Clusters}

Our algorithm may also be useful in cases where clusters strongly
overlap, however a different
interpretation of the results is required. We propose to use
in such a case a high BSV
regime, and reinterpret the sphere in feature space as representing
cluster cores, rather than the envelope of all data.

Note that equation (\ref{contour}) for the reflection of the sphere
in data space can be expressed as
\begin{equation}
\{ {\bf x} ~|~ \sum_{i} \beta_i K({\bf x}_i, {\bf x}) = \rho \} \,,
\end{equation}
where $\rho$ is determined by the value of this sum on the support
vectors.
The set of points enclosed by the contour is:
\begin{equation}
\{ {\bf x} ~|~ \sum_{i} \beta_i K({\bf x}_i, {\bf x}) > \rho \} \,.
\end{equation}
In the extreme case when almost all data points are BSVs 
($p \rightarrow 1$),
the sum in this expression,
 \begin{equation}
 P_{svc} = \sum_{i} \beta_i K({\bf x}_i, {\bf x})
\end{equation}
 is approximately equal to
\begin{equation}\label{density}
 P_w = \frac{1}{N}
\sum_i K({\bf x}_i, {\bf x}) \,.
\end{equation}
This last expression is recognized as a Parzen window estimate 
of the density function (up to a normalization factor, 
if the kernel is not appropriately normalized), see \cite{dhs}.
In this high BSV regime, we expect the contour in data space to
enclose a small number of points
which lie near the maximum of the Parzen-estimated density.
In other words, the contour specifies the {\em core} of the
probability distribution. This is schematically represented in Figure 5.

\begin{figure}[t]
\epsfysize=6cm
\epsfxsize=11cm
\centerline{\epsffile[70 210 540 585]{overlap2.eps}}
\label{overlap2}
\caption{In the case of significant overlap between clusters 
the algorithm identifies clusters
according to dense cores, or maxima of the underlying probability
distribution.
}
\end{figure}

In this regime our algorithm is closely related to the scale-space algorithm
proposed by \cite{roberts}.
He defines cluster centers as maxima of the Parzen window estimator
$P_w({\bf x})$.
The Gaussian kernel plays an important role in his analysis:
it is the only kernel
for which the number of maxima (hence the number of clusters)
is a monotonically non-decreasing function of $q$.
This is the counterpart of contour splitting in SVC.
\begin{figure}[t]
\epsfxsize=15cm
\centerline{\epsffile[64 404 542 600]{newcrabs.eps}}
\label{figure-crabs}
\caption{
Ripley's crab data displayed on a plot of their
2nd and 3rd principal components:
(a) Topographic map of \ensuremath{P_{svc}({\bf x})} and SVC cluster assignments.
Cluster core boundaries are denoted by bold contours;
parameters were \ensuremath{q=4.8, p=0.7}.
(b) The Parzen window topographic map \ensuremath{P_{w}({\bf x})}
for the same $q$ value,
and the data represented by the original classification
given by \cite{ripley}.
}
\end{figure}
As an example we study the crab data set of \cite{ripley} in Figure 6. 
We plot the topographic maps of $P_w$ and $P_{svc}$ in the high BSV
regime. The two maps are very similar. In Figure 6a we present the SVC
clustering assignment. Figure 6b shows the original classification
superimposed on the topographic map of $P_w$.
In the scale space clustering approach it is difficult to identify
the bottom right cluster,
since there is only a small region that attracts points to this
local maximum.
We propose to first identify the contours that form cluster cores,
the dark contours in Figure 6a, and then associate points (including BSVs)
to clusters according to their distances from cluster cores.

The computational advantage of SVC over Roberts' method is that,
instead of solving a problem with
many local maxima, we identify core boundaries by an SV method
with a global optimal solution.
The conceptual advantage of our method is that we define a region,
rather than just a peak, as the core of the cluster.


\subsection{The Iris Data}

We ran SVC on the iris data set of \cite{iris}, which is a standard benchmark
in the pattern recognition literature, and
can be obtained from \cite{UCI}.
The data set contains 150 instances each composed of four
measurements of an iris flower.
There are three types of flowers, represented by 50 instances each.
Clustering of this data in the space of its first two principal components
is depicted in Figure 7 (data was centered prior to extraction of
principal components).
One of the clusters is linearly separable from the other two
by a clear gap in the probability distribution.
The remaining two clusters have significant overlap, and were separated
at $q=6\,\, p=0.6$.
However, at these values of the parameters, the third cluster split
into two (see Figure 7).
When these two clusters are considered together, the result is 2 
misclassifications.
Adding the third principal component we obtained the three clusters
at 
$q=7.0\,\,p=0.70$, with four misclassifications.
With the fourth principal component the number of misclassifications
increased to 14 (using $q=9.0\,\,p=0.75$).
In addition, the number of support vectors increased with increasing
dimensionality (18 in 2 dimensions, 23 in 3 dimensions and 34
in 4 dimensions).
The improved performance in 2 or 3 dimensions can be attributed to the noise
reduction effect of PCA.
Our results compare favorably with other
non-parametric clustering algorithms:
the information theoretic approach of \cite{bottleneck-nips} leads to
5 misclassifications and
the SPC algorithm of \cite{blatt}, when applied to
the dataset in the original data-space, has 15 misclassifications.
For high dimensional datasets, e.g. the Isolet dataset which has 
617 dimensions, the problem was obtaining a support vector description:
the number of support vectors jumped from very few (one cluster) to all
data points being support vectors (every point in a separate cluster).
Using PCA to reduce the dimensionality produced data that clustered well.


\begin{figure}[t]
\epsfxsize=8cm
\centerline{\epsffile[65 190 560 610]{irisq6f06.eps}}
\label{figure-iris}
\caption{
Cluster boundaries of the iris data set analyzed in a two-dimensional space
spanned by the first two
principal components.
Parameters used are \ensuremath{q=6.0\,\, p=0.6}.
}
\end{figure}


\subsection{Varying $q$ and $p$}

We propose to use SVC as a ``divisive'' clustering algorithm, see \cite{jain}:
starting from a small value of $q$ and increasing it.
The initial value of $q$ may be chosen as
\begin{equation}
q=\frac{1}{\max_{i,j} ||{\bf x}_i - {\bf x}_j||^2} \,.
\end{equation}
At this scale all pairs of points produce a sizeable kernel value,
resulting in a single cluster.
At this value
no outliers are needed, hence we choose $C=1$.

As $q$ is increased we expect to find bifurcations of clusters.
Although this may look as hierarchical clustering, we have found
counterexamples when using BSVs. Thus strict hierarchy is not guaranteed,
unless the algorithm is applied separately to each cluster rather than
to the whole dataset. 
We do not pursue this choice here, in order to show how the cluster
structure is unraveled as $q$ is increased.
Starting out with $p=1/N$, or $C=1$, we do not allow for any outliers.
If, as $q$ is being increased, clusters of single or few points break off,
or cluster boundaries become very rough (as in Figure 3a), $p$ should be
increased in order to investigate what happens when BSVs are allowed.
In general, a good criterion seems to be the number of SVs: a low
number guarantees smooth boundaries. As $q$
increases this number increases, as in Figure 2. If the number of SVs
 is excessive,
$p$ should be increased, whereby many SVs may be
turned into BSVs, and smooth cluster (or core) boundaries emerge, as in
Figure 3b. In other words, we propose to systematically increase $q$
and $p$ along a direction that guarantees a minimal number of SVs. A
second criterion for good clustering solutions is
 the stability  of cluster
assignments over some range of the two parameters.

An important issue in the divisive approach is the decision when to stop
dividing the clusters. 
Many approaches to this problem exist, such as  \cite{milligan,psb02}
(and references therein).
However, we believe that
in our SV setting it is natural to use the number of support vectors
as an indication of a meaningful solution, as described above. Hence
 we should stop SVC when the fraction  of SVs exceeds
some threshold.


\section{Complexity}

The quadratic programming problem of
equation (\ref{lagrangian}) can be solved by the SMO algorithm
of \cite{SMO} which was proposed as an efficient tool
for SVM training in the supervised case.
Some minor modifications are required to adapt it to the
unsupervised training problem addressed here, see \cite{sch-sup}.
Benchmarks reported in \cite{SMO} show that this algorithm 
converges after approximately $O(N^2)$
kernel evaluations.
The complexity of the labeling part of 
the algorithm is $O((N - n_{bsv})^2 n_{sv} d)$,
so that the overall complexity is $O(N^{2}d)$ if the number of support
vectors is $O(1)$.
We use a heuristic to lower this estimate: we do not compute the whole
adjacency matrix, but only adjacencies with support vectors.
This gave the same results on the data sets we have tried, and lowers
the complexity to $O((N- n_{bsv}) n_{sv}^2)$.
We also note that the memory requirements of the SMO algorithm are low:
it can be implemented using $O(1)$
memory at the cost of a
decrease in efficiency.
This makes SVC useful even for very large datasets.


\section{Discussion}

We have proposed a novel clustering method, SVC, based on the
SVM formalism. 
Our method has no explicit bias of either the number, or the shape of clusters.
It has two parameters, allowing it to obtain various clustering solutions. 
The parameter $q$ of the Gaussian kernel determines the
scale at which the data is probed, and
as it is increased clusters begin to split.
The other parameter, $p$, is the soft margin constant that 
controls the number of outliers.
This parameter enables analyzing noisy data points and separating
between overlapping clusters.
This is in contrast with most clustering algorithms found in the literature,
that have no mechanism for dealing with noise or outliers.
However we note that for clustering instances with strongly overlapping
clusters  SVC can delineate only relatively small cluster cores. 
An alternative for overlapping clusters is to use a support vector
description for each cluster.
Preliminary results in this direction are found in \cite{vclust}.


A unique advantage of our algorithm is that it
can generate cluster boundaries of arbitrary shape, whereas other
algorithms that use a geometric representation are most often limited to
hyper-ellipsoids, see   \cite{jain}. 
In this respect SVC is reminiscent of the method of \cite{lipson} where
high order neurons define a high dimensional feature-space.
Our algorithm has a distinct advantage over the latter:
being based on a kernel method it avoids explicit calculations in the
high-dimensional feature space, and hence is more efficient.


In the high $p$ regime SVC becomes similar to the scale-space approach
that probes the cluster structure using a Gaussian Parzen window estimate 
of the probability density, where cluster centers are defined
by the local 
%minima 
maxima of the density. 
Our method has
the computational advantage of relying on the SVM quadratic optimization
that has one global
solution.

\vskip 0.2in
\bibliography{svc}
\end{document}