%\subsection{One class SVM}

\citet{Scholkopf1} suggested a method of adapting the 
%Sch\"{o}lkopf et al \citep{Scholkopf1}suggested a method of adapting the 
SVM methodology to the one-class classification problem.
Essentially,  after transforming the feature via a kernel,
they treat the origin as the only member of the second class.
Then using ``relaxation parameters" they separate the image of the 
one class from the origin.
%Another possible approach would be to work first in the feature space,
%and assume that not only the origin is in the second class, but all
%data points ``close enough" to the origin to be considered as noise
%or outliers and also classified as not in the first class.
Then the standard two-class SVM techniques are employed.

They framed the problem in the following way:

Suppose that a dataset has a probability distribution $P$ in the
feature space.  Find a ``simple" subset $S$ of the feature space such that
the probability that a test point from $P$ lies outside $S$ is bounded
by some a priori specified value.

Supposing that there is a dataset drawn from an underlying probability
distribution $P$, one needs to estimate a ``simple" subset $S$ of the input
space such that the probability that a test point from $P$ lies outside
of $S$ is bounded by some a prior specified $v\in(0,1)$ .
The solution for this problem is obtained by estimating a function $f$
which is positive on $S$ and negative on the complement  $\overline S$.
In other words, Sch\"{o}lkopf et al., developed an algorithm which
returns a function $f$ that takes
the value +1 in a ``small" region capturing most of the data vectors, and -1
elsewhere.


The algorithm can be summarized as mapping the data into
a feature space $H$ using an appropriate  kernel function,
and then trying to separate the mapped vectors
from the origin with maximum margin (see Figure \ref{fig-svmoneclass1}).

%\begin{eqnarray}
%\label{eq-svm-f}
\[ f(x) = \left\{ \begin{array}{ll}
     +1  & \mbox{if  $x\in S$} \\
     -1  & \mbox{if  $x\in \overline S$ }
   \end{array}\right. \]

%\end{eqnarray}

\begin{figure}
\caption{One-class SVM}

\label{fig-svmoneclass1}
%\centerline{\psfig{file=svm-oneclass-outliers.eps,height=5cm}}
\centerline{\psfig{file=oneclass.eps,height=7cm}}
\end{figure}

In our context,
let $x_1,x_2, \dots ,x_l$ be  training examples belonging  to one class $X$,
where $X$ is a compact subset of $R^N$.
Let $\Phi: X \longrightarrow H $ be
a kernel map which transforms the training examples to another space.
Then,
to separate the data set from the origin, one needs to solve the
following quadratic programming problem:

%\begin{eqnarray}
%\label{eq-}
 $$  min {1 \over 2} \|w\|^2 +\frac{1}{vl} \sum_{i=1}^l\xi_i -\rho \nonumber $$
  subject to 
 $$  (w \cdot \Phi(x_i) )  \ge \rho-\xi_i  \quad i=1 ,2 , ..., l \quad \xi_i \ge 0 $$
%\end{eqnarray}

\ \\

If $w$ and $\rho$ solve this problem, then the decision function
$$f(x)=sign((w \cdot \Phi(x))- \rho )$$
will be positive for most examples $x_i$ contained in the training set.


In our research we used the LIBSVM (version 2.0). This is an integrated
tool for support vector classification and regression which 
can handle one-class SVM using the Sholkopf etc algorithms. The LIBSVM
2.0 is available at {\it http://www.csie.ntu.edu.tw/\~{}cjlin/libsvm}.
We used the standard parameters of the algorithm.
%We implemented both of these methods, the first is called 
%``one-class SVM" in the sequel, while the second is called ``outlier-SVM".

For this ``one-class SVM", one has to choose the original 
frequency representation,  (e.g. the number of features),
the appropriate kernel, and for each kernel the appropriate parameters.
%For the ``outlier-SVM", one also has to choose the original frequency
%representation,  and how far from the origin a point can be 
%(in our case, in Hamming distance) before being classified as an 
%outlier.
%We investigated both algorithms in this setting under a variety of 
%choices.   Our experiments were more extensive for the one-class SVM
%methodology. 
We allowed, linear, sigmoid, polynomial and
radial basis  kernels; each chosen with the standard parameters.
For the original feature representation, we used binary, tf-idf and
Hadamard representations.
For the number of features we tried  10, 20, 40, 60 and 100 features for the 
one-class SVM.   

%For the outlier SVM we tried 10 and 20, for binary representations.
%(We did sample runs with other parameters, but since the results were
%clearly poor, did not complete the experiments.)


%The results are presented in tables xxxx- xxxx .

%1.  The best representation was clearly binary in all cases, both for
%one-class SVM and for outlier SVM.    
%2.  The best number of features for one class SVM was 10; 
%however it did depend on the kind of kernel used.   Moreover, the difference
%was dramatic.   Note in table ... the return for the radial basis kernel 
%was .519 overall with vector length 10; while for vector length 20, the
%results were a catastrophic .126.   On the other hand the polynomial
%kernel gave poor results (.238) with vector length 10, yet imporved to 
%.460 for polynomial kernel.     As far as we can tell from our results,
%the linear kernel seemed more stable.
%
%3.  The outlier SVM  in general performed worse than the one-class SVM.
%Except for the proper definition of outlier, its results seemed less
%sensitive to the choice of parameters.   The definition of outlier
%{\em was} crucial, however, note that this can be analyzed and performed
%automatically.
%
%


%They framed the problem in the following way:

%Suppose that a dataset has a probability distribution $P$ in the 
%feature space.  Find a ``simple" subset $S$ of feature space such that
%the probability that a test point from $P$ lies outside $S$ is bounded
%by some a priori specified value.

%Suppose that one have some dataset drawn from an underlying probability
%distribuation $P$, we need to estimate a ``simple" subset $S$ of input
%space such that the probability that a test point from $P$ lies outside
%of $S$ is bounded by some a prior specified $v\in(0,1)$ .

%The solution for this problem is obtained by estimating a function $f$
%which is positive on $S$ and negative on the complement  $\overline S$. 
%In other words, Sch\"{o}lkopf etc , developed an algorithm which 
%returns a function $f$ that takes
%the value +1 in a ``small" region capturing most of the data vectors, and -1
%elsewhere. 


%The algorithm can be summarized as maping the data into
%a feature space $H$ using an appropriate  kernel function, 
%and then trying to separate the mapped vectors
%from the origin with maximum margin.

%\begin{eqnarray}
%\label{eq-svm-f}
%\[ f(x) = \left\{ \begin{array}{ll}
%     +1  & \mbox{if  $x\in S$} \\
%     -1  & \mbox{if  $x\in \overline S$ }
%   \end{array}\right. \]

%%\end{eqnarray}

%\begin{figure}
%\caption{A svm for one class}
%\label{fig-svmoneclass1}
%\centerline{\psfig{file=svm-oneclass1.eps,height=5cm}}
%\end{figure}
%
%In our context, 
%let $x_1,x_2,..,x_l$ be  training examples belonging  to one class $X$, 
%where $X$ is a compact subset of $R^N$.
%Let $\Phi: X \longleftarrow H $ be
%a kernel map which transforms the training examples to another space. 
%
%Then,
%To separate the data set from the origin, one need to solve the
%following quadratic programming problem:
%
%\begin{eqnarray}
%\label{eq-}
% $$  min {1 \over 2} \|w\|^2 +\frac{1}{vl} \sum_{i=1}^l\xi_i -\rho \nonumber \\
%   (w \cdot \Phi(x_i) )  \ge \rho-\xi_i  \quad i=1 ,2 , ..., l \nonumber \\
%   \xi_i \ge 0 $$
%\end{eqnarray}

%If $w$ and $\rho$ solve this problem, then the decision function
%$$f(x)=sign((w \cdot \Phi(x))- \rho )$$
%will be positive for most examples $x_i$ contained in the training set.

%In our research we use the LIBSVM (version 2.0) , where this is an integrated
%tool for support vector classification and regression. This version
%can handle one-class SVM using the Sholkopf etc algorithms. The LIBSVM
%2.0 is available at {\it http://www.csie.ntu.edu.tw/~cjlin/libsvm}.