The basic idea is to work first in the feature space, and assume that not only
is the origin in the second class (as in the previous section), 
but also that all data points ``close enough"
to the origin are to be considered as noise or outliers.   

If a vector has few non-zero entries, then this indicates that this document
shares very few items with the chosen feature subset of the dictionary.
So, intuitively, this item will not serve well as a representative of the class.
In this way, it is reasonable to treat such a vector as an outlier.
(Note that while the idea of choosing the outliers as data points close to the
origin is a general one, the intuitive justification of this procedure
is specific to this application.)


\begin{figure}
\caption{Outlier SVM}

\label{fig-outlier}
%\centerline{\psfig{file=svm-oneclass-outliers.eps,height=5cm}}
\centerline{\psfig{file=outlier.eps,height=7cm}}
\end{figure}

Geometrically, using the Hamming distance means that all vectors lying on 
standard sub-spaces of small dimension (i.e. axes, faces, etc.) are treated 
as outliers (see Figure \ref{fig-outlier}).
%\marginpar{Go over with Malik.}

Hence, we decided to identify  these outliers by counting the features of an 
example with non-zero value; and if this is less than a threshold, then the
feature is labeled as a negative example.  

This raises the problem as to how to choose the appropriate values for the
threshold.   We investigated this in two ways:  (1) by experimentally
trying different global values of the threshold  and  (2) by determining
individual thresholds for the different categories.    One should
note that, in principle, this determination of threshold can be 
done automatically, e.g. by  comparing results on a test set. 
%\marginpar{ADD A DESCRIPTION BY MALIK}
After having determined the threshold
one then continues with the standard two-class SVM.

For this ``outlier-SVM", one also has to choose the original frequency
representation,  and how far from the origin a point can be
(in our case, in Hamming distance) before being classified as an
outlier.


For the outlier-SVM we tried 10 and 20 features for binary representations.
Note that the features are individual category-specific, i.e. different
features for each category.
(We did sample runs with other representations (e.g. Hadamard, tf-idf)
and larger numbers of features, but since the results were clearly poor, we
did not complete the experiments.)

%\marginpar{\small Malik, can we say something here about the smallnumber of
%features.  A referee thought the small number of features was a big 
%drawback.Note this is also pertinent tothe previous section.}


In addition, we experimented with varying the Hamming distance in order to 
define the outlier threshold.  Note that this decision can also
be chosen automatically (by comparing results on a test set after
training) so we also did some statistics over the data set wherein 
each category had its own such threshold.
Table \ref{TABLE-cumulative}
%\ref{table-svm-split-10} and \ref{table-svm-split-20}
%\marginpar{put into one table}
lists the number of documents designated as outliers for each choice of
Hamming distance.

We allowed, linear, sigmoid, polynomial and
radial basis  kernels; each chosen with the  ``standard" parameters for the
two class SVM.
