In recent research \citep{ManevitzMalik1} we proposed a one-class neural network
classification scheme and compared it over the standard {\em Reuters}
data set with variants appropriate to the 
positive example case of several algorithms, using various choices of 
parameters and various methods of representing the details.  For extensive details,
see \citet{ManevitzMalik1}.

Here we will compare the two SVM one-class algorithms with the following
ones from that study:


\begin{enumerate}
\item{Prototype (Rocchio's) algorithm   

The Prototype algorithm is widely used in information retrieval
\citep[and others]{MD, MY,MMB,KL} .
%(\citep{MD}, \citep{MY},\citep{MMB},\citep{KL} and others).
This algorithm is used frequently because it is considered as a baseline
algorithm and it is simple to implement. For more details see \cite{TJO}.

The basic idea of the algorithm is to represent each document, $\bf e$, 
as a vector
in a vector space so that documents with similar content have similar vectors.
The value  $e_i$ of the $i^{th}$ key-word  is represented as the {\em tf-idf}
weight.

The Prototype algorithm learns the class model by combining document vectors
into a prototype vector . This vector is generated by adding the document
vectors of all documents in the class.   Classification is done by judging
the angular distance from the prototype vector.  In our experiments,
we used the $F_1$ measure to optimize the threshold.

}



\item{Nearest Neighbor  

This is a modification of the standard Nearest Neighbor algorithm \citep{Yang99}
appropriate for one-class learning.
This algorithm was original presented in detail by \citet{PDatta}; and the 
method of optimizing the parameters is explained by \citet{ManevitzMalik1}.
 }


\item{Naive Bayes

Traditional Naive Bayes calculates the probability of being in a class
given an example, which has specific values for the different attributes.
One calculates this by assuming the different attributes are independent,
applying Bayes' theorem and using the a priori probability of the different
classes.

%P. Datta  \citep{PDatta} showed how to
P. \citet{PDatta} showed how to
modify the algorithm for positive data only and
we follow his presentation.

$E$ is the class of training documents in a category.
We calculate $p(d|E)$ as the product of $p(w|E)$ for all keywords
that appear in the document $d$.     Each of the $p(w|E)$
is estimated independently using the formula:
$ p(w|E)  = { { n_w +1} \over {n + m }}$
where $n_w$ is the number of times word $w$ occurs in $E$,  $n$ is the
total number of words in $E$, and $m$ is the number of keywords.

The threshold is calculated as a fraction (optimized using the
$F_1$ measure) of  the minimum over all examples in $E$
of the value $p(d|E)$

}


\item{Compression Neural Network

The algorithm is based on a feed-forward three layer neural network
with a ``bottleneck".   That is, under the assumption that the
documents are represented in an $m$-dimensional space; we choose
a three level network with $m$ inputs, $m$ outputs and $k$ neurons on the
hidden level, where $k<m$.   Then the network is trained, under
standard back-propagation, to learn the identity function on the
sample examples (see Figure~\ref{bottleneck}). This design was first used by \citet{CMZ}
to produce a compression algorithm.  See also \citet{NCM} for another use
as a novelty detector.  


\begin{figure}
\caption{A Neural Network with Bottleneck}
\label{bottleneck}
\centerline{\psfig{file=compression.eps,height=5cm}}
%\centerline{\bf Figure 1.}
\end{figure}




The idea is that while the bottleneck prevents learning the full identity
function  on $m$-space; the identity on the small set of
examples is
in fact learnable.
Then the set of vectors for which the network
acts as the identity function is a sort of sub-space which is similar
to the trained set.
This avoids the ``saturation" problem of learning
from only positive examples.
Thus the filter is defined by applying the network
to a given vector; if the result is the identity, then the vector is
``interesting".
}

\end{enumerate}

By running the same experiments using the two SVM-one class algorithms,
we are in a position to make a comparative study.

The reader is referred to \citet{ManevitzMalik1} for detailed descriptions 
of each of 
the above algorithms and the various choices of parameters investigated.









