
We present our results in a series of tables listing the $F_1$, recall and precision values
over different parameters.  The main summary table is Table \ref{table-main-comp} where results
from all the tables are gathered for comparison.

\subsection{One-class SVM (Sch\"{o}lkopf algorithm)  Results}

The results for this algorithm were very sensitive to the parameters.
However, under proper choices it can give the best results.

In particular:
\begin{itemize}
\item{This algorithm worked much better with the binary representation;
in fact the results under all kernels were extremely poor under
the other three representations we tested (frequency, tf-idf,
and Hadamard).  Table \ref{table-svm-oneclass-hadamard} presents the Hadamard 
results with 20 features;
results with tf-idf and frequency were equally poor.}
\item{10 features were best in general, but this
depended on the choice of kernel.   For polynomial kernel
the 20 features were dramatically better.   On the other hand,
for radial basis kernel 10 features were dramatically better than
20 features (see Table \ref{TABLE-oneclassSVMtentotwenty}). }
%(Compare Tables \ref{table-svm-oneclass-m10}  and \ref{table-svm-oneclass-m20}).}
%\marginpar{combine into one table}
\item{Increasing the number of features further  caused large decreases
in performance (see Table \ref{table-svm-oneclass-all}).}

Aside:  Note that the features are category-specific in our approach;  unlike
other studies such as performed by \citet{TJ}, for example.     
This makes direct dimension comparison difficult.   
Roughly,  the total number of features over all categories in our experiments
 is about 100 (for twenty features per category).

%Aside: Note that these numbers are roughly consistent with [Joachim] \citep{JOACHIM}
%where 200 features were chosen for the Reuters data base.   This is 
%because \citep{JOACHIM} used the same vectors for all the categories,
%whereas we have chosen the features in a category based manner.   The total
%number of features we used over all is  ????.
%\marginpar{MAlik,  edit this and let me know the total number of different
%features.}


\item{We investigated the effect of removing the near-Hamming distance
examples from the data set before running this algorithm.  We did this
for vectors with 10 features.  Since the results are worse in all cases 
in comparison with the original algorithm, we do not reproduce the 
data tables here.}
\end{itemize}

In summary, the best choice of parameters for this algorithm were
10 features, binary representation with radial basis kernel.   However,
the sensitivity to changes in any of these parameters makes it difficult
to generalize to other applications.  The linear kernel, however, while
giving somewhat worse results did not seem to be as sensitive.




%Parameters for Scholkopf:
%1.Scholkopf using 10 features is best for SVM.
%2.Scholkopf using Binary is best (for 20 features) (and Malik claims
%doesnt matter if 10 or 20 or whatever.)  Moreover, VERY BAD using
%Tf-idf or Hadamard.   Need explanation as to why so sensitive to 
%representation. 
%3. Scholkopf VERY sensitive to Kernel choice, even within BINARY.
%Moreover, changed between 20 and others for Radial basis is very 
%dramatic.    No explanation.   Malik is double checking the results.

%FINAL CHOICE:  Scholkopf with 10,or 20 but use linear probably, definitely
%with BINARY.
%\clearpage
\input{TABLESfourOutliers.tex}
\input{TABLE-oneclassSVMtentotwenty}
%\input {table-svm-oneclass-m10.tex }
%\input {table-svm-oneclass-m20.tex }

%\input{table-svm-sholkopf-outlier-linear-10b.tex}
%\input{table-svm-sholkopf-outlier-poly-10b.tex}
%\input{table-svm-sholkopf-outlier-radial-10b.tex}
%\input{table-svm-sholkopf-outlier-sigmoid-10b.tex}

\input {table-svm-oneclass-all.tex}

%%%%\input {table-svm-oneclass-m40.tex}
%%%%\input {table-svm-oneclass-m60.tex}
%%%%\input {table-svm-oneclass-m100.tex}

\input {table-svm-oneclass-hadamard.tex}

%\input {table-svm-oneclass-tfidf.tex}
%\input {table-svm-oneclass-freq.tex}




\subsection{Outlier-SVM Results}

The results were generally somewhat worse than the one-class SVM
results reported above, especially when looked over the array of categories.
Occasionally, (e.g. for polynomial kernel with
10 features) this method was superior.   
Compare  Table \ref{TABLE-oneclassSVMtentotwenty}  %Table of polynomial kernel for oneclass
with Table \ref{Outlier-SVM-Different-Kernels}. %Table of polynomial kernel of outlier

For larger categories, one-class SVM obtained somewhat better 
results.   Using macro averaging (i.e. taking into account the 
differing number of items in each category), this algorithm reports
somewhat better results than the one-class SVM (see Table \ref{table-main-comp}). 

%\marginpar{correct and enter table references. Also review last sentence.}

\begin{itemize}
\item{We only tested this algorithm for binary representation.} 
\item{10 features were preferable to 20 features.  
Compare Table \ref{table-svm-optimum} and the subtables in
Table \ref{Outlier-SVM-Different-Kernels}.}
%(Compare Table \ref{table-svm-optimum} and the subtables in
%tables \ref{table-svm-outlier-linear-10b}-\ref{table-svm-outlier-sigmoid-10b}.)}
%\marginpar{Combine tables}
\item{For 10 features, the best choice of Hamming distance for outlier threshold
was 4 or 5. 
Looking at Table \ref{TABLE-cumulative}  
%Looking at Tables \ref{table-svm-split-10}  and \ref{table-svm-split-20} 
this means that
roughly a third of the data were specified as outliers.  This indicates that
it would be useful to find a better criteria for the original specification
of keywords.   This is somewhat difficult because in our context 
only positive information is available.}

\end{itemize}

In summary, the best parameters for this algorithm were binary representation,
feature length 10, and linear kernel function (there was not much 
difference from the radial basis kernel).









%Parameters for Outliers
%1. Used best Hamming 7/20.  
%2. Polynomial poor; other kernels compable.

%Sholkopf seems superior to outliers for optimal choices.
%\input{TABLESfourOutliers.tex}

\input{table-cumulativesplits}
%\input {table-svm-split-10.tex}
%\input {table-svm-split-20.tex}

%\input {table-svm-outlier-20.tex}

\input {table-svm-optimum.tex}

%\clearpage

%\input{table-svm-outlier-linear-10a.tex}
%\input{table-svm-outlier-poly-10a.tex}
%\input{table-svm-outlier-radial-10a.tex}
%\input{table-svm-outlier-sigmoid-10a.tex}

%\clearpage
%%%%%%Here are the old tables 9 - 12
%\input{table-svm-outlier-linear-10b.tex}
%\input{table-svm-outlier-poly-10b.tex}
%\input{table-svm-outlier-radial-10b.tex}
%\input{table-svm-outlier-sigmoid-10b.tex}




\input{table-main-comp.tex}


\section{Comparisons and Conclusions }

In Table \ref{table-main-comp}, we list the best $F_1$ results from one-class 
SVM,
outlier-SVM and the four other algorithms from the work by \citet{ManevitzMalik1}.
%\marginpar{Malik, check that this is all accurate.}
The results from ten Reuters categories are presented.  We also list, in this table,
the ``macro" averages which take into account the different number of items in 
each category.



Looking over this table, and focusing on the unweighted average,  
we see that the one-class SVM as proposed
by Sch\"{o}lkopf et al., gives the best overall performance.   This is quite
clear with respect to all the other algorithms except the compression
NN algorithm which is comparable.  Sch\"{o}lkopf's proposal 
has the usual advantages of SVM; in particular it is less 
computationally intensive than neural networks.

%\marginpar{\small 
%Add a sentence like \it We also point out the one-class results
%are already superior to some weaker two-class algorithms like Naive Bayes,
%compare [REFERENCE]}

Under the ``macro" averaging, the NN and outlier-SVM were somewhat superior. This means
that while the one-class SVM was more robust with regards to smaller categories,
the NN and outlier-SVM showed good results by emphasizing success in the larger categories.

On the other hand, the  one-class SVM  was very sensitive to the parameters
and choice of kernel.    The neural network method, in comparison,
seemed relatively stable over these parameters.    
(In our experiments, the linear kernel was, however, fairly stable
although its results were slightly worse than the neural network algorithm.)
Thus, under current
knowledge, i.e. until understanding of the parameter choice is clearer,
it would seem that the neural network method  is the preferred one.


%Parameters and Results for Nearest Neighbor:




%Parameters and Results for Rochio:


%Bayes-Parameters




%Parameters and Results for Neural Networks:



