\section{Document Representations, Data Set and Measurements}


\subsection{Document Representations}

We used the following different representations in these experiments:
\begin{enumerate}
\item{binary representation}
\item{frequency representation}
\item{{\em tf-idf} representation}
\item{{\em Hadamard} representation}

\end{enumerate}

These are defined as follows.   
List all the words (after ``stemming"; i.e. removing suffixes and prefixes
to avoid duplicate entries and removing basic ``stop" words) 
in all the training documents  sorted by the {\em document}-frequency
(i.e. the number of documents it appears in).  
Choose the top $m$ such words according to this ``dictionary" frequency 
(called ``keywords").
(See in the sequel for experiments with different values of $m$.)

For {\em binary} representation of a specific document, choose the $m$ dimensional
binary vector where the $i^{th}$ entry is $1$ if the $i^{th}$ keyword 
appears in the document and $0$ if it does not.

For the {\em frequency} representation, choose the $m$ dimensional real valued
vector, where the $i^{th}$ entry is the normalized frequency of appearance 
of the $i^{th}$ keyword in the specific document.

For the {\em tf-idf} representation 
(``term frequency inverse document frequency"), choose the $m$ dimensional
real valued vector, where the $i^{th}$ entry is given by the formula
$$  tf-idf(keyword) =  frequency(keyword) \cdot [log \frac{n}{N(keyword)}  + 1].$$

where  $n$ is the total number of words in the dictionary and N is a function
giving the total number of documents the keyword appears in. 


The {\em Hadamard product} representation was discovered experimentally;
it consists of the $m$ dimensional vector where the $i^{th}$ entry is the
product of the  frequency of the $i^{th}$ keyword in the document and its
frequency over all documents (in the training set).
See \citet{ManevitzMalik1} for further discussion of this representation.
In any case, it is clear that this transformation emphasizes differences 
between large and small feature entries.



\subsection{Data Set and Measurements}
To test the above ideas, we applied these filters to the
standard {\it{Reuters}} dataset \citep{RDL}, a preclassified collection of
short articles.   This is one of the standard test-beds used to
test information retrieval algorithms \citep{DPHS}.   
                      
For each choice of category, we used 25\% of the
positive data from the training set to train; and then tested the filters on 
the remaining  75\% of the data set.
%Table 1
Table~\ref{table-training-test-items}
shows the ten most frequent categories along with the number of training
and test examples in each.

\input{table-reuter-splite.tex} %table-training-test-items
 
We treated each of the 10 categories as a binary classification task
and evaluated the classifiers for each category separately.

For reporting the results, we used 
%two kinds of measures:
%(1) the number accepted versus the total number of documents
%\marginpar{where did we use the number accepted?}
the $F_1$ measure,  the recall and the precision values.

For text categorization, the effectiveness measure of recall and precision 
are
defined as follows:


$$recall = \frac{\textit{Number of items of category identified}}
                      {\textit{Number of category members in test set}}$$

$$precision = \frac{\textit{Number of items of category identified}}
                      {\textit{Total items assigned to category}}$$



% $$         recall = \frac{\textit{Number of test set cateory members 
%assigned to cate$
%                   {\textit{Number of category members in test set}}$$
%
%$$          precision =\frac{\textit{Number of test set category members 
%assigned to c$
%                     {\textit{Total number of test set members assigned to 
%category}}$$
%

% Van Rijsbergen \citep{Rij} defined the $F_1$-measure as a combination of 
 Van Rijsbergen (1979) defined the $F_1$-measure as a combination of 
recall (R) and Precision (P)
with an equal weight in the following form:
            $F_1$(R,P)= $\frac{2RP}{R+P}$
(This implies that $F_1$, like R and P, is bounded by $1$ and the best results
under this measure are the higher values.)
%\marginpar{Malik, this is OK?}


