\problem{Density Estimation using Mixtures of Gaussians}

In this problem, we will fit mixture of Gaussians models to traced
handwritten digits using the EM algorithm. Most of the code and
utilities that you will need will be provided.

{\tt EM.m} fits a mixture of $k$ Gaussian distributions to the
training data. The simplest way to call this function is {\tt
[params,loglikes] = EM(X,k)}, where {\tt params} contains the
estimated parameters (mixing proportions and the means and covariances
of each of the component Gaussian); {\tt loglikes} stores the log-data
likelihoods of the mixture models we produce in the course of the EM
iterations. The following functions are called inside {\tt EM.m}:

\begin{description}
\item[{\tt Estep}] Performs the {\em expectation} step: for each data
point, we calculate the posterior probability that it was drawn from 
each of the component Gaussians. 

\item[{\tt Mstep}] Performs the {\em maximization} step: finds the
maximum likelihood settings of the parameters assuming we already have
the posterior probabilities computed in the E-step. 

\item[{\tt mixtureLL}] Calculates the log-likelihood of the data given 
the current setting of the mixture model parameters (component
Gaussians as well as the mixing proportions). 
\end{description}

\begin{question}
Complete the function {\tt Mstep.m} in terms of updating the means and
the covariances of the Gaussian components (use the skeleton we have
provided).
\end{question}

Our implementation of the E-step ({\tt EstepX.m}) deviates slightly
from a ``straight-forward'' implementation. Modifications of this type
are necessary in practice to avoid numerical problems.

\begin{question}
By looking through {\tt EstepX.m}, {\tt mixtureLLX.m} and {\tt
evalmixtureX.m} can you suggest what the problem is that these
routines are trying to avoid?
\end{question}

In the following questions, we study a dataset of handwritten digits
given in the file {\tt pendigits.mat} (type {\tt load pendigits} to
load it into {\tt MATLAB}). The digits were drawn with a stylus input
device. The raw trajectories of the digits can be found in the files
{\tt pendigits-orig.(tra|tes|names)}. The trajectories were then
scaled to normalize them, and eight equi-distant points along the
trajectory were extracted from each digit sample. We will use versions
of the resulting 16-dimensional dataset (two dimensions for each of
the eight points). You can find more information in the file {\tt
pendigits.names}.

The provided routine {\tt showdigit.m} can be used to plot the digits:
\begin{verbatim}
>> showdigit(pentrain8(42,:))
>> showdigit(pentrain4(42,:))
\end{verbatim}
The green circle is the beginning of the trajectory. Each row of the
data matrix {\tt pentrain4} contains only four points, the means of
every two consecutive points.

In order to visualize our estimation results, we will simplify the
problem a bit by considering only the second point in {\tt pentrain4},
given by {\tt pentrain4(:,3:4)}. Moreover, we will restrict ourselves
to just digits three, four, and five. The resutling $2219
\times 2$ matrix {\tt second345} contains the relevant points.

\begin{question}
Use {\tt EM.m} to fit the distribution of points in {\tt second345} as
a mixture of six Gaussian distributions. Plot the resulting
distribution, and the resulting components, using the routine {\tt
plotmixture2D.m} that we have provided. You can call {\tt
plotmixture2D.m}, e.g., as follows: 
\begin{verbatim}
>> plotmixture2D(second345,'evalmixtureX',params)
\end{verbatim}
where {\tt params} specify the 6-component mixture model. The blue
lines in the contour plot correspond to each of the component
Gaussians in the mixture and the red lines provide the contours of the
6-component mixture model. At the center of each Gaussian component,
you also see the numerical value of the mixing proportion allocated to
that component.

Turn in the contour plot (no need to turn in the 3D plot).
\end{question}

The resulting log-likelihood should be a bit over 1000, and EM should
converge after a few hundred iterations.

\begin{question}
Rerun EM on the same data-set a few times, looking at the resulting
plots and the log-likelihoods. Explain why you do not always get
close to the same distribution, and why you sometimes do get the exact
same distribution (and not just a simmilar one).
\end{question}

Plot the log-likelihood at each EM iteration (this is returned as the
second output value of {\tt EM.m}). Notice (sometimes) the step-like
behavior of the log-likelihood: the log-likelihood is almost unchanged
throughout many iterations, and then increases dramatically after a
few additional iterations. You can use the history of mixture
parameters (returned as the third output value of {\tt EM.m}) to see
how the mixture distribution itself changes in relation to the
log-likelihood. It takes time to resolve how to best allocate the
mixture components, especially if (some of) the Gaussian components
are only slightly different from each other in the current model.

\subsubsection*{Initialization}

The EM algorithm describes how to {\em update} the mixture parameters
in order to improve the log-likelihood of the data. This search
process must start somewhere. An optional parameter to {\tt EM.m} can
be used to control the way the mixture parameters are initialized.

\begin{question}
The provided routine {\tt initSame.m} sets all mixture components to
have zero mean, unit (spherical) covariances, and equal mixing
proportions. Try using it to fit {\tt second345}. How many iterations
does it take for EM to converge? Why?  Describe what happens at each
iteration (you might want to set the {\tt histfreq} parameter to
one, and plot the mixture at each iteration).
\end{question}

\begin{question}
What would happen if the means and covariances were all equal, but the
initial mixing proportions were chosen randomly? If the means and
mixing proportions were equal but the initial covariance matrices were
chosen randomly?
\end{question}

The default initialization routine used by {\tt EM.m} is {\tt
initRandPoints.m}, which initializes the means to random data points,
and sets the covariance matrices to diagonal matrices with the same
overall variance parameter computed from the overall covariance of the
data.

\subsubsection*{Data anomalies and regularization} 

The actual data was given as integers in the range 0..100 in each
coordinate.  The matrix {\tt pentrain8}, and its derivatives {\tt
pentrain4} and {\tt second345} represent the original data scaled to
[0,1] with uniform noise added.

The matrix {\tt cleanpentrain8} contains the 0..100 integer valued
data. 

\begin{question}
Try estimating a 5-component mixture model to the first point {\tt
cleanpentrain8(:,1:2)}. You can also try the rest of the time points
{\tt cleanpentrain8(:,2t-1:2t)}. Why does EM fail? What happens ? 
Hint: plot the distribution using
\begin{verbatim}
>> plot(cleanpentrain8(:,1),cleanpentrain8(:,2),'.')
\end{verbatim}
\end{question}

\begin{question}
Would we be able to resolve the above problem by introducing
regularization to the M-step? Regularization would modify how we
update the covariance matrix of each of the component Gaussians in the
M-step. Let $\hat{n}_i$ is the number of ``soft'' counts assigned to
component $i$ in the E-step and $\hat{\Sigma}_i$ the corresponding
(unregularized) estimate of the covariance matrix. We can now imagine
having an additional set of $n'$ points with covariance $S$ (the prior
covariance matrix) also assigned to component $i$. These additional
points would change the covariance update according to
\[
\hat\Sigma'_i = (\frac{\hat{n}_i}{\hat{n}_i+n'})\,\hat{\Sigma}_i
+ (\frac{n'}{\hat{n}_i+n'})\,S
\]
where $\hat\Sigma'_i$ is the new regularized estimate of the
covariance matrix for component $i$. (We do not expect you to try this
out but answer the question based on what you think would happen).
\end{question}


\subsection*{Unsupervised Learning Using Mixture Models}

We continue to use the trajectories in an unsupervised manner, i.e.,
not paying attention to the accompanying labels. We can use the mixture
models to try to partition the set of input samples (trajectories)
into {\em clusters}. We will also assess how the emerging cluster
identities will correlate with the actual labels.

The first question we have to answer is how many Gaussian components
we should use in the mixture model. How could we decide the issue? We
could proceed similarly to the case of setting the kernel width
parameter in non-parametric density estimation, i.e., via
cross-validation. The problem here, however, is that estimating any
single mixture model with say 5 components already takes a fair amount
of computing time and repeating this process $n$ times ($n$ is the
number of training examples) does not appear very attractive. Instead,
we will use a simple approximate criterion or {\em Bayesian
information criterion} to find the appropriate number of
clusters. This involves adding a complexity penalty to the log-data
likelihoods. In other words, if $LL(k)$ is the log-likelihood that our
k-component mixture model assigns to the training data after training,
we find $k$ that maximizes
\[
score(k) = LL(k) - \frac{1}{2} D_k\log(n)
\]
where $D_k$ denotes the number of parameters in the k-component
mixture model. For a Gaussian mixture mode, $D_k =
k*(d+d*(d+1)/2)+k-1$, where $d$ is the dimensionality of the examples
(e.g., 2). Note that the penalty is a function of the number of
parameters and the number of training examples $n$.

\begin{question}
Use different numbers of mixture components with {\tt second345} and
select the ``optimum'' number of components. Use {\tt mixtureModSel.m}
to plot the scores for $k=1,\ldots,8$. Return the plot. Your choice of
$k$ may change if you re-estimate the mixture models (make sure you
understand why).
\end{question}

Note that there may be more clusters than than there are different
types of digits. Some of the digits may have sub-types or our
assumption that the clusters look like Gaussians may be invalid and we
need more than one Gaussian to account for any single real cluster in
the data.

Now, given the ``best'' mixture model, we would like to correlate the
cluster identities with the available labels. We can do this to find
out how well we could capture relevant structure in the data without
actually knowing anything about the labels.

\begin{question} The labels are provided in {\tt labels345}. Write a
simple matlab routine to get hard assignments of training examples to
clusters. You'll probably want to use {\tt mixtureScaledPiX} for this
purpose. Use {\tt correlate.m} to correlate the cluster identities
with the available labels {\tt labels345}. Return your MATLAB code and
the correlation matrix. Briefly discuss whether the approach was
successful, i.e., whether we could associate the clusters with
specific labels.
\end{question}


\subsection*{Using Mixture Models for Classification}

We will now try to use the Gaussian mixture models in a classification
setting. We can estimate class-conditional density models (mixture
models) separately for each class and classify any new test example on
the basis of which class-conditional model assigns the highest
likelihood to the example.

\begin{question}
Using the provided {\tt fitAllDigits.m}, fit a 3-component Gaussian
mixture model to each of the digits in {\tt pentrain8,trainlabels}.
\end{question}

\begin{question}
Complete the routine {\tt classify.m} that takes as input the examples
to be classified and a cell-array of mixture parameters (as returned
by {\tt fitAllDigits.m}) to return the maximum likelihood assignment of
labels to the examples. Provide the code for {\tt classify.m}.
\end{question}

\begin{question}
Classify the {\tt pentrain8} and {\tt pentest8}, and use {\tt
correlate.m} to compare the results to the true labelings in {\tt
trainlabels} and {\tt testlabels}. Turn in the two correlation
matrices. 
\end{question}

For comparison, you probably want to repeat the above for 1-component
Gaussian mixtures. You could also use the model selection routine we
have provided to find a different number of mixture components for
each class-conditional density. Increasing the number of components
per class would eventually force you to use regularized estimates (see
above).