\problem{Density Estimation using Mixtures of Gaussians}

In this problem, we will fit mixture of Gaussians models to traced
handwritten digits using the EM algorithm. Most of the code and
utilities that you will need will be provided.

{\tt EM.m} fits a mixture of $k$ Gaussian distributions to the
training data. The simplest way to call this function is {\tt
[params,loglikes] = EM(X,k)}, where {\tt params} contains the
estimated parameters (mixing proportions and the means and covariances
of each of the component Gaussian); {\tt loglikes} stores the log-data
likelihoods of the mixture models we produce in the course of the EM
iterations. The following functions are called inside {\tt EM.m}:

\begin{description}
\item[{\tt Estep}] Performs the {\em expectation} step: for each data
point, we calculate the posterior probability that it was drawn from 
each of the component Gaussians. 

\item[{\tt Mstep}] Performs the {\em maximization} step: finds the
maximum likelihood settings of the parameters assuming we already have
the posterior probabilities computed in the E-step. 

\item[{\tt mixtureLL}] Calculates the log-likelihood of the data given 
the current setting of the mixture model parameters (component
Gaussians as well as the mixing proportions). 
\end{description}

\begin{question}
Complete the function {\tt Mstep.m} in terms of updating the means and
the covariances of the Gaussian components (use the skeleton we have
provided).
\end{question}
\begin{answer}
\begin{verbatim}
function compParams = Mstep(X, p)
  [n,d] = size(X) ;
  k = size(p,2) ;
  compParams = [] ;
  for i=1:k
    reped_p = repmat(p(:,i),1,d) ;	% This can come in usefull
    ni = sum(p(:,i));			% Total 'weight' of ith component
    prior = ni/n;
    mu = sum( reped_p.*X )/ni; % weighted mean
    XminusMu = op(X,'-',mu);
    Sigma = (reped_p.*XminusMu)'*XminusMu/ni; % weighted covariance
    compParams = [compParams, struct('prior',prior,'mean',mu,'cov',Sigma)];
  end;    
\end{verbatim}
\end{answer}
\score{7 points}

Our implementation of the E-step ({\tt EstepX.m}) deviates slightly
from a ``straight-forward'' implementation. Modifications of this type
are necessary in practice to avoid numerical problems.

\begin{question}
By looking through {\tt EstepX.m}, {\tt mixtureLLX.m} and {\tt
evalmixtureX.m} can you suggest what the problem is that these
routines are trying to avoid?
\end{question}
\begin{answer}
The numeric values of the densities are extremely large. Multiplying
many of these probabilities quickly leads to numeric overflow.
\end{answer}
\score{1 point}

In the following questions, we study a dataset of handwritten digits
given in the file {\tt pendigits.mat} (type {\tt load pendigits} to
load it into {\tt MATLAB}). The digits were drawn with a stylus input
device. The raw trajectories of the digits can be found in the files
{\tt pendigits-orig.(tra|tes|names)}. The trajectories were then
scaled to normalize them, and eight equi-distant points along the
trajectory were extracted from each digit sample. We will use versions
of the resulting 16-dimensional dataset (two dimensions for each of
the eight points). You can find more information in the file {\tt
pendigits.names}.

The provided routine {\tt showdigit.m} can be used to plot the digits:
\begin{verbatim}
>> showdigit(pentrain8(42,:))
>> showdigit(pentrain4(42,:))
\end{verbatim}
The green circle is the beginning of the trajectory. Each row of the
data matrix {\tt pentrain4} contains only four points, the means of
every two consecutive points.

In order to visualize our estimation results, we will simplify the
problem a bit by considering only the second point in {\tt pentrain4},
given by {\tt pentrain4(:,3:4)}. Moreover, we will restrict ourselves
to just digits three, four, and five. The resulting $2219
\times 2$ matrix {\tt second345} contains the relevant points.

\begin{question}
Use {\tt EM.m} to fit the distribution of points in {\tt second345} as
a mixture of six Gaussian distributions. Plot the resulting
distribution, and the resulting components, using the routine {\tt
plotmixture2D.m} that we have provided. You can call {\tt
plotmixture2D.m}, e.g., as follows: 
\begin{verbatim}
>> plotmixture2D(second345,'evalmixtureX',params)
\end{verbatim}
where {\tt params} specify the 6-component mixture model. The blue
lines in the contour plot correspond to each of the component
Gaussians in the mixture and the red lines provide the contours of the
6-component mixture model. At the center of each Gaussian component,
you also see the numerical value of the mixing proportion allocated to
that component.

Turn in the contour plot (no need to turn in the 3D plot).
\end{question}
\begin{answer}
The resulting mixture might be (you might converge to a different mixture):
\end{answer}

\includegraphics[width=0.8\textwidth]{p1q3a}

\score{1 point}

It is also interesting to note the improvement in log likelihood over
the progress of EM. Note that we actually got extremely close to the
solution after about 150 iterations:

\includegraphics[width=0.5\textwidth]{p1q3b}

The resulting log-likelihood should be a bit over 1000, and EM should
converge after a few hundred iterations.

\begin{question}
Rerun EM on the same data-set a few times, looking at the resulting
plots and the log-likelihoods. Explain why you do not always get
close to the same distribution, and why you sometimes do get the exact
same distribution (and not just a similar one).
\end{question}
\begin{answer}
The EM algorithm converges to a local optimum. Initializing the
mixture differently might lead to convergence to a very different
local optimum. However, once we are close enough to a local optimum,
we will always end up in it, and so we get the same mixture with
different starting points.

In some way, the space of mixtures is partitioned to parts of the
space that all lead to the same local optimum. The is similar to
slopes of a mountain-- think of partitioning a mountain range according
to which peak you would get to if you just went up. If you started in
different places, you might end up on different peaks. But for each
peak, there is a large area which leads to it.
\end{answer}
\score{3 points}

Plot the log-likelihood at each EM iteration (this is returned as the
second output value of {\tt EM.m}). Notice (sometimes) the step-like
behavior of the log-likelihood: the log-likelihood is almost unchanged
throughout many iterations, and then increases dramatically after a
few additional iterations. You can use the history of mixture
parameters (returned as the third output value of {\tt EM.m}) to see
how the mixture distribution itself changes in relation to the
log-likelihood. It takes time to resolve how to best allocate the
mixture components, especially if (some of) the Gaussian components
are only slightly different from each other in the current model.

\subsubsection*{Initialization}

The EM algorithm describes how to {\em update} the mixture parameters
in order to improve the log-likelihood of the data. This search
process must start somewhere. An optional parameter to {\tt EM.m} can
be used to control the way the mixture parameters are initialized.

\begin{question}
The provided routine {\tt initSame.m} sets all mixture components to
have zero mean, unit (spherical) covariances, and equal mixing
proportions. Try using it to fit {\tt second345}. How many iterations
does it take for EM to converge? Why?  Describe what happens at each
iteration (you might want to set the {\tt histfreq} parameter to
one, and plot the mixture at each iteration).
\end{question}
\begin{answer}
It takes only two iterations. In the E step of the first iteration,
every point is soft-assigned with equal weights to all the components
(since they are all identical). In the M step, all components will be
updated in the same way, since they all have the same weighted set of
points associated with them. In the second iteration, the E step will
not change the assignments, and so the M step will not change the
parameters, and so we have converged to a stationary point.

Note that this stationary point is very different from the local
optima we usually converge to. This is an {\em unstable} stationary
point--- any slight perturbation of the parameters will cause us to
start getting farther away from it (if we run EM).
\end{answer}
\score{2 points}

\begin{question}
What would happen if the means and covariances were all equal, but the
initial mixing proportions were chosen randomly? If the means and
mixing proportions were equal but the initial covariance matrices were
chosen randomly?
\end{question}
\begin{answer}
If only the mixing proportions were different, we would still get the
same behavior. All the points will have weights equal to the mixing
proportions, and all the components will remain identical.

If the covariances were different, farther away points will get
assigned more heavily to wider components. The components would then
start changing to reflect that, and EM would proceed normally.
\end{answer}
\score{3 points}

The default initialization routine used by {\tt EM.m} is {\tt
initRandPoints.m}, which initializes the means to random data points,
and sets the covariance matrices to diagonal matrices with the same
overall variance parameter computed from the overall covariance of the
data.

\subsubsection*{Data anomalies and regularization} 

The actual data was given as integers in the range 0..100 in each
coordinate.  The matrix {\tt pentrain8}, and its derivatives {\tt
pentrain4} and {\tt second345} represent the original data scaled to
[0,1] with uniform noise added.

The matrix {\tt cleanpentrain8} contains the 0..100 integer valued
data. 

\begin{question}
Try estimating a 5-component mixture model to the first point {\tt
cleanpentrain8(:,1:2)}. You can also try the rest of the time points
{\tt cleanpentrain8(:,2t-1:2t)}. Why does EM fail? What happens ? 
Hint: plot the distribution using
\begin{verbatim}
>> plot(cleanpentrain8(:,1),cleanpentrain8(:,2),'.')
\end{verbatim}
\end{question}
\begin{answer}
EM fails since we one of the components quickly converges to a
degenerate Gaussian. Since about a fifth of the points have the first
coordinate equal to exactly zero, we try to fit these points with a
Gaussian with zero variance in the first coordinate. But the
2-dimensional density of such Gaussian is infinite (its covariance
matrix is not invertible).
\end{answer}
\score{3 points}

\begin{question}
Would we be able to resolve the above problem by introducing
regularization to the M-step? Regularization would modify how we
update the covariance matrix of each of the component Gaussians in the
M-step. Let $\hat{n}_i$ is the number of ``soft'' counts assigned to
component $i$ in the E-step and $\hat{\Sigma}_i$ the corresponding
(unregularized) estimate of the covariance matrix. We can now imagine
having an additional set of $n'$ points with covariance $S$ (the prior
covariance matrix) also assigned to component $i$. These additional
points would change the covariance update according to
\[
\hat\Sigma'_i = (\frac{\hat{n}_i}{\hat{n}_i+n'})\,\hat{\Sigma}_i
+ (\frac{n'}{\hat{n}_i+n'})\,S
\]
where $\hat\Sigma'_i$ is the new regularized estimate of the
covariance matrix for component $i$. (We do not expect you to try this
out but answer the question based on what you think would happen).
\end{question}
\begin{answer}
This would help, since even if $\hat{\Sigma}_i$ is degenerate, since
$S$ is not degenerate, the resulting $\hat{\Sigma}'_i$ will not be
degenerate. By using this regularization we are avoiding getting close
to problematic areas of the space of mixture parameters.
\end{answer}
\score{1 point}

\subsection*{Unsupervised Learning Using Mixture Models}

We continue to use the trajectories in an unsupervised manner, i.e.,
not paying attention to the accompanying labels. We can use the mixture
models to try to partition the set of input samples (trajectories)
into {\em clusters}. We will also assess how the emerging cluster
identities will correlate with the actual labels.

The first question we have to answer is how many Gaussian components
we should use in the mixture model. How could we decide the issue? We
could proceed similarly to the case of setting the kernel width
parameter in non-parametric density estimation, i.e., via
cross-validation. The problem here, however, is that estimating any
single mixture model with say 5 components already takes a fair amount
of computing time and repeating this process $n$ times ($n$ is the
number of training examples) does not appear very attractive. Instead,
we will use a simple approximate criterion or {\em Bayesian
information criterion} to find the appropriate number of
clusters. This involves adding a complexity penalty to the log-data
likelihoods. In other words, if $LL(k)$ is the log-likelihood that our
k-component mixture model assigns to the training data after training,
we find $k$ that maximizes
\[
score(k) = LL(k) - \frac{1}{2} D_k\log(n)
\]
where $D_k$ denotes the number of parameters in the k-component
mixture model. For a Gaussian mixture mode, $D_k =
k*(d+d*(d+1)/2)+k-1$, where $d$ is the dimensionality of the examples
(e.g., 2). Note that the penalty is a function of the number of
parameters and the number of training examples $n$.

\begin{question}
Use different numbers of mixture components with {\tt second345} and
select the ``optimum'' number of components. Use {\tt mixtureModSel.m}
to plot the scores for $k=1,\ldots,8$. Return the plot. Your choice of
$k$ may change if you re-estimate the mixture models (make sure you
understand why).
\end{question}
\includegraphics[width=0.7\textwidth]{p1q5}

\score{1 point}

Note that there may be more clusters than than there are different
types of digits. Some of the digits may have sub-types or our
assumption that the clusters look like Gaussians may be invalid and we
need more than one Gaussian to account for any single real cluster in
the data.

Now, given the ``best'' mixture model, we would like to correlate the
cluster identities with the available labels. We can do this to find
out how well we could capture relevant structure in the data without
actually knowing anything about the labels.

\begin{question} The labels are provided in {\tt labels345}. Write a
simple MATLAB routine to get hard assignments of training examples to
clusters. You'll probably want to use {\tt mixtureScaledPiX} for this
purpose. Use {\tt correlate.m} to correlate the cluster identities
with the available labels {\tt labels345}. Return your MATLAB code and
the correlation matrix. Briefly discuss whether the approach was
successful, i.e., whether we could associate the clusters with
specific labels.
\end{question}
\begin{answer}
Using the functions:
\begin{verbatim}
function c = hard(X, compParams)
  scaledpix = mixtureScaledPiX(X,compParams);
  [dummy,c] = max(scaledpix,[],2);
\end{verbatim}
We get:
\begin{verbatim}
>> c = hard(second345,mm);
>> correlate(c,labels345)
ans =
     0    15   389
    26   140   184
     0   378     2
     0   228     8
   284     3     2
   352    16     2
    57     0   133
\end{verbatim}

From the above correlation matrix, we observe that most components do
correspond to separate digits. Up to a fairly low number of
``errors'', the digit ``3'' was almost completely separated, while the
digits ``4'' and ``5'' have one component in common.

The main problem is that by just looking at the seven components, we
could not know that the data actually consists of a mixture of three
digits. We would have suspected there are seven distinct classes in
the data.
\end{answer}
\score{2 points}

\subsection*{Using Mixture Models for Classification}

We will now try to use the Gaussian mixture models in a classification
setting. We can estimate class-conditional density models (mixture
models) separately for each class and classify any new test example on
the basis of which class-conditional model assigns the highest
likelihood to the example.

\begin{question}
Using the provided {\tt fitAllDigits.m}, fit a 3-component Gaussian
mixture model to each of the digits in {\tt pentrain8,trainlabels}.
\end{question}
\begin{answer}
\begin{verbatim}
>> dcp = fitAllDigits(pentrain8, trainlabels,3);
\end{verbatim}
\end{answer}
\score{No score}

\begin{question}
Complete the routine {\tt classify.m} that takes as input the examples
to be classified and a cell-array of mixture parameters (as returned
by {\tt fitAllDigits.m}) to return the maximum likelihood assignment of
labels to the examples. Provide the code for {\tt classify.m}.
\end{question}
\begin{answer}
\begin{verbatim}
function c = classify(X, classes_compParams)
  m = length(classes_compParams);
  n = size(X,1);
  p = zeros(n,m);
  for i=1:m
    p(:,i) = evalmixtureX(X, classes_compParams{i});
  end
  [dummy, c] = max(p,[],2);  
\end{verbatim}
\end{answer}
\score{5 points}

\begin{question}
Classify the {\tt pentrain8} and {\tt pentest8}, and use {\tt
correlate.m} to compare the results to the true labeling in {\tt
trainlabels} and {\tt testlabels}. Turn in the two correlation
matrices. 
\end{question}
\begin{answer}
\begin{verbatim}
>> trainc = classify(pentrain8,dcp);
>> testc = classify(pentest8,dcp);
>> correlate(trainc,trainlabels)
ans =
     0   775     0     1     0     0     0     5     0     0
     0     2   779     2     0     0     0     0     0     0
     0     0     1   714     0     0     0     0     0     0
     0     0     0     0   780     0     3     0     0     0
     0     1     0     0     0   720     1     0     0     0
     0     0     0     0     0     0   715     0     0     0
     0     0     0     0     0     0     0   772     0     0
     2     0     0     1     0     0     1     1   719     0
     1     1     0     1     0     0     0     0     0   719
   777     0     0     0     0     0     0     0     0     0

>> correlate(testc,testlabels)
ans =
     0   357     4     3     0     6     3    11     0     2
     0     5   359     0     0     0     0     0     0     0
     0     0     0   330     0     2     0     0     0     0
     0     1     0     0   362     1     0     0     0     0
     0     0     0     0     0   289     0     0     0     0
     1     0     0     0     0     0   325     0     0     0
     0     0     1     0     0     0     0   330     0     2
    16     0     0     2     0     5     8     0   336     1
     0     1     0     1     2    32     0    23     0   331
   346     0     0     0     0     0     0     0     0     0
\end{verbatim}
\end{answer}
\score{1 point}

For comparison, you probably want to repeat the above for 1-component
Gaussian mixtures. You could also use the model selection routine we
have provided to find a different number of mixture components for
each class-conditional density. Increasing the number of components
per class would eventually force you to use regularized estimates (see
above).