\problem{Markov and Hidden Markov Models}

In this problem, we will use Markov and Hidden Markov models to
identify the language of written sentences. For simplicity our
representation of text will include only 27 symbols--- the 26 letters
of the Latin alphabet, and the space symbol. Any accented letter is
represented as a non-accented letter, none-Latin letters are converted
to their closest Latin letters, and punctuation is removed. This
representation naturally looses quite a bit of information compared to
the original ASCII text. This 'handicap' is in part intentional so
that the classification task would be a bit more challenging.

Most of the MATLAB code you will need here will be given. You will
find the following routines useful (here and perhaps in some of your
projects as well):
\begin{description}
\item[{\tt readlines.m}] Reads a named text file, returning a cell
array of the lines in the file. To get line {\tt i} of cell-array {\tt
lines} returned from, e.g., {\tt lines = readlines('cnn.eng')}, use
{\tt lines\{i\}}.

\item[{\tt text2stream.m}] Converts a string (a line of text) into a
row vector of numbers in the range $\{1,\ldots,27\}$, according to the
representation discussed above. So, for example, {\tt numberline =
text2stream(lines{1})} would convert the first line of text from {\tt
lines} into a row vector of numbers. The conversion of the full output
of {\tt readlines} would have to be done line by line.

\item[{\tt count.m}] Given text in a row vector representation
and a width $k$, the function computes the count of all $k$-grams in
the array. In other words, the function returns a k-dimensional array
representing the number of times each configuration of $k$ successive
letters occurs in the text. 

\item[{\tt totalcount.m}] This function allows you to compute
the accumulated counts from each of the lines of text returned
by {\tt readlines}. Use this function to find the training counts for
the different languages. 

\end{description}

\subsection*{Language identification using Markov models}

Here we will construct a language classifier by using Markov models as
class-conditional distributions. In other words, we will separately
train a Markov model to represent each of the chosen languages:
English, Spanish, Italian and German. The training data is given in
the files {\tt cnn.eng, cnn.spa, cnn.ita, cnn.ger}, which contain
several news articles (same articles in different languages), one
article per line.

We will first try a simple independent (zeroth-order Markov)
model. Under this model, each successive symbol in text is chosen
independently of other symbols. The language is in this case
identified based only on its letter frequencies.

\begin{question}
Write a function {\tt naiveLL(stream,count1)} which takes a 1-count
(frequency of letters returned by {\tt count.m}) and evaluates the
log-likelihood of the text stream (row vector of numbers) under the
independent (zeroth-order Markov) model.
\end{question}

Extract the total 1-counts from the language training sets described
above. Before proceeding, let's quickly check your function {\tt
naiveLL}. If you evaluate the log-likelihood of {\tt 'This is an
example sentence'} using the English 1-counts from {\tt cnn.eng},
you'll get -76.5690, while the Spanish log-likelihood of the same
sentence is -77.2706.

\begin{question}
Write a short function {\tt naiveC} which takes a stream, and several
1-counts corresponding to different languages, and finds the
maximum-likelihood language for the stream. You could assume, e.g.,
that the 1-counts are stored in an array, where each column
corresponds to a specific language. The format of the labels should be
in correspondence with the {\tt test\_labels} described below.
\end{question}

The files {\tt song.eng, song.spa, song.ita, song.ger} contain
additional text in the four languages. We will use these as the test
set. For your convenience, we have provided you with script {\tt
generate\_test.m} :  
\begin{verbatim}
test_sentences = [ readlines('song.eng') ; ...
		   readlines('song.ger') ; ...
		   readlines('song.spa') ; ...
		   readlines('song.ita') ] ;
test_labels = [ ones(17,1) ; ones(17,1)*2 ; ones(17,1)*3 ; ones(17,1)*4 ]
\end{verbatim}

In order to study the performance of the classifier as a function of
the length of test strings, we will classify all prefixes of the lines
in the test files. The provided routine {\tt testC.m} calculates the
success probability of the classification, for each prefix length,
over all the streams or strings in a given cell-array. You can
call this function, e.g., as follows
\begin{verbatim}
successprobs = testC(test_sentences,test_labels,'naiveC',count1s)}
\end{verbatim} 
where {\tt count1s} provides the array of training counts that your
function {\tt naiveC} should accept as an input.

\begin{question}
Plot the success probability as a function of the length of the
string. What is the approximate number of symbols that we need to
correctly assign new piece of text to one of the four languages?
\end{question}

In order to incorporate second order statistics, we will now move on
to modeling the languages with first-order Markov models.

\begin{question}
Write a function {\tt markovLL(stream,count2)} which returns the
log-likelihood of a stream under a first-order Markov model of the
language with the specified 2-count. For the initial probabilities,
you can use 1-counts calculated from the 2-counts.
\end{question}

Quick check: The English log-likelihood of {\tt 'This is an example
sentence'} is -63.0643, while its Spanish log-likelihood is -65.4878.
We are again assuming that you are using the training sets described
above to extract the 2-counts for the different languages.

\begin{question}
Write a corresponding function {\tt markovC.m} that classifies a
stream based on Markov models for various languages, specified by
their 2-counts.
\end{question}

\begin{question}
Try to classify the sentence {\tt 'Why is this an abnormal English
sentence'}. What is its likelihood under a Markov model for each of the
languages ? Which language does it get classified as ? Why does it not
get classified as English ?
\end{question}

When estimating discrete probability models, we often need to
regularize the parameter estimates to avoid concluding that any
configuration (e.g., two successive characters) have probability zero
unless such the configuration appeared in the limited training
set. Let $\{\theta_i\}$, where $\sum_{i=1}^m \theta_i = 1$, define a
discrete probability distribution over $m$ elements
$i\in\{1,\ldots,m\}$. For the purposes of estimation, we treat
$\theta_i$ as parameters. Given now a training set summarized in terms
of the number of occurrences of each element $i$, i.e., $\hat n_i$,
the maximum likelihood estimate of $\{\theta_i\}$ would be
\[
\hat\theta_i = \frac{\hat n_i}{\sum_{j=1}^m \hat n_j}
\]
This is zero for all elements $i$ that did not occur in the training
set, i.e., when $\hat n_i = 0$. To avoid this problem, we introduce a
prior distribution over the parameters $\{\theta_i\}$:
\[
P(\theta) = \frac{1}{Z}\,\prod_{i=1}^m \theta_i^{\alpha_i}
\]
where $Z$ is a normalization constant and $\alpha_i\geq 0$'s are known
as {\em pseudo-counts}. This prior, known as a {\em Dirichlet}
distribution, assigns the highest probability to
\[
\tilde\theta_i  = \frac{\alpha_i}{\sum_{j=1}^m \alpha_j}
\]
Thus $\alpha_i$'s behave as if they were additional counts from some
prior (imaginary) training set.

We can combine the prior with the maximum likelihood criterion by
maximizing instead the penalized log-likelihood of the data (expressed
here in terms of the training set counts $\hat n_i$):
\[
J(\theta) &=& \overbrace{\sum_{i=1}^m \hat n_i \log \theta_i}^{\mbox{log-probability of
data}} + 
\overbrace{\log P(\theta)}^{\mbox{log-prior}}
\\
&=& \sum_{i=1}^m \hat n_i \log \theta_i + \sum_{i=1}^m \alpha_i \log
\theta_i + \mbox{constant}
\\
&=& \sum_{i=1}^m (\hat n_i+\alpha_i) \log \theta_i + \mbox{constant}
\]
The maximizing $\{\theta_i\}$ is now 
\[
\hat\theta_i  = \frac{\hat n_i + \alpha_i}{\sum_{j=1}^m (\hat n_j + \alpha_j)}
\]
which will be non-zero whenever $\alpha_i>0$ for all
$i=1,\ldots,m$. Setting $\alpha_i = 1/m$ would correspond to having a
single prior observation distributed uniformly among the possible
elements $i\in\{1,\ldots,m\}$. Setting $\alpha_i = 1$, on the other
hand, would mean that we had $m$ prior observations, observing each
element $i$ exactly once.

\begin{question}
Add pseudocounts (one for each configuration) and reclassify the test
sentence. What are the likelihoods now. Which language does the
sentence get classified as ?
\end{question}

\begin{question}
Use {\tt testC.m} to test the performance of Markov-based
classification (with the corrected counts) on the test set. Plot the
correct classification probability as a function of the text
length. Compare the classification performance to that of {\tt
naiveC.m}. (Turn in both plots).
\end{question}


\subsection*{Hidden Markov Models}

We will now turn to a slightly more interesting problem of language
segmentation: given a mixed-language text, we would like to identify
the segments written in different languages. 

Consider the following simple approach: we will first find the
character statistics for each language by computing the 1-counts from
the available training sets as before. We will then go through the new
text character by character, classifying each character to the most
likely language (language whose independent model assigns the highest
likelihood to the character). In other words, we would use {\tt
naiveC} to classify each character in the new text. To incorporate
higher-order statistics, we could train a Markov model for each
language (as before), and again assign the characters in the text to
the most likely language.

\begin{question}
Why would we expect the resulting segmentation not to agree with the
true segmentation? What would the resulting segmentation look like?
What is the critical piece of information we are not using in this
approach ?
\end{question}

We would rather model the multi-lingual text with a hidden Markov
model. For simplicity, we will focus on segmenting text written in
only two languages.

\begin{question}
Suggest how a hidden Markov model can solve the problem you identified
above. Provide an annotated transition diagram (heavy lines for larger
transition probabilties) and desribe what the output probabilities
should be capturing. 
\end{question}

For some setting of the parameters, this HMM probably degenerates to
the independent model. Make sure you understand when this happens.

Now, load the example text from the file {\tt segment.dat}. The text,
mixed German and Spanish, is given in the variable {\tt gerspa} as
numbers in the range 1..27 (as before). You can use the provided
function {\tt stream2text.m} to view the numbers as letters.

The routine {\tt [hmm,ll] = esthmm(data,hmm)} uses EM to find the
maximum likelihood parameters of a hidden Markov model, for a given
output sequence(s). As discussed earlier, the EM algorithm has to
start its search with some parameter setting. The {\tt hmm} input
argument to {\tt esthmm} provides this initial parameter setting.

The routine {\tt hmm = newhmm(numstates,numout)} creates random HMM
parameters for an HMM with {\tt numstates} hidden states, and {\tt
numout} output symbols. An optional third argument controls the
``non-uniformity'' of the parameters.

{\bf Note:} The orientation of the matrices {\tt Po} and {\tt Pt} is
different from those in recitation: {\tt Pt(i,j)}$=
P(state_t=j|state_{t-1}=i)$ and {\tt Po(i,a)}$= P(O_t = a|state_t=i)$.

With two different initializations, you will probably find two
different HMMs. It is a good idea to run the EM algorithm multiple
times, starting from slightly different initial settings. Every such
run can potentially lead to a different result. Make sure you
understand how you would choose among the alternative solutions.

\begin{question}
Estimate a two-state hidden Markov model with two different types of
initial settings of the state transition probabilities. First, set the
transition probabilities close to uniform values. In the second
approach, make the ``self-transitions'' quite a bit more likely than
the transitions to different states. Which initialization leads to a
better model?
\end{question}

Now try to segment {\tt gerspa} into German and Spanish, using a
two-state hidden Markov model. You can then use the routine {\tt
viterbi(sequence,hmm)} to examine the most likely (maximum a
posteriori probability or MAP) state sequence. The correct
segmentation of {\tt gerspa} is given in {\tt gerspa\_lang}.

\begin{question}
Examine the difference between your segmentation and the correct one. 
Do the errors make sense? 
\end{question}

\iffalse
\begin{begin}
Under the model, what is the conditional probability of the MAP
state sequence, given the observed text ? Explain how you calculated
this probability, including any MATLAB code you used or modified.
\end{begin}
\fi

\begin{question}
Use the provided routine {\tt hmmposteriors(sequence,hmm)} to
calculate the per character posterior probabilities over the
states. These are the $\gamma_t(i)$ probabilities described in
lectures ($t$ gives the character position in text and $i$ specifies
the state). Plot these probabilities as a function of the character
position in the text sequence and turn in the plot. You might want to
re-scale the axis using {\tt axis[0 3000 0 1.1]}.
\end{question}

\begin{question}
Find the sequence of maximum a posteriori probability states. That is,
for each time point $t$, find the state $i$ with the maximum a
posteriori probability $P(state_t=i|{\text observed seq})$. Compare
this state sequence with the maximum a posteriori state sequence. Are
they different? 
\end{question}

\begin{question}
Give an example of a hidden Markov model, with two hidden states and
two output symbols, and a length two output sequence, such that the
MAP state sequence differs from the sequence of MAP states. Be sure to
specify all the parameters of the HMM (these need not be the maximum
likelihood parameters for this specific length two output
sequence).
\end{question}

In the above example, we assumed no knowledge about the languages. We
would like to improve the segmentation by incorporating some
additional information about the languages such as single-character
statistics.

\begin{question}
What parameters of the HMM still need to be estimated ? Modify the
routine {\tt esthmm.m} accordingly. (Turn in the modifications)
\end{question}

\begin{question} {\em (Optional)}
Use the single-character statistics you calculated from {\tt cnn.spa}
and {\tt cnn.ger} to segment the text. What is the maximum likelihood
estimate of the remaining parameters ? Compare the resulting most
likely state sequence to the one you found without this prior
information. Are they different ?
\end{question}

\begin{question} 
Hidden Markov models can also be useful for classification. Suggest
how you would use HMMs in place of the Markov models described above
to identify the language of any specific (single-language)
document. Specifically, list the routines we have provided that you
could make use of and describe which additional routines you would
need.
\end{question}

\begin{question}
{\em (Optional) } Use hidden Markov models, with a varying number of
states, for the language classification task investigated in the first 
part of this problem. Using the same training set and test set as
before, create a graph of the probability of correct classification as 
a function of the length of the text. Discuss how the results compare
to using naive classification and first-order Markov models, and how
changing the number of hidden states affects the results.
\end{question}
