\problem{Markov and Hidden Markov Models}

In this problem, we will use Markov and Hidden Markov models to
identify the language of written sentences. For simplicity our
representation of text will include only 27 symbols--- the 26 letters
of the Latin alphabet, and the space symbol. Any accented letter is
represented as a non-accented letter, none-Latin letters are converted
to their closest Latin letters, and punctuation is removed. This
representation naturally looses quite a bit of information compared to
the original ASCII text. This 'handicap' is in part intentional so
that the classification task would be a bit more challenging.

Most of the MATLAB code you will need here will be given. You will
find the following routines useful (here and perhaps in some of your
projects as well):
\begin{description}
\item[{\tt readlines.m}] Reads a named text file, returning a cell
array of the lines in the file. To get line {\tt i} of cell-array {\tt
lines} returned from, e.g., {\tt lines = readlines('cnn.eng')}, use
{\tt lines\{i\}}.

\item[{\tt text2stream.m}] Converts a string (a line of text) into a
row vector of numbers in the range $\{1,\ldots,27\}$, according to the
representation discussed above. So, for example, {\tt numberline =
text2stream(lines{1})} would convert the first line of text from {\tt
lines} into a row vector of numbers. The conversion of the full output
of {\tt readlines} would have to be done line by line.

\item[{\tt count.m}] Given text in a row vector representation
and a width $k$, the function computes the count of all $k$-grams in
the array. In other words, the function returns a k-dimensional array
representing the number of times each configuration of $k$ successive
letters occurs in the text. 

\item[{\tt totalcount.m}] This function allows you to compute
the accumulated counts from each of the lines of text returned
by {\tt readlines}. Use this function to find the training counts for
the different languages. 

\end{description}

\subsection*{Language identification using Markov models}

Here we will construct a language classifier by using Markov models as
class-conditional distributions. In other words, we will separately
train a Markov model to represent each of the chosen languages:
English, Spanish, Italian and German. The training data is given in
the files {\tt cnn.eng, cnn.spa, cnn.ita, cnn.ger}, which contain
several news articles (same articles in different languages), one
article per line.

We will first try a simple independent (zeroth-order Markov)
model. Under this model, each successive symbol in text is chosen
independently of other symbols. The language is in this case
identified based only on its letter frequencies.

\begin{question}
Write a function {\tt naiveLL(stream,count1)} which takes a 1-count
(frequency of letters returned by {\tt count.m}) and evaluates the
log-likelihood of the text stream (row vector of numbers) under the
independent (zeroth-order Markov) model.
\end{question}
\begin{answer}
\begin{verbatim}
function ll = naiveLL(stream,c1)
  lp = log(c1/sum(c1));
  ll = sum(lp(stream));
\end{verbatim}
\end{answer}
\score{5 points}

Extract the total 1-counts from the language training sets described
above. Before proceeding, let's quickly check your function {\tt
naiveLL}. If you evaluate the log-likelihood of {\tt 'This is an
example sentence'} using the English 1-counts from {\tt cnn.eng},
you'll get -76.5690, while the Spanish log-likelihood of the same
sentence is -77.2706.

\begin{question}
Write a short function {\tt naiveC} which takes a stream, and several
1-counts corresponding to different languages, and finds the
maximum-likelihood language for the stream. You could assume, e.g.,
that the 1-counts are stored in an array, where each column
corresponds to a specific language. The format of the labels should be
in correspondence with the {\tt test\_labels} described below.
\end{question}
\begin{answer}
\begin{verbatim}
function [c,ll] = naiveC(stream,counts1)
  for i=1:size(counts1,2)
    ll(i) = naiveLL(stream,counts1(:,i));
  end
  [dummy,c]=max(ll);
\end{verbatim}
\end{answer}
\score{3 points}

The files {\tt song.eng, song.spa, song.ita, song.ger} contain
additional text in the four languages. We will use these as the test
set. For your convenience, we have provided you with script {\tt
generate\_test.m} :  
\begin{verbatim}
test_sentences = [ readlines('song.eng') ; ...
		   readlines('song.ger') ; ...
		   readlines('song.spa') ; ...
		   readlines('song.ita') ] ;
test_labels = [ ones(17,1) ; ones(17,1)*2 ; ones(17,1)*3 ; ones(17,1)*4 ]
\end{verbatim}

In order to study the performance of the classifier as a function of
the length of test strings, we will classify all prefixes of the lines
in the test files. The provided routine {\tt testC.m} calculates the
success probability of the classification, for each prefix length,
over all the streams or strings in a given cell-array. You can
call this function, e.g., as follows
\begin{verbatim}
successprobs = testC(test_sentences,test_labels,'naiveC',count1s)}
\end{verbatim} 
where {\tt count1s} provides the array of training counts that your
function {\tt naiveC} should accept as an input.

\begin{question}
Plot the success probability as a function of the length of the
string. What is the approximate number of symbols that we need to
correctly assign new piece of text to one of the four languages?
\end{question}
\begin{answer}
We carry out the following operations:
\begin{verbatim}
>> eng = readlines('languages/cnn.eng');
>> ita = readlines('languages/cnn.ita');
>> ger = readlines('languages/cnn.ger');
>> spa = readlines('languages/cnn.spa');
>> counts1 = [totalcount(eng,1) , totalcount(ger,1) , totalcount(spa,1) , totalcount(ita,1)]
>> p = testC(test_sentences,test_labels,'naiveC',counts1);
>> plot(p)
\end{verbatim}
Resulting in the plot: 

\includegraphics[width=0.6\textwidth]{p2q3}

From the plot we see that approximately 100 characters are needed for
reliable language classification.
\end{answer}
\score{1 point}

In order to incorporate second order statistics, we will now move on
to modeling the languages with first-order Markov models.

\begin{question}
Write a function {\tt markovLL(stream,count2)} which returns the
log-likelihood of a stream under a first-order Markov model of the
language with the specified 2-count. For the initial probabilities,
you can use 1-counts calculated from the 2-counts.
\end{question}
\begin{answer}
\begin{verbatim}
function ll = markovLL(stream,c2)
  c1 = sum(c2,2);		% 1-count calculated from 2-count
  p0 = c1/sum(c1);		% Initial probs
  p1 = op(c2,'./',c1);		% Transition probs
  lp0 = log(p0);
  lp1 = log(p1);
  ll = lp0(stream(1));
  for i=2:length(stream)
    ll = ll+lp1(stream(i-1),stream(i));
  end  
\end{verbatim}
\end{answer}
\score{5 points}

Quick check: The English log-likelihood of {\tt 'This is an example
sentence'} is -63.0643, while its Spanish log-likelihood is -65.4878.
We are again assuming that you are using the training sets described
above to extract the 2-counts for the different languages.

\begin{question}
Write a corresponding function {\tt markovC.m} that classifies a
stream based on Markov models for various languages, specified by
their 2-counts.
\end{question}
\begin{answer}
\begin{verbatim}
function [c,ll] = markovC(stream,counts2,zamir)
  if nargin<3, zamir=0; end;
  if ischar(stream), stream=preproc(stream); end;
  for i=1:size(counts2,3)
    ll(i) = markovLL(stream,counts2(:,:,i)+zamir);
  end
  [dummy,c]=max(ll);  
\end{verbatim}
The above function can also incorporate a pseudo-count (as required
later on) and can deal with either streams or strings as inputs. These 
features were not required for a solution to be complete.
\end{answer}
\score{1 point}

\begin{question}
Try to classify the sentence {\tt 'Why is this an abnormal English
sentence'}. What is its likelihood under a Markov model for each of the
languages ? Which language does it get classified as ? Why does it not
get classified as English ?
\end{question}
\begin{answer}
After calculating the two-counts:
\begin{verbatim}
>> counts2 = cat(3,totalcount(eng,2) , totalcount(ger,2) , totalcount(spa,2) , totalcount(ita,2))
\end{verbatim}
We can calculate:
\begin{verbatim}
>> [c,ll] = markovC(text2stream('Why is his is an abnormal English sentence'),counts2)
c =     3
ll =       -Inf      -Inf -115.9894      -Inf
\end{verbatim}
(We also get a few warnings).

The sentence gets classified as Spanish with a log-likelihood of
-116. In all other languages, including English, it has infinitely
negative log-likelihood. This is because the letter pair ``{\tt bn}''
does not appear in the training text, and so no sentence with this
letter combination will be generated by the learned model. The
probability of generating the above sentence under out English model
is thus zero, and its log-likelihood is negative infinity.
\end{answer}
\score{3 points}

When estimating discrete probability models, we often need to
regularized the parameter estimates to avoid concluding that any
configuration (e.g., two successive characters) have probability zero
unless such the configuration appeared in the limited training
set. Let $\{\theta_i\}$, where $\sum_{i=1}^m \theta_i = 1$, define a
discrete probability distribution over $m$ elements
$i\in\{1,\ldots,m\}$. For the purposes of estimation, we treat
$\theta_i$ as parameters. Given now a training set summarized in terms
of the number of occurrences of each element $i$, i.e., $\hat n_i$,
the maximum likelihood estimate of $\{\theta_i\}$ would be
\[
\hat\theta_i = \frac{\hat n_i}{\sum_{j=1}^m \hat n_j}
\]
This is zero for all elements $i$ that did not occur in the training
set, i.e., when $\hat n_i = 0$. To avoid this problem, we introduce a
prior distribution over the parameters $\{\theta_i\}$:
\[
P(\theta) = \frac{1}{Z}\,\prod_{i=1}^m \theta_i^{\alpha_i}
\]
where $Z$ is a normalization constant and $\alpha_i\geq 0$'s are known
as {\em pseudo-counts}. This prior, known as a {\em Dirichlet}
distribution, assigns the highest probability to
\[
\tilde\theta_i  = \frac{\alpha_i}{\sum_{j=1}^m \alpha_j}
\]
Thus $\alpha_i$'s behave as if they were additional counts from some
prior (imaginary) training set.

We can combine the prior with the maximum likelihood criterion by
maximizing instead the penalized log-likelihood of the data (expressed
here in terms of the training set counts $\hat n_i$):
\[
J(\theta) &=& \overbrace{\sum_{i=1}^m \hat n_i \log \theta_i}^{\mbox{log-probability of
data}} + 
\overbrace{\log P(\theta)}^{\mbox{log-prior}}
\\
&=& \sum_{i=1}^m \hat n_i \log \theta_i + \sum_{i=1}^m \alpha_i \log
\theta_i + \mbox{constant}
\\
&=& \sum_{i=1}^m (\hat n_i+\alpha_i) \log \theta_i + \mbox{constant}
\]
The maximizing $\{\theta_i\}$ is now 
\[
\hat\theta_i  = \frac{\hat n_i + \alpha_i}{\sum_{j=1}^m (\hat n_j + \alpha_j)}
\]
which will be non-zero whenever $\alpha_i>0$ for all
$i=1,\ldots,m$. Setting $\alpha_i = 1/m$ would correspond to having a
single prior observation distributed uniformly among the possible
elements $i\in\{1,\ldots,m\}$. Setting $\alpha_i = 1$, on the other
hand, would mean that we had $m$ prior observations, observing each
element $i$ exactly once.

\begin{question}
Add pseudocounts (one for each configuration) and reclassify the test
sentence. What are the likelihoods now. Which language does the
sentence get classified as ?
\end{question}
\begin{answer}
We now have:
\begin{verbatim}
>> [c,ll] = markovC(text2stream('Why is this is an abnormal English sentence'),counts2,1)
c =     1
ll =  -106.1994 -123.9184 -114.7622 -128.4992
\end{verbatim}
And the sentences gets correctly classified as English.
\end{answer}
\score{2 points}

\begin{question}
Use {\tt testC.m} to test the performance of Markov-based
classification (with the corrected counts) on the test set. Plot the
correct classification probability as a function of the text
length. Compare the classification performance to that of {\tt
naiveC.m}. (Turn in both plots).
\end{question}
\begin{answer}
\begin{verbatim}
>> p = testC(test_sentences,test_labels,'markovC',counts2,1);
>> plot(p)
\end{verbatim}
\includegraphics[width=0.6\textwidth]{p2q8}

The performance using Markov models is much better-- accurate
performance is achieved with less than 40 characters.
\end{answer}
\score{1 point}

\subsection*{Hidden Markov Models}

We will now turn to a slightly more interesting problem of language
segmentation: given a mixed-language text, we would like to identify
the segments written in different languages. 

Consider the following simple approach: we will first find the
character statistics for each language by computing the 1-counts from
the available training sets as before. We will then go through the new
text character by character, classifying each character to the most
likely language (language whose independent model assigns the highest
likelihood to the character). In other words, we would use {\tt
naiveC} to classify each character in the new text. To incorporate
higher-order statistics, we could train a Markov model for each
language (as before), and again assign the characters in the text to
the most likely language.

\begin{question}
Why would we expect the resulting segmentation not to agree with the
true segmentation? What would the resulting segmentation look like?
What is the critical piece of information we are not using in this
approach ?
\end{question}
\begin{answer}
We would classify each letter according to the language in which it is
most frequent. And so, all 't's will get classified as English, while
all 'a' will get classified as Spanish. Letters in the same word would
get classified as different languages. This is of course unreasonable.

The crucial information we are not using is that adjacent characters
tend to come from the same language.
\end{answer}
\score{4 points}

We would rather model the multi-lingual text with a hidden Markov
model. For simplicity, we will focus on segmenting text written in
only two languages.

\begin{question}
Suggest how a hidden Markov model can solve the problem you identified
above. Provide an annotated transition diagram (heavy lines for larger
transition probabilities) and describe what the output probabilities
should be capturing. 
\end{question}
\begin{answer}
We can use the hidden variable to designate which language the
characters come from. The transition probabilities for this hidden
variable will encode the information that transitions between languages
are much less frequent then remaining in the same language. The output
probabilities given that we are in a specific language would
correspond to the single-character statistics of that language.
\end{answer}
\score{4 points}

For some setting of the parameters, this HMM probably degenerates to
the independent model. Make sure you understand when this happens.

Now, load the example text from the file {\tt segment.dat}. The text,
mixed German and Spanish, is given in the variable {\tt gerspa} as
numbers in the range 1..27 (as before). You can use the provided
function {\tt stream2text.m} to view the numbers as letters.

The routine {\tt [hmm,ll] = esthmm(data,hmm)} uses EM to find the
maximum likelihood parameters of a hidden Markov model, for a given
output sequence(s). As discussed earlier, the EM algorithm has to
start its search with some parameter setting. The {\tt hmm} input
argument to {\tt esthmm} provides this initial parameter setting.

The routine {\tt hmm = newhmm(numstates,numout)} creates random HMM
parameters for an HMM with {\tt numstates} hidden states, and {\tt
numout} output symbols. An optional third argument controls the
``non-uniformity'' of the parameters.

{\bf Note:} The orientation of the matrices {\tt Po} and {\tt Pt} is
different from those in recitation: {\tt Pt(i,j)}$=
P(state_t=j|state_{t-1}=i)$ and {\tt Po(i,a)}$= P(O_t = a|state_t=i)$.

With two different initializations, you will probably find two
different HMMs. It is a good idea to run the EM algorithm multiple
times, starting from slightly different initial settings. Every such
run can potentially lead to a different result. Make sure you
understand how you would choose among the alternative solutions.

\begin{question}
Estimate a two-state hidden Markov model with two different types of
initial settings of the state transition probabilities. First, set the
transition probabilities close to uniform values. In the second
approach, make the ``self-transitions'' quite a bit more likely than
the transitions to different states. Which initialization leads to a
better model?
\end{question}
\begin{answer}
We can compare the two estimated models by comparing their
likelihood. We are seeking a maximum likelihood model, and so we will
select the one with a higher likelihood. 
\begin{verbatim}
>> randhmm = newhmm(2,27);
>> [hmm1,ll1] = esthmm(gerspa,randhmm);
>> ll1(end)
ans =  -7.8976e+03
>> biasedhmm = newhmm(2,27);
>> biasedhmm.Pt = [0.9,0.1;0.1,0.9];
>> [hmm2,ll2] = esthmm(gerspa,biasedhmm);
>> ll2(end)
ans =  -7.8115e+03
\end{verbatim}
The second HMM has a higher log-likelihood, and thus will be
selected.

We can also look at the resulting HMM parameters and notice that the
second model better fits what we were looking for: the transition
probabilities are very biased, the initial probabilities are almost
deterministic, and the output probabilities resemble the language
statistics we are already familiar with from the first part of the
problem.
\end{answer}
\score{2 points}

Now try to segment {\tt gerspa} into German and Spanish, using a
two-state hidden Markov model. You can then use the routine {\tt
viterbi(sequence,hmm)} to examine the most likely (maximum a
posteriori probability or MAP) state sequence. The correct
segmentation of {\tt gerspa} is given in {\tt gerspa\_lang}.

\begin{question}
Examine the difference between your segmentation and the correct one. 
Do the errors make sense? 
\end{question}
\begin{answer}
There are sixty-four miss-labeled characters. Note that all of them
occur on the boundary between language transitions. All the
transitions were identified, but the exact location was sometimes
missed. This is very reasonable.
\end{answer}
\score{1 point}

\iffalse
\begin{begin}
Under the model, what is the conditional probability of the MAP
state sequence, given the observed text ? Explain how you calculated
this probability, including any MATLAB code you used or modified.
\end{begin}
\fi

\begin{question}
Use the provided routine {\tt hmmposteriors(sequence,hmm)} to
calculate the per character posterior probabilities over the
states. These are the $\gamma_t(i)$ probabilities described in
lectures ($t$ gives the character position in text and $i$ specifies
the state). Plot these probabilities as a function of the character
position in the text sequence and turn in the plot. You might want to
re-scale the axis using {\tt axis[0 3000 0 1.1]}.
\end{question}
\includegraphics[width=0.7\textwidth]{p2q13}
\score{1 point}

\begin{question}
Find the sequence of maximum a posteriori probability states. That is,
for each time point $t$, find the state $i$ with the maximum a
posteriori probability $P(state_t=i|{\text observed seq})$. Compare
this state sequence with the maximum a posteriori state sequence. Are
they different? 
\end{question}
\begin{answer}
They differ in one position (position 631).
\end{answer}
\score{2 point}

\begin{question}
Give an example of a hidden Markov model, with two hidden states and
two output symbols, and a length two output sequence, such that the
MAP state sequence differs from the sequence of MAP states. Be sure to
specify all the parameters of the HMM (these need not be the maximum
likelihood parameters for this specific length two output
sequence).
\end{question}
\begin{answer}
$$P(X_0 = a) = 0.6, P(X_0=b) = 0.4)$$
$$P(X_1 = a | X_0 = a) = 0.5, P(X_1 = b | X_0 =a) =0.5$$
$$P(X_1 = a | X_0 = b) = 0, P(X_1 = b | X_0 = b)=1$$.
All output probabilities are uniform. For any output sequence, the MAP
state sequence is $b,b$ with posterior probability 0.4. But the MAP
state of $X_0$ is $a$, with posterior probability 0.6 (each of the state
sequences $a,a$ and $a,b$ have posterior probability 0.3).
\end{answer}
\score{3 points}

In the above example, we assumed no knowledge about the languages. We
would like to improve the segmentation by incorporating some
additional information about the languages such as single-character
statistics.

\begin{question}
What parameters of the HMM still need to be estimated ? Modify the
routine {\tt esthmm.m} accordingly. (Turn in the modifications)
\end{question}
\begin{answer}
The output probabilities are now known and fixed. We only search for
the maximum likelihood initial and transition probabilities. We make
one change to {\tt esthmm.m}: delete the following line:
\begin{verbatim}
  hmm.Po = op(Nsx,'./',sum(Nsx,2));
\end{verbatim}
It would also be appropriate, though not necessary, to delete the
calculation of {\tt Nsx}, since it is now unused.
\end{answer}
\score{3 points}

\begin{question} {\em (Optional)}
Use the single-character statistics you calculated from {\tt cnn.spa}
and {\tt cnn.ger} to segment the text. What is the maximum likelihood
estimate of the remaining parameters ? Compare the resulting most
likely state sequence to the one you found without this prior
information. Are they different ?
\end{question}

\begin{question} 
Hidden Markov models can also be useful for classification. Suggest
how you would use HMMs in place of the Markov models described above
to identify the language of any specific (single-language)
document. Specifically, list the routines we have provided that you
could make use of and describe which additional routines you would
need.
\end{question}
\begin{answer}
For each language, we would train a HMM (using {\tt esthmm.m}) on the
corpus of texts for that languages. We would have to choose the number
of hidden states in each of those HMMs. We can set a fixed number of
hidden states (e.g. any number of hidden states above 27 would give us
a model with a training accuracy at least as good as a first order
Markov model, but good training accuracy might not lead to good
generalization). We can also use cross-validation to choose the number
of states for each language model.

After training the models, in order to classify a sentence, we would
compare the likelihood of the query sentence under each of the
models. This is given as the third output parameter of {\tt
hmmposteriors.m}. However, all the other calculations in {\tt
hmmposteriors.m} are unnecessary (and take significant time)-- it is
enough to run only the forward calculation as follows:
\begin{verbatim}
function logPx = hmmloglikelihood(x,hmm)
  loga = logalpha(x,hmm);
  logPx = logsumexp(loga(:,n));
\end{verbatim}
\end{answer}
\score{5 points}

\begin{question}
{\em (Optional) } Use hidden Markov models, with a varying number of
states, for the language classification task investigated in the first 
part of this problem. Using the same training set and test set as
before, create a graph of the probability of correct classification as 
a function of the length of the text. Discuss how the results compare
to using naive classification and first-order Markov models, and how
changing the number of hidden states affects the results.
\end{question}
