Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.863J/9.611J Natural Language Processing, Spring, 2004


Laboratory 2, Components I and II:  

Statistics and Natural Language: Warmup on N-grams; Part of Speech Tagging




Handed Out: February 23                                                                                                                           Due: March 05



Introduction: Goals of the Laboratory & Background Reading – The Red Pill or the Blue Pill?

In class we have observed that there are two current approaches to tackling natural language processing: a symbolic, rule-based approach; and a statistical approach.  What are the strengths and weaknesses of each?  How can we combine them?  How do they help in building practical systems?  That is the goal of this laboratory.  We shall examine two different aspects of statistical and symbolic natural language processing.  As a warmup exercise, in Part I we look at perhaps the earliest statistical language approach, N-grams (Markov, Anton Andreevich, 1913, Statistical investigation of the text of “Eugene Onegin” illustrating the dependence between samples in chain,” Bulletin de l'Académie Impériale of the Sciences de St. Pétersbourg, Bd. 7; the first application to authorship: Mozorov, A., 1915,  “A new weapon for the objective investigation of old writings. (Linguistic spectra as means for the distinction of plagiarisms and truthful certifications or other author and for the determination of their epoch,)” Izvestija Akademii Nauk, St. Petersburg, section for Russian language, XX, 1-4.)  Then, in Part II of this lab component we compare this approach to a rule-based approach for the task of part-of-speech tagging.  In the last component of the lab, handed out Wednesday, we shall turn to a more detailed comparison.

Through observing the performances of these taggers and finishing this laboratory you should develop a deeper understanding of the two different perspectives on NLP: the symbolic rule-based perspective and the statistical perspective.

General Background reading for this Laboratory – Please read first!

Part I: Basic probability theory; n-grams

If you are at all uncomfortable with the basic notions of probability theory, especially those required for understanding n-grams – conditional probability; the chain rule; Bayes’ rule; etc., I strongly urge you to read the first selection below.  I would also strongly urge everyone to take a look at the second item below. I leave the reading in your textbook and the lecture notes to your own conscience and/or diligence (Chapter 6 in the text is a very good summary, however!)

1.      An excellent review of probability theory as applicable to N-grams:  here.

2.      Same document, with a terrific summary intro to bigrams, here.

3.      Your textbook, Chapter 6.

Part II: Part of speech tagging

1.      Your textbook, Chapter 8.

2.     Eric Brill. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. Computational Linguistics, Dec. 1995.

3.     Supplementary: if you want a thorough review of current methods and tests of part of speech taggers, see the recent technical report  [Schröder 2002].


OK, on to the actual laboratory itself.


Part I: Author, author!  (The red pill?)

Description: In this part, we apply basic counts to a corpus to explore the effectiveness of statistical language modeling via linear symbol strings, or N-grams.  We will examine the various factors that can influence such models: the size and choice of training corpus; the “n” in n-gram (the window size over which we define token or word correlations), and so forth, and then see how well N-grams ‘characterize’ an author – their first application, as noted above.

To turn in for Part I: You should do everything that is suggested in this part and answer all questions in writing on your web page report  – if a sentence ends with a question mark, it is a question you need to answer. (Well, you don’t have to answer the question about the red pill…yet.)

Credits: Part of this exercise was written by Philip Resnik of the University of Maryland, and as augmented at the University of Pittsburgh.  

A. Examining the Corpus material

  1. Add 6.863 as usual and cd to /mit/6.863/ngram/
  2. You will find 5 corpora in this directory, with filenames ending in .txt. We will first concentrate on genen.txt. Take a look at the file, using, e.g., less. This file contains an annotated version of the book of Genesis, King James version. It is a small corpus by current standards – somewhere on the order of 40,000 or 50,000 words. What words (unigrams) would you expect to have high frequency in this corpus? What bigrams do you think might be frequent?  Now take a look at the other three distinct corpora here: hound.txt and study.txt, two Sherlock Holmes stories; and austen.txt, Jane Austen’s novel Emma. 
    (You will not be graded for this part, so try to be honest with yourself and don't use the results from the next part.)

B. Computing Unigrams, Bigrams, and Trigrams

The perl program in this directory computes unigrams, bigrams, trigrams, and so forth given an input text, and outputs these along with their frequency counts.  It is a simple wrapper based on the perl module and an associated program, also in this directory.  See Ngram.html  and ngram.html for a complete description of ngram.  Here is a list of the options for gramify:
        gramify [--version] [--help] [--n=3] [--normalize] [--type=TYPE]
        [--orderby=ORD] [--onlyfirst=N] [input files]
Compute the ngram frequencies and write tables to the stdout.
--n=N           The default is 3-grams.
--normalize     Produce normalized frequencies (divided by the total
                number of n-grams of the same size)
--type=T        The default is character.  For more types see
                Text::Ngrams module. (ie., Ngram.html)
--limit=N       Limit the number of distinct n-grams.
--help          Show this help.
--version        Show version.
--orderby=ARG   ARG can be: frequency or ngram.
--onlyfirst=N   Only first N ngrams are printed for each n.
To generate fron a language model:
 --gen=”initial string”
The options can be shortened to their unique prefixes and the two dashes to one dash. No files means
using STDIN
  1. First let’s try out the program to generate language models (i.e., frequency counts) for the corpora here.  Generate a bigram table from genen.txt (please, output the table to a file in your home directory), do the following.  Note that we use the convention that files ending in .lm denote ‘language models’.  Be patient – if a file is large (e.g., austen.txt) it may take a few minutes… especially on Athena.  In fact, austen.txt is too large to generate anything but n-grams of 3 or less – unless your afs quota is bigger than 30mb. Get coffee, check your stocks, zephyr someone - perhaps the TA…

athena% –n=2 genen.txt > ~/genen.lm

  1. Note that we haven't defined yet what a word means. To keep things simple, you should assume that a word is a sequence of letters (a-zA-Z). You should treat all other characters as separators. Please note that words should be treated case-sensitive (“And” and “and” are treated as two different words). Note that the Ngram module allows great flexibility here, but we haven’t really made use of this in the corresponding perl program (you can play with this if you would like).
  2. Examine the output for unigrams. Note that v (verse), c (chapter), id, and GEN are part of the markup in file genen.txt, for identifying verse boundaries. Other than those (which are a good example of why we need pre-processing to handle markup), are the high frequency words those that you would expect?
  3. Analogously, look at the bigram and trigram counts.  Save your bigram and trigram models because you’ll use them in the next section. Markup aside, are the high frequency bigrams and trigrams what you would expect?

C. Time for Fun

Using we can also generate sentences based on the n-gram model.  More specifically, given a language model and the beginning of the sentence, the program can choose as the next word that word yielding the highest n-gram count when composed with previous words into a n-gram. Thus, it uses the previous word for bigrams and previous two words for trigrams. It continues generating new words until it has generated a 15 word sentence or reaches a dead end (all n-gram counts are zero). The beginning of the sentence (one word for bigrams and two words for trigrams) are given as command line parameter(s).  If you want to change the sentence length, you can do that by supplying an integer argument value to gramify.plas  –len=INT. For example the following creates a 5-gram language model and then generates a 14 word sentence using it, starting with the word “remember”:

athena% --n=5 genen.txt > ~/genesis.lm

athena% --gen="remember" –-len=14 --lm=~/genesis.lm      

This will write to stdout the following.


  1. Submit your generated output for the following inputs:

a)      Using a bigram language model, and the default sentence length, generate bigrams starting from: "he", "And", "father"

b)      Using a trigram language model: use the first word generated for each of the above bigram start words, and create starting two-word sequences for trigrams. Then see what the trigram generation will produce from these three bigram starting points. (Again use the default value for sentence length.)

c)      Is there any difference between the sentences generated by bigrams and trigrams? Which one of the models do you think will generate more reasonable sentences?

d)      Do you think that the sentence you generated has the highest probability among the sentences that have the same start given the n-gram model? Why? Or why not?  Explain your reasoning in a few sentences.

D. Corpus Impact

  1. Testing the effect of character markup and encoding: One thing you may have noticed about genen.txt is that data sparseness arises because uppercase and lowercase are distinct, e.g. “Door” is treated as a different word from “door.”  In the ngram directory you will find a lowercase version of genen.txt in the file genenlc.txt.  Create bigrams and trigrams for this new corpus, and repeat Part C, (all except question d)  for this new corpus. What, if anything changes?
  2. Other authors: The corpora directory contains the Sherlock Holmes stories A Study in Scarlet (study.txt) and The Hound of the Baskervilles (hound.txt). Create bigrams and trigrams for study.txt. How do the n-gram models compare with the ones from the previous corpus?
  3. Compute the same statistics for the second Holmes story (hound.txt). Same author, same main character, same genre. How do the unigrams, bigrams, and trigrams compare between the two Holmes cases?
  4. Author identity: Run the gram program to compute a trigram language model for Jane Austin, using austin.txt.  Compare it to the Holmes language models and then answer the following question:  How well do you think that n-gram analysis could distinguish between the two authors of these texts?  Please justify your answer using your study of n-grams.  Please comment on the generality of this approach – how might it fail? Or should it always succeed?


Part II: Part of Speech Tagging (The blue pill?)   

Part of speech tagging is the task of assigning an appropriate part of speech or word category, or possibly even a higher-level tag corresponding to a text topic (e.g. Noun, Verb, Adjective; or ‘new’, ‘spam’,) to each word (or larger unit) in a sentence. This is usually taken to be the first step in automatically processing language at the sentence level. Its output is crucial for tasks such as sentence parsing (assigning a structure and/or meaning to natural language input) and word sense disambiguation (distinguishing words with the same spelling and/or pronunciation). The idea is to save the parser the trouble of disambiguating each word category.  There are roughly two major approaches to this task:  rule-based approaches (Brill 1994) and statistical approaches (Jelinek 1985; DeMarcken 1990; Merialdo 1991; Cutting et al. 1992). Brill's tagger is one of the best rule-based tagging systems. It is a transformation-based error-driven model that learns a set of rules automatically based on a given corpus and then tags words following these rules. The statistical tagger you will use is in effect an n-gram: it is based on a hidden Markov model (HMM), which computes the probabilities of co-occurrence of words based on a given tagged corpus and then tags texts using these probabilities.  

Brill's Rule-Based Tagger

The fundamental component of a tagger is a POS (part-of-speech) tagset, or list of all the word categories that will be used in the tagging process.  The pioneering Brown Corpus distinguishes 87 simple tags. Some subsequent projects have tended to elaborate the Brown Corpus tagset, though in fact much subsequent ‘corpus’ work (e.g., the Linguistic Data Consortium project) has actually trimmed down the tags from those in the Brown Corpus. For instance, the Lancaster-Oslo/Bergen(LOB) Corpus uses about 135 tags, the Lancaster UCREL group about 165 tags, and the London-Lund Corpus of Spoken English 197 tags. The rationale behind developing such large, richly articulated tagsets is to approach “the ideal of providing distinct codings for all classes of words having distinct grammatical behavior” – but in fact, this has not really been done, for reasons we will examine later.  Choosing an appropriate set of tags for an annotation system has direct influence on the accuracy and the usefulness of the tagging system. The larger the tag set, the lower the accuracy of the tagging system since the system has to be capable of making finer distinctions. Conversely, the smaller the tag set, the higher the accuracy. However, a very small tag set tends to make the tagging system less useful since it provides less information. So, there is a trade-off here. Another issue in tag-set design is the consistency of the tagging system. Words of the same meaning and same functions should be tagged with the same tags.

To have some idea about tag-set designing, you can look at the tags (Penn Treebank Part of Speech Tags) used in the annotation system of the Penn Treebank Project. This is a small tag-set (about 40 tags) compared to the tag-sets mentioned above.

A similar tag set is used by Brill's tagger, which you will have a chance to investigate shortly. There is a paper to justify the appropriateness of the tagset used in the Penn Treebank Project, Building a large annotated corpus of English: the Penn Treebank (Marcus, et al. , Computational Linguistics, 19, No. 2, 1993, pp. 313-330).

Brill's rule-based tagger begins processing a sentence by looking up each word in a lexicon.  The lexicon contains a list of the possible parts of speech that the word can be assigned. For example, the word rat can be assigned the tag NN (noun, singular or mass) and also the tag VB (verb, base form) since it can be used as a singular noun (The rat scurried across the floor) or as an infinitive verb (He wouldn't rat on us would he?). When Brill's tagger is assigning its initial tags to words in a sentence, it chooses the part of speech that is most common for each word under standard usage. Thus, rat would be assigned the tag, NN.

Alternatively, Brill's tagger assigns every word the tag NN (singular common noun) initially unless they begin with a capital letter, in which case they are tagged with NNP (singular proper noun).

Clearly, the initial tagging performed by Brill’s tagger will contain many errors. To correct these errors, the tagger applies tag-changing rules to the results of the initial stage. There are two kinds of rules, Contextual Rules and Lexical Rules. The Contextual Rules revise the tag of the particular word based on the surrounding words or on the tags of the surrounding words. They take the following forms:

Change tag a to tag b when:

1. The preceding (following) word is tagged z.
2. The word two before (after) is tagged z.
3. One of the two preceding (following) words is tagged z.
4. One of the three preceding (following) words if tagged z.
5. The preceding word is tagged z and the following word is tagged w.
6. The preceding (following) word is tagged z and the word two before (after) is tagged w.
7. The preceding (following) word x.
8. The word two before (after) is x.
9. One of the two preceding (following) words is x.
10. The current word is x and the preceding (following) word is y.
11. The current word is x and the preceding (following) word is tagged z.
12. The current word is x.
13. The preceding (following) word is x and the preceding (following) tag is z.
14. The current word is x, the preceding (following) word is y and the preceding (following) tag is w.

where a, b, z and w are variables over the set of parts of speech, and x and y are variables over all words in the training corpus.

A training corpus is a collection of language data which functions as a sample of the language and is used to establish language models. A training corpus may contain manually-added information to facilitate the training process of the language model. In our case, the training corpus is an annotated (tagged) corpus. Of course, a training corpus can only consist of plain texts without any added annotations. Put it in a simple way, a training corpus functions as examples for the computer system to learn relevant knowledge (e.g. rules) for future usage in automatic natural language processing.

Let's look at two examples of Contextual Rules . The first is


This means: change tag NN to tag VB when the preceding word is tagged TO For instance,

to/TO run/NN

would be changed to

to/TO run/VB

The second example is


This means: change tag VBP(verb, non-3rd person singular present) to VB(verb, base form) when one of the three preceding words is tagged MD (modal verb).

The following are some of the abbreviations used in the representation of contextual rules:

1. PREV --- previous (preceding)
2. PREVTAG --- preceding word is tagged
3. PREV1OR2TAG --- one of the two preceding words is tagged
4. PREV1OR2OR3TAG --- one of the three preceding words is tagged
5. WDAND2AFT --- the current word is x and the word two after is y
6. PREV1OR2WD --- one of the two preceding words is …
7. NEXT1OR2TAG --- one of the two following words is tagged
8. NEXTTAG --- following word is tagged
9. NEXTWD --- following word is …
10. WDNEXTTAG --- the current word is x and the following word is tagged z
11. SURROUNDTAG --- the preceding word is tagged x and the following word is tagged y
12. PREVBIGRAM --- the two preceding words are tagged
13. CURWD --- the current word is   

The other rules used by Brill's tagger are Lexical Rules. Lexical Rules are used to analyze words that the tagger does not have in its lexicon to see if it can make a reasonable guess as to their classification. The Lexical Rules take advantage of the internal structure or morphology of many words. For example, the word grapes can be analyzed as grape + s. The use of Lexical Rules greatly reduces the size of the lexicon which Brill's parser needs to operate effectively.

The following are examples of the formats of Lexical Rules:

Change the tag of an unknown word (from X) to Y if:

1. Deleting the prefix (suffix) x, |x| =< 4, results in a word (x is any string of length 1 to 4).
2. The first (last) (1,2,3,4) characters of the word are x.
3. Adding the character string x as a prefix (suffix) results in a word (|x| =< 4).
4. Word w ever appears immediately to the left (right) of the word.
5. Character z appears in the word.

Let's look at some examples.

Example 1.

NN s fhassuf 1 NNS


change the tag of an unknown word from NN to NNS if it has suffix -s

For instance, webpages/NN to webpages/NNS

Example 2.

NN . fchar CD


change the tag of an unknown word from NN to CD if it has character '.'

For instance, 3.5/NN to 3.5/CD

Example 3.

NN - fchar JJ


change the tag of an unknown word from NN to JJ if it has character '-'

For instance, man-made, rule-based, three-year-old, etc.

Example 4.

NN ed fhassuf 2 VBN


change the tag of an unknown word from NN to VBN if it has suffix -ed

For instance, wolfed/NN to wolfed/VBN

Example 5.

NN ing fhassuf 3 VBG


change the tag of an unknown word from NN to VBG if it has suffix -ing

For instance, tagging/NN to tagging/VBG

Example 6.

ly hassuf 2 RB


change the tag of an unknown word from ?? (any) to RB if it has suffix -ly

For instance, morphologically/?? to morphologically/RB

Example 7.

ly addsuf 2 JJ


change the tag of an unknown word from ?? to JJ if adding suffix -ly results in a word

For instance, sil/NN to sil/JJ

Example 8.

NN $ fgoodright CD


change the tag of an unknown word from NN to CD if the word $ can appear to the left

Example 9.

un deletepref 2 JJ

Statistical tagging

Now let’s look at statistically-based tagging.

Statistical taggers employ large quantities of statistical data, mysteriously combining hosts of numbers to produce an often-reasonable output without seeming to know anything about the rules of language. Let's try to eliminate some of the mystery surrounding them now.

We first consider the problem in its full generality. Let w1, ..., wT be a sequence of words. We want to find the sequence of lexical categories C1, ...CT that maximizes

PROB(C1,...,CT | w1,...,wT) (1)

which means the probability of a sequence of tags C1,...,CT given a sequence of words w1,...,wT. That is to say, given a sequence of words, we try to find the most likely tag sequence it can have.

Unfortunately, it would take far too much data to generate reasonable estimates for such probabilities, so direct methods cannot be applied. There are, however, reasonable approximation techniques that produce good results. To develop them, we must restate the problem using Bayes' rule, which says that this conditional probability equals

PROB(C1, ...CT) * PROB(w1, ...wT | C1, ...CT) ) / PROB(w1, ...wT) (2)


By using some simplifying methods and approximations(you are referred to Allen's book for detailed explanations), we can reduce the problem into finding the sequence C1, ...CT that maximize the value of

producti=1, T i=1, T (PROB(Ci | Ci-1) * PROB(wi | Ci)) (3)


The advantage of this new formula is that the probabilities involved can be readily estimated from a corpus of text labeled with parts of speech. In particular, given a database of text, the bigram probabilities can be estimated simply by counting the number of times each pair of categories occurs and comparing this to the individual category counts.

Recall that a Bigram means a frequency count (probability estimate) of two words a time, that is, the size of the relevant locality is of two words.

The probability that a V follows an N would be estimated as follows:

PROB(Ci=V|Ci-1=N) approxCount(N at position i-1 and V at i)/Count(N at position i-1) (4)


These probabilities are our familiar bigram probabilities. A bigram probability indicates the likelihood of a category given the knowledge that a particular other category preceded it.

The term PROB(wi|Ci) in formula (3) is called the lexical- generation probability. PROB(wi|Ci) can be estimated simply by counting the number of occurrences of each word by category. Note that the lexical-generation probability is the probability that a given category is realized by a specific word, not the probability that a given word falls in a specific category.

Comparing taggers: warmup questions

Now we will do a brief exercise to compare the two sorts of systems systems, in preparation for more extensive analysis next time.  Instructions for running each of the taggers are available on a separate page.   Both of these taggers were trained on data from the Wall Street Journal corpus, as hand-tagged by the University of Pennsylvania – and roughly the same training text. (This is to get the bigram probability estimates.  You can change the training statistics, as the running-tagger document describes, if you wish – but this is a more ambitious, and computationally intensive, goal.) Note further that both taggers use the Penn Treebank Part of Speech Tags, as described earlier.


Run each of the taggers on the file per the instructions on the file magi.txt in the directory /mit/6.863/tagging/and also available here. This is a complete version of O’Henry's famous short story “Gift of the Magi.”   Please direct any output to your own directory.  (Do not use the file magi-raw.txt – this file does not break up or tokenize punctuation properly, and is meant for you just to read.)  Then please answer the following questions:


Question (a)   The word let’s occurs three times in the magi text. What does the Brill tagger give as its final tags for each of these three occurrences? Are they correct? If they are incorrect, point out which tags are incorrect, and explain why. Please include in your discussion a list of what initial tag is assigned to each occurrence, and which (if any) contextual rules are used by the Brill tagger to change this initial setting to its intermediate and then final values.  Note: You will have to use the program option –I so as to write the intermediate results to a file (see the description of the running the taggers).  (The three occurrences: “Take yer hat off and let’s…”; “Let’s be happy. You don’t know...;”  “let’s put our Christmas presents away…”)

Question (b)   Repeat using the statistical tagger (except that you cannot display any intermediate output since this kind of tagger does not produce any) – we just want you to see what it does on let’s.  Please describe any errors that it makes, and any differences from the Brill tagger, and explain the differences, if any.

INCLUDE in your report: answers to Questions (a) and (b) above.

This concludes Components I and II of Laboratory 2.