Massachusetts Institute of Technology
Department of Electrical Engineering and Computer Science
6.863J/9.611J Natural Language Processing, Spring, 2004
Handed Out: February 23 Due: March 05
In class we have observed that there are two current approaches to tackling natural language processing: a symbolic, rule-based approach; and a statistical approach. What are the strengths and weaknesses of each? How can we combine them? How do they help in building practical systems? That is the goal of this laboratory. We shall examine two different aspects of statistical and symbolic natural language processing. As a warmup exercise, in Part I we look at perhaps the earliest statistical language approach, N-grams (Markov, Anton Andreevich, 1913, Statistical investigation of the text of “Eugene Onegin” illustrating the dependence between samples in chain,” Bulletin de l'Académie Impériale of the Sciences de St. Pétersbourg, Bd. 7; the first application to authorship: Mozorov, A., 1915, “A new weapon for the objective investigation of old writings. (Linguistic spectra as means for the distinction of plagiarisms and truthful certifications or other author and for the determination of their epoch,)” Izvestija Akademii Nauk, St. Petersburg, section for Russian language, XX, 1-4.) Then, in Part II of this lab component we compare this approach to a rule-based approach for the task of part-of-speech tagging. In the last component of the lab, handed out Wednesday, we shall turn to a more detailed comparison.
Through observing the performances of these taggers and finishing this laboratory you should develop a deeper understanding of the two different perspectives on NLP: the symbolic rule-based perspective and the statistical perspective.
General Background reading for this Laboratory – Please read first!
Part I: Basic probability theory; n-grams
If you are at all uncomfortable with the basic notions of probability theory, especially those required for understanding n-grams – conditional probability; the chain rule; Bayes’ rule; etc., I strongly urge you to read the first selection below. I would also strongly urge everyone to take a look at the second item below. I leave the reading in your textbook and the lecture notes to your own conscience and/or diligence (Chapter 6 in the text is a very good summary, however!)
1. An excellent review of probability theory as applicable to N-grams: here.
2. Same document, with a terrific summary intro to bigrams, here.
3. Your textbook, Chapter 6.
Part II: Part of speech tagging
1. Your textbook, Chapter 8.
2. Eric Brill. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. Computational Linguistics, Dec. 1995.
3. Supplementary: if you want a thorough review of current methods and tests of part of speech taggers, see the recent technical report [Schröder 2002].
Description: In this part, we apply basic counts to a corpus to explore the effectiveness of statistical language modeling via linear symbol strings, or N-grams. We will examine the various factors that can influence such models: the size and choice of training corpus; the “n” in n-gram (the window size over which we define token or word correlations), and so forth, and then see how well N-grams ‘characterize’ an author – their first application, as noted above.
To turn in for Part I: You should do everything that is suggested in this part and answer all questions in writing on your web page report – if a sentence ends with a question mark, it is a question you need to answer. (Well, you don’t have to answer the question about the red pill…yet.)
Credits: Part of this exercise was written by Philip Resnik of the University of Maryland, and as augmented at the University of Pittsburgh.
The perl program gramify.pl in this directory computes unigrams, bigrams, trigrams, and so forth given an input text, and outputs these along with their frequency counts. It is a simple wrapper based on the perl module Ngram.pm and an associated program ngram.pl, also in this directory. See Ngram.html and ngram.html for a complete description of ngram. Here is a list of the options for gramify:
gramify [--version] [--help] [--n=3] [--normalize] [--type=TYPE]
[--orderby=ORD] [--onlyfirst=N] [input files]
Compute the ngram frequencies and write tables to the stdout.
--n=N The default is 3-grams.
--normalize Produce normalized frequencies (divided by the total
number of n-grams of the same size)
--type=T The default is character. For more types see
Text::Ngrams module. (ie., Ngram.html)
--limit=N Limit the number of distinct n-grams.
--help Show this help.
--version Show version.
--orderby=ARG ARG can be: frequency or ngram.
--onlyfirst=N Only first N ngrams are printed for each n.
To generate fron a language model:
The options can be shortened to their unique prefixes and the two dashes to one dash. No files means
athena% gramify.pl –n=2 genen.txt > ~/genen.lm
Using gramify.pl we can also generate sentences based on the n-gram model. More specifically, given a language model and the beginning of the sentence, the program gramify.pl can choose as the next word that word yielding the highest n-gram count when composed with previous words into a n-gram. Thus, it uses the previous word for bigrams and previous two words for trigrams. It continues generating new words until it has generated a 15 word sentence or reaches a dead end (all n-gram counts are zero). The beginning of the sentence (one word for bigrams and two words for trigrams) are given as command line parameter(s). If you want to change the sentence length, you can do that by supplying an integer argument value to gramify.plas –len=INT. For example the following creates a 5-gram language model and then generates a 14 word sentence using it, starting with the word “remember”:
athena% gramify.pl --n=5 genen.txt > ~/genesis.lm
athena% gramify.pl --gen="remember" –-len=14 --lm=~/genesis.lm
This will write to stdout the following.
a) Using a bigram language model, and the default sentence length, generate bigrams starting from: "he", "And", "father"
b) Using a trigram language model: use the first word generated for each of the above bigram start words, and create starting two-word sequences for trigrams. Then see what the trigram generation will produce from these three bigram starting points. (Again use the default value for sentence length.)
c) Is there any difference between the sentences generated by bigrams and trigrams? Which one of the models do you think will generate more reasonable sentences?
d) Do you think that the sentence you generated has the highest probability among the sentences that have the same start given the n-gram model? Why? Or why not? Explain your reasoning in a few sentences.
Part of speech tagging is the task of assigning an appropriate part of speech or word category, or possibly even a higher-level tag corresponding to a text topic (e.g. Noun, Verb, Adjective; or ‘new’, ‘spam’,…) to each word (or larger unit) in a sentence. This is usually taken to be the first step in automatically processing language at the sentence level. Its output is crucial for tasks such as sentence parsing (assigning a structure and/or meaning to natural language input) and word sense disambiguation (distinguishing words with the same spelling and/or pronunciation). The idea is to save the parser the trouble of disambiguating each word category. There are roughly two major approaches to this task: rule-based approaches (Brill 1994) and statistical approaches (Jelinek 1985; DeMarcken 1990; Merialdo 1991; Cutting et al. 1992). Brill's tagger is one of the best rule-based tagging systems. It is a transformation-based error-driven model that learns a set of rules automatically based on a given corpus and then tags words following these rules. The statistical tagger you will use is in effect an n-gram: it is based on a hidden Markov model (HMM), which computes the probabilities of co-occurrence of words based on a given tagged corpus and then tags texts using these probabilities.
Brill's Rule-Based Tagger
The fundamental component of a tagger is a POS (part-of-speech) tagset, or list of all the word categories that will be used in the tagging process. The pioneering Brown Corpus distinguishes 87 simple tags. Some subsequent projects have tended to elaborate the Brown Corpus tagset, though in fact much subsequent ‘corpus’ work (e.g., the Linguistic Data Consortium project) has actually trimmed down the tags from those in the Brown Corpus. For instance, the Lancaster-Oslo/Bergen(LOB) Corpus uses about 135 tags, the Lancaster UCREL group about 165 tags, and the London-Lund Corpus of Spoken English 197 tags. The rationale behind developing such large, richly articulated tagsets is to approach “the ideal of providing distinct codings for all classes of words having distinct grammatical behavior” – but in fact, this has not really been done, for reasons we will examine later. Choosing an appropriate set of tags for an annotation system has direct influence on the accuracy and the usefulness of the tagging system. The larger the tag set, the lower the accuracy of the tagging system since the system has to be capable of making finer distinctions. Conversely, the smaller the tag set, the higher the accuracy. However, a very small tag set tends to make the tagging system less useful since it provides less information. So, there is a trade-off here. Another issue in tag-set design is the consistency of the tagging system. Words of the same meaning and same functions should be tagged with the same tags.
To have some idea about tag-set designing, you can look at the tags (Penn Treebank Part of Speech Tags) used in the annotation system of the Penn Treebank Project. This is a small tag-set (about 40 tags) compared to the tag-sets mentioned above.
A similar tag set is used by Brill's tagger, which you will have a chance to investigate shortly. There is a paper to justify the appropriateness of the tagset used in the Penn Treebank Project, Building a large annotated corpus of English: the Penn Treebank (Marcus, et al. , Computational Linguistics, 19, No. 2, 1993, pp. 313-330).
Brill's rule-based tagger begins processing a sentence by looking up each word in a lexicon. The lexicon contains a list of the possible parts of speech that the word can be assigned. For example, the word rat can be assigned the tag NN (noun, singular or mass) and also the tag VB (verb, base form) since it can be used as a singular noun (The rat scurried across the floor) or as an infinitive verb (He wouldn't rat on us would he?). When Brill's tagger is assigning its initial tags to words in a sentence, it chooses the part of speech that is most common for each word under standard usage. Thus, rat would be assigned the tag, NN.
Alternatively, Brill's tagger assigns every word the tag NN (singular common noun) initially unless they begin with a capital letter, in which case they are tagged with NNP (singular proper noun).
Clearly, the initial tagging performed by Brill’s tagger will contain many errors. To correct these errors, the tagger applies tag-changing rules to the results of the initial stage. There are two kinds of rules, Contextual Rules and Lexical Rules. The Contextual Rules revise the tag of the particular word based on the surrounding words or on the tags of the surrounding words. They take the following forms:
Change tag a to tag b when:
1. The preceding (following) word is
2. The word two before (after) is tagged z.
3. One of the two preceding (following) words is tagged z.
4. One of the three preceding (following) words if tagged z.
5. The preceding word is tagged z and the following word is tagged w.
6. The preceding (following) word is tagged z and the word two before (after) is tagged w.
7. The preceding (following) word x.
8. The word two before (after) is x.
9. One of the two preceding (following) words is x.
10. The current word is x and the preceding (following) word is y.
11. The current word is x and the preceding (following) word is tagged z.
12. The current word is x.
13. The preceding (following) word is x and the preceding (following) tag is z.
14. The current word is x, the preceding (following) word is y and the preceding (following) tag is w.
where a, b, z and w are variables over the set of parts of speech, and x and y are variables over all words in the training corpus.
A training corpus is a collection of language data which functions as a sample of the language and is used to establish language models. A training corpus may contain manually-added information to facilitate the training process of the language model. In our case, the training corpus is an annotated (tagged) corpus. Of course, a training corpus can only consist of plain texts without any added annotations. Put it in a simple way, a training corpus functions as examples for the computer system to learn relevant knowledge (e.g. rules) for future usage in automatic natural language processing.
Let's look at two examples of Contextual Rules . The first is
NN VB PREVTAG TO
This means: change tag NN to tag VB when the preceding word is tagged TO For instance,
would be changed to
The second example is
VBP VB PREV1OR2OR3TAG MD
This means: change tag VBP(verb, non-3rd person singular present) to VB(verb, base form) when one of the three preceding words is tagged MD (modal verb).
The following are some of the abbreviations used in the representation of contextual rules:
1. PREV --- previous
2. PREVTAG --- preceding word is tagged
3. PREV1OR2TAG --- one of the two preceding words is tagged
4. PREV1OR2OR3TAG --- one of the three preceding words is tagged
5. WDAND2AFT --- the current word is x and the word two after is y
6. PREV1OR2WD --- one of the two preceding words is …
7. NEXT1OR2TAG --- one of the two following words is tagged
8. NEXTTAG --- following word is tagged
9. NEXTWD --- following word is …
10. WDNEXTTAG --- the current word is x and the following word is tagged z
11. SURROUNDTAG --- the preceding word is tagged x and the following word is tagged y
12. PREVBIGRAM --- the two preceding words are tagged
13. CURWD --- the current word is …
The other rules used by Brill's tagger are Lexical Rules. Lexical Rules are used to analyze words that the tagger does not have in its lexicon to see if it can make a reasonable guess as to their classification. The Lexical Rules take advantage of the internal structure or morphology of many words. For example, the word grapes can be analyzed as grape + s. The use of Lexical Rules greatly reduces the size of the lexicon which Brill's parser needs to operate effectively.
The following are examples of the formats of Lexical Rules:
Change the tag of an unknown word (from X) to Y if:
1. Deleting the prefix (suffix) x, |x| =< 4,
results in a
word (x is any string of length 1 to 4).
2. The first (last) (1,2,3,4) characters of the word are x.
3. Adding the character string x as a prefix (suffix) results in a word (|x| =< 4).
4. Word w ever appears immediately to the left (right) of the word.
5. Character z appears in the word.
Let's look at some examples.
NN s fhassuf 1 NNS
change the tag of an unknown word from NN to NNS if it has suffix -s
For instance, webpages/NN to webpages/NNS
NN . fchar CD
change the tag of an unknown word from NN to CD if it has character '.'
For instance, 3.5/NN to 3.5/CD
NN - fchar JJ
change the tag of an unknown word from NN to JJ if it has character '-'
For instance, man-made, rule-based, three-year-old, etc.
NN ed fhassuf 2 VBN
change the tag of an unknown word from NN to VBN if it has suffix -ed
For instance, wolfed/NN to wolfed/VBN
NN ing fhassuf 3 VBG
change the tag of an unknown word from NN to VBG if it has suffix -ing
For instance, tagging/NN to tagging/VBG
ly hassuf 2 RB
change the tag of an unknown word from ?? (any) to RB if it has suffix -ly
For instance, morphologically/?? to morphologically/RB
ly addsuf 2 JJ
change the tag of an unknown word from ?? to JJ if adding suffix -ly results in a word
For instance, sil/NN to sil/JJ
NN $ fgoodright CD
change the tag of an unknown word from NN to CD if the word $ can appear to the left
un deletepref 2 JJ
Now let’s look at statistically-based tagging.
Statistical taggers employ large quantities of statistical data, mysteriously combining hosts of numbers to produce an often-reasonable output without seeming to know anything about the rules of language. Let's try to eliminate some of the mystery surrounding them now.
We first consider the problem in its full generality. Let w1, ..., wT be a sequence of words. We want to find the sequence of lexical categories C1, ...CT that maximizes
PROB(C1,...,CT | w1,...,wT) (1)
which means the probability of a sequence of tags C1,...,CT given a sequence of words w1,...,wT. That is to say, given a sequence of words, we try to find the most likely tag sequence it can have.
Unfortunately, it would take far too much data to generate reasonable estimates for such probabilities, so direct methods cannot be applied. There are, however, reasonable approximation techniques that produce good results. To develop them, we must restate the problem using Bayes' rule, which says that this conditional probability equals
PROB(C1, ...CT) * PROB(w1, ...wT | C1, ...CT) ) / PROB(w1, ...wT) (2)
By using some simplifying methods and approximations(you are referred to Allen's book for detailed explanations), we can reduce the problem into finding the sequence C1, ...CT that maximize the value of
i=1, T i=1, T (PROB(Ci | Ci-1) * PROB(wi | Ci)) (3)
The advantage of this new formula is that the probabilities involved can be readily estimated from a corpus of text labeled with parts of speech. In particular, given a database of text, the bigram probabilities can be estimated simply by counting the number of times each pair of categories occurs and comparing this to the individual category counts.
Recall that a Bigram means a frequency count (probability estimate) of two words a time, that is, the size of the relevant locality is of two words.
The probability that a V follows an N would be estimated as follows:
PROB(Ci=V|Ci-1=N) Count(N at position i-1 and V at i)/Count(N at position i-1) (4)
These probabilities are our familiar bigram probabilities. A bigram probability indicates the likelihood of a category given the knowledge that a particular other category preceded it.
The term PROB(wi|Ci) in formula (3) is called the lexical- generation probability. PROB(wi|Ci) can be estimated simply by counting the number of occurrences of each word by category. Note that the lexical-generation probability is the probability that a given category is realized by a specific word, not the probability that a given word falls in a specific category.
Comparing taggers: warmup questions
Run each of the taggers on the file per the instructions on the file magi.txt in the directory /mit/6.863/tagging/and also available here. This is a complete version of O’Henry's famous short story “Gift of the Magi.” Please direct any output to your own directory. (Do not use the file magi-raw.txt – this file does not break up or tokenize punctuation properly, and is meant for you just to read.) Then please answer the following questions:
Question (a) The word let’s occurs three times in the magi text. What does the Brill tagger give as its final tags for each of these three occurrences? Are they correct? If they are incorrect, point out which tags are incorrect, and explain why. Please include in your discussion a list of what initial tag is assigned to each occurrence, and which (if any) contextual rules are used by the Brill tagger to change this initial setting to its intermediate and then final values. Note: You will have to use the program option –I so as to write the intermediate results to a file (see the description of the running the taggers). (The three occurrences: “Take yer hat off and let’s…”; “Let’s be happy. You don’t know...;” “let’s put our Christmas presents away…”)
Question (b) Repeat using the statistical tagger (except that you cannot display any intermediate output since this kind of tagger does not produce any) – we just want you to see what it does on let’s. Please describe any errors that it makes, and any differences from the Brill tagger, and explain the differences, if any.
INCLUDE in your report: answers to Questions (a) and (b) above.
This concludes Components I and II of Laboratory 2.