Massachusetts Institute of Technology
Department of Electrical Engineering and
Computer
Science
6.863J/9.611J Natural Language Processing,
Spring,
2004
Handed Out: February 23 Due: March 05
In class we have observed that there are two current approaches to tackling natural language processing: a symbolic, rule-based approach; and a statistical approach. What are the strengths and weaknesses of each? How can we combine them? How do they help in building practical systems? That is the goal of this laboratory. We shall examine two different aspects of statistical and symbolic natural language processing. As a warmup exercise, in Part I we look at perhaps the earliest statistical language approach, N-grams (Markov, Anton Andreevich, 1913, Statistical investigation of the text of “Eugene Onegin” illustrating the dependence between samples in chain,” Bulletin de l'Académie Impériale of the Sciences de St. Pétersbourg, Bd. 7; the first application to authorship: Mozorov, A., 1915, “A new weapon for the objective investigation of old writings. (Linguistic spectra as means for the distinction of plagiarisms and truthful certifications or other author and for the determination of their epoch,)” Izvestija Akademii Nauk, St. Petersburg, section for Russian language, XX, 1-4.) Then, in Part II of this lab component we compare this approach to a rule-based approach for the task of part-of-speech tagging. In the last component of the lab, handed out Wednesday, we shall turn to a more detailed comparison.
Through observing the performances of these taggers and finishing this laboratory you should develop a deeper understanding of the two different perspectives on NLP: the symbolic rule-based perspective and the statistical perspective.
General Background
reading for this Laboratory – Please read first!
Part I: Basic
probability theory; n-grams
If you are at all uncomfortable with the basic notions of probability theory, especially those required for understanding n-grams – conditional probability; the chain rule; Bayes’ rule; etc., I strongly urge you to read the first selection below. I would also strongly urge everyone to take a look at the second item below. I leave the reading in your textbook and the lecture notes to your own conscience and/or diligence (Chapter 6 in the text is a very good summary, however!)
1.
An excellent review of probability theory
as applicable
to N-grams: here.
2.
Same document, with a terrific summary
intro to
bigrams, here.
3.
Your textbook, Chapter 6.
Part II: Part of
speech tagging
1.
Your textbook, Chapter 8.
2.
Eric Brill. Transformation-Based
Error-Driven Learning and Natural Language Processing: A Case Study
in
Part-of-Speech Tagging. Computational Linguistics, Dec. 1995.
3.
Supplementary: if you want a
thorough review of
current methods and tests of part of speech taggers, see the recent
technical
report [Schröder
2002].
Description: In this part, we apply basic counts to a corpus to explore the effectiveness of statistical language modeling via linear symbol strings, or N-grams. We will examine the various factors that can influence such models: the size and choice of training corpus; the “n” in n-gram (the window size over which we define token or word correlations), and so forth, and then see how well N-grams ‘characterize’ an author – their first application, as noted above.
To turn in for Part I: You should do everything that is suggested in this part and answer all questions in writing on your web page report – if a sentence ends with a question mark, it is a question you need to answer. (Well, you don’t have to answer the question about the red pill…yet.)
Credits: Part of this exercise was written by Philip Resnik of the University of Maryland, and as augmented at the University of Pittsburgh.
The perl program gramify.pl in this directory computes unigrams, bigrams, trigrams, and so forth given an input text, and outputs these along with their frequency counts. It is a simple wrapper based on the perl module Ngram.pm and an associated program ngram.pl, also in this directory. See Ngram.html and ngram.html for a complete description of ngram. Here is a list of the options for gramify:
gramify [--version] [--help] [--n=3] [--normalize] [--type=TYPE]
[--orderby=ORD] [--onlyfirst=N] [input files]
Compute the ngram frequencies and write tables to the stdout.
Options:
--n=N The default is 3-grams.
--normalize Produce normalized frequencies (divided by the total
number of n-grams of the same size)
--type=T The default is character. For more types see
Text::Ngrams module. (ie., Ngram.html)
--limit=N Limit the number of distinct n-grams.
--help Show this help.
--version Show version.
--orderby=ARG ARG can be: frequency or ngram.
--onlyfirst=N Only first N ngrams are printed for each n.
To generate fron a language model:
--lm=language_filename
--gen=”initial string”
--len=INT
The options can be shortened to their unique prefixes and the two dashes to one dash. No files means
using STDIN
athena% gramify.pl –n=2 genen.txt > ~/genen.lm
Using gramify.pl we can also generate sentences based on the n-gram model. More specifically, given a language model and the beginning of the sentence, the program gramify.pl can choose as the next word that word yielding the highest n-gram count when composed with previous words into a n-gram. Thus, it uses the previous word for bigrams and previous two words for trigrams. It continues generating new words until it has generated a 15 word sentence or reaches a dead end (all n-gram counts are zero). The beginning of the sentence (one word for bigrams and two words for trigrams) are given as command line parameter(s). If you want to change the sentence length, you can do that by supplying an integer argument value to gramify.plas –len=INT. For example the following creates a 5-gram language model and then generates a 14 word sentence using it, starting with the word “remember”:
athena% gramify.pl
--n=5 genen.txt > ~/genesis.lm
athena% gramify.pl --gen="remember" –-len=14 --lm=~/genesis.lm
This will write to stdout the following.
remember_my_covenant_with_you_and_ye_shall_be_a_firmament_in_the_land_of_Canaan_and
a) Using a bigram language model, and the default sentence length, generate bigrams starting from: "he", "And", "father"
b) Using a trigram language model: use the first word generated for each of the above bigram start words, and create starting two-word sequences for trigrams. Then see what the trigram generation will produce from these three bigram starting points. (Again use the default value for sentence length.)
c) Is there any difference between the sentences generated by bigrams and trigrams? Which one of the models do you think will generate more reasonable sentences?
d) Do you think that the sentence you generated has the highest probability among the sentences that have the same start given the n-gram model? Why? Or why not? Explain your reasoning in a few sentences.
Part of speech tagging is the task of assigning an appropriate part of speech or word category, or possibly even a higher-level tag corresponding to a text topic (e.g. Noun, Verb, Adjective; or ‘new’, ‘spam’,…) to each word (or larger unit) in a sentence. This is usually taken to be the first step in automatically processing language at the sentence level. Its output is crucial for tasks such as sentence parsing (assigning a structure and/or meaning to natural language input) and word sense disambiguation (distinguishing words with the same spelling and/or pronunciation). The idea is to save the parser the trouble of disambiguating each word category. There are roughly two major approaches to this task: rule-based approaches (Brill 1994) and statistical approaches (Jelinek 1985; DeMarcken 1990; Merialdo 1991; Cutting et al. 1992). Brill's tagger is one of the best rule-based tagging systems. It is a transformation-based error-driven model that learns a set of rules automatically based on a given corpus and then tags words following these rules. The statistical tagger you will use is in effect an n-gram: it is based on a hidden Markov model (HMM), which computes the probabilities of co-occurrence of words based on a given tagged corpus and then tags texts using these probabilities.
Brill's
Rule-Based Tagger
The fundamental component of a tagger is a POS (part-of-speech)
tagset, or
list of all the word categories that will be used in the tagging
process. The pioneering Brown Corpus
distinguishes 87
simple tags. Some subsequent projects have tended to elaborate the
Brown Corpus
tagset, though in fact much subsequent ‘corpus’ work (e.g., the
Linguistic Data
Consortium project) has actually trimmed down the tags from those in
the Brown
Corpus. For instance, the Lancaster-Oslo/Bergen(LOB) Corpus uses about
135
tags, the Lancaster UCREL group about 165 tags, and the London-Lund
Corpus of
Spoken English 197 tags. The rationale behind developing such large,
richly
articulated tagsets is to approach “the ideal of providing distinct
codings for
all classes of words having distinct grammatical behavior” – but in
fact, this
has not really been done, for reasons we will examine later. Choosing an appropriate set of tags for an
annotation system has direct influence on the accuracy and the
usefulness of
the tagging system. The larger the tag set, the lower the accuracy of
the
tagging system since the system has to be capable of making finer
distinctions.
Conversely, the smaller the tag set, the higher the accuracy. However,
a very
small tag set tends to make the tagging system less useful since it
provides
less information. So, there is a trade-off here. Another issue in
tag-set
design is the consistency of the tagging system. Words of the same
meaning and
same functions should be tagged with the same tags.
To have some idea about tag-set designing, you can look at the
tags (Penn
Treebank Part of Speech
Tags) used in the annotation system of the Penn Treebank Project.
This is a
small tag-set (about 40 tags) compared to the tag-sets mentioned above.
A similar tag set is used by Brill's tagger, which you will have a
chance to
investigate shortly. There is a paper to justify the appropriateness of
the
tagset used in the Penn Treebank Project, Building a large
annotated corpus
of English: the Penn Treebank (Marcus, et al. , Computational
Linguistics, 19, No. 2, 1993, pp. 313-330).
Brill's rule-based tagger begins processing a sentence by looking up
each
word in a lexicon. The lexicon contains
a list of the possible parts of speech that the word can be assigned.
For
example, the word rat can be assigned the tag NN (noun,
singular or
mass) and also the tag VB (verb, base form) since it can be used as a
singular
noun (The rat scurried across the floor) or as an infinitive
verb (He
wouldn't rat on us would he?). When Brill's tagger is
assigning its
initial tags to words in a sentence, it chooses the part of speech that
is most
common for each word under standard usage. Thus, rat would be
assigned the
tag, NN.
Alternatively, Brill's tagger assigns every word the tag NN
(singular common
noun) initially unless they begin with a capital letter, in which case
they are
tagged with NNP (singular proper noun).
Clearly, the initial tagging performed by Brill’s tagger will
contain many
errors. To correct these errors, the tagger applies tag-changing rules
to the
results of the initial stage. There are two kinds of rules, Contextual
Rules and
Lexical
Rules. The Contextual
Rules revise the tag
of the particular word based on the surrounding words or on the tags of
the
surrounding words. They take the following forms:
Change tag a to tag b when:
1. The preceding (following) word is
tagged z.
2. The word two before (after) is tagged z.
3. One of the two preceding (following) words is tagged z.
4. One of the three preceding (following) words if tagged z.
5. The preceding word is tagged z and the following word is
tagged w.
6. The preceding (following) word is tagged z and the word two
before
(after) is tagged w.
7. The preceding (following) word x.
8. The word two before (after) is x.
9. One of the two preceding (following) words is x.
10. The current word is x and the preceding (following) word is
y.
11. The current word is x and the preceding (following) word is
tagged z.
12. The current word is x.
13. The preceding (following) word is x and the preceding
(following)
tag is z.
14. The current word is x, the preceding (following) word is y
and the preceding (following) tag is w.
where a, b, z and w are variables over the set of
parts of
speech, and x and y are variables over all words in
the training
corpus.
A training corpus is a collection of language data which functions
as a
sample of the language and is used to establish language models. A
training
corpus may contain manually-added information to facilitate the
training
process of the language model. In our case, the training corpus is an
annotated
(tagged) corpus. Of course, a training corpus can only consist of plain
texts
without any added annotations. Put it in a simple way, a training
corpus
functions as examples for the computer system to learn relevant
knowledge (e.g.
rules) for future usage in automatic natural language processing.
Let's look at two examples of Contextual
Rules . The first is
NN
VB PREVTAG TO
This means: change tag NN to tag VB when the preceding word is
tagged TO For
instance,
to/TO
run/NN
would be changed to
to/TO
run/VB
The second example is
VBP
VB
PREV1OR2OR3TAG MD
This means: change tag VBP(verb, non-3rd person singular present) to
VB(verb, base form) when one of the three preceding words is tagged MD
(modal
verb).
The following are some of the abbreviations used in the
representation of
contextual rules:
1. PREV --- previous
(preceding)
2. PREVTAG --- preceding word is tagged
3. PREV1OR2TAG --- one of the two preceding words is tagged
4. PREV1OR2OR3TAG --- one of the three preceding words is tagged
5. WDAND2AFT --- the current word is x and the word two after is y
6. PREV1OR2WD --- one of the two preceding words is …
7. NEXT1OR2TAG --- one of the two following words is tagged
8. NEXTTAG --- following word is tagged
9. NEXTWD --- following word is …
10. WDNEXTTAG --- the current word is x and the following word is
tagged z
11. SURROUNDTAG --- the preceding word is tagged x and the following
word is
tagged y
12. PREVBIGRAM --- the two preceding words are tagged
13. CURWD --- the current word is …
The other rules used by Brill's tagger are Lexical Rules. Lexical Rules are used
to analyze
words that the tagger does not have in its lexicon to see if it can
make a
reasonable guess as to their classification. The Lexical Rules take
advantage
of the internal structure or morphology of many words. For
example, the
word grapes can be analyzed as grape + s. The use of
Lexical
Rules greatly reduces the size of the lexicon which Brill's parser
needs to
operate effectively.
The following are examples of the formats of Lexical Rules:
Change the tag of an unknown word (from X) to Y if:
1. Deleting the prefix (suffix) x, |x| =< 4,
results in a
word (x is any string of length 1 to 4).
2. The first (last) (1,2,3,4) characters of the word are x.
3. Adding the character string x as a prefix (suffix) results
in a word
(|x| =< 4).
4. Word w ever appears immediately to the left (right) of the
word.
5. Character z appears in the word.
Let's look at some examples.
Example 1.
NN s
fhassuf 1 NNS
means
change
the tag of an
unknown word from NN to NNS if it has suffix -s
For
instance,
webpages/NN to webpages/NNS
Example 2.
NN .
fchar CD
means
change
the tag of an
unknown word from NN to CD if it has character '.'
For
instance,
3.5/NN to 3.5/CD
Example 3.
NN -
fchar JJ
means
change
the tag of an
unknown word from NN to JJ if it has character '-'
For
instance,
man-made, rule-based, three-year-old, etc.
Example 4.
NN
ed fhassuf 2
VBN
means
change
the tag of an
unknown word from NN to VBN if it has suffix -ed
For
instance,
wolfed/NN to wolfed/VBN
Example 5.
NN
ing fhassuf 3
VBG
means
change
the tag of an
unknown word from NN to VBG if it has suffix -ing
For
instance,
tagging/NN to tagging/VBG
Example 6.
ly
hassuf 2 RB
means
change
the tag of an
unknown word from ?? (any) to RB if it has suffix -ly
For
instance,
morphologically/?? to morphologically/RB
Example 7.
ly
addsuf 2 JJ
means
change
the tag of an
unknown word from ?? to JJ if adding suffix -ly results in a word
For
instance,
sil/NN to sil/JJ
Example 8.
NN $
fgoodright CD
means
change
the tag of an
unknown word from NN to CD if the word $ can appear to the left
Example 9.
un
deletepref 2 JJ
Statistical
tagging
Now let’s look at statistically-based tagging.
Statistical taggers employ large quantities of statistical data,
mysteriously combining hosts of numbers to produce an often-reasonable
output
without seeming to know anything about the rules of language. Let's try
to
eliminate some of the mystery surrounding them now.
We first consider the problem in its full generality. Let w1,
..., wT be a sequence of words. We want to find the sequence
of
lexical categories C1, ...CT that maximizes
PROB(C1,...,CT
| w1,...,wT) (1)
which means the probability of a sequence of tags C1,...,CT
given
a sequence of words w1,...,wT. That is to say,
given a
sequence of words, we try to find the most likely tag sequence it can
have.
Unfortunately, it would take far too much data to generate
reasonable
estimates for such probabilities, so direct methods cannot be applied.
There
are, however, reasonable approximation techniques that produce good
results. To
develop them, we must restate the problem using Bayes' rule, which says
that
this conditional probability equals
PROB(C1,
...CT) * PROB(w1, ...wT | C1,
...CT)
) / PROB(w1, ...wT) (2)
By using some simplifying methods and approximations(you are
referred to
Allen's book for detailed explanations), we can reduce the problem into
finding
the sequence C1, ...CT that maximize the value of
i=1,
T i=1, T (PROB(Ci | Ci-1) * PROB(wi | Ci))
(3)
The advantage of this new formula is that the probabilities involved
can be
readily estimated from a corpus of text labeled with parts of speech.
In
particular, given a database of text, the bigram probabilities can be
estimated
simply by counting the number of times each pair of categories occurs
and
comparing this to the individual category counts.
Recall that a Bigram means a frequency count (probability estimate)
of two
words a time, that is, the size of the relevant locality is of two
words.
The probability that a V follows an N would be estimated as follows:
PROB(Ci=V|Ci-1=N)
Count(N
at position i-1 and V at i)/Count(N at position i-1) (4)
These probabilities are our familiar bigram probabilities. A bigram
probability indicates the likelihood of a category given the knowledge
that a
particular other category preceded it.
The term PROB(wi|Ci) in formula (3) is called
the lexical-
generation probability. PROB(wi|Ci) can be
estimated
simply by counting the number of occurrences of each word by category.
Note
that the lexical-generation probability is the probability that a given
category is realized by a specific word, not the probability
that a
given word falls in a specific category.
Comparing
taggers: warmup questions
Run each of the taggers on the file per the instructions on the file magi.txt in the directory /mit/6.863/tagging/and also available here. This is a complete version of O’Henry's famous short story “Gift of the Magi.” Please direct any output to your own directory. (Do not use the file magi-raw.txt – this file does not break up or tokenize punctuation properly, and is meant for you just to read.) Then please answer the following questions:
Question
(a)
The word let’s occurs three times in the magi text. What
does the
Brill tagger give as its final tags for each of these three
occurrences? Are
they correct? If they are incorrect, point out which tags are
incorrect, and
explain why. Please include in your discussion a list of what initial
tag is assigned to each occurrence, and which (if any) contextual rules
are
used by the Brill tagger to change this initial setting to its
intermediate and
then final values. Note: You will have to use the program option –I so as to write the
intermediate
results to a file (see the description of the running the
taggers). (The
three occurrences: “Take yer hat off and let’s…”; “Let’s be happy. You
don’t
know...;” “let’s put our Christmas presents away…”)
Question
(b) Repeat
using the statistical tagger (except that you cannot display any
intermediate
output since this kind of tagger does not produce any) – we just want
you to
see what it does on let’s. Please
describe any errors that it makes, and any differences from
the Brill tagger, and explain the differences, if any.
INCLUDE
in your report: answers to Questions (a) and (b) above.
This concludes
Components I and II of Laboratory 2.