Massachusetts Institute of Technology
Department of Electrical Engineering and Computer
Science
6.863J/9.611J Natural Language Processing, Spring,
2004
Handed Out: February 26 Due: March 05
In class we have observed that there are two current approaches to tackling natural language processing: a symbolic, rule-based approach; and a statistical approach. What are the strengths and weaknesses of each? How can we combine them? How do they help in building practical systems? That is the goal of this laboratory. We will carry out a deeper analysis of POS tagging on larger and different corpora. Through observing the performances of these taggers and finishing this laboratory you should develop a deeper understanding of the two different perspectives on NLP: the symbolic rule-based perspective and the statistical perspective.
General
Background reading for this Laboratory – Please read if you haven’t already!
Basic
probability theory; n-grams
If you are at all uncomfortable with the basic notions of probability theory, especially those required for understanding n-grams – conditional probability; the chain rule; Bayes’ rule; etc., I strongly urge you to read the first selection below. I would also strongly urge everyone to take a look at the second item below. I leave the reading in your textbook and the lecture notes to your own conscience and/or diligence (Chapter 6 in the text is a very good summary, however!)
1.
An excellent review of probability theory as applicable
to N-grams: here.
2.
Same document, with a terrific summary intro to
bigrams, here.
3.
Your textbook, Chapter 6.
Part of speech
tagging
1.
Your textbook, Chapter 8.
2.
Eric Brill. Transformation-Based
Error-Driven Learning and Natural Language Processing: A Case Study in
Part-of-Speech Tagging. Computational Linguistics, Dec. 1995.
3.
Supplementary: if you want a thorough review of
current methods and tests of part of speech taggers, see the recent technical
report [Schröder
2002].
We will now compare the Brill and HMM taggers on a much longer run of text. Run
each of the taggers on the following texts from the Penn Treebank and compare
their output to the "gold standard" tagged texts. In each of the
lines below, the link to “Text n” (e.g., “Text 1”) is to a version of
the text formatted with one sentence per line - this is easier to read, but you should not use it
for the actual tagging experiments. The second column has the untagged versions
of the same texts, tokenized and formatted for use by the taggers. These are the files you should use as
input to the taggers. These files are all labeled as *.raw. The third column, with
"Tagged" link is to the human-tagged file from the Treebank, which is
the “gold standard” – truth – to be used to evaluate the tagging. These
corresponding tagged files are all labeled *.pos. Their tags are to be
considered gospel. (You will notice
that they are actually bracketed, or pre-arsed as well, but we will ignore this
information for now.) The texts are divided into chunks you see for ease of
handling –all the texts are bundled together in one file as well, but that
might prove to be too large to look at (it is, however, useful for evaluating
the taggers). These files are linked to
the web page directory in the AI Lab, but they are also in the course locker
under /mit/6.863/tagging/.
All the texts in one file: Untagged and Tagged. Choose five tagging errors made by each tagger (i.e., 10 errors in total) and discuss the possible reasons for these errors.
In addition, as you may learn from looking at them, these texts are drawn from different sources. In particular, only the last three (Texts 11, 12, 13) are from the Wall Street Journal – the same material the taggers were trained on. Text 10 is particularly interesting, because it is from the so-called “Switchboard” data of actual phone conversations (more or less). Texts 1-9 are from written material other than the Wall Street Journal. Thus, we would expect performance to vary.
We have provided a very simple-minded java script in the tagging directory to ease your work in finding the differences between the gold standard text and the computer tagged test. It will take as input a correct, or ‘gold standard’ .pos file and a corresponding Brill- or HMM-tagged file, and then write out an html page describing where the two differ (suitable for framing in your own home) – see the example output from this program here. This java file can be invoked the following way, to direct the creation of a web page like the example one you might have just clicked on:
java ProcessTags <gold-standard .pos file> <tagged file> > <htmlfilename>.html
Please study the differences in tagging errors (if any) from genre to genre
(naturally, we might expect the last three texts to give the taggers the fewest
problems, since the taggers were trained on data similar to this).
INCLUDE in your report: Log files exhibiting the tagging errors you are discussing, and your discussion of the errors, including any shifts as the text genre itself changes.
Part II: Comparing Tagger Performance – evaluation metrics
Quantitatively compare the performance of the two taggers. To do this, you
will use this
perl program to compute the confusion matrices comparing each tagger's
output to the gold standard and to compute Kappa for each tagger. This file is also in the /mit/6.863/tagging/ directory. A
description about how to use the program is here.
INCLUDE in your report: Kappa values for each tagger, and answers to the above questions as to performance, systematic errors, and possible fix-ups.
Part III: Taggers and Kimmo
It has probably not escaped your attention (since I’ve said it three times in class) that the tags that these engines use are impoverished – the so-called Brown corpus tags. We certainly don’t have the richness provided by the PC-KIMMO machinery, with fine-grained features and word decomposition – like the parsing of Spanish verbs into their tense and endings. So why use Kimmo at all? In no more than one page, please come up with the (rough!) specifications for a design that could integrate these two tasks (1) PC-KIMMO word parsing; and (2) tagging, describing how each might assist the other, or not. In addition, there’s the issue of statistical variation. Please illustrate on three or four examples- no need to implement unless you feel very ambitious.
INCLUDE in your report: your answer to the above ‘resolution’ between
PKIMMO and tagging.