Massachusetts Institute of Technology
Department of Electrical Engineering and Computer Science
6.863J/9.611J Natural Language Processing, Spring, 2003
Handed Out: January 24 Due: March 05
Section 1. Warmup
Run each of the taggers on the file per the instructions on the file magi.txt in the directory /mit/6.863/tagging/ and also available here. (Do not use the file magi-raw.txt – this file does not break up, or tokenize punctuation properly, and is meant for you just to read.) This is a complete version of O'Henry's famous short story “Gift of the Magi.” Please direct any output to your own directory.
(a) The word let’s occurs three times in the magi text. What does the Brill tagger give as its final tags for each of these three occurrences? Are they correct? If they are incorrect, point out which tags are incorrect, and explain why. Please include in your discussion a list of what initial tag is assigned to each occurrence, and which (if any) contextual rules are used by the Brill tagger to change this initial setting to its intermediate and then final values. You will have to use the program option –I so as to write the intermediate results to a file (see the description of the running the taggers). (The three occurrences: “Take yer hat off and let’s…”; “Let’s be happy. You don’t know….;” “let’s put our Christmas presents away…”)
(b) Repeat using the HMM tagger (except this tagger does not produce any intermediate output, of course) – we just want you to see what it does on let’s.
INCLUDE in your report: answers to part (a) and (b) above.
Section 2. Comparing tagger performance, part 1
Now we will repeat this experiment on a much longer run of text. Run each of the taggers on the following texts from the Penn Treebank and compare their output to the "gold standard" tagged texts. In each of the lines below, the link to “Text n” (e.g., “Text 1”) is to a version of the text formatted with one sentence per line - this is easier to read, but you should not use it for the actual tagging experiments. The second column has the untagged versions of the same texts, tokenized and formatted for use by the taggers. These are the files you should use as input to the taggers. These files are all labeled as *.raw. The third column, with "Tagged" link is to the human-tagged file from the Treebank, which is the “gold standard” – truth – to be used to evaluate the tagging. These corresponding tagged files are all labeled *.pos. Their tags are gospel. (You will notice that they are actually bracketed, or parsed as well, but we will ignore this information for now.) The texts are divided into chunks you see for ease of handling -- all the texts are bundled together in one file as well, but that might prove to be too large to look at (it is, however, useful for evaluating the taggers). These files are linked to the web page directory in the AI Lab, but they are also in the course locker under /mit/6.863/tagging/.
In addition, as you may learn from looking at them, these texts are drawn from different sources. In particular, only the last three (Texts 11, 12, 13) are from the Wall Street Journal – the same material the taggers were trained on. Text 10 is particularly interesting, because it is from the so-called “Switchboard” data of actual phone conversations (more or less). Texts 1-9 are from written material other than the Wall Street Journal. Thus, we would expect performance to vary. Please study the differences in tagging errors (if any) from genre to genre (naturally, we might expect the last three texts to give the taggers the least problems).
INCLUDE in your report: Log files exhibiting the tagging errors you are discussing, and your discussion of the errors, including any shifts as the text genre itself changes.
Section 3. Comparing tagger performance, part 2
Quantitatively compare the performance of the two taggers. To do this, you will use this perl program to compute the confusion matrices comparing each tagger's output to the gold standard and to compute Kappa for each tagger. This file is also in the /mit/6.863/tagging/ directory. A description about how to use the program is here.
INCLUDE in your report: Kappa values for each tagger, and answers to the above questions as to performance, systematic errors, and possible fix-ups.
Section 4. Taggers and Kimmo
It has probably not escaped your attention (since I’ve said it three times in class) that the tags that these engines use are impoverished – the so-called Brown corpus tags. We certainly don’t have the richness provided by the PC-KIMMO machinery, with fine-grained features and word decomposition – like the parsing of Spanish verbs into their tense and endings. So why use Kimmo at all? In no more than one page, please come up with the (rough!) specifications for a design that could integrate these two tasks (1) PC-KIMMO word parsing; and (2) tagging; describing how each might assist (or not) the other. Please illustrate on three or four examples (no need to implement unless you feel very ambitious).
INCLUDE in your report: your answer to the above ‘resolution’ between PC-KIMMO and tagging.