Laboratory 2, Component III:

Statistics and Natural Language: Part of Speech Tagging Bake-Off

Handed Out: February 26 Due: March 05

Introduction: Goals of the Laboratory & Background Reading – The Red Pill or the Blue Pill?

In class we have observed that there are two current approaches to tackling natural language processing: a symbolic, rule-based approach; and a statistical approach. What are the strengths and weaknesses of each? How can we combine them? How do they help in building practical systems? That is the goal of this laboratory. We will carry out a deeper analysis of POS tagging on larger and different corpora. Through observing the performances of these taggers and finishing this laboratory you should develop a deeper understanding of the two different perspectives on NLP: the symbolic rule-based perspective and the statistical perspective.

General Background reading for this Laboratory – Please read if you haven’t already!

Basic probability theory; n-grams

If you are at all uncomfortable with the basic notions of probability theory, especially those required for understanding n-grams – conditional probability; the chain rule; Bayes’ rule; etc., I strongly urge you to read the first selection below. I would also strongly urge everyone to take a look at the second item below. I leave the reading in your textbook and the lecture notes to your own conscience and/or diligence (Chapter 6 in the text is a very good summary, however!)

1. An excellent review of probability theory as applicable to N-grams: here.

2. Same document, with a terrific summary intro to bigrams, here.

3. Your textbook, Chapter 6.

Part of speech tagging

1. Your textbook, Chapter 8.

2. Eric Brill. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. Computational Linguistics, Dec. 1995.

3. Supplementary: if you want a thorough review of current methods and tests of part of speech taggers, see the recent technical report [Schröder 2002].

Your basic job will be to run both taggers on the 13 text collections below. We have partitioned these because some of the files come from different texts than what the taggers were trained on; all the files are collected together into one large file all.raw and all.pos. You will make use of this partitioning, as you’ll see below. As you will note from the instructions page, you should capture the tagged output by redirecting it to a file in your own directory. As usual, you should write up your results as a report in html format, and submit the URL to the TA. In particular, please include the information labeled “INCLUDE” below (in addition to other discussion that we shall point out).

As before, instructions for running each of these taggers are available on a separate page

Part I: Comparing Tagger Performance – the Red Pill or the Blue Pill, take 2

We will now compare the Brill and HMM taggers on a much longer run of text. Run each of the taggers on the following texts from the Penn Treebank and compare their output to the "gold standard" tagged texts. In each of the lines below, the link to “Text n” (e.g., “Text 1”) is to a version of the text formatted with one sentence per line - this is easier to read, but you should not use it for the actual tagging experiments. The second column has the untagged versions of the same texts, tokenized and formatted for use by the taggers. These are the files you should use as input to the taggers. These files are all labeled as *.raw. The third column, with "Tagged" link is to the human-tagged file from the Treebank, which is the “gold standard” – truth – to be used to evaluate the tagging. These corresponding tagged files are all labeled *.pos. Their tags are to be considered gospel. (You will notice that they are actually bracketed, or pre-arsed as well, but we will ignore this information for now.) The texts are divided into chunks you see for ease of handling –all the texts are bundled together in one file as well, but that might prove to be too large to look at (it is, however, useful for evaluating the taggers). These files are linked to the web page directory in the AI Lab, but they are also in the course locker under /mit/6.863/tagging/.

All the texts in one file: Untagged and Tagged. Choose five tagging errors made by each tagger (i.e., 10 errors in total) and discuss the possible reasons for these errors.

In addition, as you may learn from looking at them, these texts are drawn from different sources. In particular, only the last three (Texts 11, 12, 13) are from the Wall Street Journal – the same material the taggers were trained on. Text 10 is particularly interesting, because it is from the so-called “Switchboard” data of actual phone conversations (more or less). Texts 1-9 are from written material other than the Wall Street Journal. Thus, we would expect performance to vary.

We have provided a very simple-minded java script in the tagging directory to ease your work in finding the differences between the gold standard text and the computer tagged test. It will take as input a correct, or ‘gold standard’ .pos file and a corresponding Brill- or HMM-tagged file, and then write out an html page describing where the two differ (suitable for framing in your own home) – see the example output from this program here. This java file can be invoked the following way, to direct the creation of a web page like the example one you might have just clicked on:

java ProcessTags <gold-standard .pos file> <tagged file> > <htmlfilename>.html

Please study the differences in tagging errors (if any) from genre to genre (naturally, we might expect the last three texts to give the taggers the fewest problems, since the taggers were trained on data similar to this).

INCLUDE in your report: Log files exhibiting the tagging errors you are discussing, and your discussion of the errors, including any shifts as the text genre itself changes.

Part II: Comparing Tagger Performance – evaluation metrics

Quantitatively compare the performance of the two taggers. To do this, you will use this perl program to compute the confusion matrices comparing each tagger's output to the gold standard and to compute Kappa for each tagger. This file is also in the /mit/6.863/tagging/ directory. A description about how to use the program is here.

Compute Kappa for each tagger. Which one performs better on these data? Do the comparison text by text (1 through 13, and then all, for 14 comparisons, and construct a histogram).
What is causing the errors? Use the confusion matrices to identify any systematic errors. This matrix describes which tags are switched between gold standard and tagger output. (Use the –x option if you want to output the matrix in csv format for spreadsheets, otherwise you’ll have to print it and look…much harder.) Describe three of these systematic errors and show an example of each. If genre changes are relevant, discuss them also.
How might you repair these systematic errors? (Don’t go overboard here – one paragraph is enough, but not glittering generalities please, like “more data needed.”

INCLUDE in your report: Kappa values for each tagger, and answers to the above questions as to performance, systematic errors, and possible fix-ups.

Part III: Taggers and Kimmo

It has probably not escaped your attention (since I’ve said it three times in class) that the tags that these engines use are impoverished – the so-called Brown corpus tags. We certainly don’t have the richness provided by the PC-KIMMO machinery, with fine-grained features and word decomposition – like the parsing of Spanish verbs into their tense and endings. So why use Kimmo at all? In no more than one page, please come up with the (rough!) specifications for a design that could integrate these two tasks (1) PC-KIMMO word parsing; and (2) tagging, describing how each might assist the other, or not. In addition, there’s the issue of statistical variation. Please illustrate on three or four examples- no need to implement unless you feel very ambitious.

INCLUDE in your report: your answer to the above ‘resolution’ between PKIMMO and tagging.

Laboratory 2, Component III:

Statistics and Natural Language: Part of Speech Tagging Bake-Off

Introduction: Goals of the Laboratory & Background Reading – The Red Pill or the Blue Pill?

As before, instructions for running each of these taggers are available on a separate page

Part I: Comparing Tagger Performance – the Red Pill or the Blue Pill, take 2

This concludes Laboratory 2.