Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science

6.863J/9.611J Natural Language Processing, Spring, 2004


Laboratory 2, Component III: 

Statistics and Natural Language: Part of Speech Tagging Bake-Off




Handed Out: February 26                                                                                                                           Due: March 05



Introduction: Goals of the Laboratory & Background Reading – The Red Pill or the Blue Pill?

In class we have observed that there are two current approaches to tackling natural language processing: a symbolic, rule-based approach; and a statistical approach.  What are the strengths and weaknesses of each?  How can we combine them?  How do they help in building practical systems?  That is the goal of this laboratory.  We will carry out a deeper analysis of POS tagging on larger and different corpora. Through observing the performances of these taggers and finishing this laboratory you should develop a deeper understanding of the two different perspectives on NLP: the symbolic rule-based perspective and the statistical perspective.

General Background reading for this Laboratory – Please read if you haven’t already!

Basic probability theory; n-grams

If you are at all uncomfortable with the basic notions of probability theory, especially those required for understanding n-grams – conditional probability; the chain rule; Bayes’ rule; etc., I strongly urge you to read the first selection below.  I would also strongly urge everyone to take a look at the second item below. I leave the reading in your textbook and the lecture notes to your own conscience and/or diligence (Chapter 6 in the text is a very good summary, however!)

1.      An excellent review of probability theory as applicable to N-grams:  here.

2.      Same document, with a terrific summary intro to bigrams, here.

3.      Your textbook, Chapter 6.

Part of speech tagging

1.      Your textbook, Chapter 8.

2.     Eric Brill. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. Computational Linguistics, Dec. 1995.

3.     Supplementary: if you want a thorough review of current methods and tests of part of speech taggers, see the recent technical report  [Schröder 2002].


Your basic job will be to run both taggers on the 13 text collections below.  We have partitioned these because some of the files come from different texts than what the taggers were trained on; all the files are collected together into one large file all.raw and all.pos. You will make use of this partitioning, as you’ll see below. As you will note from the instructions page, you should capture the tagged output by redirecting it to a file in your own directory.  As usual, you should write up your results as a report in html format, and submit the URL to the TA.  In particular, please include the information labeled “INCLUDE” below (in addition to other discussion that we shall point out).

As before, instructions for running each of these taggers are available on a separate page

Part I:  Comparing Tagger Performance – the Red Pill or the Blue Pill, take 2

We will now compare the Brill and HMM taggers on a much longer run of text. Run each of the taggers on the following texts from the Penn Treebank and compare their output to the "gold standard" tagged texts. In each of the lines below, the link to “Text n” (e.g., “Text 1”) is to a version of the text formatted with one sentence per line -  this is easier to read, but you should not use it for the actual tagging experiments. The second column has the untagged versions of the same texts, tokenized and formatted for use by the taggers.  These are the files you should use as input to the taggers. These files are all labeled as *.raw.  The third column, with "Tagged" link is to the human-tagged file from the Treebank, which is the “gold standard” – truth – to be used to evaluate the tagging. These corresponding tagged files are all labeled *.pos. Their tags are to be considered gospel.  (You will notice that they are actually bracketed, or pre-arsed as well, but we will ignore this information for now.) The texts are divided into chunks you see for ease of handling –all the texts are bundled together in one file as well, but that might prove to be too large to look at (it is, however, useful for evaluating the taggers).  These files are linked to the web page directory in the AI Lab, but they are also in the course locker under /mit/6.863/tagging/. 

All the texts in one file: Untagged  and  Tagged. Choose five tagging errors made by each tagger (i.e., 10 errors in total) and discuss the possible reasons for these errors.

In addition, as you may learn from looking at them, these texts are drawn from different sources.  In particular, only the last three (Texts 11, 12, 13) are from the Wall Street Journal – the same material the taggers were trained on.  Text 10 is particularly interesting, because it is from the so-called “Switchboard” data of actual phone conversations (more or less). Texts 1-9 are from written material other than the Wall Street Journal.  Thus, we would expect performance to vary. 

We have provided a very simple-minded java script in the tagging directory  to ease your work in finding the differences between the gold standard text and the computer tagged test.  It  will take as input a correct, or ‘gold standard’ .pos file and a corresponding Brill- or HMM-tagged file, and then write out an html page describing where the two differ (suitable for framing in your own home) – see the example output from this program here. This java file can be invoked the following way, to direct the creation of a web page like the example one you might have just clicked on:

                        java  ProcessTags  <gold-standard .pos file>  <tagged file>  >  <htmlfilename>.html

Please study the differences in tagging errors (if any) from genre to genre (naturally, we might expect the last three texts to give the taggers the fewest problems, since the taggers were trained on data similar to this).

INCLUDE in your report: Log files exhibiting the tagging errors you are discussing, and your discussion of the errors, including any shifts as the text genre itself changes.

Part II:  Comparing Tagger Performance – evaluation metrics

Quantitatively compare the performance of the two taggers. To do this, you will use this perl program to compute the confusion matrices comparing each tagger's output to the gold standard and to compute Kappa for each tagger.  This file is also in the /mit/6.863/tagging/ directory. A description about how to use the program is here.

INCLUDE in your report: Kappa values for each tagger, and answers to the above questions as to performance, systematic errors, and possible fix-ups.

Part III:  Taggers and Kimmo

It has probably not escaped your attention (since I’ve said it three times in class) that the tags that these engines use are impoverished – the so-called Brown corpus tags.  We certainly don’t have the richness provided by the PC-KIMMO machinery, with fine-grained features and word decomposition – like the parsing of Spanish verbs into their tense and endings.  So why use Kimmo at all?  In no more than one page, please come up with the (rough!) specifications for a design that could integrate these two tasks (1) PC-KIMMO word parsing; and (2) tagging, describing how each might assist the other, or not.  In addition, there’s the issue of statistical variation.  Please illustrate on three or four examples- no need to implement unless you feel very ambitious. 

INCLUDE in your report: your answer to the above ‘resolution’ between PKIMMO and tagging.

This concludes Laboratory 2.