Laboratory 2: Running the Part of Speech Taggers

This page provides more information to help you run the part-of-speech taggers for Laboratory 2.

Tokenizing the text

Note that you must "tokenize" the input to both the taggers. In particular, you must perform the following substitutions:

Split punctuation from adjoining words
Convert double quotes (") to doubled single forward and backward quotes (`` and '')
Split verb contractions and possessive 's from the component morphemes:

children's -> children 's
parents -> parents '
won't -> wo n't
gonna -> gon na
I'm -> I 'm

The texts provided have been tokenized for you. If you want to try tagging any other text, please make sure that it is properly tokenized.

Running the taggers

Brill tagger

The Brill tagger is described in the textbook, and can be downloaded from Eric Brill's Home Page. This paper describes the tagger in more detail. To read the documentation about how it was trained, options, etc., there are four files in the brill tagger documentation directory, /mit/6.863/tagging/brilltagger/Docs/; these describe how to use the tager and how to train it: README.QUICK; README.LONG; README.NBEST; and README.TRAINING. In particular, you should familiarize yourself with the –i <filename> output option of the tagger, which you will need to answer the lab questions. This makes the tagger write out its first, unamended guesses for tags to a separate file as well as its intermediate files and the rules it is using to change tags. If you use the default arguments as given below you are using a full system trained on the entire Brown corpus of one million words and 5 million words from the Wall Street Journal. (The README files discuss this in much greater detail.)

To use the Brill tagger on an Athena Sun workstation:

Create a subdirectory to use with the Brill tagger; for example, you could create a directory called brill
cd into the directory that you created
To run the Brill tagger you can cd to /mit/6.863/tagging/brilltagger/Bin_and_Data/ and run tagger from there, while making sure that filename points to the data file you want to tag, which is directly under tagger, i.e., .../../magi.txt, for example:

athena> tagger LEXICON filename BIGRAMS LEXICALRULEFILE CONTEXTUALRULEFILE

Here filename is the name of the file to tag, and LEXICON, BIGRAMS, LEXICALRULEFILE, and CONTEXTUALRULEFILE are strings that you actually type. This will print the tagged file to standard output. If you want to save the output in a file called outfile, you can redirect it like this:

athena> tagger LEXICON filename BIGRAMS LEXICALRULEFILE CONTEXTUALRULEFILE > ~/brill/outfile

4.             The instructions for running the tagger on an Athena Linux workstation are the same, except you must cd to the directory /mit/6.863/linux/brilltagger/Bin_and_Data and make sure that your filename is pointing to ../tagging/<filename>

5.             If you forget the arguments, tagger –h lists them in the usual way.

6.         The files in the brilltagging directory under Utilities contain some useful perl scripts for comparing files, e.g., comparator, for analyzing the changes from stage to stage (see the README files if you are interested).

7.         If you run into memory problems (less likely these days), the –s number option will process number of lines at a time.

LT-POS HMM tagger

The LT-POS tagger we will use for this assignment was developed by members of Edinburgh's Language Technology Group. As mentioned, this tagger does much more than tag – it also chunks words in groups, or phrases. We shall put aside this feature for now. Because it is more sophisticated, it takes a bit more time to initially load (wait 30 seconds), but once it does, tagging is very fast.

To use the HMM tagger on an Athena Sun workstation:

Create a subdirectory to use with the HMM tagger; for example, you could create a directory called hmm. Note that you probably want your Brill and HMM taggers working in different directories, or things might get messy...
To run the HMM tagger, cd to the course /tagging/ltchunk directory, and then do:

athena> cat filename | bin/ltchunk -show_tags

(again, filename is the name of the file to tag, properly qualified so that it will point to the data you want to process, e.g.., ../magi.txt ). This will print the tagged (and chunked) data to standard output. If you want to save it in a file outfile, redirect it as follows:

athena> cat filename | bin/ltchunk -show_tags > ~/hmm/outfile

Once again, on a Linux workstation the commands are the same, but you must first cd to the directory /mit/6.863/linux/ltchunk and adjust your data filename paths accordingly, pointing to the /mit/6.863/tagging/ directory.