This page provides more information to help you run the part-of-speech taggers for Laboratory 2.
Note that you must "tokenize" the input to both the taggers. In particular, you must perform the following substitutions:
The texts provided have been tokenized for you. If you want to try tagging any other text, please make sure that it is properly tokenized.
The Brill tagger is described in the textbook, and can be downloaded from Eric Brill's Home Page. This paper describes the tagger in more detail. To read the documentation about how it was trained, options, etc., there are four files in the brill tagger documentation directory, /mit/6.863/tagging/brilltagger/Docs/; these describe how to use the tager and how to train it: README.QUICK; README.LONG; README.NBEST; and README.TRAINING. In particular, you should familiarize yourself with the –i <filename> output option of the tagger, which you will need to answer the lab questions. This makes the tagger write out its first, unamended guesses for tags to a separate file as well as its intermediate files and the rules it is using to change tags. If you use the default arguments as given below you are using a full system trained on the entire Brown corpus of one million words and 5 million words from the Wall Street Journal. (The README files discuss this in much greater detail.)
To use the Brill tagger on an Athena Sun workstation:
brill
athena>
tagger LEXICON filename BIGRAMS LEXICALRULEFILE CONTEXTUALRULEFILE
Here filename is the name of the file to tag, and LEXICON, BIGRAMS, LEXICALRULEFILE, and CONTEXTUALRULEFILE are strings that you actually type. This will print the tagged file to standard output. If you want to save the output in a file called outfile, you can redirect it like this:
athena> tagger LEXICON filename BIGRAMS LEXICALRULEFILE CONTEXTUALRULEFILE > ~/brill/outfile
4. The instructions for running the tagger on an Athena Linux workstation are the same, except you must cd to the directory /mit/6.863/linux/brilltagger/Bin_and_Data and make sure that your filename is pointing to ../tagging/<filename>
5. If you forget the arguments, tagger –h lists them in the usual way.
6. The files in the brilltagging directory under Utilities contain some useful perl scripts for comparing files, e.g., comparator, for analyzing the changes from stage to stage (see the README files if you are interested).
7. If you run into memory problems (less likely these days), the –s number option will process number of lines at a time.
The LT-POS tagger we will use for this assignment was developed by members of Edinburgh's Language Technology Group. As mentioned, this tagger does much more than tag – it also chunks words in groups, or phrases. We shall put aside this feature for now. Because it is more sophisticated, it takes a bit more time to initially load (wait 30 seconds), but once it does, tagging is very fast.
To
use the HMM tagger on an Athena Sun workstation:
hmm
.
Note that you probably want your Brill and HMM taggers working in
different directories, or things might get messy... athena> cat filename | bin/ltchunk
-show_tags
(again, filename is the name of the file to tag, properly qualified so that it will point to the data you want to process, e.g.., ../magi.txt ). This will print the tagged (and chunked) data to standard output. If you want to save it in a file outfile, redirect it as follows:
athena> cat filename | bin/ltchunk
-show_tags > ~/hmm/outfile