Evaluating tagger output

This page provides more information to help you compare the tagger output. We have provided a perl program that can be used to compute the confusion matrices and kappa for the taggers, comparing them against the ‘truth’ in the .pos files.  For a definition of Kappa, please see the class notes or the Jurafsky text; Kappa ranges between 0 (for no agreement between the tagger and the Gold Standard) and 1 (for complete agreement); typically excellent values are greater than 0.8.  The program will also print the P(A) and P(E) values used to compute kappa (recall that P(A) is the agreement between ‘truth’ and the tagger, while P(E) is the agreement expected by chance alone.)

Running the program

The basic syntax for compare-taggers.pl is as follows.  This program resides in the /mit/6.863/tagging/ directory, so you can run it from there.

athena> compare-taggers.pl (-b|-h) (-m) (-k) tagger-output gold-standard


The parameters are as follows:


The file containing the output from the tagger


The file containing the "gold standard" tags


Specify that the output is from the Brill tagger


Specify that the output is from the HMM tagger


Print out the confusion matrix


Print out kappa

            Print out the results in a format suitable for input into a spreadsheet


To print the confusion matrix and kappa for the tagged file wsj_1975.brill, tagged by the Brill tagger, and save the matrix and kappa in the file wsj_1975.results:

athena> compare-taggers.pl -b -m -k wsj_1975.brill wsj_1975.pos > ~/wsj_1975.results

To print kappa only for the tagged file sw2019.hmm, tagged by the HMM tagger:

athena> compare-taggers.pl -h -k sw2019.hmm sw2019.pos