KTagger Reference Manual tagging words using PC-Kimmo parsing version 1.0b13 October 1997 by Evan Antworth and Stephen McConnel Copyright (C) 2000 SIL International Published by: Language Software Development SIL International 7500 W. Camp Wisdom Road Dallas, TX 75236 U.S.A. Permission is granted to make and distribute verbatim copies of this file provided the copyright notice and this permission notice are preserved in all copies. The author may be reached at the address above or via email as `steve@acadcomp.sil.org'. Introduction to the KTagger program *********************************** KTagger is a stand-alone application built with PC-Kimmo's basic parsing functions. It accepts as input a word list file, consisting of one word per line, and produces as output a structured text file containing the morphological parse(s) of each word. The content and format of the output file is determined by a "control" file constructed by the user. KTagger can be used to do part-of-speech tagging, or to produce a word lexicon or any other kind of structured output. To use KTagger, you need a PC-KIMMO language description such as Englex. The description must include a word grammar file. You do not need PC-Kimmo itself to use KTagger. KTagger runs on these systems: MS-DOS, Windows, Macintosh, and Unix. Running KTagger *************** KTagger is a batch process oriented program. It reads a control file, and then processes an input text file to produce an output analysis file. KTagger uses an old-fashioned command line interface following the convention of options starting with a dash character (`-'). The available options are listed below in alphabetical order. Those options which require an argument have the argument type following the option letter. `-h' displays a helpful (?) message describing these command line options. `-i filename' selects the input text file. `-l filename' selects the output log file. `-o filename' selects the output file of tagged text. `-q' causes KTagger to work quietly, with minimum output to the screen. `-x filename' selects the control file. The following options exist only in beta-test versions of the program, since they are used only for debugging. `-/' increments the debugging level. The default is zero (no debugging output). `-z filename' opens a file for recording a memory allocation log. `-Z address,count' traps the program at the point where `address' is allocated or freed for the `count''th time. The KTagger Control File ************************ `\rules' `\lexicon' `\grammar' These specify the PC-Kimmo language description files. Each of these declarations must occur exactly once in the control file. If one of them is missing, the program aborts with an error message. `\header' `\footer' These specify what appears at the beginning (before any word records) and end (after all word records) of the output file. Each of these declarations may occur only once in the control file. If one of them is missing, then an empty string is used for that declaration. The content of the `\header' or `\footer' declarations is a string that is delimited by double-quote characters (an empty string is indicated by `""'). A string can contain these special characters: `\n' newline `\t' tab `\f' formfeed `\"' double quote `\\' backslash `\recordstarttag' `\recordendtag' These specify what begins and ends each record in the output file. (Each word in the input file is represented by a record in the output file.) Each of these declarations may occur only once in the control file. If one of them is missing, then an empty string is used for that declaration. `\field' `\starttag' `\endtag' `\path' These define an output field as a group. The `\field' declaration has no content; it merely indicates the start of a field definition. The `\starttag' declaration contains a string (possibly empty) inserted before each instance of that field in the output file. The `\endtag' declaration containss a string (possibly empty) inserted after each instance of that field in the output file. The `\path' declaration contains a feature path specification that refers to the parse result of a word. There are five reserved path specifications: `' the input word `' the lexical form `' the gloss `' the parse tree `' top node features In addition, the `\path' declaration may specify any feature path found in the top node features. Using Englex, a path declaration of would return all head features, while would return just the value of the pos feature. Thus it is possible to output any feature value made available by the grammar of the language description. `\rem' marks a comment ("remark") in the control file. Any number of these fields may appear in the control file. They have no effect on the processing. Examples ******** This chapter illustrates how to use KTagger by giving three sample control files used in conjunction with the Englex PC-Kimmo description of English and the following list of words: be began but by child children compute computer computerize could Tab-delimited Format Output =========================== Consider the following control file: \rem TDF.CTL - control file for KTagger \rem Produces output file in tab-delimited format \rem Uses Englex (English description for PC-KIMMO) \rules ../../../pckimmo/test/eng/english.rul \lexicon ../../../pckimmo/test/eng/english.lex \grammar ../../../pckimmo/test/eng/english.grm \header "" \footer "" \recordstarttag "" \recordendtag "\n" \field \starttag "" \endtag "\t" \path \field \starttag "" \endtag "\t" \path \field \starttag "" \endtag "\t" \path \field \starttag "" \endtag "" \path For the given set of input words, the following output is created: be be V [ pos:V vform:BASE ] be be AUX [ neg:- pos:AUX ] began be`gan V [ finite:+ pos:V tense:PAST vform:ED ] but but CJ [ pos:CJ ] but but PP [ pos:PP ] by by PP [ pos:PP ] by by AV [ pos:AV ] child `child N [ agr:[ 3sg:+ ] number:SG pos:N proper:- verbal:- ] children `children N [ agr:[ 3sg:- ] number:PL pos:N proper:- verbal:- ] compute com`pute V [ pos:V vform:BASE ] computer com`pute+er N [ agr:[ 3sg:+ ] number:SG pos:N ] computerize com`pute+er+ize V [ finite:- pos:V vform:BASE ] could could AUX [ modal:+ neg:- pos:AUX ] (Lines that are too long have been split, with the `' feature shown on the second line indented one tab stop.) Standard Format Output ====================== Consider the following control file: \rem SFM.CTL - control file for KTagger \rem Produces output file in standard format \rem Uses Englex (English description for PC-KIMMO) \rules ../../../pckimmo/test/eng/english.rul \lexicon ../../../pckimmo/test/eng/english.lex \grammar ../../../pckimmo/test/eng/english.grm \header "" \footer "" \recordstarttag "" \recordendtag "\n" \field \starttag "\\w " \endtag "\n" \path \field \starttag "\\lx " \endtag "\n" \path \field \starttag "\\pos " \endtag "\n" \path \field \starttag "\\lemma " \endtag "\n" \path \field \starttag "\\lempos " \endtag "\n" \path For the given set of input words, the following output is created: \w be \lx be \pos V \lemma be \lempos V \w be \lx be \pos AUX \lemma be \lempos AUX \w began \lx be`gan \pos V \lemma be`gin \lempos V \w but \lx but \pos CJ \lemma but \lempos CJ \w but \lx but \pos PP \lemma but \lempos PP \w by \lx by \pos PP \lemma by \lempos PP \w by \lx by \pos AV \lemma by \lempos AV \w child \lx `child \pos N \lemma `child \lempos N \w children \lx `children \pos N \lemma `child \lempos N \w compute \lx com`pute \pos V \lemma com`pute \lempos V \w computer \lx com`pute+er \pos N \lemma com`pute \lempos V \w computerize \lx com`pute+er+ize \pos V \lemma com`pute \lempos V \w could \lx could \pos AUX \lemma could \lempos AUX SGML Output =========== Consider the following control file: \rem SGML.CTL - control file for KTagger \rem Produces output file in SGML LEXICON format \rem Uses Englex (English description for PC-KIMMO) \rules ../../../pckimmo/test/eng/english.rul \lexicon ../../../pckimmo/test/eng/english.lex \grammar ../../../pckimmo/test/eng/english.grm \header "\n \n \n \n \n \n \n ]>\n\n \n" \footer "\n" \recordstarttag "\n" \recordendtag "\n" \field \starttag "" \endtag "\n" \path \field \starttag "" \endtag "\n" \path \field \starttag "" \endtag "\n" \path \field \starttag "" \endtag "\n" \path \field \starttag "" \endtag "\n" \path For the given set of input words, the following output is created: ]> be be V be V be be AUX be AUX began be`gan V be`gin V but but CJ but CJ but but PP but PP by by PP by PP by by AV by AV child `child N `child N children `children N `child N compute com`pute V com`pute V computer com`pute+er N com`pute V computerize com`pute+er+ize V com`pute V could could AUX could AUX Note that this output contains exactly the same information as the previous example, except for being packaged as SGML rather than as a standard format file. Copyright and fair use policy ***************************** All of the files in this release of KTagger (source code, executables, examples, documentation) are copyrighted by SIL International (Language Software Development, 7500 W. Camp Wisdom Road, Dallas, TX 75236, U.S.A.). Permission is hereby granted to the user to copy, use, and distribute the KTagger files under the following conditions: 1. if you distribute this original release of Englex, you must include all files in unmodified form; 2. you may not charge money for distributing KTagger, in original or modified form, beyond minimal media cost without permission of SIL International; and 3. KTagger may not be used in any commercial product without permission of SIL International.