KTagger Reference Manual

tagging words using PC-Kimmo parsing

version 1.0b13

October 1997

by Evan Antworth and Stephen McConnel


Table of Contents


1. Introduction to the KTagger program

KTagger is a stand-alone application built with PC-Kimmo's basic parsing functions. It accepts as input a word list file, consisting of one word per line, and produces as output a structured text file containing the morphological parse(s) of each word. The content and format of the output file is determined by a "control" file constructed by the user. KTagger can be used to do part-of-speech tagging, or to produce a word lexicon or any other kind of structured output.

To use KTagger, you need a PC-KIMMO language description such as Englex. The description must include a word grammar file. You do not need PC-Kimmo itself to use KTagger.

KTagger runs on these systems: MS-DOS, Windows, Macintosh, and Unix.

2. Running KTagger

KTagger is a batch process oriented program. It reads a control file, and then processes an input text file to produce an output analysis file.

KTagger uses an old-fashioned command line interface following the convention of options starting with a dash character (`-'). The available options are listed below in alphabetical order. Those options which require an argument have the argument type following the option letter.

-h
displays a helpful (?) message describing these command line options.
-i filename
selects the input text file.
-l filename
selects the output log file.
-o filename
selects the output file of tagged text.
-q
causes KTagger to work quietly, with minimum output to the screen.
-x filename
selects the control file.

The following options exist only in beta-test versions of the program, since they are used only for debugging.

-/
increments the debugging level. The default is zero (no debugging output).
-z filename
opens a file for recording a memory allocation log.
-Z address,count
traps the program at the point where address is allocated or freed for the count'th time.

3. The KTagger Control File

\rules
\lexicon
\grammar
These specify the PC-Kimmo language description files. Each of these declarations must occur exactly once in the control file. If one of them is missing, the program aborts with an error message.
\header
\footer
These specify what appears at the beginning (before any word records) and end (after all word records) of the output file. Each of these declarations may occur only once in the control file. If one of them is missing, then an empty string is used for that declaration. The content of the \header or \footer declarations is a string that is delimited by double-quote characters (an empty string is indicated by ""). A string can contain these special characters:
\n
newline
\t
tab
\f
formfeed
\"
double quote
\\
backslash
\recordstarttag
\recordendtag
These specify what begins and ends each record in the output file. (Each word in the input file is represented by a record in the output file.) Each of these declarations may occur only once in the control file. If one of them is missing, then an empty string is used for that declaration.
\field
\starttag
\endtag
\path
These define an output field as a group. The \field declaration has no content; it merely indicates the start of a field definition. The \starttag declaration contains a string (possibly empty) inserted before each instance of that field in the output file. The \endtag declaration containss a string (possibly empty) inserted after each instance of that field in the output file. The \path declaration contains a feature path specification that refers to the parse result of a word. There are five reserved path specifications:
<WORD>
the input word
<LEX>
the lexical form
<GLOSS>
the gloss
<TREE>
the parse tree
<FEAT>
top node features
In addition, the \path declaration may specify any feature path found in the top node features. Using Englex, a path declaration of <head> would return all head features, while <head pos> would return just the value of the pos feature. Thus it is possible to output any feature value made available by the grammar of the language description.
\rem
marks a comment ("remark") in the control file. Any number of these fields may appear in the control file. They have no effect on the processing.

4. Examples

This chapter illustrates how to use KTagger by giving three sample control files used in conjunction with the Englex PC-Kimmo description of English and the following list of words:

be
began
but
by
child
children
compute
computer
computerize
could

4.1 Tab-delimited Format Output

Consider the following control file:

\rem TDF.CTL - control file for KTagger
\rem Produces output file in tab-delimited format
\rem Uses Englex (English description for PC-KIMMO)

\rules   ../../../pckimmo/test/eng/english.rul
\lexicon ../../../pckimmo/test/eng/english.lex
\grammar ../../../pckimmo/test/eng/english.grm

\header ""
\footer ""

\recordstarttag ""
\recordendtag "\n"

\field
\starttag ""
\endtag "\t"
\path <WORD>

\field
\starttag ""
\endtag "\t"
\path <LEX>

\field
\starttag ""
\endtag "\t"
\path <head pos>

\field
\starttag ""
\endtag ""
\path <head>

For the given set of input words, the following output is created:

be      be      V       [ pos:V vform:BASE ] 
be      be      AUX     [ neg:- pos:AUX ] 
began   be`gan  V       [ finite:+ pos:V tense:PAST vform:ED ] 
but     but     CJ      [ pos:CJ ] 
but     but     PP      [ pos:PP ] 
by      by      PP      [ pos:PP ] 
by      by      AV      [ pos:AV ] 
child   `child  N
        [ agr:[ 3sg:+ ] number:SG pos:N proper:- verbal:- ]
children        `children       N
        [ agr:[ 3sg:- ] number:PL pos:N proper:- verbal:- ]
compute com`pute        V       [ pos:V vform:BASE ] 
computer        com`pute+er     N
        [ agr:[ 3sg:+ ] number:SG pos:N ] 
computerize     com`pute+er+ize V
        [ finite:- pos:V vform:BASE ] 
could   could   AUX     [ modal:+ neg:- pos:AUX ] 

(Lines that are too long have been split, with the <head> feature shown on the second line indented one tab stop.)

4.2 Standard Format Output

Consider the following control file:

\rem SFM.CTL - control file for KTagger
\rem Produces output file in standard format
\rem Uses Englex (English description for PC-KIMMO)

\rules   ../../../pckimmo/test/eng/english.rul
\lexicon ../../../pckimmo/test/eng/english.lex
\grammar ../../../pckimmo/test/eng/english.grm

\header ""
\footer ""

\recordstarttag ""
\recordendtag "\n"

\field
\starttag "\\w "
\endtag "\n"
\path <WORD>

\field
\starttag "\\lx "
\endtag "\n"
\path <LEX>

\field
\starttag "\\pos "
\endtag "\n"
\path <head pos>

\field
\starttag "\\lemma "
\endtag "\n"
\path <root>

\field
\starttag "\\lempos "
\endtag "\n"
\path <root_pos>

For the given set of input words, the following output is created:

\w be
\lx be
\pos V
\lemma be
\lempos V

\w be
\lx be
\pos AUX
\lemma be
\lempos AUX

\w began
\lx be`gan
\pos V
\lemma be`gin
\lempos V

\w but
\lx but
\pos CJ
\lemma but
\lempos CJ

\w but
\lx but
\pos PP
\lemma but
\lempos PP

\w by
\lx by
\pos PP
\lemma by
\lempos PP

\w by
\lx by
\pos AV
\lemma by
\lempos AV

\w child
\lx `child
\pos N
\lemma `child
\lempos N

\w children
\lx `children
\pos N
\lemma `child
\lempos N

\w compute
\lx com`pute
\pos V
\lemma com`pute
\lempos V

\w computer
\lx com`pute+er
\pos N
\lemma com`pute
\lempos V

\w computerize
\lx com`pute+er+ize
\pos V
\lemma com`pute
\lempos V

\w could
\lx could
\pos AUX
\lemma could
\lempos AUX

4.3 SGML Output

Consider the following control file:

\rem SGML.CTL - control file for KTagger
\rem Produces output file in SGML LEXICON format
\rem Uses Englex (English description for PC-KIMMO)

\rules   ../../../pckimmo/test/eng/english.rul
\lexicon ../../../pckimmo/test/eng/english.lex
\grammar ../../../pckimmo/test/eng/english.grm

\header "<!DOCTYPE LEXICON [\n
<!ELEMENT LEXICON  - - (LE+)>\n
<!ELEMENT LE       - - ( W, LX, POS, LEMMA, LEMPOS )>\n
<!ELEMENT W        - - (#PCDATA)>\n
<!ELEMENT LX       - - (#PCDATA)>\n
<!ELEMENT POS      - - (#PCDATA)>\n
<!ELEMENT LEMMA    - - (#PCDATA)>\n
<!ELEMENT LEMPOS   - - (#PCDATA)>\n
]>\n\n
<lexicon>\n"
\footer "</lexicon>\n"

\recordstarttag "<le>\n"
\recordendtag "</le>\n"

\field
\starttag "<w>"
\endtag "</w>\n"
\path <WORD>

\field
\starttag "<lx>"
\endtag "</lx>\n"
\path <LEX>

\field
\starttag "<pos>"
\endtag "</pos>\n"
\path <head pos>

\field
\starttag "<lemma>"
\endtag "</lemma>\n"
\path <root>

\field
\starttag "<lempos>"
\endtag "</lempos>\n"
\path <root_pos>

For the given set of input words, the following output is created:

<!DOCTYPE LEXICON [
<!ELEMENT LEXICON  - - (LE+)>
<!ELEMENT LE       - - ( W, LX, POS, LEMMA, LEMPOS )>
<!ELEMENT W        - - (#PCDATA)>
<!ELEMENT LX       - - (#PCDATA)>
<!ELEMENT POS      - - (#PCDATA)>
<!ELEMENT LEMMA    - - (#PCDATA)>
<!ELEMENT LEMPOS   - - (#PCDATA)>
]>

<lexicon>
<le>
<w>be</w>
<lx>be</lx>
<pos>V</pos>
<lemma>be</lemma>
<lempos>V</lempos>
</le>
<le>
<w>be</w>
<lx>be</lx>
<pos>AUX</pos>
<lemma>be</lemma>
<lempos>AUX</lempos>
</le>
<le>
<w>began</w>
<lx>be`gan</lx>
<pos>V</pos>
<lemma>be`gin</lemma>
<lempos>V</lempos>
</le>
<le>
<w>but</w>
<lx>but</lx>
<pos>CJ</pos>
<lemma>but</lemma>
<lempos>CJ</lempos>
</le>
<le>
<w>but</w>
<lx>but</lx>
<pos>PP</pos>
<lemma>but</lemma>
<lempos>PP</lempos>
</le>
<le>
<w>by</w>
<lx>by</lx>
<pos>PP</pos>
<lemma>by</lemma>
<lempos>PP</lempos>
</le>
<le>
<w>by</w>
<lx>by</lx>
<pos>AV</pos>
<lemma>by</lemma>
<lempos>AV</lempos>
</le>
<le>
<w>child</w>
<lx>`child</lx>
<pos>N</pos>
<lemma>`child</lemma>
<lempos>N</lempos>
</le>
<le>
<w>children</w>
<lx>`children</lx>
<pos>N</pos>
<lemma>`child</lemma>
<lempos>N</lempos>
</le>
<le>
<w>compute</w>
<lx>com`pute</lx>
<pos>V</pos>
<lemma>com`pute</lemma>
<lempos>V</lempos>
</le>
<le>
<w>computer</w>
<lx>com`pute+er</lx>
<pos>N</pos>
<lemma>com`pute</lemma>
<lempos>V</lempos>
</le>
<le>
<w>computerize</w>
<lx>com`pute+er+ize</lx>
<pos>V</pos>
<lemma>com`pute</lemma>
<lempos>V</lempos>
</le>
<le>
<w>could</w>
<lx>could</lx>
<pos>AUX</pos>
<lemma>could</lemma>
<lempos>AUX</lempos>
</le>
</lexicon>

Note that this output contains exactly the same information as the previous example, except for being packaged as SGML rather than as a standard format file.

5. Copyright and fair use policy

All of the files in this release of KTagger (source code, executables, examples, documentation) are copyrighted by SIL International (Language Software Development, 7500 W. Camp Wisdom Road, Dallas, TX 75236, U.S.A.). Permission is hereby granted to the user to copy, use, and distribute the KTagger files under the following conditions:

  1. if you distribute this original release of Englex, you must include all files in unmodified form;
  2. you may not charge money for distributing KTagger, in original or modified form, beyond minimal media cost without permission of SIL International; and
  3. KTagger may not be used in any commercial product without permission of SIL International.


This document was generated on 11 May 2000 using texi2html 1.55k.