KTagger Reference Manual
                 tagging words using PC-Kimmo parsing
                            version 1.0b13
                             October 1997

                 by Evan Antworth and Stephen McConnel

                 Copyright (C) 2000 SIL International

Published by:
Language Software Development
SIL International
7500 W. Camp Wisdom Road
Dallas, TX 75236
U.S.A.
Permission is granted to make and distribute verbatim copies of this
file provided the copyright notice and this permission notice are
preserved in all copies.

The author may be reached at the address above or via email as
`steve@acadcomp.sil.org'.

Introduction to the KTagger program
***********************************

KTagger is a stand-alone application built with PC-Kimmo's basic
parsing functions.  It accepts as input a word list file, consisting of
one word per line, and produces as output a structured text file
containing the morphological parse(s) of each word.  The content and
format of the output file is determined by a "control" file constructed
by the user.  KTagger can be used to do part-of-speech tagging, or to
produce a word lexicon or any other kind of structured output.

To use KTagger, you need a PC-KIMMO language description such as
Englex.  The description must include a word grammar file.  You do not
need PC-Kimmo itself to use KTagger.

KTagger runs on these systems: MS-DOS, Windows, Macintosh, and Unix.

Running KTagger
***************

KTagger is a batch process oriented program.  It reads a control file,
and then processes an input text file to produce an output analysis
file.

KTagger uses an old-fashioned command line interface following the
convention of options starting with a dash character (`-').  The
available options are listed below in alphabetical order.  Those
options which require an argument have the argument type following the
option letter.

`-h'
     displays a helpful (?) message describing these command line
     options.

`-i filename'
     selects the input text file.

`-l filename'
     selects the output log file.

`-o filename'
     selects the output file of tagged text.

`-q'
     causes KTagger to work quietly, with minimum output to the screen.

`-x filename'
     selects the control file.

The following options exist only in beta-test versions of the program,
since they are used only for debugging.

`-/'
     increments the debugging level.  The default is zero (no debugging
     output).

`-z filename'
     opens a file for recording a memory allocation log.

`-Z address,count'
     traps the program at the point where `address' is allocated or
     freed for the `count''th time.

The KTagger Control File
************************

`\rules'
`\lexicon'
`\grammar'
     These specify the PC-Kimmo language description files.  Each of
     these declarations must occur exactly once in the control file.
     If one of them is missing, the program aborts with an error
     message.

`\header'
`\footer'
     These specify what appears at the beginning (before any word
     records) and end (after all word records) of the output file.
     Each of these declarations may occur only once in the control
     file.  If one of them is missing, then an empty string is used for
     that declaration.

     The content of the `\header' or `\footer' declarations is a string
     that is delimited by double-quote characters (an empty string is
     indicated by `""').  A string can contain these special characters:
    `\n'
          newline

    `\t'
          tab

    `\f'
          formfeed

    `\"'
          double quote

    `\\'
          backslash

`\recordstarttag'
`\recordendtag'
     These specify what begins and ends each record in the output file.
     (Each word in the input file is represented by a record in the
     output file.)  Each of these declarations may occur only once in
     the control file.  If one of them is missing, then an empty string
     is used for that declaration.

`\field'
`\starttag'
`\endtag'
`\path'
     These define an output field as a group.  The `\field' declaration
     has no content; it merely indicates the start of a field
     definition.  The `\starttag' declaration contains a string
     (possibly empty) inserted before each instance of that field in
     the output file.  The `\endtag' declaration containss a string
     (possibly empty) inserted after each instance of that field in the
     output file.  The `\path' declaration contains a feature path
     specification that refers to the parse result of a word.  There
     are five reserved path specifications:
    `<WORD>'
          the input word

    `<LEX>'
          the lexical form

    `<GLOSS>'
          the gloss

    `<TREE>'
          the parse tree

    `<FEAT>'
          top node features

     In addition, the `\path' declaration may specify any feature path
     found in the top node features.  Using Englex, a path declaration
     of <head> would return all head features, while <head pos> would
     return just the value of the pos feature.  Thus it is possible to
     output any feature value made available by the grammar of the
     language description.

`\rem'
     marks a comment ("remark") in the control file.  Any number of
     these fields may appear in the control file.  They have no effect
     on the processing.

Examples
********

This chapter illustrates how to use KTagger by giving three sample
control files used in conjunction with the Englex PC-Kimmo description
of English and the following list of words:
     be
     began
     but
     by
     child
     children
     compute
     computer
     computerize
     could

Tab-delimited Format Output
===========================

Consider the following control file:
     \rem TDF.CTL - control file for KTagger
     \rem Produces output file in tab-delimited format
     \rem Uses Englex (English description for PC-KIMMO)
     
     \rules   ../../../pckimmo/test/eng/english.rul
     \lexicon ../../../pckimmo/test/eng/english.lex
     \grammar ../../../pckimmo/test/eng/english.grm
     
     \header ""
     \footer ""
     
     \recordstarttag ""
     \recordendtag "\n"
     
     \field
     \starttag ""
     \endtag "\t"
     \path <WORD>
     
     \field
     \starttag ""
     \endtag "\t"
     \path <LEX>
     
     \field
     \starttag ""
     \endtag "\t"
     \path <head pos>
     
     \field
     \starttag ""
     \endtag ""
     \path <head>

For the given set of input words, the following output is created:
     be      be      V       [ pos:V vform:BASE ]
     be      be      AUX     [ neg:- pos:AUX ]
     began   be`gan  V       [ finite:+ pos:V tense:PAST vform:ED ]
     but     but     CJ      [ pos:CJ ]
     but     but     PP      [ pos:PP ]
     by      by      PP      [ pos:PP ]
     by      by      AV      [ pos:AV ]
     child   `child  N
             [ agr:[ 3sg:+ ] number:SG pos:N proper:- verbal:- ]
     children        `children       N
             [ agr:[ 3sg:- ] number:PL pos:N proper:- verbal:- ]
     compute com`pute        V       [ pos:V vform:BASE ]
     computer        com`pute+er     N
             [ agr:[ 3sg:+ ] number:SG pos:N ]
     computerize     com`pute+er+ize V
             [ finite:- pos:V vform:BASE ]
     could   could   AUX     [ modal:+ neg:- pos:AUX ]

(Lines that are too long have been split, with the `<head>' feature
shown on the second line indented one tab stop.)

Standard Format Output
======================

Consider the following control file:
     \rem SFM.CTL - control file for KTagger
     \rem Produces output file in standard format
     \rem Uses Englex (English description for PC-KIMMO)
     
     \rules   ../../../pckimmo/test/eng/english.rul
     \lexicon ../../../pckimmo/test/eng/english.lex
     \grammar ../../../pckimmo/test/eng/english.grm
     
     \header ""
     \footer ""
     
     \recordstarttag ""
     \recordendtag "\n"
     
     \field
     \starttag "\\w "
     \endtag "\n"
     \path <WORD>
     
     \field
     \starttag "\\lx "
     \endtag "\n"
     \path <LEX>
     
     \field
     \starttag "\\pos "
     \endtag "\n"
     \path <head pos>
     
     \field
     \starttag "\\lemma "
     \endtag "\n"
     \path <root>
     
     \field
     \starttag "\\lempos "
     \endtag "\n"
     \path <root_pos>

For the given set of input words, the following output is created:
     \w be
     \lx be
     \pos V
     \lemma be
     \lempos V
     
     \w be
     \lx be
     \pos AUX
     \lemma be
     \lempos AUX
     
     \w began
     \lx be`gan
     \pos V
     \lemma be`gin
     \lempos V
     
     \w but
     \lx but
     \pos CJ
     \lemma but
     \lempos CJ
     
     \w but
     \lx but
     \pos PP
     \lemma but
     \lempos PP
     
     \w by
     \lx by
     \pos PP
     \lemma by
     \lempos PP
     
     \w by
     \lx by
     \pos AV
     \lemma by
     \lempos AV
     
     \w child
     \lx `child
     \pos N
     \lemma `child
     \lempos N
     
     \w children
     \lx `children
     \pos N
     \lemma `child
     \lempos N
     
     \w compute
     \lx com`pute
     \pos V
     \lemma com`pute
     \lempos V
     
     \w computer
     \lx com`pute+er
     \pos N
     \lemma com`pute
     \lempos V
     
     \w computerize
     \lx com`pute+er+ize
     \pos V
     \lemma com`pute
     \lempos V
     
     \w could
     \lx could
     \pos AUX
     \lemma could
     \lempos AUX

SGML Output
===========

Consider the following control file:
     \rem SGML.CTL - control file for KTagger
     \rem Produces output file in SGML LEXICON format
     \rem Uses Englex (English description for PC-KIMMO)
     
     \rules   ../../../pckimmo/test/eng/english.rul
     \lexicon ../../../pckimmo/test/eng/english.lex
     \grammar ../../../pckimmo/test/eng/english.grm
     
     \header "<!DOCTYPE LEXICON [\n
     <!ELEMENT LEXICON  - - (LE+)>\n
     <!ELEMENT LE       - - ( W, LX, POS, LEMMA, LEMPOS )>\n
     <!ELEMENT W        - - (#PCDATA)>\n
     <!ELEMENT LX       - - (#PCDATA)>\n
     <!ELEMENT POS      - - (#PCDATA)>\n
     <!ELEMENT LEMMA    - - (#PCDATA)>\n
     <!ELEMENT LEMPOS   - - (#PCDATA)>\n
     ]>\n\n
     <lexicon>\n"
     \footer "</lexicon>\n"
     
     \recordstarttag "<le>\n"
     \recordendtag "</le>\n"
     
     \field
     \starttag "<w>"
     \endtag "</w>\n"
     \path <WORD>
     
     \field
     \starttag "<lx>"
     \endtag "</lx>\n"
     \path <LEX>
     
     \field
     \starttag "<pos>"
     \endtag "</pos>\n"
     \path <head pos>
     
     \field
     \starttag "<lemma>"
     \endtag "</lemma>\n"
     \path <root>
     
     \field
     \starttag "<lempos>"
     \endtag "</lempos>\n"
     \path <root_pos>

For the given set of input words, the following output is created:
     <!DOCTYPE LEXICON [
     <!ELEMENT LEXICON  - - (LE+)>
     <!ELEMENT LE       - - ( W, LX, POS, LEMMA, LEMPOS )>
     <!ELEMENT W        - - (#PCDATA)>
     <!ELEMENT LX       - - (#PCDATA)>
     <!ELEMENT POS      - - (#PCDATA)>
     <!ELEMENT LEMMA    - - (#PCDATA)>
     <!ELEMENT LEMPOS   - - (#PCDATA)>
     ]>
     
     <lexicon>
     <le>
     <w>be</w>
     <lx>be</lx>
     <pos>V</pos>
     <lemma>be</lemma>
     <lempos>V</lempos>
     </le>
     <le>
     <w>be</w>
     <lx>be</lx>
     <pos>AUX</pos>
     <lemma>be</lemma>
     <lempos>AUX</lempos>
     </le>
     <le>
     <w>began</w>
     <lx>be`gan</lx>
     <pos>V</pos>
     <lemma>be`gin</lemma>
     <lempos>V</lempos>
     </le>
     <le>
     <w>but</w>
     <lx>but</lx>
     <pos>CJ</pos>
     <lemma>but</lemma>
     <lempos>CJ</lempos>
     </le>
     <le>
     <w>but</w>
     <lx>but</lx>
     <pos>PP</pos>
     <lemma>but</lemma>
     <lempos>PP</lempos>
     </le>
     <le>
     <w>by</w>
     <lx>by</lx>
     <pos>PP</pos>
     <lemma>by</lemma>
     <lempos>PP</lempos>
     </le>
     <le>
     <w>by</w>
     <lx>by</lx>
     <pos>AV</pos>
     <lemma>by</lemma>
     <lempos>AV</lempos>
     </le>
     <le>
     <w>child</w>
     <lx>`child</lx>
     <pos>N</pos>
     <lemma>`child</lemma>
     <lempos>N</lempos>
     </le>
     <le>
     <w>children</w>
     <lx>`children</lx>
     <pos>N</pos>
     <lemma>`child</lemma>
     <lempos>N</lempos>
     </le>
     <le>
     <w>compute</w>
     <lx>com`pute</lx>
     <pos>V</pos>
     <lemma>com`pute</lemma>
     <lempos>V</lempos>
     </le>
     <le>
     <w>computer</w>
     <lx>com`pute+er</lx>
     <pos>N</pos>
     <lemma>com`pute</lemma>
     <lempos>V</lempos>
     </le>
     <le>
     <w>computerize</w>
     <lx>com`pute+er+ize</lx>
     <pos>V</pos>
     <lemma>com`pute</lemma>
     <lempos>V</lempos>
     </le>
     <le>
     <w>could</w>
     <lx>could</lx>
     <pos>AUX</pos>
     <lemma>could</lemma>
     <lempos>AUX</lempos>
     </le>
     </lexicon>

Note that this output contains exactly the same information as the
previous example, except for being packaged as SGML rather than as a
standard format file.

Copyright and fair use policy
*****************************

All of the files in this release of KTagger (source code, executables,
examples, documentation) are copyrighted by SIL International (Language
Software Development, 7500 W. Camp Wisdom Road, Dallas, TX 75236,
U.S.A.).  Permission is hereby granted to the user to copy, use, and
distribute the KTagger files under the following conditions:
  1. if you distribute this original release of Englex, you must include
     all files in unmodified form;

  2. you may not charge money for distributing KTagger, in original or
     modified form, beyond minimal media cost without permission of SIL
     International; and

  3. KTagger may not be used in any commercial product without
     permission of SIL International.