PC-Kimmo Reference Manual
           a two-level processor for morphological analysis
                             version 2.1.0
                             October 1997

                 by Evan Antworth and Stephen McConnel

                 Copyright (C) 2000 SIL International

Published by:
Language Software Development
SIL International
7500 W. Camp Wisdom Road
Dallas, TX 75236
U.S.A.
Permission is granted to make and distribute verbatim copies of this
file provided the copyright notice and this permission notice are
preserved in all copies.

The author may be reached at the address above or via email as
`steve@acadcomp.sil.org'.

Introduction to the PC-Kimmo program
************************************

This document describes PC-Kimmo, an implementation of the two-level
computational linguistic formalism for personal computers.  It is
available for MS-DOS, Microsoft Windows, Macintosh, and Unix.(1)

The authors would appreciate feedback directed to the following
addresses.  For linguistic questions, contact:

         Gary Simons
         SIL International
         7500 W. Camp Wisdom Road
         Dallas, TX 75236                 gary.simons@sil.org
         U.S.A.

For programming questions, contact:
         Stephen McConnel                 (972)708-7361 (office)
         Language Software Development    (972)708-7561 (fax)
         SIL International
         7500 W. Camp Wisdom Road
         Dallas, TX 75236                 steve@acadcomp.sil.org
         U.S.A.                        or Stephen_McConnel@sil.org

An online user manual for PC-Kimmo is available on the world wide web
at the URL `http://www.sil.org/pckimmo/v2/doc/guide.html'.

---------- Footnotes ----------

(1) The Microsoft Windows implementation uses the Microsoft C QuickWin
function, and the Macintosh implementation uses the Metrowerks C SIOUX
function.

The Two-level Formalism
***********************

Two-level phonology is a linguistic tool developed by computational
linguists.  Its primary use is in systems for natural language
processing such as PC-Kimmo.  This chapter describes the linguistic and
computational basis of two-level phonology.(1)

---------- Footnotes ----------

(1) This chapter is excerpted from Antworth 1991.

Computational and linguistic roots
==================================

As the fields of computer science and linguistics have grown up
together during the past several decades, they have each benefited from
cross-fertilization.  Modern linguistics has especially been influenced
by the formal language theory that underlies computation.  The most
famous application of formal language theory to linguistics was
Chomsky's (1957) transformational generative grammar.  Chomsky's
strategy was to consider several types of formal languages to see if
they were capable of modeling natural language syntax.  He started by
considering the simplest type of formal languages, called finite state
languages.  As a general principle, computational linguists try to use
the least powerful computational devices possible.  This is because the
less powerful devices are better understood, their behavior is
predictable, and they are computationally more efficient.  Chomsky
(1957:18ff) demonstrated that natural language syntax could not be
effectively modeled as a finite state language; thus he rejected finite
state languages as a theory of syntax and proposed that syntax requires
the use of more powerful, non-finite state languages.  However, there
is no reason to assume that the same should be true for natural
language phonology.  A finite state model of phonology is especially
desirable from the computational point of view, since it makes possible
a computational implementation that is simple and efficient.

While various linguists proposed that generative phonological rules
could be implemented by finite state devices (see Johnson 1972, Kay
1983), the most successful model of finite state phonology was
developed by Kimmo Koskenniemi, a Finnish computer scientist.  He
called his model two-level morphology (Koskenniemi 1983), though his
use of the term morphology should be understood to encompass both what
linguists would consider morphology proper (the decomposition of words
into morphemes) and phonology (at least in the sense of
morphophonemics).  Our main interest in this article is the
phonological formalism used by the two-level model, hereafter called
two-level phonology.  Two-level phonology traces its linguistic
heritage to "classical" generative phonology as codified in `The Sound
Pattern of English' (Chomsky and Halle 1968).  The basic insight of
two-level phonology is due to the phonologist C. Douglas Johnson (1972)
who showed that the SPE theory of phonology could be implemented using
finite state devices by replacing sequential rule application with
simultaneous rule application.  At its core, then, two-level phonology
is a rule formalism, not a complete theory of phonology.  The following
sections of this article describe the mechanism of two-level rule
application by contrasting it with rule application in classical
generative phonology.  It should be noted that Chomsky and Halle's
theory of rule application became the focal point of much controversy
during the 1970s with the result that current theories of phonology
differ significantly from classical generative phonology.  The
relevance of two-level phonology to current theory is an important
issue, but one that will not be fully addressed here.  Rather, the
comparison of two-level phonology to classical generative phonology is
done mainly for expository purposes, recognizing that while classical
generative phonology has been superseded by subsequent theoretical
work, it constitutes a historically coherent view of phonology that
continues to influence current theory and practice.

One feature that two-level phonology shares with classical generative
phonology is linear representation.  That is, phonological forms are
represented as linear strings of symbols.  This is in contrast to the
nonlinear representations used in much current work in phonology,
namely autosegmental and metrical phonology (see Goldsmith 1990).  On
the computational side, two-level phonology is consistent with natural
language processing systems that are designed to operate on linear
orthographic input.

Two-level rule application
==========================

We will begin by reviewing the formal properties of generative rules.
Stated succinctly, generative rules are sequentially ordered rewriting
rules.  What does this mean?

First, rewriting rules are rules that change or transform one symbol
into another symbol.  For example, a rewriting rule of the form
`a --> b' interprets the relationship between the symbols `a' and `b'
as a dynamic change whereby the symbol `a' is rewritten or turned into
the symbol `b'.  This means that after this operation takes place, the
symbol `a' no longer "exists," in the sense that it is no longer
available to other rules.  In linguistic theory generative rules are
known as process rules.  Process rules attempt to characterize the
relationship between levels of representation (such as the phonemic and
phonetic levels) by specifying how to transform representations from
one level into representations on the other level.

Second, generative phonological rules apply sequentially, that is, one
after another, rather than applying simultaneously.  This means that
each rule creates as its output a new intermediate level of
representation.  This intermediate level then serves as the input to
the next rule.  As a consequence, the underlying form becomes
inaccessible to later rules.

Third, generative phonological rules are ordered; that is, the
description specifies the sequence in which the rules must apply.
Applying rules in any other order may result in incorrect output.

As an example of a set of generative rules, consider the following
rules:

     (1)    Vowel Raising
            e --> i / ___C_0 i
     
     (2)    Palatalization
            t --> c / ___i

Rule 1 (Vowel Raising) states that `e' becomes (is rewritten as) `i' in
the environment preceding `Ci' (where `C' stands for the set of
consonants and `C_0' stands for zero or more consonants).  Rule 2
(Palatalization) states that `t' becomes `c' preceding `i'.  A sample
derivation of forms to which these rules apply looks like this (where
UR stands for Underlying Representation, SR stands for Surface
Representation):(1)

         UR:    temi
         (1)    timi
         (2)    cimi
         SR:    cimi

Notice that in addition to the underlying and surface levels, an
intermediate level has been created as the result of sequentially
applying rules 1 and 2.  The application of rule 1 produces the
intermediate form `timi', which then serves as the input to rule 2.

Not only are these rules sequential, they are ordered, such that rule 1
must apply before rule 2.  Rule 1 has a feeding relationship to rule 2;
that is, rule 1 increases the number of forms that can undergo rule 2
by creating more instances of `i'.  Consider what would happen if they
were applied in the reverse order.  Given the input form `temi', rule 2
would do nothing, since its environment is not satisfied.  Rule 1 would
then apply to produce the incorrect surface form `timi'.

Two-level rules differ from generative rules in the following ways.
First, whereas generative rules apply in a sequential order, two-level
rules apply simultaneously, which is better described as applying in
parallel.  Applying rules in parallel to an input form means that for
each segment in the form all of the rules must apply successfully, even
if only vacuously.

Second, whereas sequentially applied generative rules create
intermediate levels of derivation, simultaneously applied two-level
rules require only two levels of representation: the underlying or
lexical level and the surface level.  There are no intermediate levels
of derivation.  It is in this sense that the model is called two-level.

Third, whereas generative rules relate the underlying and surface
levels by rewriting underlying symbols as surface symbols, two-level
rules express the relationship between the underlying and surface
levels by positing direct, static correspondences between pairs of
underlying and surface symbols.  For instance, instead of rewriting
underlying `a' as surface `b', a two-level rule states that an
underlying `a' corresponds to a surface `b'.  The two-level rule does
not change `a' into `b', so `a' is available to other rules.  In other
words, after a two-level rule applies, both the underlying and surface
symbols still "exist."

Fourth, whereas generative rules have access only to the current
intermediate form at each stage of the derivation, two-level rules have
access to both underlying and surface environments.  Generative rules
cannot "look back" at underlying environments or "look ahead" to
surface environments.  In contrast, the environments of two-level rules
are stated as lexical-to-surface correspondences.  This means that a
two-level rule can easily refer to an underlying `a' that corresponds
to a surface `b', or to a surface `b' that corresponds to an underlying
`a'.  In generative phonology, the interaction between a pair of rules
is controlled by requiring that they apply in a certain sequential
order.  In two-level phonology, rule interactions are controlled not by
ordering the rules but by carefully specifying their environments as
strings of two-level correspondences.

Fifth, whereas generative, rewriting rules are unidirectional (that is,
they operate only in an underlying to surface direction), two-level
rules are bidirectional.  Two-level rules can operate either in an
underlying to surface direction (generation mode) or in a surface to
underlying direction (recognition mode).  Thus in generation mode
two-level rules accept an underlying form as input and return a surface
form, while in recognition mode they accept a surface form as input and
return an underlying form.  The practical application of bidirectional
phonological rules is obvious: a computational implementation of
bidirectional rules is not limited to generation mode to produce words;
it can also be used in recognition direction to parse words.

---------- Footnotes ----------

(1) This made-up example is used for expository purposes. To make better
phonological sense, the forms should have internal morpheme boundaries,
for instance `te+mi' (otherwise there would be no basis for positing an
underlying `e'). See the section below on the use of zero to see how
morpheme boundaries are handled.

How a two-level description works
=================================

To understand how a two-level phonological description works, we will
use the example given above involving Raising and Palatalization.  The
two-level model treats the relationship between the underlying form
`temi' and the surface form `cimi' as a direct, symbol-to-symbol
correspondence:

         UR:    t e m i
         SR:    c i m i

Each pair of lexical and surface symbols is a correspondence pair.  We
refer to a correspondence pair with the notation
`<underlying symbol>:<surface symbol>', for instance `e:i' and `m:m'.
There must be an exact one-to-one correspondence between the symbols of
the underlying form and the symbols of the surface form.  Deletion and
insertion of symbols (explained in detail in the next section) is
handled by positing correspondences with zero, a null segment.  The
two-level model uses a notation for expressing two-level rules that is
similar to the notation linguists use for phonological rules.
Corresponding to the generative rule for Palatalization (rule 2 above),
here is the two-level rule for the `t:c' correspondence:

     (3)    Palatalization
            t:c <=> ___ @:i

This rule is a statement about the distribution of the pair `t:c' on
the left side of the arrow with respect to the context or environment
on the right side of the arrow.  A two-level rule has three parts: the
correspondence, the operator, and the environment.  The correspondence
part of rule 3 is the pair `t:c', which is the correspondence that the
rule sanctions.  The operator part of rule 3 is the double-headed
arrow.  It indicates the nature of the logical relationship between the
correspondence and the environment (thus it means something very
different from the rewriting arrow `-->' of generative phonology).  The
`<=>' arrow is equivalent to the biconditional operator of formal logic
and means that the correspondence occurs always and only in the stated
context; that is, `t:c' is allowed if and only if it is found in the
context `___i'.  In short, rule 3 is an obligatory rule.  The
environment part of rule 3 is everything to the right of the arrow.
The long underline indicates the gap where the pair `t:c' occurs.
Notice that even the environment part of the rule is specified as
two-level correspondence pairs.

The environment part of rule 3 requires further explanation.  Instead
of using a correspondence such as `i:i', it uses the correspondence
`@:i'.  The `@' symbol is a special "wildcard" symbol that stands for
any phonological segment included in the description.  In the context
of rule 3, the correspondence `@:i' stands for all the feasible pairs
in the description whose surface segment is `i', in this case `e:i' and
`i:i'.  Thus by using the correspondence `@:i', we allow Palatalization
to apply in the environment of either a lexical `e' or lexical `i'.  In
other words, we are claiming that Palatalization is sensitive to a
surface (phonetic) environment rather than an underlying (phonemic)
environment.  Thus rule 3 will apply to both underlying forms `timi'
and `temi' to produce a surface form with an initial `c'.

Corresponding to the generative rule for Raising (rule 1 above) is the
following two-level rule for the `e:i' correspondence:

     (4)    Vowel Raising
            e:i <=> ___ C:C* @:i

(The asterisk in `C:C*' indicates zero or more instances of the
correspondence `C:C') Similar to rule 3 above, rule 4 uses the
correspondence `@:i' in its environment.  Thus rule 4 states that the
correspondence `e:i' occurs preceding a surface `i', regardless of
whether it is derived from a lexical `e' or `i'.  Why is this
necessary? Consider the case of an underlying form such as `pememi'.
In order to derive the surface form `pimimi', Raising must apply twice:
once before a lexical `i' and again before a lexical `e', both of which
correspond to a surface `i'.  Thus rule 4 will apply to both instances
of lexical `e', capturing the regressive spreading of Raising through
the word.

By applying rules 3 and 4 in parallel, they work in consort to produce
the right output.  For example,

         UR:     t    e    m    i
                 |    |    |    |
         Rules   3    4    |    |
                 |    |    |    |
         SR:     c    i    m    i

Conceptually, a two-level phonological description of a data set such
as this can be understood as follows.  First, the two-level description
declares an alphabet of all the phonological segments used in the data
in both underlying and surface forms, in the case of our example, `t',
`m', `c', `e', and `i'.  Second, the description declares a set
feasible pairs, which is the complete set of all underlying-to-surface
correspondences of segments that occur in the data.  The set of
feasible pairs for these data is the union of the set of default
correspondences, whose underlying and surface segments are identical
(namely `t:t', `m:m', `e:e', and `i:i') and the set of special
correspondences, whose underlying and surface segments are different
(namely `t:c' and `e:i').  Notice that since the segment `c' only
occurs as a surface segment in the feasible pairs, the description will
disallow any underlying form that contains a `c'.

A minimal two-level description, then, consists of nothing more than
this declaration of the feasible pairs.  Since it contains all possible
underlying-to-surface correspondences, such a description will produce
the correct output form, but because it does not constrain the
environments where the special correspondences can occur, it will also
allow many incorrect output forms.  For example, given the underlying
form `temi', it will produce the surface forms `temi', `timi', `cemi',
and `cimi', of which only the last is correct.

Third, in order to restrict the output to only correct forms, we
include rules in the description that specify where the special
correspondences are allowed to occur.  Thus the rules function as
constraints or filters, blocking incorrect forms while allowing correct
forms to pass through.  For instance, rule 3 (Palatalization) states
that a lexical `t' must be realized as a surface `c' when it precedes
`@:i'; thus, given the underlying form `temi' it will block the
potential surface output forms `timi' (because the surface sequence
`ti' is prohibited) and `cemi' (because surface `c' is prohibited
before anything except surface `i').  Rule 4 (Raising) states that a
lexical `e' must be realized as a surface `i' when it precedes the
sequence `C:C' `@:i'; thus, given the underlying form `temi' it will
block the potential surface output forms `temi' and `cemi' (because the
surface sequence `emi' is prohibited).  Therefore of the four potential
surface forms, three are filtered out; rules 3 and 4 leave only the
correct form `cimi'.

Two-level phonology facilitates a rather different way of thinking
about phonological rules.  We think of generative rules as processes
that change one segment into another.  In contrast, two-level rules do
not perform operations on segments, rather they state static
constraints on correspondences between underlying and surface forms.
Generative phonology and two-level phonology also differ in how they
characterize relationships between rules.  Rules in generative
phonology are described in terms of their relative order of application
and their effect on the input of other rules (the so-called feeding and
bleeding relations).  Thus the generative rule 1 for Raising precedes
and feeds rule 2 for Palatalization.  In contrast, rules in the
two-level model are categorized according to whether they apply in
lexical versus surface environments.  So we say that the two-level
rules for Raising and Palatalization are sensitive to a surface rather
than underlying environment.

With zero you can do (almost) anything
======================================

Phonological processes that delete or insert segments pose a special
challenge to two-level phonology.  Since an underlying form and its
surface form must correspond segment for segment, how can segments be
deleted from an underlying form or inserted into a surface form?  The
answer lies in the use of the special null symbol `0' (zero).  Thus the
correspondence `x:0' represents the deletion of `x', while `0:x'
represents the insertion of `x'.  (It should be understood that these
zeros are provided by rule application mechanism and exist only
internally; that is, zeros are not included in input forms nor are they
printed in output forms.)  As an example of deletion, consider these
forms from Tagalog (where `+' represents a morpheme boundary):

         UR:    m a n + b i l i
         SR:    m a m 0 0 i l i

Using process terminology, these forms exemplify phonological
coalescence, whereby the sequence `nb' becomes `m'.  Since in the
two-level model a sequence of two underlying segments cannot correspond
to a single surface segment, coalescence must be interpreted as
simultaneous assimilation and deletion.  Thus we need two rules: an
assimilation rule for the correspondence `n:m' and a deletion rule for
the correspondence `b:0' (note that the morpheme boundary `+' is
treated as a special symbol that is always deleted).

     (5)    Nasal Assimilation
            n:m <=> ___ +:0 b:@
     
     (6)    Deletion
            b:0 <=> @:m +:0 ___

Notice the interaction between the rules: Nasal Assimilation occurs in
a lexical environment, namely a lexical `b' (which can correspond to
either a surface `b' or `0'), while Deletion occurs in a surface
environment, namely a surface `m' (which could be the realization of
either a lexical `n' or `m').  In this way the two rules interact with
each other to produce the correct output.

Insertion correspondences, where the lexical segment is `0', enable one
to write rules for processes such as stress insertion, gemination,
infixation, and reduplication.  For example, Tagalog has a verbalizing
infix `um' that attaches between the first consonant and vowel of a
stem; thus the infixed form of `bili' is `bumili'.  To account for this
formation with two-level rules, we represent the underlying form of the
infix `um' as the prefix `X+', where `X' is a special symbol that has
no phonological purpose other than standing for the infix.  We then
write a rule that inserts the sequence `um' in the presence of `X+',
which is deleted.  Here is the two-level correspondence:

         UR:    X + b 0 0 i l i
         SR:    0 0 b u m i l i

and here is the two-level rule, which simultaneously deletes `X' and
inserts `um':

     (7)    Infixation
            X:0 <=> ___ +:0 C:C 0:u 0:m V:V

These examples involving deletion and insertion show that the invention
of zero is just as important for phonology as it was for arithmetic.
Without zero, two-level phonology would be limited to the most trivial
phonological processes; with zero, the two-level model has the
expressive power to handle complex phonological or morphological
phenomena (though not necessarily with the degree of felicity that a
linguist might desire).

Running PC-Kimmo
****************

PC-Kimmo is an interactive program.  It has a few command line options,
but it is controlled primarily by commands typed at the keyboard (or
loaded from a file previously prepared).

PC-Kimmo Command Line Options
=============================

The PC-Kimmo program uses an old-fashioned command line interface
following the convention of options starting with a dash character
(`-').  The available options are listed below in alphabetical order.
Those options which require an argument have the argument type
following the option letter.

`-g filename'
     loads the grammar from a PC-Kimmo grammar file.

`-l filename'
     loads an analysis lexicon from a PC-Kimmo lexicon file.

`-r filename'
     loads the two-level rules from a PC-Kimmo rules file.

`-s filename'
     loads a synthesis lexicon from a PC-Kimmo lexicon file.

`-t filename'
     opens a file containing one or more PC-Kimmo commands.  See
     `Interactive Commands' below.

The following options exist only in beta-test versions of the program,
since they are used only for debugging.

`-/'
     increments the debugging level.  The default is zero (no debugging
     output).

`-z filename'
     opens a file for recording a memory allocation log.

`-Z address,count'
     traps the program at the point where `address' is allocated or
     freed for the `count''th time.

Interactive Commands
====================

Each of the commands available in PC-Kimmo is described below.  Each
command consists of one or more keywords followed by zero or more
arguments.  Keywords may be abbreviated to the minimum length necessary
to prevent ambiguity.

cd
--

`cd' DIRECTORY changes the current directory to the one specified.
Spaces in the directory pathname are not permitted.

For MS-DOS or Windows, you can give a full path starting with the disk
letter and a colon (for example, `a:'); a path starting with `\' which
indicates a directory at the top level of the current disk; a path
starting with `..' which indicates the directory above the current one;
and so on.  Directories are separated by the `\' character.  (The
forward slash `/' works just as well as the backslash `\' for MS-DOS or
Windows.)

For the Macintosh, you can give a full path starting with the name of a
hard disk, a path starting with `:' which means the current folder, or
one starting `::' which means the folder containing the current one
(and so on).

For Unix, you can give a full path starting with a `/' (for example,
`/usr/pckimmo'); a path starting with `..' which indicates the
directory above the current one; and so on.  Directories are separated
by the `/' character.

clear
-----

`clear' erases all existing rules, lexicon, and grammar information,
allowing the user to prepare to load information for a new language.
Strictly speaking, it is not needed since the `load rules' command
erases any previously existing rules, the `load lexicon' command erases
any previously existing analysis lexicon, the `load synthesis-lexicon'
command erases any previously existing synthesis lexicon, and the
`load grammar' command erases any previously existing grammar.

`cle' is the minimal abbreviation for `clear'.

close
-----

`close' closes the current log file opened by a previous `log' command.

`clo' is the minimal abbreviation for `close'.

compare
-------

The `compare' commands all test the current language description files
by processing data against known (precomputed) results.

`co' is the minimal abbreviation for `compare'.  `file compare' is a
synonym for `compare'.

compare generate
................

`compare generate' <FILE> reads lexical and surface forms from the
specified file.  After reading a lexical form, PC-Kimmo generates the
corresponding surface form(s) and compares the result to the surface
form(s) read from the file.  If `VERBOSE' is `ON', then each form from
the file is echoed on the screen with a message indicating whether or
not the surface forms generated by PC-Kimmo and read from the file are
in agreement.  If `VERBOSE' is `OFF', then only the disagreements in
surface form are displayed fully.  Each result which agrees is
indicated by a single dot written to the screen.

The default filetype extension for `compare generate' is `.gen', and
the default filename is `data.gen'.

`co g' is the minimal abbreviation for `compare generate'.
`file compare generate' is a synonym for `compare generate'.

compare pairs
.............

`compare pairs' <FILE> reads pairs of surface and lexical forms from
the specified file.  After reading a lexical form, PC-Kimmo produces
any corresponding surface form(s) and compares the result(s) to the
surface form read from the file.  For each surface form, PC-Kimmo also
produces any corresponding lexical form(s) and compares the result to
the lexical form read from the file.  If `VERBOSE' is `ON', then each
form from the file is echoed on the screen with a message indicating
whether or not the forms produced by PC-Kimmo and read from the file
are in agreement.  If `VERBOSE' is `OFF', then each result which agrees
is indicated by a single dot written to the screen, and only
disagreements in lexical forms are displayed fully.

The default filetype extension for `compare pairs' is `.pai', and the
default filename is `data.pai'.

`co p' is the minimal abbreviation for `compare pairs'.
`file compare pairs' is a synonym for `compare pairs'.

compare recognize
.................

`compare recognize' <FILE> reads surface and lexical forms from the
specified file.  After reading a surface form, PC-Kimmo produces any
corresponding lexical form(s) and compares the result(s) to the lexical
form(s) read from the file.  If `VERBOSE' is `ON', then each form from
the file is echoed on the screen with a message indicating whether or
not the lexical forms produced by PC-Kimmo and read from the file are
in agreement.  If `VERBOSE' is `OFF', then each result which agrees is
indicated by a single dot written to the screen, and only disagreements
in lexical forms are displayed fully.

The default filetype extension for `compare recognize' is `.rec', and
the default filename is `data.rec'.

`co r' is the minimal abbreviation for `compare recognize'.
`file compare recognize' is a synonym for `compare recognize'.

compare synthesize
..................

`compare synthesize' <FILE> reads morphological and surface forms from
the specified file.  After reading a morphological form, PC-Kimmo
produces any corresponding surface form(s) and compares the result(s)
to the surface form(s) read from the file.  If `VERBOSE' is `ON', then
each form from the file is echoed on the screen with a message
indicating whether or not the surface forms produced by PC-Kimmo and
read from the file are in agreement.  If `VERBOSE' is `OFF', then each
result which agrees is indicated by a single dot written to the screen,
and only disagreements in surface forms are displayed fully.

The default filetype extension for `compare synthesize' is `.syn', and
the default filename is `data.syn'.

`co s' is the minimal abbreviation for `compare synthesize'.
`file compare synthesize' is a synonym for `compare synthesize'.

directory
---------

`directory' lists the contents of the current directory.  This command
is available only for the MS-DOS and Unix implementations.  It does not
exist for the Microsoft Windows or Macintosh implementations.

edit
----

`edit' FILENAME attempts to edit the specified file using the program
indicated by the environment variable `EDITOR'.  If this environment
variable is not defined, then `edit' is used to edit the file on
MS-DOS, and `emacs' is used to edit the file on Unix.  This command is
not available for the Microsoft Windows or Macintosh implementations.

exit
----

`exit' stops PC-Kimmo, returning control to the operating system.  This
is the same as `quit'.

file
----

The `file' commands process data from a file, optionally writing the
results to another file.  Each of these commands is described below.

file compare
............

The `file compare' commands all test the current language description
files by processing data against known (precomputed) results.

`f c' is the minimal abbreviation for `file compare'.  `file compare'
is a synonym for `compare'.  See `compare generate', `compare pairs',
`compare recognize', and `compare synthesize' above.

file generate
.............

`file generate' <INFILE> [<OUTFILE>] reads lexical forms from the
specified input file and writes the corresponding computed surface
forms either to the screen or to an optionally specified output file.

This command behaves the same as `generate' except that input comes
from a file rather than the keyboard, and output may go to a file
rather than the screen.  See `generate' below.

`f g' is the minimal abbreviation for `file generate'.

file recognize
..............

`file recognize' <INFILE> [<OUTFILE>] reads surface forms from the
specified input file and writes the corresponding computed
morphological and lexical forms either to the screen or to an
optionally specified output file.

This command behaves the same as `recognize' except that input comes
from a file rather than the keyboard, and output may go to a file
rather than the screen.  See `recognize' below.

`f r' is the minimal abbreviation for `file recognize'.

file synthesize
...............

`file synthesize' <INFILE> [<OUTFILE>] reads morphological forms from
the specified input file and writes the corresponding computed surface
forms either to the screen or to an optionally specified output file.

This command behaves the same as `synthesize' except that input comes
from a file rather than the keyboard, and output may go to a file
rather than the screen.  See `synthesize' below.

`f s' is the minimal abbreviation for `file synthesize'.

generate
--------

`generate' [<LEXICAL-FORM>] attempts to produce a surface form from a
lexical form provided by the user.  If a lexical form is typed on the
same line as the command, then that lexical form is used to generate a
surface form.  If the command is typed without a form, then PC-Kimmo
prompts the user for lexical forms with a special generator prompt, and
processes each form in turn.  This cycle of typing and generating is
terminated by typing an empty "form" (that is, nothing but the `Enter'
or `Return' key).

The rules must be loaded before using this command.  It does not
require either a lexicon or a grammar.

`g' is the minimal abbreviation for `generate'.

help
----

`help' COMMAND displays a description of the specified command.  If
`help' is typed by itself, PC-Kimmo displays a list of commands with
short descriptions of each command.

`h' is the minimal abbreviation for `help'.

list
----

The `list' commands all display information about the currently loaded
data.  Each of these commands are described below.

`li' is the minimal abbreviation for `list'.

list lexicon
............

`list lexicon' displays the names of all the (sub)lexicons currently
loaded.  The order of presentation is the order in which they are
referenced in the `ALTERNATIONS' declarations.

`li l' is the minimal abbreviation for `list lexicon'.

list pairs
..........

`list pairs' displays all the feasible pairs for the current set of
active rules.  The feasible pairs are displayed as pairs of lines, with
the lexical characters shown above the corresponding surface characters.

`li p' is the minimal abbreviation for `list pairs'.

list rules
..........

`list rules' displays the names of the current rules, preceded by the
number of the rule (used by the `set rules' command) and an indication
of whether the rule is `ON' or `OFF'.

`li r' is the minimal abbreviation for `list rules'.

load
----

The `load' commands all load information stored in specially formatted
files.  Each of the `load' commands is described below.

`l' is the minimal abbreviation for `load'.

load grammar
............

`load grammar' [<FILE>] erases any existing word grammar and reads a
new word grammar from the specified file.

The default filetype extension for `load grammar' is `.grm', and the
default filename is `grammar.grm'.

A grammar file can also be loaded by using the `-g' command line option
when starting PC-Kimmo.

`l g' is the minimal abbreviation for `load grammar'.

load lexicon
............

`load lexicon' [<FILE>] erases any existing analysis lexicon
information and reads a new analysis lexicon from the specified file.
A rules file must be loaded before an analysis lexicon file can be
loaded.

The default filetype extension for `load lexicon' is `.lex', and the
default filename is `lexicon.lex'.

An analysis lexicon file can also be loaded by using the `-l' command
line option when starting PC-Kimmo.  This requires that a `-r' option
also be used to load a rules file.

`l l' is the minimal abbreviation for `load lexicon'.

load rules
..........

`load rules' [<FILE>] erases any existing rules and reads a new set of
two-level rules from the specified file.

The default filetype extension for `load rules' is `.rul', and the
default filename is `rules.rul'.

A rules file can also be loaded by using the `-r' command line option
when starting PC-Kimmo.

`l r' is the minimal abbreviation for `load rules'.

load synthesis-lexicon
......................

`load synthesis-lexicon' [<FILE>] erases any existing synthesis lexicon
and reads a new synthesis lexicon from the specified file.  A rules
file must be loaded before a synthesis lexicon file can be loaded.

The default filetype extension for `load synthesis-lexicon' is `.lex',
and the default filename is `lexicon.lex'.

A synthesis lexicon file can also be loaded by using the `-s' command
line option when starting PC-Kimmo.  This requires that a `-r' option
also be used to load a rules file.

`l s' is the minimal abbreviation for `load synthesis-lexicon'.

log
---

`log' [<FILE>] opens a log file.  Each item processed by a `generate',
`recognize', `synthesize', `compare', or `file' command is recorded in
the log file as well as being displayed on the screen.

If a filename is given on the same line as the `log' command, then that
file is used for the log file.  Any previously existing file with the
same name will be overwritten.  If no filename is provided, then the
file `pckimmo.log' in the current directory is used for the log file.

Use `close' to stop recording in a log file.  If a `log' command is
given when a log file is already open, then the earlier log file is
closed before the new log file is opened.

quit
----

`quit' stops PC-Kimmo, returning control to the operating system.  This
is the same as `exit'.

recognize
---------

`recognize' [<SURFACE-FORM>] attempts to produce lexical and
morphological forms from a surface wordform provided by the user.  If a
wordform is typed on the same line as a command, then that word is
parsed.  If the command is typed without a form, then PC-Kimmo prompts
the user for surface forms with a special recognizer prompt, and
processes each form in turn.  This cycle of typing and parsing is
terminated by typing an empty "word" (that is, nothing but the `Enter'
or `Return' key).

Both the rules and the lexicon must be loaded before using this
command.  A grammar may also be loaded and used to eliminate invalid
parses from the two-level processor results.  If a grammar is used,
then parse trees and feature structures may be displayed as well as the
lexical and morphological forms.

save
----

`save' [FILE.TAK] writes the current settings to the designated file in
the form of PC-Kimmo commands.  If the file is not specified, the
settings are written to `pckimmo.tak' in the current directory.

set
---

The `set' commands control program behavior by setting internal program
variables.  Each of these commands (and variables) is described below.

set ambiguities
...............

`set ambiguities' NUMBER limits the number of analyses printed to the
given number.  The default value is 10.  Note that this does not limit
the number of analyses produced, just the number printed.

set ample-dictionary
....................

`set ample-dictionary' VALUE determines whether or not the AMPLE
dictionary files are divided according to morpheme type.
`set ample-dictionary split' declares that the AMPLE dictionary is
divided into a prefix dictionary file, an infix dictionary file, a
suffix dictionary file, and one or more root dictionary files.  The
existence of the three affix dictionary depends on settings in the
AMPLE analysis data file.  If they exist, the `load ample dictionary'
command requires that they be given in this relative order: prefix,
infix, suffix, root(s).

`set ample-dictionary unified' declares that any of the AMPLE
dictionary files may contain any type of morpheme.  This implies that
each dictionary entry may contain a field specifying the type of
morpheme (the default is ROOT), and that the dictionary code table
contains a `\unified' field.  One of the changes listed under
`\unified' must convert a backslash code to `T'.

The default is for the AMPLE dictionary to be *split*.(1)

---------- Footnotes ----------

(1) The unified dictionary is a new feature of AMPLE version 3.

set check-cycles
................

`set check-cycles' VALUE enables or disables a check to prevent cycles
in the parse chart.  `set check-cycles on' turns on this check, and
`set check-cycles off' turns it off.  This check slows down the parsing
of a sentence, but it makes the parser less vulnerable to hanging on
perverse grammars.  The default setting is `on'.

set comment
...........

`set comment' CHARACTER sets the comment character to the indicated
value.  If CHARACTER is missing (or equal to the current comment
character), then comment handling is disabled.  The default comment
character is `;' (semicolon).

set failures
............

`set failures' VALUE enables or disables GRAMMAR FAILURE MODE.
`set failures on' turns on grammar failure mode, and `set failures off'
turns it off.  When grammar failure mode is on, the partial results of
forms that fail the grammar module are displayed.  A form may fail the
grammar either by failing the feature constraints or by failing the
constituent structure rules. In the latter case, a partial tree (bush)
will be returned.  The default setting is `off'.

Be careful with this option.  Setting failures to `on' can cause the
PC-Kimmo to go into an infinite loop for certain recursive grammars and
certain input sentences.  WE MAY TRY TO DO SOMETHING TO DETECT THIS
TYPE OF BEHAVIOR, AT LEAST PARTIALLY.

set features
............

`set features' VALUE determines how features will be displayed.

`set features all' enables the display of the features for all nodes of
the parse tree.

`set features top' enables the display of the feature structure for
only the top node of the parse tree.  This is the default setting.

`set features flat' causes features to be displayed in a flat, linear
string that uses less space on the screen.

`set features full' causes features to be displayed in an indented form
that makes the embedded structure of the feature set clear.  This is
the default setting.

`set features on' turns on features display mode, allowing features to
be shown.  This is the default setting.

`set features off' turns off features display mode, preventing features
from being shown.

set gloss
.........

`set gloss' VALUE enables the display of glosses in the parse tree
output if VALUE is `on', and disables the display of glosses if VALUE is
`off'.  If any glosses exist in the lexicon file, then `gloss' is
automatically turned `on' when the lexicon is loaded.  If no glosses
exist in the lexicon, then this flag is ignored.

set marker category
...................

`set marker category' MARKER establishes the marker for the field
containing the category (part of speech) feature.  The default is `\c'.

set marker features
...................

`set marker features' MARKER establishes the marker for the field
containing miscellaneous features.  (This field is not needed for many
words.)  The default is `\f'.

set marker gloss
................

`set marker gloss' MARKER establishes the marker for the field
containing the word gloss.  The default is `\g'.

set marker record
.................

`set marker record' MARKER establishes the field marker that begins a
new record in the lexicon file.  This may or may not be the same as the
`word' marker.  The default is `\w'.

set marker word
...............

`set marker word' MARKER establishes the marker for the word field.
The default is `\w'.

set timing
..........

`set timing' VALUE enables timing mode if VALUE is `on', and disables
timing mode if VALUE is `off'.  If timing mode is `on', then the
elapsed time required to process a command is displayed when the
command finishes.  If timing mode is `off', then the elapsed time is
not shown.  The default is `off'.  (This option is useful only to
satisfy idle curiosity.)

set top-down-filter
...................

`set top-down-filter' VALUE enables or disables top-down filtering
based on the categories.  `set top-down-filter on' turns on this
filtering, and `set top-down-filter off' turns it off.  The top-down
filter speeds up the parsing of a sentence, but might cause the parser
to miss some valid parses.  The default setting is `on'.

This should not be required in the final version of PC-Kimmo.

set tree
........

`set tree' VALUE specifies how parse trees should be displayed.

`set tree full' turns on the parse tree display, displaying the result
of the parse as a full tree.  This is the default setting.  A short
sentence would look something like this:
                Sentence
                    |
               Declarative
               _____|_____
              NP        VP
               |      ___|____
               N      V    COMP
             cows    eat     |
                            NP
                             |
                             N
                           grass

`set tree flat' turns on the parse tree display, displaying the result
of the parse as a flat tree structure in the form of a bracketed
string.  The same short sentence would look something like this:
      (Sentence (Declarative (NP
          (N  cows)) (VP (V  eat) (COMP
          (NP (N  grass))))))

`set tree indented' turns on the parse tree display, displaying the
result of the parse in an indented format sometimes called a *northwest
tree*.  The same short sentence would look like this:
         Sentence
             Declarative
                 NP
                     N  cows
                 VP
                     V  eat
                     COMP
                         NP
                             N  grass

`set tree off' disables the display of parse trees altogether.

set trim-empty-features
.......................

`set trim-empty-features' VALUE disables the display of empty feature
values if VALUE is `on', and enables the display of empty feature
values if VALUE is `off'.  The default is not to display empty feature
values.

set unification
...............

`set unification' VALUE enables or disables feature unification.
`set unification on' turns on unification mode.  This is the default
setting.

`set unification off' turns off feature unification in the grammar.
Only the context-free phrase structure rules are used to guide the
parse; the feature contraints are ignored.  This can be dangerous, as
it is easy to introduce infinite cycles in recursive phrase structure
rules.

set verbose
...........

`set verbose' VALUE enables or disables the screen display of parse
trees in the `file parse' command.  `set verbose on' enables the screen
display of parse trees, and `set verbose off' disables such display.
The default setting is `off'.

set warnings
............

`set warnings' VALUE enables warning mode if VALUE is `on', and disables
warning mode if VALUE is `off'.  If warning mode is enabled, then
warning messages are displayed on the output. If warning mode is
disabled, then no warning messages are displayed.  The default setting
is `on'.

set write-ample-parses
......................

`set write-ample-parses' VALUE enables writing `\parse' and `\features'
fields at the end of each sentence in the disambiguated analysis file
if VALUE is `on', and disables writing these fields if VALUE is `off'.
The default setting is `off'.

This variable setting affects only the `file disambiguate' command.

show
----

The `show' commands display internal settings on the screen.  Each of
these commands is described below.

show lexicon
............

`show lexicon' prints the contents of the lexicon stored in memory on
the standard output.  THIS IS NOT VERY USEFUL, AND MAY BE REMOVED.

show status
...........

`show status' displays the names of the current grammar, sentences, and
log files, and the values of the switches established by the `set'
command.

`show' (by itself) and `status' are synonyms for `show status'.

status
------

`status' displays the names of the current grammar, sentences, and log
files, and the values of the switches established by the `set' command.

synthesize
----------

`synthesize' [<MORPHOLOGICAL-FORM>] attempts to produce surface forms
from a morphological form provided by the user.  If a morphological
form is typed on the same line as the command, then that form is
synthesized.  If the command is typed without a form, then PC-Kimmo
repeatedly prompts the user for morphological forms with a special
synthesizer prompt, processing each form.  This cycle of typing and
synthesizing is terminated by typing an empty "form" (that is, nothing
but the `Enter' or `Return' key).

Note that the morphemes in the morphological form must be separated by
spaces, and must match gloss entries loaded from the lexicon.  Also, the
morphemes must be given in the proper order.

Both the rules and the synthesis lexicon must be loaded before using
this command.  It does not use a grammar.

system
------

`system' [COMMAND] allows the user to execute an operating system
command (such as checking the available space on a disk) from within
PC-Kimmo.  This is available only for MS-DOS and Unix, not for
Microsoft Windows or the Macintosh.

If no system-level command is given on the line with the `system'
command, then PC-Kimmo is pushed into the background and a new system
command processor (shell) is started.  Control is usually returned to
PC-Kimmo in this case by typing `exit' as the operating system command.

`sys' is the minimal abbreviation for `system'.  `!' (exclamation
point) is a synonym for `system'.  (`!' does not require a space to
separate it from the command.)

take
----

`take' [FILE.TAK] redirects command input to the specified file.

The default filetype extension for `take' is `.tak', and the default
filename is `pckimmo.tak'.

`take' files can be nested three deep.  That is, the user types
`take file1', `file1' contains the command `take file2', and `file2'
has the command `take file3'.  It would be an error for `file3' to
contain a `take' command.  This should not prove to be a serious
limitation.

A `take' file can also be specified by using the `-t' command line
option when starting PC-Kimmo.  When started, PC-Kimmo looks for a
`take' file named `pckimmo.tak' in the current directory to initialize
itself with.

The PC-Kimmo Rules File
***********************

The general structure of the rules file is a list of keyword
declarations.  Figure 1 shows the conventional structure of the rules
file.  Note that the notation `{x | y}' means either `x' or `y' (but
not both).

     Figure 1 Structure of the rules file
     
     COMMENT <CHARACTER>
     ALPHABET <SYMBOL LIST>
     NULL <CHARACTER>
     ANY <CHARACTER>
     BOUNDARY <CHARACTER>
     SUBSET <SUBSET NAME> <SYMBOL LIST>
     . (more subsets)
     .
     .
     RULE <RULE NAME> <NUMBER OF STATES> <NUMBER OF COLUMNS>
      <LEXICAL SYMBOL LIST>
      <SURFACE SYMBOL LIST>
     <STATE NUMBER>{: | .} <STATE NUMBER LIST>
       . (more states)
       .
       .
     . (more rules)
     .
     .
     END

The following specifications apply to the rules file.

   * Extra spaces, blank lines, and comment lines are ignored.  In the
     descriptions below, reference to the use of a space character
     implies any whitespace character (that is, any character treated
     like a space character).  The following control characters when
     used in a file are whitespace characters: `^I' (ASCII 9, tab),
     `^J' (ASCII 10, line feed), `^K' (ASCII 11, vertical tab), `^L'
     (ASCII 12, form feed), and `^M' (ASCII 13, carriage return).

   * Comments may be placed anywhere in the file.  All data following a
     comment character to the end of the line is ignored.  (See below
     on the `COMMENT' declaration.)

   * The set of valid keywords used to form declarations includes
     `COMMENT', `ALPHABET', `NULL', `ANY', `BOUNDARY', `SUBSET',
     `RULE', and `END'.

   * These declarations are obligatory and can occur only once in a
     file: `ALPHABET', `NULL', `ANY', `BOUNDARY'.

   * These declarations are optional and can occur one or more times in
     a file: `COMMENT', `SUBSET', and `RULE'.

   * The `COMMENT' declaration sets the comment character used in the
     rules file, lexicon files, and grammar file.  The `COMMENT'
     declaration can only be used in the rules file, not in the lexicon
     or grammar file.  The `COMMENT' declaration is optional.  If it is
     not used, the comment character is set to `;' (semicolon) as a
     default.

   * The `COMMENT' declaration can be used anywhere in the rules file
     and can be used more than once.  That is, different parts of the
     rules file can use different comment characters.  The `COMMENT'
     declaration can (and in practice usually does) occur as the first
     keyword in the rules file, followed by either one or more
     `COMMENT' declarations or the `ALPHABET' declaration.

   * Note that if you use the `COMMENT' declaration to declare the
     character that is already in use as the comment character, an error
     will result.  For instance, if semicolon is the current comment
     character, the declaration `COMMENT ;' will result in an error.

   * The comment character can no longer be set using a command line
     option or with a command in the user interface, as was the case in
     version 1 of PC-Kimmo.

   * The `ALPHABET' declaration must either occur first in the file or
     follow one or more `COMMENT' declarations only.  The other
     declarations can appear in any order.  The `COMMENT', `NULL',
     `ANY', `BOUNDARY', and `SUBSET' declarations can even be
     interspersed among the rules.  However, these declarations must
     appear before any rule that uses them or an error will result.

   * The `ALPHABET' declaration defines the set of symbols used in
     either lexical or surface representations.  The keyword `ALPHABET'
     is followed by a <SYMBOL LIST> of all alphabetic symbols.  Each
     symbol must be separated from the others by at least one space.
     The list can span multiple lines, and ends with the next valid
     keyword.  All alphanumeric characters (such as `a', `B', and `2'),
     symbols (such as `$' and `+'), and punctuation characters (such as
     `.' and `?') are available as alphabet members.  The characters in
     the IBM extended character set (above ASCII 127) are also
     available.  Control characters (below ASCII 32) can also be used,
     with the exception of whitespace characters (see above), `^Z' (end
     of file), and `^@' (null).  The alphabet can contain a maximum of
     255 symbols.  An alphabetic symbol can also be a multigraph, that
     is, a sequence of two or more characters.  The individual
     characters composing a multigraph do not necessarily have to also
     be declared as alphabetic characters.  For example, an alphabet
     could include the characters `s' and `z' and the multigraph `sz%',
     but not include `%' as an alphabetic character.  Note that a
     multigraph cannot also be interpreted as a sequence of the
     individual characters that comprise it.

   * The keyword `NULL' is followed by a single <CHARACTER> that
     represents a null (empty, zero) element.  The `NULL' symbol is
     considered to be an alphabetic character, but cannot also be
     listed in the `ALPHABET' declaration.  The `NULL' symbol declared
     in the rules file is also used in the lexicon file to represent a
     null lexical entry.

   * The keyword `ANY' is followed by a single "wildcard" <CHARACTER>
     that represents a match of any character in the alphabet.  The
     `ANY' symbol is not considered to be an alphabetic character,
     though it is used in the column headers of state tables.  It
     cannot be listed in the `ALPHABET' declaration.  It is not used in
     the lexicon file.

   * The keyword `BOUNDARY' is followed by a single <CHARACTER>
     character that represents an initial or final word boundary.  The
     `BOUNDARY' symbol is considered to be an alphabetic character, but
     cannot also be listed in the `ALPHABET' declaration.  When used in
     the column header of a state table, it can only appear as the pair
     `#:#' (where, for instance, `#' has been declared as the
     `BOUNDARY' symbol).  The `BOUNDARY' symbol is also used in the
     lexicon file in the continuation class field of a lexical entry to
     indicate the end of a word (that is, no continuation class).

   * The `SUBSET' declaration defines set of characters that are
     referred to in the column headers of rules.  The keyword `SUBSET'
     is followed by the <SUBSET NAME> and <SYMBOL LIST>. <SUBSET NAME>
     is a single word (one or more characters) that names the list of
     characters that follows it.  The subset name must be unique (that
     is, if it is a single character it cannot also be in the alphabet
     or be any other declared symbol).  It can be composed of any
     characters (except space); that is, it is not limited to the
     characters declared in the `ALPHABET' section.  It must not be
     identical to any keyword used in the rules file.  The subset name
     is used in rules to represent all members of the subset of the
     alphabet that it defines.  Note that `SUBSET' declarations can be
     interspersed among the rules.  This allows subsets to be placed
     near the rule that uses them if such a style is desired.  However,
     a subset must be declared before a rule that uses it.

   * The <SYMBOL LIST> following a <SUBSET NAME> is a list of single
     symbols, each of which is separated by at least one space.  The
     list can span multiple lines.  Each symbol in the list must be a
     member of the previously defined `ALPHABET', with the exception of
     the `NULL' symbol, which can appear in a subset list but is not
     included in the `ALPHABET' declaration.  Neither the `ANY' symbol
     nor the `BOUNDARY' symbol can appear in a subset symbol list.

   * The keyword `RULE' signals that a state table immediately follows.
     Note that two-level rules must be expressed as a state table rather
     than in the form discussed in chapter 2 `The Two-level Formalism'
     above.

   * <RULE NAME> is the name or description of the rule which the state
     table encodes.  It functions as an annotation to the state table
     and has no effect on the computational operation of the table.  It
     is displayed by the list rules and show rule commands and is also
     displayed in traces.  The rule name must be surrounded by a pair
     of identical delimiter characters.  Any material can be used
     between the delimiters of the rule name with the exception of the
     current comment character and of course the rule name delimiter
     character of the rule itself.  Each rule in the file can use a
     different pair of delimiters.  The rule name must be all on one
     line, but it does not have to be on the same line as the `RULE'
     keyword.

   * <NUMBER OF STATES> is the number of states (rows in the table) that
     will be defined for this table.  The states must begin at 1 and go
     in sequence through the number defined here (that is, gaps in state
     numbers are not allowed).

   * <NUMBER OF COLUMNS> is the number of state transitions (columns in
     the table) that will be defined for each state.

   * <LEXICAL SYMBOL LIST> is a list of elements separated by one or
     more spaces.  Each element represents the lexical half of a
     lexical:surface correspondence which, when matched, defines a state
     transition.  Each element in the list must be either a member of
     the alphabet, a subset name, the `NULL' symbol, the `ANY' symbol,
     or the `BOUNDARY' symbol (in which case the corresponding surface
     character must also be the `BOUNDARY' symbol).  The list can span
     multiple lines, but the number of elements in the list must be
     equal to the number of columns defined for the rule.

   * <SURFACE SYMBOL LIST> is a list of elements separated by one or
     more spaces.  Each element represents the surface half of a
     lexical:surface correspondence which, when matched, defines a state
     transition.  Each element in the list must be either a member of
     the alphabet, a subset name, the `NULL' symbol, the `ANY' symbol,
     or the `BOUNDARY' symbol (in which case the corresponding lexical
     character must also be the `BOUNDARY' symbol).  The list can span
     multiple lines, but the number of characters in the list must be
     equal to the number of columns defined for the rule.

   * <STATE NUMBER> is the number of the state or row of the table.  The
     first state number must be 1, and subsequent state numbers must
     follow in numerical sequence without any gaps.

   * `{: | .}' is the final or nonfinal state indicator.  This should
     be a colon (`:') if the state is a final state and a period (`.')
     if it is a nonfinal state.  It must follow the <STATE NUMBER> with
     no intervening space.

   * <STATE NUMBER LIST> is a list of state transition numbers for a
     particular state.  Each number must be between 1 and the number of
     states (inclusive) declared for the table.  The list can span
     multiple lines, but the number of elements in the list must be
     equal to the number of columns declared for this rule.

   * The keyword `END' follows all other declarations and indicates the
     end of the rules file.  Any material in the file thereafter is
     ignored by PC-Kimmo.  The `END' keyword is optional; the physical
     end of the file also terminates the rules file.

Figure 2 shows a sample rules file.

     Figure 2 A sample rules file
     
     ALPHABET
       b c d f g h j k l m n p q r s t v w x y z +    ; + is morpheme boundary
       a e i o u
     NULL 0
     ANY  @
     BOUNDARY #
     SUBSET C b c d f g h j k l m n p q r s t v w x y z
     SUBSET V a e i o u
     ; more subsets
     
     RULE "Consonant defaults"  1 23
        b c d f g h j k l m n p q r s t v w x y z + @
        b c d f g h j k l m n p q r s t v w x y z 0 @
     1: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
     
     RULE "Vowel defaults"  1 6
        a e i o u @
        a e i o u @
     1: 1 1 1 1 1 1
     
     RULE "Voicing s:z <=> V___V" 4 4
        V s s @
        V z @ @
     1: 2 0 1 1
     2: 2 4 3 1
     3: 0 0 1 1
     4. 2 0 0 0
     
     ; more rules
     
     END

The PC-Kimmo Lexicon Files
**************************

A lexicon consists of one main lexicon file plus one or more files of
lexical entries.  The general structure of the main lexicon file is a
list of keyword declarations.  The set of valid keywords is
`ALTERNATION', `FEATURES', `FIELDCODE', `INCLUDE', and `END'.  Figure
3 shows the conventional structure of the main lexicon file.

     Figure 3 Structure of the main lexicon file
     
     ALTERNATION <ALTERNATION NAME> <SUBLEXICON NAME LIST>
     . (more ALTERNATIONs)
     .
     .
     FEATURES <FEATURE ABBREVIATION LIST>
     
     FIELDCODE <LEXICAL ITEM CODE> U
     FIELDCODE <SUBLEXICON CODE>  L
     FIELDCODE <ALTERNATION CODE>  A
     FIELDCODE <FEATURES CODE>  F
     FIELDCODE <GLOSS CODE>  G
     
     INCLUDE <FILESPEC>
     . (more INCLUDEd files)
     .
     .
     END

The following specifications apply to the main lexicon file.

   * Extra spaces, blank lines, and comment lines are ignored.  In the
     descriptions below, reference to the use of a space character
     implies any whitespace character (that is, any character treated
     like a space character).  The following control characters when
     used in a file are whitespace characters: `^I' (ASCII 9, tab),
     `^J' (ASCII 10, line feed), `^K' (ASCII 11, vertical tab), `^L'
     (ASCII 12, form feed), and `^M' (ASCII 13, carriage return).

   * The comment character declared in the rules file is operative in
     the main lexicon file.  Comments may be placed anywhere in the
     file.  All data following a comment character to the end of the
     line is ignored.

   * The set of valid keywords used to form declarations includes
     `ALTERNATION', `FEATURES', `FIELDCODE', `INCLUDE', and `END'.

   * The declarations can appear in any order with the proviso that any
     alternation name, feature name, or fieldcode used in a lexical
     entry must be declared before the lexical entry is read.  In
     practice, this means that the `INCLUDE' declarations should appear
     last, but the `ALTERNATION', `FEATURES', and `FIELDCODE'
     declarations can appear in any order.

   * The `ALTERNATION' declaration defines a set of sublexicon names
     that serve as the continuation class of a lexical item.  The
     `ALTERNATION' keyword is followed by an <ALTERNATION NAME> and a
     <SUBLEXICON NAME LIST>.  `ALTERNATION' declarations are optional
     (but nearly always used in practice) and can occur as many times
     as needed.

   * <ALTERNATION NAME> is a name associated with the following
     <SUBLEXICON NAME LIST>.  It is a word composed of one or more
     characters, not limited to the `ALPHABET' characters declared in
     the rules file.  An alternation name can be any word other than a
     keyword used in the lexicon file.  The program does not check to
     see if an alternation name is actually used in the lexicon file.

   * <SUBLEXICON NAME LIST> is a list of sublexicon names.  It can span
     multiple lines until the next valid keyword is encountered.  Each
     sublexicon name in the list must be used in the sublexicon field
     of a lexical entry.  Although it is not enforced at the time the
     lexicon file is loaded, an undeclared sublexicon named in a
     sublexicon name list will cause an error when the recognizer tries
     to use it.

   * The `FEATURES' keyword followed by a <FEATURE ABBREVIATION LIST>.
     A <FEATURE ABBREVIATION LIST> is a list of words, each of which is
     expanded into feature structures by the word grammar.

   * The `FIELDCODE' declaration is used to define what fieldcode will
     be used to mark each type of field in a lexical entry.  The
     `FIELDCODE' keyword is followed by a <CODE> and one of five
     possible internal codes: `U', `L', `A', `F', or `G'.  There must
     be five `FIELDCODE' declarations, one for each of these internal
     codes, where `U' indicates the lexical item field, `L' indicates
     the sublexicon field, `A' indicates the alternation field, `F'
     indicates the features field, and `G' indicates the gloss field.

   * The `INCLUDE' keyword is followed by a <FILESPEC> that names a
     file containing lexical entries to be loaded.  An `INCLUDE'd file
     cannot contain any declarations (such as a `FIELDCODE' or an
     `INCLUDE' declaration), only lexical entries and comment lines.

   * The keyword `END' follows all other declarations and indicates the
     end of the main lexicon file.  Any material in the file thereafter
     is ignored by PC-Kimmo.  The `END' keyword is optional; the
     physical end of the file also terminates the main lexicon file.

Figure 4 shows a sample main lexicon file.

     Figure 4 A sample main lexicon file
     
     ALTERNATION Begin PREF
     ALTERNATION Pref N AJ V AV
     ALTERNATION Stem SUFFIX
     
     FEATURES sg pl reg irreg
     
     FIELDCODE  lf   U   ;lexical item
     FIELDCODE  lx   L   ;sublexicon
     FIELDCODE  alt  A   ;alternation
     FIELDCODE  fea  F   ;features
     FIELDCODE  gl   G   ;gloss
     
     INCLUDE affix.lex    ;file of affixes
     INCLUDE noun.lex     ;file of nouns
     INCLUDE verb.lex     ;file of verbs
     INCLUDE adjectiv.lex ;file of adjectives
     INCLUDE adverb.lex   ;file of adverbs
     
     END

Figure 5 shows the structure of a lexical entry.  Lexical entries are
encoded in "field-oriented standard format."  Standard format is an
information interchange convention developed by SIL International.  It
tags the kinds of information in ASCII text files by means of markers
which begin with backslash.  Field-oriented standard format (FOSF) is a
refinement of standard format geared toward representing data which has
a database-like record and field structure.

     Figure 5 Structure of a lexical entry
     
     \<LEXICAL ITEM CODE> <LEXICAL ITEM>
     \<SUBLEXICON CODE> <SUBLEXICON NAME>
     \<ALTERNATION CODE> {<ALTERNATION NAME> | <BOUNDARY SYMBOL>}
     \<FEATURES CODE> <FEATURES LIST>
     \<GLOSS CODE> <GLOSS STRING>

The following points provide an informal description of the syntax of
FOSF files.

   * A field-oriented standard format (FOSF) file consists of a
     sequence of records.

   * A record consists of a sequence of fields.

   * A field consist of a field marker and a field value.

   * A field marker consists of a backslash character at the beginning
     of a line, followed by an alphabetic or numeric character,
     followed by zero or more printable characters, and terminated by a
     space, tab, or the end of a line.  A field marker without its
     initial backslash character is termed a field code.

   * A field marker must begin in the first position of a line.
     Backslash characters occurring elsewhere in the file are not
     interpreted as field markers.

   * The first field marker of the record is considered the record
     marker, and thus the same field must occur first in every record
     of the file.

   * Each field marker is separated from the field value by one or more
     spaces, tabs, or newlines.  The field value continues up to the
     next field marker.

   * Any line that is empty or contains only whitespace characters is
     considered a comment line and is ignored.  Comment lines may occur
     between or within fields.

   * Fields and lines in an FOSF file can be arbitrarily long.

   * There are two basic types of fields in FOSF files: nonrepeating and
     repeating.  Repeating fields are multiple consecutive occurrences
     of fields marked by the same marker.  Individual fields within a
     repeating field can be called subfields.

The following specifications apply to how FOSF is implemented in
PC-Kimmo.

   * Lexical entries are encoded as records in a FOSF file.

   * Only those fields whose field codes are declared in the main
     lexicon file are recognized (see above on the `FIELDCODE'
     declaration).  All other fields are considered to be extraneous
     and are ignored.

   * The first field of each lexical entry must be the lexical item
     field.  The lexical item field code is assigned to the internal
     code U by a `FIELDCODE' declaration in the main lexicon file.

   * Only nonrepeating fields are permitted.

   * The comment character declared in the rules file is operative in
     included files of lexical entries.  All data following a comment
     character to the end of the line is ignored.

A file of lexical entries is loaded by using an `INCLUDE' declaration
in the main lexicon file (see above).  An `INCLUDE'd file of lexical
entries cannot contain any declarations (such as a `FIELDCODE' or an
`INCLUDE' declaration), only lexical entries and comment lines.

The following specifications apply to lexical entries.

   * A lexical entry is composed of five fields: lexical item,
     sublexicon, alternation, features, and gloss.  The lexical item,
     sublexicon, and alternation, fields are obligatory, the features
     and gloss fields are optional.  The first field of the entry must
     always be the lexical item.  The other fields can appear in any
     order, even differing from one entry to another.

   * Although the gloss field is optional, if a lexical entry does not
     include one, a warning message to that effect will be displayed
     when the entry is loaded.  To suppress this warning message, do
     the command `set warnings off' (see section 3.2.17.19 `set
     warnings') before loading the lexicon.

   * If an entry has an empty gloss field (that is, the field marker
     for the gloss field is present but there is no data after it),
     then the contents of the lexical form field will be also be used
     as the gloss for that entry.

   * A lexical item field consists of a <LEXICAL ITEM CODE> and a
     <LEXICAL ITEM>.

   * A <LEXICAL ITEM CODE> is a field code assigned to the internal
     code `U' by a `FIELDCODE' declaration in the main lexicon file.

   * A <LEXICAL ITEM> is one or more characters that represent an
     element (typically a morpheme or word) of the lexicon.  Each
     character (or multigraph) must be in the alphabet defined for the
     language.  The lexical item uses only the lexical subset of the
     alphabet.

   * A sublexicon field consists of a <SUBLEXICON CODE> and a
     <SUBLEXICON NAME>.

   * A <SUBLEXICON CODE> is a field code assigned to the internal code
     `L' by a `FIELDCODE' declaration in the main lexicon file.

   * A <SUBLEXICON NAME> is the name associated with a sublexicon.  It
     is a word composed of one or more characters, not limited to the
     alphabetic characters declared in the rules file.  Every lexical
     item must belong to a sublexicon.  Every lexicon must include a
     special sublexicon named INITIAL (that is, there must be at least
     one lexical entry that belongs to the INITIAL sublexicon).

   * Lexical entries belonging to a sublexicon do not have to be listed
     consecutively in a single file (as was the case for PC-Kimmo
     version 1); rather, lexical entries in a file can occur in any
     order, regardless of what sublexicon they belong to.  Lexical
     entries of a sublexicon can even be placed in two or more separate
     files.

   * An alternation field consists of a <ALTERNATION CODE> followed by
     either an <ALTERNATION NAME> or the <BOUNDARY SYMBOL>.

   * An <ALTERNATION NAME> is declared in an `ALTERNATION' declaration
     in the main lexicon file.  The <BOUNDARY SYMBOL> is declared in
     the rules file and indicates the end of all possible continuations
     in the lexicon.

   * A features field consists of a <FEATURES CODE> and a <FEATURES
     LIST>.

   * A <FEATURES CODE> is a field code assigned to the internal code
     `F' by a `FIELDCODE' declaration in the main lexicon file.

   * A <FEATURES LIST> is a list of feature abbreviations.  Each
     abbreviation is a single word consisting of alphanumeric
     characters or other characters except `(){}[]<>=:$!' (these are
     used for special purposes in the grammar file).  The character `\'
     should not be used as the first character of an abbreviation
     because that is how fields are marked in the lexicon file.  Upper
     and lower case letters used in template names are considered
     different.  For example, `PLURAL' is not the same as `Plural' or
     `plural'.  Feature abbreviations are expanded into full feature
     structures by the word grammar (see chapter 6 `The Grammar File').

   * A gloss field consists of a <GLOSS CODE> and a <GLOSS STRING>.

   * A <GLOSS CODE> is a field code assigned to the internal code `G'
     by a `FIELDCODE' declaration in the main lexicon file.

   * A <GLOSS STRING> is a string of text.  Any material can be used in
     the gloss field with the exception of the comment character.

Figure 6 shows a sample lexical entry.

     Figure 6 A sample lexical entry
     
     \lf  `knives
     \lx  N
     \alt Infl
     \fea pl irreg
     \gl  N(`knife)+PL

The Grammar File
****************

The following specifications apply generally to the word grammar file:
   * Blank lines, spaces, and tabs separate elements of the grammar
     file from one another, but are ignored otherwise.

   * The comment character declared by the `set comment' command (see
     section 3.2.17.4 `set comment' above) is operative in the grammar
     file.  The default comment character is the semicolon (`;').
     Comments may be placed anywhere in the grammar file.  Everything
     following a comment character to the end of the line is ignored.

   * A grammar file is divided into fields identified by a small set of
     keywords.
       1. `Rule' starts a context-free phrase structure rule with its
          set of feature constraints.  These rules define how words
          join together to form phrases, clauses, or sentences.  The
          lexicon and grammar are tied together by using the lexical
          categories as the terminal symbols of the phrase structure
          rules and by using the other lexical features in the feature
          constraints.

       2. `Let' starts a feature template definition.  Feature
          templates are used as macros (abbreviations) in the lexicon.
          They may also be used to assign default feature structures to
          the categories.

       3. `Parameter' starts a program parameter definition.  These
          parameters control various aspects of the program.

       4. `Define' starts a lexical rule definition.  As noted in
          Shieber (1985), something more powerful than just
          abbreviations for common feature elements is sometimes needed
          to represent systematic relationships among the elements of a
          lexicon.  This need is met by lexical rules, which express
          transformations rather than mere abbreviations.

          Lexical rules are not yet implemented properly.  They may or
          may not be useful for word grammars used by PC-Kimmo.

       5. `Lexicon' starts a lexicon section.  This is only for
          compatibility with the original PATR-II.  The section name is
          skipped over properly, but nothing is done with it.

       6. `Word' starts an entry in the lexicon.  This is only for
          compatibility with the original PATR-II.  The entry is skipped
          over properly, but nothing is done with it.

       7. `End' effectively terminates the file.  Anything following
          this keyword is ignored.
          Note that these keywords are not case sensitive:  `RULE' is
     the same as `rule', and both are the same as `Rule'.

   * Each of the fields in the grammar file may optionally end with a
     period.  If there is no period, the next keyword (in an
     appropriate slot) marks the end of one field and the beginning of
     the next.

Rules
=====

A PC-Kimmo word grammar rule has these parts, in the order listed:
  1. the keyword `Rule'

  2. an optional rule identifier enclosed in braces (`{}')

  3. the nonterminal symbol to be expanded

  4. an arrow (`->') or equal sign (`=')

  5. zero or more terminal or nonterminal symbols, possibly marked for
         alternation or optionality

  6. an optional colon (`:')

  7. zero or more feature constraints

  8. an optional period (`.')

The optional rule identifier consists of one or more words enclosed in
braces.  Its current utility is only as a special form of comment
describing the intent of the rule.  (Eventually it may be used as a tag
for interactively adding and removing rules.)  The only limits on the
rule identifier are that it not contain the comment character and that
it all appears on the same line in the grammar file.

The terminal and nonterminal symbols in the rule have the following
characteristics:
   * Upper and lower case letters used in symbols are considered
     different.  For example, `NOUN' is not the same as `Noun', and
     neither is the same as `noun'.

   * The symbol X may be used to stand for any terminal or nonterminal.
     For example, this rule says that any category in the grammar
     rules can be replaced by two copies of the same category separated
     by a CJ.
                  Rule X -> X_1 CJ X_2
                          <X cat>  = <X_1 cat>
                          <X cat>  = <X_2 cat>
                          <X arg1> = <X_1 arg1>
                          <X arg1> = <X_2 arg1>
     The symbol X can be useful for capturing generalities.  Care
     must be taken, since it can be replaced by anything.

   * Index numbers are used to distinguish instances of a symbol that
     is used more than once in a rule.  They are added to the end of a
     symbol following an underscore character (`_').  This is
     illustrated in the rule for X above.

   * The characters `(){}[]<>=:/' cannot be used in terminal or
     nonterminal symbols since they are used for special purposes in the
     grammar file.  The character `_' can be used *only* for attaching
     an index number to a symbol.

   * By default, the left hand symbol of the first rule in the grammar
     file is the start symbol of the grammar.

The symbols on the right hand side of a phrase structure rule may be
marked or grouped in various ways:
   * Parentheses around an element of the expansion (right hand) part
     of a rule indicate that the element is optional. Parentheses may
     be placed around multiple elements. This makes an optional group
     of elements.

   * A forward slash (/) is used to separate alternative elements of the
     expansion (right hand) part of a rule.

   * Curly braces can be used for grouping elements. For example the
     following says that an S consists of an NP followed by either a TVP
     or an IV:
                  Rule S -> NP {TVP / IV}

   * Alternatives are taken to be as long as possible. Thus if the curly
     braces were omitted from the rule above, as in the rule below, the
     TVP would be treated as part of the alternative containing the NP.
     It would not be allowed before the IV.
                  Rule S -> NP TVP / IV

   * Parentheses group enclosed elements the same as curly braces do.
     Alternatives and groups delimited by parentheses or curly braces
     may be nested to any depth.

A rule can be followed by zero or more *feature constraints* that refer
to symbols used in the rule.  A feature constraint has these parts, in
the order listed:
  1. a feature path that begins with one of the symbols from the
     phrase structure rule

  2. an equal sign

  3. either another path or a value

A feature constraint that refers only to symbols on the right hand side
of the rule constrains their co-occurrence.  In the following rule and
constraint, the values of the *agr* features for the NP and VP nodes of
the parse tree must unify:
             Rule S -> NP VP
                     <NP agr> = <VP agr>
If a feature constraint refers to a symbol on the right hand side
of the rule, and has an atomic value on its right hand side, then the
designated feature must not have a different value.  In the following
rule and constraint, the *head case* feature for the NP node of the
parse tree must either be originally undefined or equal to NOM:
             Rule S -> NP VP
                     <NP head case> = NOM
(After unification succeeds, the *head case* feature for the NP
node of the parse tree will be equal to NOM.)

A feature constraint that refers to the symbol on the left hand side of
the rule passes information up the parse tree.  In the following rule
and constraint, the value of the *tense* feature is passed from the VP
node up to the S node:
             Rule S -> NP VP
                     <S tense> = <VP tense>

Feature templates
=================

A PC-Kimmo grammar feature template has these parts, in the order
listed:
  1. the keyword `Let'

  2. the template name

  3. the keyword `be'

  4. a feature definition

  5. an optional period (`.')
     If the template name is a terminal category (a terminal symbol in
one of the phrase structure rules), the template defines the default
features for that category.  Otherwise the template name serves as an
abbreviation for the associated feature structure.

The characters `(){}[]<>=:' cannot be used in template names since they
are used for special purposes in the grammar file.  The characters `/_'
can be freely used in template names.  The character `\' should not be
used as the first character of a template name because that is how
fields are marked in the lexicon file.

The abbreviations defined by templates are usually used in the feature
field of entries in the lexicon file.  For example, the lexical entry
for the irregular plural form *feet* may have the abbreviation *pl* in
its features field.  The grammar file would define this abbreviation
with a template like this:
            Let pl be [number: PL]
The path notation may also be used:
            Let pl be <number> = PL
More complicated feature structures may be defined in templates.  For
example,
          Let 3sg be [tense:  PRES
                      agr:    3SG
                      finite: +
                      vform:  S]
which is equivalent to:
          Let 3sg be <tense>  = PRES
                     <agr>    = 3SG
                     <finite> = +
                     <vform>  = S
In the following example, the abbreviation *irreg* is defined using
another abbreviation:
            Let irreg be <reg> = -
                         pl
The abbreviation *pl* must be defined previously in the grammar
file or an error will result.  A subsequent template could also use the
abbreviation *irreg* in its definition.  In this way, an inheritance
hierarchy features may be constructed.

Feature templates permit disjunctive definitions.  For example, the
lexical entry for the word *deer* may specify the feature abbreviation
*sg-pl*.  The grammar file would define this as a disjunction of
feature structures reflecting the fact that the word can be either
singular or plural:
         Let sg/pl be {[number:SG]
                       [number:PL]}
This has the effect of creating two entries for *deer*, one with
singular number and another with plural.  Note that there is no limit
to the number of disjunct structures listed between the braces.  Also,
there is no slash (`/') between the elements of the disjunction as
there is between the elements of a disjunction in the rules.  A shorter
version of the above template using the path notation looks like this:
         Let sg/pl be <number> = {SG PL}
Abbreviations can also be used in disjunctions, provided that they have
previously been defined:
           Let sg be <number> = SG
           Let pl be <number> = PL
           Let sg/pl be {[sg] [pl]}
Note the square brackets around the abbreviations *sg* and *pl*;
without square brackets they would be interpreted as simple values
instead.

Feature templates can assign default atomic feature values, indicated
by prefixing an exclamation point (!).  A default value can be
overridden by an explicit feature assignment.  This template says that
all members of category N have singular number as a default value:
           Let N be <number> = !SG
The effect of this template is to make all nouns singular unless they
are explicitly marked as plural.  For example, regular nouns such as
*book* do not need any feature in their lexical entries to signal that
they are singular; but an irregular noun such as *feet* would have a
feature abbreviation such as *pl* in its lexical entry.  This would be
defined in the grammar as `[number: PL]', and would override the
default value for the feature number specified by the template above.
If the N template above used `SG' instead of `!SG', then the word
*feet* would fail to parse, since its *number* feature would have an
internal conflict between `SG' and `PL'.

Parameter settings
==================

A PC-Kimmo grammar parameter setting has these parts, in the order
listed:
  1. the keyword `Parameter'

  2. an optional colon (`:')

  3. one or more keywords identifying the parameter

  4. the keyword `is'

  5. the parameter value

  6. an optional period (`.')

PC-Kimmo recognizes the following grammar parameters:
`Start symbol'
     defines the start symbol of the grammar.  For example,
                  Parameter Start symbol is S
     declares that the parse goal of the grammar is the nonterminal
     category S.  The default start symbol is the left hand symbol of
     the first phrase structure rule in the grammar file.

`Restrictor'
     defines a set of features to use for top-down filtering, expressed
     as a list of feature paths.  For example,
                  Parameter Restrictor is <cat> <head form>
     declares that the *cat* and *head form* features should be used to
     screen rules before adding them to the parse chart.  The default
     is not to use any features for such filtering.  This filtering,
     named *restriction* in Shieber (1985), is performed in addition to
     the normal top-down filtering based on categories alone.
     RESTRICTION IS NOT YET IMPLEMENTED.  SHOULD IT BE INSTEAD OF
     NORMAL FILTERING RATHER THAN IN ADDITION TO?

`Attribute order'
     specifies the order in which feature attributes are displayed.  For
     example,
                  Parameter Attribute order is cat lex sense head
                                               first rest agreement
     declares that the *cat* attribute should be the first one
     shown in any output from PC-Kimmo, and that the other attributes
     should be shown in the relative order shown, with the *agreement*
     attribute shown last among those listed, but ahead of any
     attributes that are not listed above.  Attributes that are not
     listed are ordered according to their character code sort order.
     If the attribute order is not specified, then the category feature
     *cat* is shown first, with all other attributes sorted according
     to their character codes.

`Category feature'
     defines the label for the category attribute.  For example,
                  Parameter Category feature is Categ
     declares that *Categ* is the name of the category attribute.  The
     default name for this attribute is *cat*.

`Lexical feature'
     defines the label for the lexical attribute.  For example,
                  Parameter Lexical feature is Lex
     declares that *Lex* is the name of the lexical attribute.  The
     default name for this attribute is *lex*.

`Gloss feature'
     defines the label for the gloss attribute.  For example,
                  Parameter Gloss feature is Gloss
     declares that *Gloss* is the name of the gloss attribute.  The
     default name for this attribute is *gloss*.

Lexical rules
=============

A PC-Kimmo grammar lexical rule has these parts, in the order listed:
  1. the keyword `Define'

  2. the name of the lexical rule

  3. the keyword `as'

  4. the rule definition

  5. an optional period (`.')
     The rule definition consists of one or more mappings.  Each
mapping has three parts: an output feature path, an assignment
operator, and the value assigned, either an input feature path or an
atomic value.  Every output path begins with the feature name `out' and
every input path begins with the feature name `in'.  The assignment
operator is either an equal sign (`=') or an equal sign followed by a
"greater than" sign (`=>').

As noted before, lexical rules are not yet implemented properly, and
may not prove to be useful for PC-Kimmo word grammars in any case.

Convlex: converting version 1 lexicons
**************************************

The format of the lexicon files changed significantly between version 1
and version 2 of PC-Kimmo.  For this reason, an auxiliary program to
convert lexicon files was written.

A version 1 PC-Kimmo lexicon file looks like this:
     ; SAMPLE.LEX  25-OCT-89
     
     ; To load this file, first load the rules file SAMPLE.RUL and
     ; then enter the command LOAD LEXICON SAMPLE.
     
     ALTERNATION  Begin      NOUN
     ALTERNATION  Noun       End
     
     LEXICON INITIAL
       0             Begin           "[ "
     
     LEXICON NOUN
       s'ati         Noun            "Noun1"
       s'adi         Noun            "Noun2"
       bab'at        Noun            "Noun3"
       bab'ad        Noun            "Noun4"
     
     LEXICON End
       0             #               " ]"
     
     END
For PC-Kimmo version 2, the same lexicon must be split into two
files.  The first one would look like this:
     ; SAMPLE.LEX  25-OCT-89
     
     ; To load this file, first load the rules file SAMPLE.RUL and
     ; then enter the command LOAD LEXICON SAMPLE.
     
     ALTERNATION  Begin      NOUN
     ALTERNATION  Noun       End
     
     FIELDCODE lf  U
     FIELDCODE lx  L
     FIELDCODE alt A
     FIELDCODE fea F
     FIELDCODE gl  G
     
     INCLUDE sample2.sfm
     END
Note that everything except the lexicon sections and entries has
been copied verbatim into this new primary lexicon file.  The
`FIELDCODE' statements define how to interpret the other lexicon files
containing the actual lexicon sections and entries.  These files are
indicated by `INCLUDE' statements, and look like this:
     \lf 0
     \lx INITIAL
     \alt Begin
     \fea
     \gl [
     
     \lf s'ati
     \lx NOUN
     \alt Noun
     \fea
     \gl Noun1
     
     \lf s'adi
     \lx NOUN
     \alt Noun
     \fea
     \gl Noun2
     
     \lf bab'at
     \lx NOUN
     \alt Noun
     \fea
     \gl Noun3
     
     \lf bab'ad
     \lx NOUN
     \alt Noun
     \fea
     \gl Noun4
     
     \lf 0
     \lx End
     \alt #
     \fea
     \gl  ]

`convlex' was written to make the transition from version 1 to version
2 of PC-Kimmo as painless as possible.  It reads a version 1 lexicon
file, including any `INCLUDE'd files, and writes a version 2 set of
lexicon files.  For a trivial case like the example above, the
interaction with the user might go something like this:
     C:\>convlex
     CONVLEX: convert lexicon from PC-KIMMO version 1 to version 2
     
     Comment character: [;]
     Input lexicon file: sample.lex
     Output lexicon file: sample2.lex
     Primary sfm lexicon file: sample2.sfm
For each `INCLUDE' statement in the version 1 lexicon file,
`convlex' prompts for a replacement filename like this:
     New sfm include file to replace noun.lex: noun2.sfm

The user interface is extremely crude, but since this is a program that
is run only once or twice by most users, that should not be regarded as
a problem.

Bibliography
************

  1. Antworth, Evan L.. 1990.  `PC-KIMMO: a two-level processor for
     morphological analysis'.  Occasional Publications in Academic
     Computing No. 16. Dallas, TX: Summer Institute of Linguistics.

  2. Antworth, Evan L.. 1991.  Introduction to two-level phonology.
     `Notes on Linguistics' 53:4-18.  Dallas, TX: Summer Institute of
     Linguistics.

  3. Antworth, Evan L.. 1995. `User's Guide to PC-KIMMO version 2'.  URL
     ftp://ftp.sil.org/software/dos/pc-kimmo/guide.zip (visited August
     29, 1997).

  4. Chomsky, Noam. 1957.  `Syntactic structures.'  The Hague: Mouton.

  5. Chomsky, Noam, and Morris Halle. 1968.  `The sound pattern of
     English.'  New York: Harper and Row.

  6. Goldsmith, John A. 1990.  `Autosegmental and metrical phonology.'
     Basil Blackwell.

  7. Johnson, C. Douglas. 1972.  `Formal aspects of phonological
     description.'  The Hague: Mouton.

  8. Kay, Martin. 1983.  When meta-rules are not meta-rules.  In Karen
     Sparck Jones and Yorick Wilks, eds., `Automatic natural language
     parsing,' 94-116.  Chichester: Ellis Horwood Ltd. See pages
     100-104.

  9. Koskenniemi, Kimmo. 1983.  `Two-level morphology: a general
     computational model for word-form recognition and production.'
     Publication No. 11.  Helsinki: University of Helsinki Department
     of General Linguistics.