Writing Kimmo Lexicons
1.0 First, two general tips on rule writing and lexicon writing.
Besides tracing a rule, there are two other PCKIMMO commands
that you can use.
1. The first is SET RULE {ONOFF} this lets you turn
individual rules on or off. This is very helpful if you find that
your recognizer returns NO results  ie, gives the output
****NONE***** when it should be giving some parse. Then you can turn
off rules one by one, and you can isolate an offending rule this
way. (Usually, a rule that has the system winding up in a nonfinal
state when it should not be..)
To use this, your rules must beging with numbered double quoted
comments lines as in,
"Rule 3 EPTHENSIS.... mumble....." 5 6
The only thing that matters here in quotes is the "Rule 3" business.
If you want to see your list of rules, do SHOW RULES
2. The second debugging aid is the SHOW RULE command
This is very helpful in dealing with one of the trickier parts of
writing rules: how the different subsets and feasible pairs
(lexical/surface characters) interact, even within one rule. SHOW
RULE will tell you what characters are ACTUALLY being processed by the
automata, which can differ from what you wrote down in the automaton
spec.
Let me illustrate with an example.
Consider the following INCORRECT Epenthesis rule. Here S represents
the subset s, x, z. Remember this is the rule that is supposed to
pair fox+s with foxes. That is, IF the the pair 0:e appears, then it
must have a left context of S (where S is x,z, s) followed by +:0 and
the right context s#. (note that the right context doesn't care what
the underlying lexical form is, so we don't write it down.) That is a
DECLARATIVE CONSTRAINT that says this pair is OK  note that the left
and right context of 0:e is indeed one of the members of S:S (in this
case, X:x). followed by +:0. And the right context is indeed simply s
# (on the surface  we don't have to mention the hash mark # boundary
symbol in the lexical or underlying string, really  it is assumed to
be the same as the surface string.) So we are really lining up the
following pair of characters, where I have written the surface
characters in lower case. This is, in fact, how you can develop your
own automata. First try pairing up lexical and surface strings, for
the Spanish examples.
F O X + 0 S # (lexical, or underlying)
f o x 0 e s # (surface)
RULE "3 Epenthesis. 0:e ==> S +:0____s#" 5 6
s S + 0 # @
s S + e # @
1: 1 2 1 0 1 1
2: 1 2 3 0 1 1
3: 1 2 1 4 1 1
4. 5 0 0 0 0 0
5. 0 0 0 0 1 0
The 5, 6 at the end of the RULE statement gives us the number rows and
columns in the state table. Recall that 0 here means a reject state.
The states are listed on the leftmost column. The transition arc
labels are the top row of (lexical, surface) pairs  feasible pairs.
The inner cells say what the next states are  if we are in state 1,
and see an s/s, then we go to state 1.
Given the lexical form fox+s, this table correctly produces foxes, but given
kiss+s, it fails to produce the form kisses. Doing the SHOW RULE command
will give us the following information, to tell us why it fails:
>show rule 3
3 on Epenthesis Epenthesis. 0:e ==> S +:0____s#"
s:s ( s:s )
S:S ( x:x z:z )
+:0 ( +:0 )
0:e ( 0:e )
#:# ( #:# )
@:@ ( b:b d:d F:F g:g j:j k:k l:l m:m n:n p:p q:q r:r t:t
v:v w:w y:y a:a e:e i:i o:o u:u ':' : :0 ':0 )
From this display, it is obvious that the column header S:S does NOT
contain the pair s:s as might be expected. This is because the column
headers s:s and S:S OVERLAP with respect to the pair s:s  this pair
matches BOTH. The pair s:s is assigned to the s:s column because that
one is MORE SPECIFIC than the S:S column header. That is, only ONE
feasible pair matches the s:s header, while three pairs match the S:S
header.
Thus the input form kiss+s FAILS the fule because the final s in the
root "kiss" is matched to the s:s column, leaving the table in state
1.  where we really want it to be in state 2 (the left context
mentions S:S, that is what state 2 is doing for us...) The table must
be revised so that for the first three states the s:s column has the
same state transitions as the S:S column.
Finally, the command SHOW LEXICON  this will help
if you mistype a lexicon name, etc.
2.0 Writing KIMMO lexicons.
(my apologies if you received this more than once 
something went amiss with the AI mailer. Also, I've
posted this material on the web site).
Here we explain how twolevel rules work, how
they can be implemented as finitestate machines, and
all the types of rule constraints can be translated
into finitestate tables. We then summarize the rule
semantics. This is followed by a detailed discussion
of rule conflicts; specficity and conflicts amongst
SUBSETS; and finally, and explanation of the rule
file format and the rules
in the pckimmo file english.rul.
It's a lot to read through... but I hope, complete,
and will guide you through Spanish.
1. How twolevel rules work.
Consider Rule 2 (R2) below.
R2 t:c ==> ____i
The operator ==> means that lexical t is realized as a surface c only
(but not always) in the environment preceding i:i.
The correspondence t:c declared in R2 is a special correspondence.
All twolevel descriptions must also contain a set of *default*
correspondences, such as t:t, i:i, etc. (This is the socalled
"BOGUS RULE"  it isn't really bogus, it is a default.)
The sum of the special and default correspondences are the total set
of valid correspondences or feasible pairs that can be used in the
description.
If a twolevel description containing R2 (and all default
correspondences) is applied to the lexical (underlying) form "tati"
(without the quote marks) PCKIMMO proceeds as follow to produce the
corresponding surface form(s). (NOTE this is why you can use
GENERATE without a dictionary and JUST the .rul file)
Beginning with the first character of the input form, it looks to see
if there is a correspondence declared for it. Due to R2, it will find
that lexical t can correspond to surface c, so it will begin by
positing that correspondence.
Lexical: t a t i
   
Rule: R2

Surface: c
At this point the generated has entered R2. For the posited t:c
correspondence to succeed, the generator MUST find an i:i
correspondence next  that is what R2 says. When the generator moves
on to the second character of the input word, it finds that it is a
lexical a, and thus R2 FAILS, so the generator must back up, undo
what it has done so far, and try to find a different path. Backing up
to the first character t, it now tries the DEFAULT correspondence t:t
(which is guaranteed to succeed, since it has NO conditions):
Lexical: t a t i
   
Rule: R2

Surface: t
The generator now moves on to the second character. No correspondence
for lexical a has been declared other than the default, so the
generator posits a surface a:
Lexical: t a t i
   
Rule: R2 
 
Surface: t a
Moving on to the third character, the generator again finds a lexical
t, so it posits a surface c and enters R2 again:
Lexical: t a t i
   
Rule: R2  R2
  
Surface: t a c
Now the generator looks at the fourth character, a lexical i. This
SATISFIES the environment of R2, so it keeps the i (NOTE that the
constraint refers only to a surface i, and says nothing about the
lexical, underlying character):
R2 t:c ==> ____i
Since the context of R2 requires an i, the generator must also posit
a surface i, so it does, and exits R2. NOTE that by the time R2 is
finished, TWO characters will have been posited.
Lexical: t a t i
   
Rule: R2  R2 
   
Surface: t a c i
Since there are no more characters in the lexical form, the generator
outputs the surface form "taci". However, the generator is not yet
done. It will continue backtracking, trying to find alternative
realizations of the lexical form. First, it will undo the i:i
correspondence of the last character of the input word, then it will
consider the third character, lexical t. Having already tried the
correspondence t:c, it will try the default correspondence t:t:
Lexical: t a t i
  
Rule: R2  
  
Surface: t a t i
Now the generator will try the final correspondence and succeed,
since R2 does NOT prohibit t:t before an i (rather, it prohibits t:c
in any environment EXCEPT BEFORE i). It will then output "tati".
The reader may confirm that no other outputs will be generated.
2. The ==> rule as a finitestate machine.
A key insight of PCKIMMO is that if phonological rules are written as
twolevel rules, they can be implemented as FST's running in
parallel. In the next 4 sections we briefly show how each of the four
rule types (==>, <==, <==>, and \<==) translates to an FST.
We then go on to describe conflicts in SUBSETS, and RULES.
2.1 A ==> rule.
Consider rule R2 again.
A possible paraphrase is, If ever the correspondence t:c occurs, it
must be followed by i:i. In other words, if anything OTHER THAN t:c
occurs, this rule ignores it. This must be incorporated into our
twolevel FST, call this T2 (for table 2)
t i @
c i @
1: 2 1 1
2. 0 1 0
The @:@ arc means ANY OTHER symbol than t, i, or c, i.
State 2 is a kind of 'default'state that ignores everyting except the
substring crucial to the rule. It is also the only final, accepting
state.
Importantly, the state table is constructed such that the entire set
of feasible pairs in the rule description is partition among the
column headers WITH NO OVERLAP (this is the source of MANY bugs in
Kimmo rule systems). T2 specifies the special correspondence t:c and
the environment in which it is allowed. (the machine goes to state 2
to anticipate that an i:i comes next  if it does, success, and goes
to state 1; if not, it goes to state 0, the rejecting state.)
The column header @:@ in T2 matches ALL the feasible pairs that are
defined by ALL THE OTHER FSTs of the system  thus saying that R2
'takes a pass' and doesn't care about any other feasible pairs. So,
with respect to T2, @:@ does not stand for all feasible pairs,
rather, all feasible pairs except i:c and i:i.
The default correspondences of the system must be declared in a
trivial FST like T3: (also see below where we cover the .rul file
format). If we assume p, t, k, a, i, u in our alphabet, then we need:
p t k a i u @
p t k a i u @
1: 1 1 1 1 1 1 1
(Table T3)
Even this table of correspondences must include @:@ as a column.
Otherwise, it would fail when it encountered a special correspondence
such as t:c, because all the rule in a twolevel description apply in
parallel, and for each character in an input string ALL the rules
must succeed, even if vacuously. Now, given the lexical form tatik,
T2 and T3 together will generate the surface forms tatik and tacik.
IMPORTANT. To understand how to represent twolevel rules as state
tables, we must understand what the rules really mean. It is a
common tendency to think of them positively, that is, as statement of
where the correspondence succeeds. IN FACT STATE TABLES ARE FAILURE
DRIVEN, THEY SPECIFY WHERE THE CORRESPONDENCES MUST FAIL.
This point is perhaps THE biggest source of difficulty in building
the FSTs.
In our case above, it is natural to think of R2 as saying that t:c
succeeds when it occurs preceding i:i. But T2 actually works because
it FAILS when ANYTHING BUT i:i follows t:c.
2.2 A <== rule.
Now consider R4.
R4 t:c <== ____i
This rule says that lexical t is always realized as surface c when it
occurs before i:i, but NOT ONLY BEFORE i:i. Thus, the lexical form
tati will successfully match the surface form taci, but not tati.
Note, however, it would also match "caci" since it does not disallow
t:c in any environment. Rather, its function is to disallow t:t in
the environment following i:i.
Remember that state tables are failuredriven, so the strategy of
writing the state table for R3 is to force it to fail if it
recognizes the sequence t:t i:i. So the state table for R4, viz., T4,
looks like this:
T4
t t i @
c t i @
1: 1 2 1 1
2: 0 2 0 1
In state 1, any occurrences of the pairs t:c, i:i, or any other
feasible pairs are allowed without leaving state 1. It is only the
correspondence t:t that forces a transition to state 2, where all
feasible pairs succeed except i:i. Note that state 2 must be a final
state  this allows all the correspondences to succeed and return to
state 1. Also note that in state 2 the cell under the t:t column
contains a 2. This is necessary to allow for the possibility of a tt
sequence in the input. For example, tatti will surface as the form
tatci. This phenomenon is called "backlooping"  more on this below.
Actually T4 is potentially overspecified. It is not really the pair
t:t that is disallowed before i, but rather the pair t:notc
(lexical t and surface anything but c) Given that the more specific
correspondence t:c is already in the table, the more general
correspondence t:@ will take care of all the rest of the characters,
including t:t. (I'll leave the details of this to you..)
In summary, the rule type L:S <==E positively says that L is ALWAYS
realized as S in the environment E. Thus, it is a kind of OBLIGATORY
rule. Negatively, it says that L is realized as any character but S
is not allowed in E. The state table must be written so that it
forces all correspondences of L with anything BUT S to fail.
2.3 A <==> rule.
R5 t:c <==> ____i
The state table for a <==> rule is simply the combination of the
tables for ==> and <==. You build it by anding the two fst's
together. So here, t:c MUST occur before i, and NOWHERE ELSE.
We next turn to the problem of what can happen when you have
more than one rule  rule conflicts, the use of SUBSETS, and
overlapping character descriptions.
[this part on rule conflicts, subsets, and an illustration via
the english.rul file will be sent next]