PC-Kimmo Reference Manual a two-level processor for morphological analysis version 2.1.0 October 1997 by Evan Antworth and Stephen McConnel Copyright (C) 2000 SIL International Published by: Language Software Development SIL International 7500 W. Camp Wisdom Road Dallas, TX 75236 U.S.A. Permission is granted to make and distribute verbatim copies of this file provided the copyright notice and this permission notice are preserved in all copies. The author may be reached at the address above or via email as `steve@acadcomp.sil.org'. Introduction to the PC-Kimmo program ************************************ This document describes PC-Kimmo, an implementation of the two-level computational linguistic formalism for personal computers. It is available for MS-DOS, Microsoft Windows, Macintosh, and Unix.(1) The authors would appreciate feedback directed to the following addresses. For linguistic questions, contact: Gary Simons SIL International 7500 W. Camp Wisdom Road Dallas, TX 75236 gary.simons@sil.org U.S.A. For programming questions, contact: Stephen McConnel (972)708-7361 (office) Language Software Development (972)708-7561 (fax) SIL International 7500 W. Camp Wisdom Road Dallas, TX 75236 steve@acadcomp.sil.org U.S.A. or Stephen_McConnel@sil.org An online user manual for PC-Kimmo is available on the world wide web at the URL `http://www.sil.org/pckimmo/v2/doc/guide.html'. ---------- Footnotes ---------- (1) The Microsoft Windows implementation uses the Microsoft C QuickWin function, and the Macintosh implementation uses the Metrowerks C SIOUX function. The Two-level Formalism *********************** Two-level phonology is a linguistic tool developed by computational linguists. Its primary use is in systems for natural language processing such as PC-Kimmo. This chapter describes the linguistic and computational basis of two-level phonology.(1) ---------- Footnotes ---------- (1) This chapter is excerpted from Antworth 1991. Computational and linguistic roots ================================== As the fields of computer science and linguistics have grown up together during the past several decades, they have each benefited from cross-fertilization. Modern linguistics has especially been influenced by the formal language theory that underlies computation. The most famous application of formal language theory to linguistics was Chomsky's (1957) transformational generative grammar. Chomsky's strategy was to consider several types of formal languages to see if they were capable of modeling natural language syntax. He started by considering the simplest type of formal languages, called finite state languages. As a general principle, computational linguists try to use the least powerful computational devices possible. This is because the less powerful devices are better understood, their behavior is predictable, and they are computationally more efficient. Chomsky (1957:18ff) demonstrated that natural language syntax could not be effectively modeled as a finite state language; thus he rejected finite state languages as a theory of syntax and proposed that syntax requires the use of more powerful, non-finite state languages. However, there is no reason to assume that the same should be true for natural language phonology. A finite state model of phonology is especially desirable from the computational point of view, since it makes possible a computational implementation that is simple and efficient. While various linguists proposed that generative phonological rules could be implemented by finite state devices (see Johnson 1972, Kay 1983), the most successful model of finite state phonology was developed by Kimmo Koskenniemi, a Finnish computer scientist. He called his model two-level morphology (Koskenniemi 1983), though his use of the term morphology should be understood to encompass both what linguists would consider morphology proper (the decomposition of words into morphemes) and phonology (at least in the sense of morphophonemics). Our main interest in this article is the phonological formalism used by the two-level model, hereafter called two-level phonology. Two-level phonology traces its linguistic heritage to "classical" generative phonology as codified in `The Sound Pattern of English' (Chomsky and Halle 1968). The basic insight of two-level phonology is due to the phonologist C. Douglas Johnson (1972) who showed that the SPE theory of phonology could be implemented using finite state devices by replacing sequential rule application with simultaneous rule application. At its core, then, two-level phonology is a rule formalism, not a complete theory of phonology. The following sections of this article describe the mechanism of two-level rule application by contrasting it with rule application in classical generative phonology. It should be noted that Chomsky and Halle's theory of rule application became the focal point of much controversy during the 1970s with the result that current theories of phonology differ significantly from classical generative phonology. The relevance of two-level phonology to current theory is an important issue, but one that will not be fully addressed here. Rather, the comparison of two-level phonology to classical generative phonology is done mainly for expository purposes, recognizing that while classical generative phonology has been superseded by subsequent theoretical work, it constitutes a historically coherent view of phonology that continues to influence current theory and practice. One feature that two-level phonology shares with classical generative phonology is linear representation. That is, phonological forms are represented as linear strings of symbols. This is in contrast to the nonlinear representations used in much current work in phonology, namely autosegmental and metrical phonology (see Goldsmith 1990). On the computational side, two-level phonology is consistent with natural language processing systems that are designed to operate on linear orthographic input. Two-level rule application ========================== We will begin by reviewing the formal properties of generative rules. Stated succinctly, generative rules are sequentially ordered rewriting rules. What does this mean? First, rewriting rules are rules that change or transform one symbol into another symbol. For example, a rewriting rule of the form `a --> b' interprets the relationship between the symbols `a' and `b' as a dynamic change whereby the symbol `a' is rewritten or turned into the symbol `b'. This means that after this operation takes place, the symbol `a' no longer "exists," in the sense that it is no longer available to other rules. In linguistic theory generative rules are known as process rules. Process rules attempt to characterize the relationship between levels of representation (such as the phonemic and phonetic levels) by specifying how to transform representations from one level into representations on the other level. Second, generative phonological rules apply sequentially, that is, one after another, rather than applying simultaneously. This means that each rule creates as its output a new intermediate level of representation. This intermediate level then serves as the input to the next rule. As a consequence, the underlying form becomes inaccessible to later rules. Third, generative phonological rules are ordered; that is, the description specifies the sequence in which the rules must apply. Applying rules in any other order may result in incorrect output. As an example of a set of generative rules, consider the following rules: (1) Vowel Raising e --> i / ___C_0 i (2) Palatalization t --> c / ___i Rule 1 (Vowel Raising) states that `e' becomes (is rewritten as) `i' in the environment preceding `Ci' (where `C' stands for the set of consonants and `C_0' stands for zero or more consonants). Rule 2 (Palatalization) states that `t' becomes `c' preceding `i'. A sample derivation of forms to which these rules apply looks like this (where UR stands for Underlying Representation, SR stands for Surface Representation):(1) UR: temi (1) timi (2) cimi SR: cimi Notice that in addition to the underlying and surface levels, an intermediate level has been created as the result of sequentially applying rules 1 and 2. The application of rule 1 produces the intermediate form `timi', which then serves as the input to rule 2. Not only are these rules sequential, they are ordered, such that rule 1 must apply before rule 2. Rule 1 has a feeding relationship to rule 2; that is, rule 1 increases the number of forms that can undergo rule 2 by creating more instances of `i'. Consider what would happen if they were applied in the reverse order. Given the input form `temi', rule 2 would do nothing, since its environment is not satisfied. Rule 1 would then apply to produce the incorrect surface form `timi'. Two-level rules differ from generative rules in the following ways. First, whereas generative rules apply in a sequential order, two-level rules apply simultaneously, which is better described as applying in parallel. Applying rules in parallel to an input form means that for each segment in the form all of the rules must apply successfully, even if only vacuously. Second, whereas sequentially applied generative rules create intermediate levels of derivation, simultaneously applied two-level rules require only two levels of representation: the underlying or lexical level and the surface level. There are no intermediate levels of derivation. It is in this sense that the model is called two-level. Third, whereas generative rules relate the underlying and surface levels by rewriting underlying symbols as surface symbols, two-level rules express the relationship between the underlying and surface levels by positing direct, static correspondences between pairs of underlying and surface symbols. For instance, instead of rewriting underlying `a' as surface `b', a two-level rule states that an underlying `a' corresponds to a surface `b'. The two-level rule does not change `a' into `b', so `a' is available to other rules. In other words, after a two-level rule applies, both the underlying and surface symbols still "exist." Fourth, whereas generative rules have access only to the current intermediate form at each stage of the derivation, two-level rules have access to both underlying and surface environments. Generative rules cannot "look back" at underlying environments or "look ahead" to surface environments. In contrast, the environments of two-level rules are stated as lexical-to-surface correspondences. This means that a two-level rule can easily refer to an underlying `a' that corresponds to a surface `b', or to a surface `b' that corresponds to an underlying `a'. In generative phonology, the interaction between a pair of rules is controlled by requiring that they apply in a certain sequential order. In two-level phonology, rule interactions are controlled not by ordering the rules but by carefully specifying their environments as strings of two-level correspondences. Fifth, whereas generative, rewriting rules are unidirectional (that is, they operate only in an underlying to surface direction), two-level rules are bidirectional. Two-level rules can operate either in an underlying to surface direction (generation mode) or in a surface to underlying direction (recognition mode). Thus in generation mode two-level rules accept an underlying form as input and return a surface form, while in recognition mode they accept a surface form as input and return an underlying form. The practical application of bidirectional phonological rules is obvious: a computational implementation of bidirectional rules is not limited to generation mode to produce words; it can also be used in recognition direction to parse words. ---------- Footnotes ---------- (1) This made-up example is used for expository purposes. To make better phonological sense, the forms should have internal morpheme boundaries, for instance `te+mi' (otherwise there would be no basis for positing an underlying `e'). See the section below on the use of zero to see how morpheme boundaries are handled. How a two-level description works ================================= To understand how a two-level phonological description works, we will use the example given above involving Raising and Palatalization. The two-level model treats the relationship between the underlying form `temi' and the surface form `cimi' as a direct, symbol-to-symbol correspondence: UR: t e m i SR: c i m i Each pair of lexical and surface symbols is a correspondence pair. We refer to a correspondence pair with the notation `:', for instance `e:i' and `m:m'. There must be an exact one-to-one correspondence between the symbols of the underlying form and the symbols of the surface form. Deletion and insertion of symbols (explained in detail in the next section) is handled by positing correspondences with zero, a null segment. The two-level model uses a notation for expressing two-level rules that is similar to the notation linguists use for phonological rules. Corresponding to the generative rule for Palatalization (rule 2 above), here is the two-level rule for the `t:c' correspondence: (3) Palatalization t:c <=> ___ @:i This rule is a statement about the distribution of the pair `t:c' on the left side of the arrow with respect to the context or environment on the right side of the arrow. A two-level rule has three parts: the correspondence, the operator, and the environment. The correspondence part of rule 3 is the pair `t:c', which is the correspondence that the rule sanctions. The operator part of rule 3 is the double-headed arrow. It indicates the nature of the logical relationship between the correspondence and the environment (thus it means something very different from the rewriting arrow `-->' of generative phonology). The `<=>' arrow is equivalent to the biconditional operator of formal logic and means that the correspondence occurs always and only in the stated context; that is, `t:c' is allowed if and only if it is found in the context `___i'. In short, rule 3 is an obligatory rule. The environment part of rule 3 is everything to the right of the arrow. The long underline indicates the gap where the pair `t:c' occurs. Notice that even the environment part of the rule is specified as two-level correspondence pairs. The environment part of rule 3 requires further explanation. Instead of using a correspondence such as `i:i', it uses the correspondence `@:i'. The `@' symbol is a special "wildcard" symbol that stands for any phonological segment included in the description. In the context of rule 3, the correspondence `@:i' stands for all the feasible pairs in the description whose surface segment is `i', in this case `e:i' and `i:i'. Thus by using the correspondence `@:i', we allow Palatalization to apply in the environment of either a lexical `e' or lexical `i'. In other words, we are claiming that Palatalization is sensitive to a surface (phonetic) environment rather than an underlying (phonemic) environment. Thus rule 3 will apply to both underlying forms `timi' and `temi' to produce a surface form with an initial `c'. Corresponding to the generative rule for Raising (rule 1 above) is the following two-level rule for the `e:i' correspondence: (4) Vowel Raising e:i <=> ___ C:C* @:i (The asterisk in `C:C*' indicates zero or more instances of the correspondence `C:C') Similar to rule 3 above, rule 4 uses the correspondence `@:i' in its environment. Thus rule 4 states that the correspondence `e:i' occurs preceding a surface `i', regardless of whether it is derived from a lexical `e' or `i'. Why is this necessary? Consider the case of an underlying form such as `pememi'. In order to derive the surface form `pimimi', Raising must apply twice: once before a lexical `i' and again before a lexical `e', both of which correspond to a surface `i'. Thus rule 4 will apply to both instances of lexical `e', capturing the regressive spreading of Raising through the word. By applying rules 3 and 4 in parallel, they work in consort to produce the right output. For example, UR: t e m i | | | | Rules 3 4 | | | | | | SR: c i m i Conceptually, a two-level phonological description of a data set such as this can be understood as follows. First, the two-level description declares an alphabet of all the phonological segments used in the data in both underlying and surface forms, in the case of our example, `t', `m', `c', `e', and `i'. Second, the description declares a set feasible pairs, which is the complete set of all underlying-to-surface correspondences of segments that occur in the data. The set of feasible pairs for these data is the union of the set of default correspondences, whose underlying and surface segments are identical (namely `t:t', `m:m', `e:e', and `i:i') and the set of special correspondences, whose underlying and surface segments are different (namely `t:c' and `e:i'). Notice that since the segment `c' only occurs as a surface segment in the feasible pairs, the description will disallow any underlying form that contains a `c'. A minimal two-level description, then, consists of nothing more than this declaration of the feasible pairs. Since it contains all possible underlying-to-surface correspondences, such a description will produce the correct output form, but because it does not constrain the environments where the special correspondences can occur, it will also allow many incorrect output forms. For example, given the underlying form `temi', it will produce the surface forms `temi', `timi', `cemi', and `cimi', of which only the last is correct. Third, in order to restrict the output to only correct forms, we include rules in the description that specify where the special correspondences are allowed to occur. Thus the rules function as constraints or filters, blocking incorrect forms while allowing correct forms to pass through. For instance, rule 3 (Palatalization) states that a lexical `t' must be realized as a surface `c' when it precedes `@:i'; thus, given the underlying form `temi' it will block the potential surface output forms `timi' (because the surface sequence `ti' is prohibited) and `cemi' (because surface `c' is prohibited before anything except surface `i'). Rule 4 (Raising) states that a lexical `e' must be realized as a surface `i' when it precedes the sequence `C:C' `@:i'; thus, given the underlying form `temi' it will block the potential surface output forms `temi' and `cemi' (because the surface sequence `emi' is prohibited). Therefore of the four potential surface forms, three are filtered out; rules 3 and 4 leave only the correct form `cimi'. Two-level phonology facilitates a rather different way of thinking about phonological rules. We think of generative rules as processes that change one segment into another. In contrast, two-level rules do not perform operations on segments, rather they state static constraints on correspondences between underlying and surface forms. Generative phonology and two-level phonology also differ in how they characterize relationships between rules. Rules in generative phonology are described in terms of their relative order of application and their effect on the input of other rules (the so-called feeding and bleeding relations). Thus the generative rule 1 for Raising precedes and feeds rule 2 for Palatalization. In contrast, rules in the two-level model are categorized according to whether they apply in lexical versus surface environments. So we say that the two-level rules for Raising and Palatalization are sensitive to a surface rather than underlying environment. With zero you can do (almost) anything ====================================== Phonological processes that delete or insert segments pose a special challenge to two-level phonology. Since an underlying form and its surface form must correspond segment for segment, how can segments be deleted from an underlying form or inserted into a surface form? The answer lies in the use of the special null symbol `0' (zero). Thus the correspondence `x:0' represents the deletion of `x', while `0:x' represents the insertion of `x'. (It should be understood that these zeros are provided by rule application mechanism and exist only internally; that is, zeros are not included in input forms nor are they printed in output forms.) As an example of deletion, consider these forms from Tagalog (where `+' represents a morpheme boundary): UR: m a n + b i l i SR: m a m 0 0 i l i Using process terminology, these forms exemplify phonological coalescence, whereby the sequence `nb' becomes `m'. Since in the two-level model a sequence of two underlying segments cannot correspond to a single surface segment, coalescence must be interpreted as simultaneous assimilation and deletion. Thus we need two rules: an assimilation rule for the correspondence `n:m' and a deletion rule for the correspondence `b:0' (note that the morpheme boundary `+' is treated as a special symbol that is always deleted). (5) Nasal Assimilation n:m <=> ___ +:0 b:@ (6) Deletion b:0 <=> @:m +:0 ___ Notice the interaction between the rules: Nasal Assimilation occurs in a lexical environment, namely a lexical `b' (which can correspond to either a surface `b' or `0'), while Deletion occurs in a surface environment, namely a surface `m' (which could be the realization of either a lexical `n' or `m'). In this way the two rules interact with each other to produce the correct output. Insertion correspondences, where the lexical segment is `0', enable one to write rules for processes such as stress insertion, gemination, infixation, and reduplication. For example, Tagalog has a verbalizing infix `um' that attaches between the first consonant and vowel of a stem; thus the infixed form of `bili' is `bumili'. To account for this formation with two-level rules, we represent the underlying form of the infix `um' as the prefix `X+', where `X' is a special symbol that has no phonological purpose other than standing for the infix. We then write a rule that inserts the sequence `um' in the presence of `X+', which is deleted. Here is the two-level correspondence: UR: X + b 0 0 i l i SR: 0 0 b u m i l i and here is the two-level rule, which simultaneously deletes `X' and inserts `um': (7) Infixation X:0 <=> ___ +:0 C:C 0:u 0:m V:V These examples involving deletion and insertion show that the invention of zero is just as important for phonology as it was for arithmetic. Without zero, two-level phonology would be limited to the most trivial phonological processes; with zero, the two-level model has the expressive power to handle complex phonological or morphological phenomena (though not necessarily with the degree of felicity that a linguist might desire). Running PC-Kimmo **************** PC-Kimmo is an interactive program. It has a few command line options, but it is controlled primarily by commands typed at the keyboard (or loaded from a file previously prepared). PC-Kimmo Command Line Options ============================= The PC-Kimmo program uses an old-fashioned command line interface following the convention of options starting with a dash character (`-'). The available options are listed below in alphabetical order. Those options which require an argument have the argument type following the option letter. `-g filename' loads the grammar from a PC-Kimmo grammar file. `-l filename' loads an analysis lexicon from a PC-Kimmo lexicon file. `-r filename' loads the two-level rules from a PC-Kimmo rules file. `-s filename' loads a synthesis lexicon from a PC-Kimmo lexicon file. `-t filename' opens a file containing one or more PC-Kimmo commands. See `Interactive Commands' below. The following options exist only in beta-test versions of the program, since they are used only for debugging. `-/' increments the debugging level. The default is zero (no debugging output). `-z filename' opens a file for recording a memory allocation log. `-Z address,count' traps the program at the point where `address' is allocated or freed for the `count''th time. Interactive Commands ==================== Each of the commands available in PC-Kimmo is described below. Each command consists of one or more keywords followed by zero or more arguments. Keywords may be abbreviated to the minimum length necessary to prevent ambiguity. cd -- `cd' DIRECTORY changes the current directory to the one specified. Spaces in the directory pathname are not permitted. For MS-DOS or Windows, you can give a full path starting with the disk letter and a colon (for example, `a:'); a path starting with `\' which indicates a directory at the top level of the current disk; a path starting with `..' which indicates the directory above the current one; and so on. Directories are separated by the `\' character. (The forward slash `/' works just as well as the backslash `\' for MS-DOS or Windows.) For the Macintosh, you can give a full path starting with the name of a hard disk, a path starting with `:' which means the current folder, or one starting `::' which means the folder containing the current one (and so on). For Unix, you can give a full path starting with a `/' (for example, `/usr/pckimmo'); a path starting with `..' which indicates the directory above the current one; and so on. Directories are separated by the `/' character. clear ----- `clear' erases all existing rules, lexicon, and grammar information, allowing the user to prepare to load information for a new language. Strictly speaking, it is not needed since the `load rules' command erases any previously existing rules, the `load lexicon' command erases any previously existing analysis lexicon, the `load synthesis-lexicon' command erases any previously existing synthesis lexicon, and the `load grammar' command erases any previously existing grammar. `cle' is the minimal abbreviation for `clear'. close ----- `close' closes the current log file opened by a previous `log' command. `clo' is the minimal abbreviation for `close'. compare ------- The `compare' commands all test the current language description files by processing data against known (precomputed) results. `co' is the minimal abbreviation for `compare'. `file compare' is a synonym for `compare'. compare generate ................ `compare generate' reads lexical and surface forms from the specified file. After reading a lexical form, PC-Kimmo generates the corresponding surface form(s) and compares the result to the surface form(s) read from the file. If `VERBOSE' is `ON', then each form from the file is echoed on the screen with a message indicating whether or not the surface forms generated by PC-Kimmo and read from the file are in agreement. If `VERBOSE' is `OFF', then only the disagreements in surface form are displayed fully. Each result which agrees is indicated by a single dot written to the screen. The default filetype extension for `compare generate' is `.gen', and the default filename is `data.gen'. `co g' is the minimal abbreviation for `compare generate'. `file compare generate' is a synonym for `compare generate'. compare pairs ............. `compare pairs' reads pairs of surface and lexical forms from the specified file. After reading a lexical form, PC-Kimmo produces any corresponding surface form(s) and compares the result(s) to the surface form read from the file. For each surface form, PC-Kimmo also produces any corresponding lexical form(s) and compares the result to the lexical form read from the file. If `VERBOSE' is `ON', then each form from the file is echoed on the screen with a message indicating whether or not the forms produced by PC-Kimmo and read from the file are in agreement. If `VERBOSE' is `OFF', then each result which agrees is indicated by a single dot written to the screen, and only disagreements in lexical forms are displayed fully. The default filetype extension for `compare pairs' is `.pai', and the default filename is `data.pai'. `co p' is the minimal abbreviation for `compare pairs'. `file compare pairs' is a synonym for `compare pairs'. compare recognize ................. `compare recognize' reads surface and lexical forms from the specified file. After reading a surface form, PC-Kimmo produces any corresponding lexical form(s) and compares the result(s) to the lexical form(s) read from the file. If `VERBOSE' is `ON', then each form from the file is echoed on the screen with a message indicating whether or not the lexical forms produced by PC-Kimmo and read from the file are in agreement. If `VERBOSE' is `OFF', then each result which agrees is indicated by a single dot written to the screen, and only disagreements in lexical forms are displayed fully. The default filetype extension for `compare recognize' is `.rec', and the default filename is `data.rec'. `co r' is the minimal abbreviation for `compare recognize'. `file compare recognize' is a synonym for `compare recognize'. compare synthesize .................. `compare synthesize' reads morphological and surface forms from the specified file. After reading a morphological form, PC-Kimmo produces any corresponding surface form(s) and compares the result(s) to the surface form(s) read from the file. If `VERBOSE' is `ON', then each form from the file is echoed on the screen with a message indicating whether or not the surface forms produced by PC-Kimmo and read from the file are in agreement. If `VERBOSE' is `OFF', then each result which agrees is indicated by a single dot written to the screen, and only disagreements in surface forms are displayed fully. The default filetype extension for `compare synthesize' is `.syn', and the default filename is `data.syn'. `co s' is the minimal abbreviation for `compare synthesize'. `file compare synthesize' is a synonym for `compare synthesize'. directory --------- `directory' lists the contents of the current directory. This command is available only for the MS-DOS and Unix implementations. It does not exist for the Microsoft Windows or Macintosh implementations. edit ---- `edit' FILENAME attempts to edit the specified file using the program indicated by the environment variable `EDITOR'. If this environment variable is not defined, then `edit' is used to edit the file on MS-DOS, and `emacs' is used to edit the file on Unix. This command is not available for the Microsoft Windows or Macintosh implementations. exit ---- `exit' stops PC-Kimmo, returning control to the operating system. This is the same as `quit'. file ---- The `file' commands process data from a file, optionally writing the results to another file. Each of these commands is described below. file compare ............ The `file compare' commands all test the current language description files by processing data against known (precomputed) results. `f c' is the minimal abbreviation for `file compare'. `file compare' is a synonym for `compare'. See `compare generate', `compare pairs', `compare recognize', and `compare synthesize' above. file generate ............. `file generate' [] reads lexical forms from the specified input file and writes the corresponding computed surface forms either to the screen or to an optionally specified output file. This command behaves the same as `generate' except that input comes from a file rather than the keyboard, and output may go to a file rather than the screen. See `generate' below. `f g' is the minimal abbreviation for `file generate'. file recognize .............. `file recognize' [] reads surface forms from the specified input file and writes the corresponding computed morphological and lexical forms either to the screen or to an optionally specified output file. This command behaves the same as `recognize' except that input comes from a file rather than the keyboard, and output may go to a file rather than the screen. See `recognize' below. `f r' is the minimal abbreviation for `file recognize'. file synthesize ............... `file synthesize' [] reads morphological forms from the specified input file and writes the corresponding computed surface forms either to the screen or to an optionally specified output file. This command behaves the same as `synthesize' except that input comes from a file rather than the keyboard, and output may go to a file rather than the screen. See `synthesize' below. `f s' is the minimal abbreviation for `file synthesize'. generate -------- `generate' [] attempts to produce a surface form from a lexical form provided by the user. If a lexical form is typed on the same line as the command, then that lexical form is used to generate a surface form. If the command is typed without a form, then PC-Kimmo prompts the user for lexical forms with a special generator prompt, and processes each form in turn. This cycle of typing and generating is terminated by typing an empty "form" (that is, nothing but the `Enter' or `Return' key). The rules must be loaded before using this command. It does not require either a lexicon or a grammar. `g' is the minimal abbreviation for `generate'. help ---- `help' COMMAND displays a description of the specified command. If `help' is typed by itself, PC-Kimmo displays a list of commands with short descriptions of each command. `h' is the minimal abbreviation for `help'. list ---- The `list' commands all display information about the currently loaded data. Each of these commands are described below. `li' is the minimal abbreviation for `list'. list lexicon ............ `list lexicon' displays the names of all the (sub)lexicons currently loaded. The order of presentation is the order in which they are referenced in the `ALTERNATIONS' declarations. `li l' is the minimal abbreviation for `list lexicon'. list pairs .......... `list pairs' displays all the feasible pairs for the current set of active rules. The feasible pairs are displayed as pairs of lines, with the lexical characters shown above the corresponding surface characters. `li p' is the minimal abbreviation for `list pairs'. list rules .......... `list rules' displays the names of the current rules, preceded by the number of the rule (used by the `set rules' command) and an indication of whether the rule is `ON' or `OFF'. `li r' is the minimal abbreviation for `list rules'. load ---- The `load' commands all load information stored in specially formatted files. Each of the `load' commands is described below. `l' is the minimal abbreviation for `load'. load grammar ............ `load grammar' [] erases any existing word grammar and reads a new word grammar from the specified file. The default filetype extension for `load grammar' is `.grm', and the default filename is `grammar.grm'. A grammar file can also be loaded by using the `-g' command line option when starting PC-Kimmo. `l g' is the minimal abbreviation for `load grammar'. load lexicon ............ `load lexicon' [] erases any existing analysis lexicon information and reads a new analysis lexicon from the specified file. A rules file must be loaded before an analysis lexicon file can be loaded. The default filetype extension for `load lexicon' is `.lex', and the default filename is `lexicon.lex'. An analysis lexicon file can also be loaded by using the `-l' command line option when starting PC-Kimmo. This requires that a `-r' option also be used to load a rules file. `l l' is the minimal abbreviation for `load lexicon'. load rules .......... `load rules' [] erases any existing rules and reads a new set of two-level rules from the specified file. The default filetype extension for `load rules' is `.rul', and the default filename is `rules.rul'. A rules file can also be loaded by using the `-r' command line option when starting PC-Kimmo. `l r' is the minimal abbreviation for `load rules'. load synthesis-lexicon ...................... `load synthesis-lexicon' [] erases any existing synthesis lexicon and reads a new synthesis lexicon from the specified file. A rules file must be loaded before a synthesis lexicon file can be loaded. The default filetype extension for `load synthesis-lexicon' is `.lex', and the default filename is `lexicon.lex'. A synthesis lexicon file can also be loaded by using the `-s' command line option when starting PC-Kimmo. This requires that a `-r' option also be used to load a rules file. `l s' is the minimal abbreviation for `load synthesis-lexicon'. log --- `log' [] opens a log file. Each item processed by a `generate', `recognize', `synthesize', `compare', or `file' command is recorded in the log file as well as being displayed on the screen. If a filename is given on the same line as the `log' command, then that file is used for the log file. Any previously existing file with the same name will be overwritten. If no filename is provided, then the file `pckimmo.log' in the current directory is used for the log file. Use `close' to stop recording in a log file. If a `log' command is given when a log file is already open, then the earlier log file is closed before the new log file is opened. quit ---- `quit' stops PC-Kimmo, returning control to the operating system. This is the same as `exit'. recognize --------- `recognize' [] attempts to produce lexical and morphological forms from a surface wordform provided by the user. If a wordform is typed on the same line as a command, then that word is parsed. If the command is typed without a form, then PC-Kimmo prompts the user for surface forms with a special recognizer prompt, and processes each form in turn. This cycle of typing and parsing is terminated by typing an empty "word" (that is, nothing but the `Enter' or `Return' key). Both the rules and the lexicon must be loaded before using this command. A grammar may also be loaded and used to eliminate invalid parses from the two-level processor results. If a grammar is used, then parse trees and feature structures may be displayed as well as the lexical and morphological forms. save ---- `save' [FILE.TAK] writes the current settings to the designated file in the form of PC-Kimmo commands. If the file is not specified, the settings are written to `pckimmo.tak' in the current directory. set --- The `set' commands control program behavior by setting internal program variables. Each of these commands (and variables) is described below. set ambiguities ............... `set ambiguities' NUMBER limits the number of analyses printed to the given number. The default value is 10. Note that this does not limit the number of analyses produced, just the number printed. set ample-dictionary .................... `set ample-dictionary' VALUE determines whether or not the AMPLE dictionary files are divided according to morpheme type. `set ample-dictionary split' declares that the AMPLE dictionary is divided into a prefix dictionary file, an infix dictionary file, a suffix dictionary file, and one or more root dictionary files. The existence of the three affix dictionary depends on settings in the AMPLE analysis data file. If they exist, the `load ample dictionary' command requires that they be given in this relative order: prefix, infix, suffix, root(s). `set ample-dictionary unified' declares that any of the AMPLE dictionary files may contain any type of morpheme. This implies that each dictionary entry may contain a field specifying the type of morpheme (the default is ROOT), and that the dictionary code table contains a `\unified' field. One of the changes listed under `\unified' must convert a backslash code to `T'. The default is for the AMPLE dictionary to be *split*.(1) ---------- Footnotes ---------- (1) The unified dictionary is a new feature of AMPLE version 3. set check-cycles ................ `set check-cycles' VALUE enables or disables a check to prevent cycles in the parse chart. `set check-cycles on' turns on this check, and `set check-cycles off' turns it off. This check slows down the parsing of a sentence, but it makes the parser less vulnerable to hanging on perverse grammars. The default setting is `on'. set comment ........... `set comment' CHARACTER sets the comment character to the indicated value. If CHARACTER is missing (or equal to the current comment character), then comment handling is disabled. The default comment character is `;' (semicolon). set failures ............ `set failures' VALUE enables or disables GRAMMAR FAILURE MODE. `set failures on' turns on grammar failure mode, and `set failures off' turns it off. When grammar failure mode is on, the partial results of forms that fail the grammar module are displayed. A form may fail the grammar either by failing the feature constraints or by failing the constituent structure rules. In the latter case, a partial tree (bush) will be returned. The default setting is `off'. Be careful with this option. Setting failures to `on' can cause the PC-Kimmo to go into an infinite loop for certain recursive grammars and certain input sentences. WE MAY TRY TO DO SOMETHING TO DETECT THIS TYPE OF BEHAVIOR, AT LEAST PARTIALLY. set features ............ `set features' VALUE determines how features will be displayed. `set features all' enables the display of the features for all nodes of the parse tree. `set features top' enables the display of the feature structure for only the top node of the parse tree. This is the default setting. `set features flat' causes features to be displayed in a flat, linear string that uses less space on the screen. `set features full' causes features to be displayed in an indented form that makes the embedded structure of the feature set clear. This is the default setting. `set features on' turns on features display mode, allowing features to be shown. This is the default setting. `set features off' turns off features display mode, preventing features from being shown. set gloss ......... `set gloss' VALUE enables the display of glosses in the parse tree output if VALUE is `on', and disables the display of glosses if VALUE is `off'. If any glosses exist in the lexicon file, then `gloss' is automatically turned `on' when the lexicon is loaded. If no glosses exist in the lexicon, then this flag is ignored. set marker category ................... `set marker category' MARKER establishes the marker for the field containing the category (part of speech) feature. The default is `\c'. set marker features ................... `set marker features' MARKER establishes the marker for the field containing miscellaneous features. (This field is not needed for many words.) The default is `\f'. set marker gloss ................ `set marker gloss' MARKER establishes the marker for the field containing the word gloss. The default is `\g'. set marker record ................. `set marker record' MARKER establishes the field marker that begins a new record in the lexicon file. This may or may not be the same as the `word' marker. The default is `\w'. set marker word ............... `set marker word' MARKER establishes the marker for the word field. The default is `\w'. set timing .......... `set timing' VALUE enables timing mode if VALUE is `on', and disables timing mode if VALUE is `off'. If timing mode is `on', then the elapsed time required to process a command is displayed when the command finishes. If timing mode is `off', then the elapsed time is not shown. The default is `off'. (This option is useful only to satisfy idle curiosity.) set top-down-filter ................... `set top-down-filter' VALUE enables or disables top-down filtering based on the categories. `set top-down-filter on' turns on this filtering, and `set top-down-filter off' turns it off. The top-down filter speeds up the parsing of a sentence, but might cause the parser to miss some valid parses. The default setting is `on'. This should not be required in the final version of PC-Kimmo. set tree ........ `set tree' VALUE specifies how parse trees should be displayed. `set tree full' turns on the parse tree display, displaying the result of the parse as a full tree. This is the default setting. A short sentence would look something like this: Sentence | Declarative _____|_____ NP VP | ___|____ N V COMP cows eat | NP | N grass `set tree flat' turns on the parse tree display, displaying the result of the parse as a flat tree structure in the form of a bracketed string. The same short sentence would look something like this: (Sentence (Declarative (NP (N cows)) (VP (V eat) (COMP (NP (N grass)))))) `set tree indented' turns on the parse tree display, displaying the result of the parse in an indented format sometimes called a *northwest tree*. The same short sentence would look like this: Sentence Declarative NP N cows VP V eat COMP NP N grass `set tree off' disables the display of parse trees altogether. set trim-empty-features ....................... `set trim-empty-features' VALUE disables the display of empty feature values if VALUE is `on', and enables the display of empty feature values if VALUE is `off'. The default is not to display empty feature values. set unification ............... `set unification' VALUE enables or disables feature unification. `set unification on' turns on unification mode. This is the default setting. `set unification off' turns off feature unification in the grammar. Only the context-free phrase structure rules are used to guide the parse; the feature contraints are ignored. This can be dangerous, as it is easy to introduce infinite cycles in recursive phrase structure rules. set verbose ........... `set verbose' VALUE enables or disables the screen display of parse trees in the `file parse' command. `set verbose on' enables the screen display of parse trees, and `set verbose off' disables such display. The default setting is `off'. set warnings ............ `set warnings' VALUE enables warning mode if VALUE is `on', and disables warning mode if VALUE is `off'. If warning mode is enabled, then warning messages are displayed on the output. If warning mode is disabled, then no warning messages are displayed. The default setting is `on'. set write-ample-parses ...................... `set write-ample-parses' VALUE enables writing `\parse' and `\features' fields at the end of each sentence in the disambiguated analysis file if VALUE is `on', and disables writing these fields if VALUE is `off'. The default setting is `off'. This variable setting affects only the `file disambiguate' command. show ---- The `show' commands display internal settings on the screen. Each of these commands is described below. show lexicon ............ `show lexicon' prints the contents of the lexicon stored in memory on the standard output. THIS IS NOT VERY USEFUL, AND MAY BE REMOVED. show status ........... `show status' displays the names of the current grammar, sentences, and log files, and the values of the switches established by the `set' command. `show' (by itself) and `status' are synonyms for `show status'. status ------ `status' displays the names of the current grammar, sentences, and log files, and the values of the switches established by the `set' command. synthesize ---------- `synthesize' [] attempts to produce surface forms from a morphological form provided by the user. If a morphological form is typed on the same line as the command, then that form is synthesized. If the command is typed without a form, then PC-Kimmo repeatedly prompts the user for morphological forms with a special synthesizer prompt, processing each form. This cycle of typing and synthesizing is terminated by typing an empty "form" (that is, nothing but the `Enter' or `Return' key). Note that the morphemes in the morphological form must be separated by spaces, and must match gloss entries loaded from the lexicon. Also, the morphemes must be given in the proper order. Both the rules and the synthesis lexicon must be loaded before using this command. It does not use a grammar. system ------ `system' [COMMAND] allows the user to execute an operating system command (such as checking the available space on a disk) from within PC-Kimmo. This is available only for MS-DOS and Unix, not for Microsoft Windows or the Macintosh. If no system-level command is given on the line with the `system' command, then PC-Kimmo is pushed into the background and a new system command processor (shell) is started. Control is usually returned to PC-Kimmo in this case by typing `exit' as the operating system command. `sys' is the minimal abbreviation for `system'. `!' (exclamation point) is a synonym for `system'. (`!' does not require a space to separate it from the command.) take ---- `take' [FILE.TAK] redirects command input to the specified file. The default filetype extension for `take' is `.tak', and the default filename is `pckimmo.tak'. `take' files can be nested three deep. That is, the user types `take file1', `file1' contains the command `take file2', and `file2' has the command `take file3'. It would be an error for `file3' to contain a `take' command. This should not prove to be a serious limitation. A `take' file can also be specified by using the `-t' command line option when starting PC-Kimmo. When started, PC-Kimmo looks for a `take' file named `pckimmo.tak' in the current directory to initialize itself with. The PC-Kimmo Rules File *********************** The general structure of the rules file is a list of keyword declarations. Figure 1 shows the conventional structure of the rules file. Note that the notation `{x | y}' means either `x' or `y' (but not both). Figure 1 Structure of the rules file COMMENT ALPHABET NULL ANY BOUNDARY SUBSET . (more subsets) . . RULE {: | .} . (more states) . . . (more rules) . . END The following specifications apply to the rules file. * Extra spaces, blank lines, and comment lines are ignored. In the descriptions below, reference to the use of a space character implies any whitespace character (that is, any character treated like a space character). The following control characters when used in a file are whitespace characters: `^I' (ASCII 9, tab), `^J' (ASCII 10, line feed), `^K' (ASCII 11, vertical tab), `^L' (ASCII 12, form feed), and `^M' (ASCII 13, carriage return). * Comments may be placed anywhere in the file. All data following a comment character to the end of the line is ignored. (See below on the `COMMENT' declaration.) * The set of valid keywords used to form declarations includes `COMMENT', `ALPHABET', `NULL', `ANY', `BOUNDARY', `SUBSET', `RULE', and `END'. * These declarations are obligatory and can occur only once in a file: `ALPHABET', `NULL', `ANY', `BOUNDARY'. * These declarations are optional and can occur one or more times in a file: `COMMENT', `SUBSET', and `RULE'. * The `COMMENT' declaration sets the comment character used in the rules file, lexicon files, and grammar file. The `COMMENT' declaration can only be used in the rules file, not in the lexicon or grammar file. The `COMMENT' declaration is optional. If it is not used, the comment character is set to `;' (semicolon) as a default. * The `COMMENT' declaration can be used anywhere in the rules file and can be used more than once. That is, different parts of the rules file can use different comment characters. The `COMMENT' declaration can (and in practice usually does) occur as the first keyword in the rules file, followed by either one or more `COMMENT' declarations or the `ALPHABET' declaration. * Note that if you use the `COMMENT' declaration to declare the character that is already in use as the comment character, an error will result. For instance, if semicolon is the current comment character, the declaration `COMMENT ;' will result in an error. * The comment character can no longer be set using a command line option or with a command in the user interface, as was the case in version 1 of PC-Kimmo. * The `ALPHABET' declaration must either occur first in the file or follow one or more `COMMENT' declarations only. The other declarations can appear in any order. The `COMMENT', `NULL', `ANY', `BOUNDARY', and `SUBSET' declarations can even be interspersed among the rules. However, these declarations must appear before any rule that uses them or an error will result. * The `ALPHABET' declaration defines the set of symbols used in either lexical or surface representations. The keyword `ALPHABET' is followed by a of all alphabetic symbols. Each symbol must be separated from the others by at least one space. The list can span multiple lines, and ends with the next valid keyword. All alphanumeric characters (such as `a', `B', and `2'), symbols (such as `$' and `+'), and punctuation characters (such as `.' and `?') are available as alphabet members. The characters in the IBM extended character set (above ASCII 127) are also available. Control characters (below ASCII 32) can also be used, with the exception of whitespace characters (see above), `^Z' (end of file), and `^@' (null). The alphabet can contain a maximum of 255 symbols. An alphabetic symbol can also be a multigraph, that is, a sequence of two or more characters. The individual characters composing a multigraph do not necessarily have to also be declared as alphabetic characters. For example, an alphabet could include the characters `s' and `z' and the multigraph `sz%', but not include `%' as an alphabetic character. Note that a multigraph cannot also be interpreted as a sequence of the individual characters that comprise it. * The keyword `NULL' is followed by a single that represents a null (empty, zero) element. The `NULL' symbol is considered to be an alphabetic character, but cannot also be listed in the `ALPHABET' declaration. The `NULL' symbol declared in the rules file is also used in the lexicon file to represent a null lexical entry. * The keyword `ANY' is followed by a single "wildcard" that represents a match of any character in the alphabet. The `ANY' symbol is not considered to be an alphabetic character, though it is used in the column headers of state tables. It cannot be listed in the `ALPHABET' declaration. It is not used in the lexicon file. * The keyword `BOUNDARY' is followed by a single character that represents an initial or final word boundary. The `BOUNDARY' symbol is considered to be an alphabetic character, but cannot also be listed in the `ALPHABET' declaration. When used in the column header of a state table, it can only appear as the pair `#:#' (where, for instance, `#' has been declared as the `BOUNDARY' symbol). The `BOUNDARY' symbol is also used in the lexicon file in the continuation class field of a lexical entry to indicate the end of a word (that is, no continuation class). * The `SUBSET' declaration defines set of characters that are referred to in the column headers of rules. The keyword `SUBSET' is followed by the and . is a single word (one or more characters) that names the list of characters that follows it. The subset name must be unique (that is, if it is a single character it cannot also be in the alphabet or be any other declared symbol). It can be composed of any characters (except space); that is, it is not limited to the characters declared in the `ALPHABET' section. It must not be identical to any keyword used in the rules file. The subset name is used in rules to represent all members of the subset of the alphabet that it defines. Note that `SUBSET' declarations can be interspersed among the rules. This allows subsets to be placed near the rule that uses them if such a style is desired. However, a subset must be declared before a rule that uses it. * The following a is a list of single symbols, each of which is separated by at least one space. The list can span multiple lines. Each symbol in the list must be a member of the previously defined `ALPHABET', with the exception of the `NULL' symbol, which can appear in a subset list but is not included in the `ALPHABET' declaration. Neither the `ANY' symbol nor the `BOUNDARY' symbol can appear in a subset symbol list. * The keyword `RULE' signals that a state table immediately follows. Note that two-level rules must be expressed as a state table rather than in the form discussed in chapter 2 `The Two-level Formalism' above. * is the name or description of the rule which the state table encodes. It functions as an annotation to the state table and has no effect on the computational operation of the table. It is displayed by the list rules and show rule commands and is also displayed in traces. The rule name must be surrounded by a pair of identical delimiter characters. Any material can be used between the delimiters of the rule name with the exception of the current comment character and of course the rule name delimiter character of the rule itself. Each rule in the file can use a different pair of delimiters. The rule name must be all on one line, but it does not have to be on the same line as the `RULE' keyword. * is the number of states (rows in the table) that will be defined for this table. The states must begin at 1 and go in sequence through the number defined here (that is, gaps in state numbers are not allowed). * is the number of state transitions (columns in the table) that will be defined for each state. * is a list of elements separated by one or more spaces. Each element represents the lexical half of a lexical:surface correspondence which, when matched, defines a state transition. Each element in the list must be either a member of the alphabet, a subset name, the `NULL' symbol, the `ANY' symbol, or the `BOUNDARY' symbol (in which case the corresponding surface character must also be the `BOUNDARY' symbol). The list can span multiple lines, but the number of elements in the list must be equal to the number of columns defined for the rule. * is a list of elements separated by one or more spaces. Each element represents the surface half of a lexical:surface correspondence which, when matched, defines a state transition. Each element in the list must be either a member of the alphabet, a subset name, the `NULL' symbol, the `ANY' symbol, or the `BOUNDARY' symbol (in which case the corresponding lexical character must also be the `BOUNDARY' symbol). The list can span multiple lines, but the number of characters in the list must be equal to the number of columns defined for the rule. * is the number of the state or row of the table. The first state number must be 1, and subsequent state numbers must follow in numerical sequence without any gaps. * `{: | .}' is the final or nonfinal state indicator. This should be a colon (`:') if the state is a final state and a period (`.') if it is a nonfinal state. It must follow the with no intervening space. * is a list of state transition numbers for a particular state. Each number must be between 1 and the number of states (inclusive) declared for the table. The list can span multiple lines, but the number of elements in the list must be equal to the number of columns declared for this rule. * The keyword `END' follows all other declarations and indicates the end of the rules file. Any material in the file thereafter is ignored by PC-Kimmo. The `END' keyword is optional; the physical end of the file also terminates the rules file. Figure 2 shows a sample rules file. Figure 2 A sample rules file ALPHABET b c d f g h j k l m n p q r s t v w x y z + ; + is morpheme boundary a e i o u NULL 0 ANY @ BOUNDARY # SUBSET C b c d f g h j k l m n p q r s t v w x y z SUBSET V a e i o u ; more subsets RULE "Consonant defaults" 1 23 b c d f g h j k l m n p q r s t v w x y z + @ b c d f g h j k l m n p q r s t v w x y z 0 @ 1: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 RULE "Vowel defaults" 1 6 a e i o u @ a e i o u @ 1: 1 1 1 1 1 1 RULE "Voicing s:z <=> V___V" 4 4 V s s @ V z @ @ 1: 2 0 1 1 2: 2 4 3 1 3: 0 0 1 1 4. 2 0 0 0 ; more rules END The PC-Kimmo Lexicon Files ************************** A lexicon consists of one main lexicon file plus one or more files of lexical entries. The general structure of the main lexicon file is a list of keyword declarations. The set of valid keywords is `ALTERNATION', `FEATURES', `FIELDCODE', `INCLUDE', and `END'. Figure 3 shows the conventional structure of the main lexicon file. Figure 3 Structure of the main lexicon file ALTERNATION . (more ALTERNATIONs) . . FEATURES FIELDCODE U FIELDCODE L FIELDCODE A FIELDCODE F FIELDCODE G INCLUDE . (more INCLUDEd files) . . END The following specifications apply to the main lexicon file. * Extra spaces, blank lines, and comment lines are ignored. In the descriptions below, reference to the use of a space character implies any whitespace character (that is, any character treated like a space character). The following control characters when used in a file are whitespace characters: `^I' (ASCII 9, tab), `^J' (ASCII 10, line feed), `^K' (ASCII 11, vertical tab), `^L' (ASCII 12, form feed), and `^M' (ASCII 13, carriage return). * The comment character declared in the rules file is operative in the main lexicon file. Comments may be placed anywhere in the file. All data following a comment character to the end of the line is ignored. * The set of valid keywords used to form declarations includes `ALTERNATION', `FEATURES', `FIELDCODE', `INCLUDE', and `END'. * The declarations can appear in any order with the proviso that any alternation name, feature name, or fieldcode used in a lexical entry must be declared before the lexical entry is read. In practice, this means that the `INCLUDE' declarations should appear last, but the `ALTERNATION', `FEATURES', and `FIELDCODE' declarations can appear in any order. * The `ALTERNATION' declaration defines a set of sublexicon names that serve as the continuation class of a lexical item. The `ALTERNATION' keyword is followed by an and a . `ALTERNATION' declarations are optional (but nearly always used in practice) and can occur as many times as needed. * is a name associated with the following . It is a word composed of one or more characters, not limited to the `ALPHABET' characters declared in the rules file. An alternation name can be any word other than a keyword used in the lexicon file. The program does not check to see if an alternation name is actually used in the lexicon file. * is a list of sublexicon names. It can span multiple lines until the next valid keyword is encountered. Each sublexicon name in the list must be used in the sublexicon field of a lexical entry. Although it is not enforced at the time the lexicon file is loaded, an undeclared sublexicon named in a sublexicon name list will cause an error when the recognizer tries to use it. * The `FEATURES' keyword followed by a . A is a list of words, each of which is expanded into feature structures by the word grammar. * The `FIELDCODE' declaration is used to define what fieldcode will be used to mark each type of field in a lexical entry. The `FIELDCODE' keyword is followed by a and one of five possible internal codes: `U', `L', `A', `F', or `G'. There must be five `FIELDCODE' declarations, one for each of these internal codes, where `U' indicates the lexical item field, `L' indicates the sublexicon field, `A' indicates the alternation field, `F' indicates the features field, and `G' indicates the gloss field. * The `INCLUDE' keyword is followed by a that names a file containing lexical entries to be loaded. An `INCLUDE'd file cannot contain any declarations (such as a `FIELDCODE' or an `INCLUDE' declaration), only lexical entries and comment lines. * The keyword `END' follows all other declarations and indicates the end of the main lexicon file. Any material in the file thereafter is ignored by PC-Kimmo. The `END' keyword is optional; the physical end of the file also terminates the main lexicon file. Figure 4 shows a sample main lexicon file. Figure 4 A sample main lexicon file ALTERNATION Begin PREF ALTERNATION Pref N AJ V AV ALTERNATION Stem SUFFIX FEATURES sg pl reg irreg FIELDCODE lf U ;lexical item FIELDCODE lx L ;sublexicon FIELDCODE alt A ;alternation FIELDCODE fea F ;features FIELDCODE gl G ;gloss INCLUDE affix.lex ;file of affixes INCLUDE noun.lex ;file of nouns INCLUDE verb.lex ;file of verbs INCLUDE adjectiv.lex ;file of adjectives INCLUDE adverb.lex ;file of adverbs END Figure 5 shows the structure of a lexical entry. Lexical entries are encoded in "field-oriented standard format." Standard format is an information interchange convention developed by SIL International. It tags the kinds of information in ASCII text files by means of markers which begin with backslash. Field-oriented standard format (FOSF) is a refinement of standard format geared toward representing data which has a database-like record and field structure. Figure 5 Structure of a lexical entry \ \ \ { | } \ \ The following points provide an informal description of the syntax of FOSF files. * A field-oriented standard format (FOSF) file consists of a sequence of records. * A record consists of a sequence of fields. * A field consist of a field marker and a field value. * A field marker consists of a backslash character at the beginning of a line, followed by an alphabetic or numeric character, followed by zero or more printable characters, and terminated by a space, tab, or the end of a line. A field marker without its initial backslash character is termed a field code. * A field marker must begin in the first position of a line. Backslash characters occurring elsewhere in the file are not interpreted as field markers. * The first field marker of the record is considered the record marker, and thus the same field must occur first in every record of the file. * Each field marker is separated from the field value by one or more spaces, tabs, or newlines. The field value continues up to the next field marker. * Any line that is empty or contains only whitespace characters is considered a comment line and is ignored. Comment lines may occur between or within fields. * Fields and lines in an FOSF file can be arbitrarily long. * There are two basic types of fields in FOSF files: nonrepeating and repeating. Repeating fields are multiple consecutive occurrences of fields marked by the same marker. Individual fields within a repeating field can be called subfields. The following specifications apply to how FOSF is implemented in PC-Kimmo. * Lexical entries are encoded as records in a FOSF file. * Only those fields whose field codes are declared in the main lexicon file are recognized (see above on the `FIELDCODE' declaration). All other fields are considered to be extraneous and are ignored. * The first field of each lexical entry must be the lexical item field. The lexical item field code is assigned to the internal code U by a `FIELDCODE' declaration in the main lexicon file. * Only nonrepeating fields are permitted. * The comment character declared in the rules file is operative in included files of lexical entries. All data following a comment character to the end of the line is ignored. A file of lexical entries is loaded by using an `INCLUDE' declaration in the main lexicon file (see above). An `INCLUDE'd file of lexical entries cannot contain any declarations (such as a `FIELDCODE' or an `INCLUDE' declaration), only lexical entries and comment lines. The following specifications apply to lexical entries. * A lexical entry is composed of five fields: lexical item, sublexicon, alternation, features, and gloss. The lexical item, sublexicon, and alternation, fields are obligatory, the features and gloss fields are optional. The first field of the entry must always be the lexical item. The other fields can appear in any order, even differing from one entry to another. * Although the gloss field is optional, if a lexical entry does not include one, a warning message to that effect will be displayed when the entry is loaded. To suppress this warning message, do the command `set warnings off' (see section 3.2.17.19 `set warnings') before loading the lexicon. * If an entry has an empty gloss field (that is, the field marker for the gloss field is present but there is no data after it), then the contents of the lexical form field will be also be used as the gloss for that entry. * A lexical item field consists of a and a . * A is a field code assigned to the internal code `U' by a `FIELDCODE' declaration in the main lexicon file. * A is one or more characters that represent an element (typically a morpheme or word) of the lexicon. Each character (or multigraph) must be in the alphabet defined for the language. The lexical item uses only the lexical subset of the alphabet. * A sublexicon field consists of a and a . * A is a field code assigned to the internal code `L' by a `FIELDCODE' declaration in the main lexicon file. * A is the name associated with a sublexicon. It is a word composed of one or more characters, not limited to the alphabetic characters declared in the rules file. Every lexical item must belong to a sublexicon. Every lexicon must include a special sublexicon named INITIAL (that is, there must be at least one lexical entry that belongs to the INITIAL sublexicon). * Lexical entries belonging to a sublexicon do not have to be listed consecutively in a single file (as was the case for PC-Kimmo version 1); rather, lexical entries in a file can occur in any order, regardless of what sublexicon they belong to. Lexical entries of a sublexicon can even be placed in two or more separate files. * An alternation field consists of a followed by either an or the . * An is declared in an `ALTERNATION' declaration in the main lexicon file. The is declared in the rules file and indicates the end of all possible continuations in the lexicon. * A features field consists of a and a . * A is a field code assigned to the internal code `F' by a `FIELDCODE' declaration in the main lexicon file. * A is a list of feature abbreviations. Each abbreviation is a single word consisting of alphanumeric characters or other characters except `(){}[]<>=:$!' (these are used for special purposes in the grammar file). The character `\' should not be used as the first character of an abbreviation because that is how fields are marked in the lexicon file. Upper and lower case letters used in template names are considered different. For example, `PLURAL' is not the same as `Plural' or `plural'. Feature abbreviations are expanded into full feature structures by the word grammar (see chapter 6 `The Grammar File'). * A gloss field consists of a and a . * A is a field code assigned to the internal code `G' by a `FIELDCODE' declaration in the main lexicon file. * A is a string of text. Any material can be used in the gloss field with the exception of the comment character. Figure 6 shows a sample lexical entry. Figure 6 A sample lexical entry \lf `knives \lx N \alt Infl \fea pl irreg \gl N(`knife)+PL The Grammar File **************** The following specifications apply generally to the word grammar file: * Blank lines, spaces, and tabs separate elements of the grammar file from one another, but are ignored otherwise. * The comment character declared by the `set comment' command (see section 3.2.17.4 `set comment' above) is operative in the grammar file. The default comment character is the semicolon (`;'). Comments may be placed anywhere in the grammar file. Everything following a comment character to the end of the line is ignored. * A grammar file is divided into fields identified by a small set of keywords. 1. `Rule' starts a context-free phrase structure rule with its set of feature constraints. These rules define how words join together to form phrases, clauses, or sentences. The lexicon and grammar are tied together by using the lexical categories as the terminal symbols of the phrase structure rules and by using the other lexical features in the feature constraints. 2. `Let' starts a feature template definition. Feature templates are used as macros (abbreviations) in the lexicon. They may also be used to assign default feature structures to the categories. 3. `Parameter' starts a program parameter definition. These parameters control various aspects of the program. 4. `Define' starts a lexical rule definition. As noted in Shieber (1985), something more powerful than just abbreviations for common feature elements is sometimes needed to represent systematic relationships among the elements of a lexicon. This need is met by lexical rules, which express transformations rather than mere abbreviations. Lexical rules are not yet implemented properly. They may or may not be useful for word grammars used by PC-Kimmo. 5. `Lexicon' starts a lexicon section. This is only for compatibility with the original PATR-II. The section name is skipped over properly, but nothing is done with it. 6. `Word' starts an entry in the lexicon. This is only for compatibility with the original PATR-II. The entry is skipped over properly, but nothing is done with it. 7. `End' effectively terminates the file. Anything following this keyword is ignored. Note that these keywords are not case sensitive: `RULE' is the same as `rule', and both are the same as `Rule'. * Each of the fields in the grammar file may optionally end with a period. If there is no period, the next keyword (in an appropriate slot) marks the end of one field and the beginning of the next. Rules ===== A PC-Kimmo word grammar rule has these parts, in the order listed: 1. the keyword `Rule' 2. an optional rule identifier enclosed in braces (`{}') 3. the nonterminal symbol to be expanded 4. an arrow (`->') or equal sign (`=') 5. zero or more terminal or nonterminal symbols, possibly marked for alternation or optionality 6. an optional colon (`:') 7. zero or more feature constraints 8. an optional period (`.') The optional rule identifier consists of one or more words enclosed in braces. Its current utility is only as a special form of comment describing the intent of the rule. (Eventually it may be used as a tag for interactively adding and removing rules.) The only limits on the rule identifier are that it not contain the comment character and that it all appears on the same line in the grammar file. The terminal and nonterminal symbols in the rule have the following characteristics: * Upper and lower case letters used in symbols are considered different. For example, `NOUN' is not the same as `Noun', and neither is the same as `noun'. * The symbol X may be used to stand for any terminal or nonterminal. For example, this rule says that any category in the grammar rules can be replaced by two copies of the same category separated by a CJ. Rule X -> X_1 CJ X_2 = = = = The symbol X can be useful for capturing generalities. Care must be taken, since it can be replaced by anything. * Index numbers are used to distinguish instances of a symbol that is used more than once in a rule. They are added to the end of a symbol following an underscore character (`_'). This is illustrated in the rule for X above. * The characters `(){}[]<>=:/' cannot be used in terminal or nonterminal symbols since they are used for special purposes in the grammar file. The character `_' can be used *only* for attaching an index number to a symbol. * By default, the left hand symbol of the first rule in the grammar file is the start symbol of the grammar. The symbols on the right hand side of a phrase structure rule may be marked or grouped in various ways: * Parentheses around an element of the expansion (right hand) part of a rule indicate that the element is optional. Parentheses may be placed around multiple elements. This makes an optional group of elements. * A forward slash (/) is used to separate alternative elements of the expansion (right hand) part of a rule. * Curly braces can be used for grouping elements. For example the following says that an S consists of an NP followed by either a TVP or an IV: Rule S -> NP {TVP / IV} * Alternatives are taken to be as long as possible. Thus if the curly braces were omitted from the rule above, as in the rule below, the TVP would be treated as part of the alternative containing the NP. It would not be allowed before the IV. Rule S -> NP TVP / IV * Parentheses group enclosed elements the same as curly braces do. Alternatives and groups delimited by parentheses or curly braces may be nested to any depth. A rule can be followed by zero or more *feature constraints* that refer to symbols used in the rule. A feature constraint has these parts, in the order listed: 1. a feature path that begins with one of the symbols from the phrase structure rule 2. an equal sign 3. either another path or a value A feature constraint that refers only to symbols on the right hand side of the rule constrains their co-occurrence. In the following rule and constraint, the values of the *agr* features for the NP and VP nodes of the parse tree must unify: Rule S -> NP VP = If a feature constraint refers to a symbol on the right hand side of the rule, and has an atomic value on its right hand side, then the designated feature must not have a different value. In the following rule and constraint, the *head case* feature for the NP node of the parse tree must either be originally undefined or equal to NOM: Rule S -> NP VP = NOM (After unification succeeds, the *head case* feature for the NP node of the parse tree will be equal to NOM.) A feature constraint that refers to the symbol on the left hand side of the rule passes information up the parse tree. In the following rule and constraint, the value of the *tense* feature is passed from the VP node up to the S node: Rule S -> NP VP = Feature templates ================= A PC-Kimmo grammar feature template has these parts, in the order listed: 1. the keyword `Let' 2. the template name 3. the keyword `be' 4. a feature definition 5. an optional period (`.') If the template name is a terminal category (a terminal symbol in one of the phrase structure rules), the template defines the default features for that category. Otherwise the template name serves as an abbreviation for the associated feature structure. The characters `(){}[]<>=:' cannot be used in template names since they are used for special purposes in the grammar file. The characters `/_' can be freely used in template names. The character `\' should not be used as the first character of a template name because that is how fields are marked in the lexicon file. The abbreviations defined by templates are usually used in the feature field of entries in the lexicon file. For example, the lexical entry for the irregular plural form *feet* may have the abbreviation *pl* in its features field. The grammar file would define this abbreviation with a template like this: Let pl be [number: PL] The path notation may also be used: Let pl be = PL More complicated feature structures may be defined in templates. For example, Let 3sg be [tense: PRES agr: 3SG finite: + vform: S] which is equivalent to: Let 3sg be = PRES = 3SG = + = S In the following example, the abbreviation *irreg* is defined using another abbreviation: Let irreg be = - pl The abbreviation *pl* must be defined previously in the grammar file or an error will result. A subsequent template could also use the abbreviation *irreg* in its definition. In this way, an inheritance hierarchy features may be constructed. Feature templates permit disjunctive definitions. For example, the lexical entry for the word *deer* may specify the feature abbreviation *sg-pl*. The grammar file would define this as a disjunction of feature structures reflecting the fact that the word can be either singular or plural: Let sg/pl be {[number:SG] [number:PL]} This has the effect of creating two entries for *deer*, one with singular number and another with plural. Note that there is no limit to the number of disjunct structures listed between the braces. Also, there is no slash (`/') between the elements of the disjunction as there is between the elements of a disjunction in the rules. A shorter version of the above template using the path notation looks like this: Let sg/pl be = {SG PL} Abbreviations can also be used in disjunctions, provided that they have previously been defined: Let sg be = SG Let pl be = PL Let sg/pl be {[sg] [pl]} Note the square brackets around the abbreviations *sg* and *pl*; without square brackets they would be interpreted as simple values instead. Feature templates can assign default atomic feature values, indicated by prefixing an exclamation point (!). A default value can be overridden by an explicit feature assignment. This template says that all members of category N have singular number as a default value: Let N be = !SG The effect of this template is to make all nouns singular unless they are explicitly marked as plural. For example, regular nouns such as *book* do not need any feature in their lexical entries to signal that they are singular; but an irregular noun such as *feet* would have a feature abbreviation such as *pl* in its lexical entry. This would be defined in the grammar as `[number: PL]', and would override the default value for the feature number specified by the template above. If the N template above used `SG' instead of `!SG', then the word *feet* would fail to parse, since its *number* feature would have an internal conflict between `SG' and `PL'. Parameter settings ================== A PC-Kimmo grammar parameter setting has these parts, in the order listed: 1. the keyword `Parameter' 2. an optional colon (`:') 3. one or more keywords identifying the parameter 4. the keyword `is' 5. the parameter value 6. an optional period (`.') PC-Kimmo recognizes the following grammar parameters: `Start symbol' defines the start symbol of the grammar. For example, Parameter Start symbol is S declares that the parse goal of the grammar is the nonterminal category S. The default start symbol is the left hand symbol of the first phrase structure rule in the grammar file. `Restrictor' defines a set of features to use for top-down filtering, expressed as a list of feature paths. For example, Parameter Restrictor is declares that the *cat* and *head form* features should be used to screen rules before adding them to the parse chart. The default is not to use any features for such filtering. This filtering, named *restriction* in Shieber (1985), is performed in addition to the normal top-down filtering based on categories alone. RESTRICTION IS NOT YET IMPLEMENTED. SHOULD IT BE INSTEAD OF NORMAL FILTERING RATHER THAN IN ADDITION TO? `Attribute order' specifies the order in which feature attributes are displayed. For example, Parameter Attribute order is cat lex sense head first rest agreement declares that the *cat* attribute should be the first one shown in any output from PC-Kimmo, and that the other attributes should be shown in the relative order shown, with the *agreement* attribute shown last among those listed, but ahead of any attributes that are not listed above. Attributes that are not listed are ordered according to their character code sort order. If the attribute order is not specified, then the category feature *cat* is shown first, with all other attributes sorted according to their character codes. `Category feature' defines the label for the category attribute. For example, Parameter Category feature is Categ declares that *Categ* is the name of the category attribute. The default name for this attribute is *cat*. `Lexical feature' defines the label for the lexical attribute. For example, Parameter Lexical feature is Lex declares that *Lex* is the name of the lexical attribute. The default name for this attribute is *lex*. `Gloss feature' defines the label for the gloss attribute. For example, Parameter Gloss feature is Gloss declares that *Gloss* is the name of the gloss attribute. The default name for this attribute is *gloss*. Lexical rules ============= A PC-Kimmo grammar lexical rule has these parts, in the order listed: 1. the keyword `Define' 2. the name of the lexical rule 3. the keyword `as' 4. the rule definition 5. an optional period (`.') The rule definition consists of one or more mappings. Each mapping has three parts: an output feature path, an assignment operator, and the value assigned, either an input feature path or an atomic value. Every output path begins with the feature name `out' and every input path begins with the feature name `in'. The assignment operator is either an equal sign (`=') or an equal sign followed by a "greater than" sign (`=>'). As noted before, lexical rules are not yet implemented properly, and may not prove to be useful for PC-Kimmo word grammars in any case. Convlex: converting version 1 lexicons ************************************** The format of the lexicon files changed significantly between version 1 and version 2 of PC-Kimmo. For this reason, an auxiliary program to convert lexicon files was written. A version 1 PC-Kimmo lexicon file looks like this: ; SAMPLE.LEX 25-OCT-89 ; To load this file, first load the rules file SAMPLE.RUL and ; then enter the command LOAD LEXICON SAMPLE. ALTERNATION Begin NOUN ALTERNATION Noun End LEXICON INITIAL 0 Begin "[ " LEXICON NOUN s'ati Noun "Noun1" s'adi Noun "Noun2" bab'at Noun "Noun3" bab'ad Noun "Noun4" LEXICON End 0 # " ]" END For PC-Kimmo version 2, the same lexicon must be split into two files. The first one would look like this: ; SAMPLE.LEX 25-OCT-89 ; To load this file, first load the rules file SAMPLE.RUL and ; then enter the command LOAD LEXICON SAMPLE. ALTERNATION Begin NOUN ALTERNATION Noun End FIELDCODE lf U FIELDCODE lx L FIELDCODE alt A FIELDCODE fea F FIELDCODE gl G INCLUDE sample2.sfm END Note that everything except the lexicon sections and entries has been copied verbatim into this new primary lexicon file. The `FIELDCODE' statements define how to interpret the other lexicon files containing the actual lexicon sections and entries. These files are indicated by `INCLUDE' statements, and look like this: \lf 0 \lx INITIAL \alt Begin \fea \gl [ \lf s'ati \lx NOUN \alt Noun \fea \gl Noun1 \lf s'adi \lx NOUN \alt Noun \fea \gl Noun2 \lf bab'at \lx NOUN \alt Noun \fea \gl Noun3 \lf bab'ad \lx NOUN \alt Noun \fea \gl Noun4 \lf 0 \lx End \alt # \fea \gl ] `convlex' was written to make the transition from version 1 to version 2 of PC-Kimmo as painless as possible. It reads a version 1 lexicon file, including any `INCLUDE'd files, and writes a version 2 set of lexicon files. For a trivial case like the example above, the interaction with the user might go something like this: C:\>convlex CONVLEX: convert lexicon from PC-KIMMO version 1 to version 2 Comment character: [;] Input lexicon file: sample.lex Output lexicon file: sample2.lex Primary sfm lexicon file: sample2.sfm For each `INCLUDE' statement in the version 1 lexicon file, `convlex' prompts for a replacement filename like this: New sfm include file to replace noun.lex: noun2.sfm The user interface is extremely crude, but since this is a program that is run only once or twice by most users, that should not be regarded as a problem. Bibliography ************ 1. Antworth, Evan L.. 1990. `PC-KIMMO: a two-level processor for morphological analysis'. Occasional Publications in Academic Computing No. 16. Dallas, TX: Summer Institute of Linguistics. 2. Antworth, Evan L.. 1991. Introduction to two-level phonology. `Notes on Linguistics' 53:4-18. Dallas, TX: Summer Institute of Linguistics. 3. Antworth, Evan L.. 1995. `User's Guide to PC-KIMMO version 2'. URL ftp://ftp.sil.org/software/dos/pc-kimmo/guide.zip (visited August 29, 1997). 4. Chomsky, Noam. 1957. `Syntactic structures.' The Hague: Mouton. 5. Chomsky, Noam, and Morris Halle. 1968. `The sound pattern of English.' New York: Harper and Row. 6. Goldsmith, John A. 1990. `Autosegmental and metrical phonology.' Basil Blackwell. 7. Johnson, C. Douglas. 1972. `Formal aspects of phonological description.' The Hague: Mouton. 8. Kay, Martin. 1983. When meta-rules are not meta-rules. In Karen Sparck Jones and Yorick Wilks, eds., `Automatic natural language parsing,' 94-116. Chichester: Ellis Horwood Ltd. See pages 100-104. 9. Koskenniemi, Kimmo. 1983. `Two-level morphology: a general computational model for word-form recognition and production.' Publication No. 11. Helsinki: University of Helsinki Department of General Linguistics.