PC-KIMMO REFERENCE MANUAL version 1.0, May 1990 Evan L. Antworth Copyright 1990 Summer Institute of Linguistics 1 Introduction and technical specifications.....................1 2 Installing PC-KIMMO...........................................2 3 Starting PC-KIMMO.............................................3 4 Entering commands and getting on-line help....................4 5 Command reference by function.................................5 5.1 Get help.................................................5 5.2 Load rules and lexicon...................................5 5.3 Select new language......................................6 5.4 Take commands from a file................................6 5.5 List rule names, feasible pairs, or sublexicon names.....7 5.6 Set system parameters....................................7 5.7 Turn logging on or off...................................9 5.8 Show system status.......................................9 5.9 Show rule or sublexicon..................................9 5.10 Generate surface forms from a lexical form..............10 5.11 Recognize lexical forms from a surface form.............10 5.12 Compare data from a file................................10 5.13 Generate forms from a file..............................12 5.14 Recognize forms from a file.............................12 5.15 Execute an operating system command.....................12 5.16 Edit a file.............................................13 5.17 Halt the program........................................13 6 Alphabetic list of commands..................................13 7 File formats.................................................17 7.1 Rules file...............................................17 7.2 Lexicon file.............................................20 7.3 Generation comparison file...............................22 7.4 Recognition comparison file..............................23 7.5 Pairs comparison file....................................23 7.6 Generation file..........................................24 7.7 Recognition file.........................................24 7.8 Summary of default file names and extensions.............24 8 Trace formats................................................25 8.1 Generator trace..........................................26 8.2 Recognizer trace.........................................28 9 Algorithms...................................................32 9.1 Generating surface forms.................................32 9.2 Recognizing lexical forms................................33 10 Error messages..............................................37 10.1 Errors related to reading and parsing commands..........37 10.2 Errors related to reading the rules file................39 10.3 Errors related to reading the lexicon file..............43 10.4 Errors related to recognizing or generating a form......45 10.5 Errors that abort program execution.....................46 References.....................................................46 Errata.........................................................46 1 Introduction and technical specifications PC-KIMMO is a new implementation for microcomputers of a program dubbed KIMMO after its inventor Kimmo Koskenniemi. Koskenniemi's two-level model was designed to generate and recognize words (see Koskenniemi 1983). Work on PC-KIMMO was begun in 1985, following the specifications of the LISP implementation of Koskenniemi's model described in Karttunen 1983. The aim was to develop a version of the two-level processor that would run on an IBM PC PC-KIMMO Reference Manual Page 2 compatible computer and that would include an environment for testing and debugging a linguistic description. The PC-KIMMO program is actually a shell program that serves as an interactive user interface to the primitive PC-KIMMO functions. These functions are available as a source code library that can be included in a program written by the user. The coding has been done in Microsoft C by David Smith and Stephen McConnel under the direction of Gary Simons and under the auspices of the Summer Institute of Linguistics. Every effort has been made to maintain portability. Both the PC-KIMMO shell and the program modules will run on any hardware using MS-DOS or PC-DOS version 2.0 or higher. It can be run with as little as 256KB of memory, but will use up to 640KB. PC-KIMMO has also been compiled and tested for UNIX System V (SCO UNIX V/386 and A/UX) and for 4.2 BSD UNIX. We have also ported PC-KIMMO to the Macintosh, though it retains its command-line interface rather than using the graphical user interface one expects from Macintosh programs. Also, a few commands are not available in the Macintosh version; see the README file on the Macintosh version of the PC-KIMMO release diskette for detailed information. There are two versions of the PC-KIMMO release diskette, one for IBM PC compatibles and one for the Macintosh. Each contains the executable PC-KIMMO program, examples of language descriptions, and the source code library for the primitive PC-KIMMO functions. The PC-KIMMO executable program and the source code library are copyrighted but are made freely available to the general public under the condition that they not be resold or used for commercial purposes. For those who wish to compile PC-KIMMO for their UNIX system, the complete source code for both the user shell and the primitive functions is available for the cost of the media and shipping from Academic Computing Department, Summer Institute of Linguistics, 7500 W. Camp Wisdom Road, Dallas, TX 75236. The English description referred to in this chapter is based on Karttunen and Wittenburg 1983 as modified by Steve Echerd and Evan Antworth; see appendix A for a detailed exposition of the English description. The English files are found in the ENGLISH subdirectory on the PC-KIMMO release diskette. 2 Installing PC-KIMMO The following instructions apply to installing the IBM PC version of PC-KIMMO. Most of the information is also consistent with the UNIX version. For information on installing and running the Macintosh version, see the README file on the Macintosh version of the PC-KIMMO release diskette. If your computer has floppy disks only, make a working copy of the PC-KIMMO release diskette that came with this book. Store the original in a safe place. Insert your working copy of the PC-KIMMO diskette in drive A of your machine. PC-KIMMO Reference Manual Page 3 If your computer has a hard disk, use the INSTALL.BAT procedure on the PC-KIMMO diskette to install the system on your hard disk. To do this, insert the PC-KIMMO diskette in one of your disk drives. Type A: (or whatever the name of the drive is) in order to log control to that disk. Now type INSTALL followed by the name of the hard disk on which you want to install PC-KIMMO (for instance, INSTALL C:). This will create a subdirectory called PCKIMMO on your hard disk and copy the contents of the release diskette (with all its subdirectories) into it. Whether you are using a floppy or hard disk system, the operating system's PATH variable must be set to include the directory where the PC-KIMMO program is found. The AUTOEXEC.BAT file on your boot disk should contain a path statement that specifies all the disks and directories that contain programs. On a floppy disk system, the path statement should include as a minimum the root directory of drive A, for instance, PATH=A:\. On a hard disk system, add ; C:\PCKIMMO to the end of the path statement. For the path statement to become effective, you must reboot the computer. (If you want to change the path variable without changing the AUTOEXEC.BAT file and rebooting, enter a path command directly at the operating system prompt.) In order to use PC-KIMMO's EDIT command, you must set the operating system environment variable EDITOR to the name of your text editing program. This is done by including in the AUTOEXEC.BAT file a line of this form: SET EDITOR= where specifies the path and full file name of your editing program. For example, if your editor's file name is EMACS.EXE and is found in the UTIL subdirectory directly under the root directory, include this line: SET EDITOR= \UTIL\EMACS.EXE 3 Starting PC-KIMMO Be sure that DOS is logged onto the drive where PC-KIMMO is located. To change to the subdirectory that contains the English example, enter CD \ENGLISH on a floppy disk system, or CD \PCKIMMO\ENGLISH on a hard disk system. Now type PCKIMMO (if your PATH variable is not correctly set to include the PC-KIMMO subdirectory, type ..\PCKIMMO). When PC-KIMMO has successfully started up, you will see a version message and the PC-KIMMO command line prompt. PC-KIMMO can also be started with optional command line arguments. The format of the command line is: pckimmo [-c ] [-r ] [-l ] [-t ] The options are used as follows: o The -c option changes the character used to delimit comments in files used by PC-KIMMO. The argument is a single character. If this option is not specified, the semicolon (;) will be used as the comment delimiter. This option is PC-KIMMO Reference Manual Page 4 equivalent to issuing the SET COMMENT command from the program prompt. o The -r option specifies a rules file to be loaded. It is equivalent to issuing the LOAD RULES command from the program prompt. o The -l option specifies a lexicon file to be loaded. It is equivalent to issuing the LOAD LEXICON command from the program prompt. It must be used with the -r option. o The -t option specifies a command file from which PC-KIMMO reads and executes commands. It is equivalent to issuing the TAKE command from the program prompt. 4 Entering commands and getting on-line help The user interacts with PC-KIMMO by entering commands at the command line prompt, in much the same way that one enters commands at the operating system prompt. Case is ignored for all command keywords. Keywords can be shortened to any unambiguous form. For instance, LOAD RULES, LOAD RUL, LOAD R, and LOA R are all acceptable. Typing just L is ambiguous for the commands LOAD, LOG, and LIST. However, because LOAD is such a frequently used command, it takes special precedence over the other commands beginning with L, which means that typing just L will execute only the LOAD command. PC-KIMMO can be used with a TSR (Terminate and Stay Resident) command line editor such as CED or NDOSEDIT. This allows the user to recall and edit several previous command lines. The list of previous PC-KIMMO command lines is kept separate from the list of previous operating system command lines. If you exit PC-KIMMO and then run it again, the set of command lines from your previous PC-KIMMO session is still available. Neither of the command line editors remembers a command shorter than three characters. It should be noted that CED uses the ^ character as a kind of "virtual carriage return." This means that forms containing ^ as an alphabetic character cannot be entered from the keyboard with the GENERATE and RECOGNIZE commands, though of course such words can be read from a file. Screen scrolling can be halted by pressing Ctrl-S (that is, hold down the Ctrl (Control) key and press S); any key will resume scrolling. Processing can be interrupted by pressing Ctrl-C. Note that this action does not abort PC-KIMMO, but returns it to the program prompt. It is useful for stopping a long screen display (such as a trace) or a file processing command. Pressing Ctrl-P causes screen output to be echoed to the printer. Pressing Ctrl-P again stops printer echoing. There are several ways to get on-line help: o To get a list of the available commands, type ?. o To get information on what these commands do, type HELP. PC-KIMMO Reference Manual Page 5 o To get the specific syntax and use for a command, type HELP plus a specific command name. o To get a list of the keywords that can go with a particular command, type the command name followed by ?. Note however that if the command does not take a keyword it will be executed; for instance typing NEW ? will execute the NEW command. 5 Command reference by function The following subsections document each command, arranged by function, of the PC-KIMMO system. Square brackets in the command line summaries indicate optional elements. The notation {x | y} means either x or y (but not both). Command keywords and arguments in boldface are typed literally; for instance, the command summary SET TRACING {ON | OFF} means to type either SET TRACING ON or SET TRACING OFF. Command arguments in italics are replaced by elements of the specified type; for instance, the command summary SET COMMENT means to replace with a single character, such as set comment ;. 5.1 Get help ? Displays a list of command names. HELP [] Issuing the HELP command with no argument displays a list of commands with a brief description of their function. Issuing the HELP command with the name of a specific command displays a usage summary for the command. ? Typing a command name followed by ?, instead of a keyword, displays a message listing the keywords expected for that command. 5.2 Load rules and lexicon The LOAD command is used to load either rules or a lexicon from a file. LOAD RULES [] The LOAD RULES command loads a set of rules from the file specified on the command line. The can contain a path, for example, B:\ENGLISH\ENGLISH.RUL. The default file name extension is .RUL; thus, the command LOAD RULES ENGLISH will load the file ENGLISH.RUL. If no file name is given, the default file name RULES.RUL is used. The rules file must be in the format described later in this chapter (see section 7.1). PC-KIMMO Reference Manual Page 6 An error in the format of the rules file will cause the program to stop loading the file, erase the rules already loaded, and report an error message with the line number where the error was encountered. Refer to section 10 on error messages for more details. The rules file must be loaded before the lexicon and before performing any generation or recognition operations. The LOAD RULES command can also be invoked by using the -r command line option when starting up PC-KIMMO (see section 3). LOAD LEXICON [] The LOAD LEXICON command loads a lexicon from the file specified in the command line. The can contain a path, for example, B:\ENGLISH\ENGLISH.RUL. The default file name extension is .LEX; thus, the command LOAD LEXICON ENGLISH will load the file ENGLISH.LEX. If no file name is given, the default file name LEXICON.LEX is used. The lexicon file must be in the format described later in this chapter (see section 7.2). An error in the format of the lexicon file will cause the program to stop loading the file, erase the parts of the lexicon already loaded, and report an error message with the line number where the error was encountered. Refer to section 10 on error messages for more details. The rules file must be loaded before the lexicon. The lexicon file must be loaded before performing any recognition operations. A generation operation can be performed without loading the lexicon. The LOAD LEXICON command can also be invoked by using the -l command line option when starting up PC-KIMMO (see section 3). 5.3 Select new language NEW The NEW command clears the rules and lexicon currently loaded. Strictly speaking it is not needed, since the LOAD RULES command erases all existing rules and the LOAD LEXICON command erases any existing lexicon. 5.4 Take commands from a file TAKE [] The TAKE command causes PC-KIMMO to read and execute commands from a file. The can contain a path, for example, B:\KIMMO\ENGLISH.TAK. The TAKE command recognizes the default file name PCKIMMO.TAK and the default file extension .TAK. The command file can itself issue the TAKE command to call another command file down to a depth of three files. That is, the user can specify a command file that contains the command TAKE PC-KIMMO Reference Manual Page 7 , that itself contains the command TAKE . It would be an error for to contain a TAKE command. A command file can also be specified by using the -t command line option when starting up PC-KIMMO (see section 3). Note that a command file cannot submit forms to the special generator and recognizer prompts (see sections 5.10 and 5.11). 5.5 List rule names, feasible pairs, or sublexicon names The LIST command is used to display either rule names, feasible pairs, or sublexicon names. LIST PAIRS The LIST PAIRS command displays on the screen the set of feasible pairs specified by the set of rules currently turned on. LIST RULES The LIST RULES command displays on the screen the current state of the rules that are loaded. The display consists of each rule by number, an indication of whether the rule is on or off, and the rule name from the header lines of its state table in the rules file. LIST LEXICON The LIST LEXICON command displays on the screen the names of the sublexicons of the lexicon currently in use. 5.6 Set system parameters The SET command is used to turn tracing on or off, to turn on or off certain rules, to turn on or off various processing flags, and to change the comment delimiter character. SET TRACING {ON | OFF | } The SET TRACING command allows you turn the tracing mechanism on or off. When tracing is on, details of the analysis of a form are displayed on the screen during generation or recognition operations. If logging (see section 5.7) is on, the trace will also be written to the log file. Tracing is operative for these commands: GENERATE, RECOGNIZE, FILE COMPARE GENERATE, FILE COMPARE RECOGNIZE, FILE COMPARE PAIRS, FILE GENERATE, and FILE RECOGNIZE. The amount of detail shown in the trace display is set by the tracing level. The argument to the SET TRACING command can range from 0 to 3, where 0 is no tracing at all and 3 is the most detailed level of tracing. Issuing the command SET TRACING OFF sets tracing to level 0. Issuing the command SET TRACING ON sets tracing to level 2. At level 1, no information is given as to which feasible pair is being tried or the condition of the rules (that is, what state each automaton is in). Both the PC-KIMMO Reference Manual Page 8 generator and recognizer report each RESULT line, with all NULL symbols being explicitly printed. The recognizer also displays lexicon information; that is, it reports which sublexicon is being entered or backed out of. At level 2, the feasible pairs being tried and the state of each rule (automaton) is displayed. The recognizer displays lexicon information as it does at level 1. At level 3, more detailed information is given on which feasible pairs are being tried and the state of each rule. For more information on the format of the trace display, see section 8 on trace formats. SET RULES {ON | OFF} { | ALL} The SET RULES command allows you to turn selected rules on or off for testing or debugging purposes. When a rule is turned off, it is completely ignored in the recognition or generation of forms. One effect of this is to cause the recalculation of feasible pairs, considering only the rules which remain on. Use the LIST PAIRS command to see the set of feasible pairs currently in use. On the command line, you can specify the action ON or OFF followed by a list of rule numbers or the keyword ALL (in which case all rules are turned on or off). Specific rules are turned on or off by listing their rule numbers (shown by the LIST RULES command), each separated by a space. SET COMMENT The SET COMMENT command changes the comment delimiter character (see section 7). The default is semicolon (;). The comment delimiter can also be set with the -c command line option when starting up PC-KIMMO (see section 3). SET LIMIT {ON | OFF} The SET LIMIT command limits the result of a generation or recognition function to one form. That is, if limit is set off, then PC-KIMMO backtracks after finding a correct result so that it can find every possible result. With limit set on, after finding one correct result form PC-KIMMO does not backtrack to try to find more results. SET TIMING {ON | OFF} The SET TIMING command uses the computer's system clock to time the execution of generation and recognition operations. It displays the result as the number of seconds the operation lasted. It applies to these commands: GENERATE, RECOGNIZE, FILE COMPARE GENERATE, FILE COMPARE RECOGNIZE, FILE COMPARE PAIRS, FILE GENERATE, and FILE RECOGNIZE. SET VERBOSE {ON | OFF} The SET VERBOSE command affects the amount of information displayed on the screen during a file comparison operation PC-KIMMO Reference Manual Page 9 (either generate, recognize, or pairs, see section 5.12). If verbose is set off, a file comparison operation displays only a dot for each form correctly analyzed, though any exceptional results will cause the complete form and warning messages to be displayed. If verbose is set on, a file comparison operation displays the complete contents of the file (minus comments) plus confirmation and warning messages. 5.7 Turn logging on or off The LOG and CLOSE commands are used to turn logging on and off. LOG [] The LOG command turns the logging mechanism on. When logging is on, the information displayed on the screen during execution of generation or recognition operations is also written to a disk file whose name is specified in the command line. The can contain a path, for example, B:\ENGLISH\ENGLISH.LOG. If no file name is given, a log file named PCKIMMO.LOG is written to the default directory. If a LOG command is given when a log file is already open, then the open log file is closed before the new log file is created. Logging records the processing of these commands: GENERATE, RECOGNIZE, FILE COMPARE GENERATE, FILE COMPARE RECOGNIZE, FILE COMPARE PAIRS, FILE GENERATE, and FILE RECOGNIZE. Tracing displays are also recorded in a log file. CLOSE The CLOSE command turns logging off and closes the log file. 5.8 Show system status The STATUS command is used to display on the screen the status of various system parameters. STATUS The STATUS command displays the names of the rules and lexicon files currently loaded, the name of the log file (if logging is on), the comment delimiter character, and the status of the limit, timing, tracing and verbose flags. It can also be invoked with the synonyms SHOW STATUS or SHOW. 5.9 Show rule or sublexicon SHOW RULE The SHOW RULE command first displays the number, on/off status, and name of the rule (similar to the LIST RULES command). If the rule is turned on, it then displays each column header of the state table for that rule with the set of feasible pairs that it specifies. This command is used primarily for debugging purposes. PC-KIMMO Reference Manual Page 10 SHOW LEXICON The SHOW LEXICON command displays the contents of a sublexicon. It shows each lexical item, its gloss, and its continuation class. If the continuation class of a lexical entry names an alternation, the alternation is expanded into a list of sublexicon names. Note that this command displays the parts of the lexical entry in the following order (rather than the order in which they appear in the lexicon file): lexical item, gloss, continuation class. 5.10 Generate surface forms from a lexical form GENERATE [] The GENERATE command accepts as input a lexical form and returns one or more surface forms. If no lexical form argument is given, PC-KIMMO supplies a special generator prompt where forms can be typed in directly without the GENERATE keyword. Entering a blank line at the generator prompt returns the program to the main command line prompt. 5.11 Recognize lexical forms from a surface form RECOGNIZE [] The RECOGNIZE command accepts as input a surface form and returns one or more lexical forms. If no surface form argument is given, PC-KIMMO supplies a special recognizer prompt where forms can be typed in directly without the RECOGNIZE keyword. Entering a blank line at the recognizer prompt returns the program to the main command line prompt. 5.12 Compare data from a file The COMPARE commands compare data prepared by the user to the results of data processed by PC-KIMMO. The data are contained in files whose formats are described in section 7. [FILE] COMPARE GENERATE [] The COMPARE GENERATE command reads lexical forms from a file, submits them to the generator for analysis, and compares the resulting surface form(s) with the expected results listed in the file. The can contain a path, for example, B:\ENGLISH\ENGLISH.GEN. A generation comparison file has the default extension .GEN and the default file name DATA.GEN. The format of the generation comparison file is described in section 7.3. Results of the comparison are reported according to the setting of the verbosity flag (see the SET VERBOSE command described in section 5.6). If verbosity is set off, only exceptions (that is, actual results from the generator that are different from the expected results as specified in the file) are reported. A dot is displayed on the screen as each input (lexical) form is processed. If verbosity is set on, each group of lexical and PC-KIMMO Reference Manual Page 11 surface forms in the file is displayed, either with an error message for wrong comparisons or the message OK if the actual and expected results match exactly. [FILE] COMPARE RECOGNIZE [] The COMPARE RECOGNIZE command reads surface forms from a file, submits them to the recognizer for analysis, and compares the resulting lexical form(s) with the expected results specified in the file. The can contain a path, for example, B:\ENGLISH\ENGLISH.REC. A recognition comparison file has the default extension .REC and the default file name DATA.REC. The format of the recognition comparison file is described in section 7.4. Results of the comparison are reported according to the setting of the verbosity flag (see the SET VERBOSE command described in section 5.6. If verbosity is set off, only exceptions (that is, actual results from the recognizer that are different from the expected results as specified in the file) are reported. A dot is displayed on the screen as each input (surface) form is processed. If verbosity is set on, each group of surface and lexical forms in the file is displayed, either with an error message for wrong comparisons or the message OK if the actual and expected results compared identically. [FILE] COMPARE PAIRS [] The COMPARE PAIRS command allows lexical:surface pairs of forms listed in the file specified on the command line to be compared in both directions. The can contain a path, for example, B:\ENGLISH\ENGLISH.PAI. A pairs comparison file has the default extension .PAI and the default file name DATA.PAI. The format of the pairs comparison file is described in section 7.5. PC-KIMMO considers each pair of forms (a lexical form followed by its surface form). The lexical form is input to the generator to produce one or more surface forms. The surface form listed in the file is compared with the generated surface forms to see if there is a successful match. The surface form listed in the file is then input to the recognizer to produce one or more lexical forms. The lexical form listed in the file is compared with the recognized lexical forms to see if there is a successful match. Results of the comparison are reported according to the setting of the verbosity flag (see the SET VERBOSE command described in section 5.6). If verbosity is set off, only exceptions (that is, one of the comparisons failed) are reported. A dot is displayed on the screen as each pair of forms is processed. If verbosity is set on, each pair of lexical and surface forms in the file is displayed, either with an error message for wrong comparisons or the message OK if the forms match exactly. PC-KIMMO Reference Manual Page 12 5.13 Generate forms from a file FILE GENERATE [] The FILE GENERATE command reads lexical forms from a file, submits them to the generator for analysis, and returns each lexical form followed by the resulting surface form(s). The format of the generation input file is described in section 7.6. If an argument is specified, the results are written to that file; otherwise, the results are displayed on the screen. The format of the output file created by this command is identical to a comparison generation file. The of either file can contain a path, for example, B:\ENGLISH\ENGLISH.LST. The command does not recognize any default file names or extensions. The verbosity flag (see the SET VERBOSE command described in section 5.6) has no effect on the FILE GENERATE command. 5.14 Recognize forms from a file FILE RECOGNIZE [] The FILE RECOGNIZE command reads surface forms from a file, submits them to the recognizer for analysis, and returns each surface form followed by the resulting lexical form(s). The format of the recognition input file is described in section 7.7. If an argument is specified, the results are written to that file; otherwise the results are displayed on the screen. The format of the output file created by this command is identical to a comparison recognition file. The of either file can contain a path, for example, B:\ENGLISH\ENGLISH.LST. The command does not recognize any default file names or extensions. The verbosity flag (see the SET VERBOSE command described in section 5.6) has no effect on the FILE RECOGNIZE command. For details on the format of the recognition input file, see section 7.7. 5.15 Execute an operating system command SYSTEM [] The SYSTEM command allows you to execute an operating system command from within PC-KIMMO. For example, on an IBM PC-compatible computer, the command SYSTEM DIR will execute the DOS directory command. If no command argument is given, then PC-KIMMO is pushed into the background and a new system command processor shell is started. While you are in the shell, you can execute any commands or programs. To leave the shell and return to PC-KIMMO, type EXIT. On an IBM PC-compatible computer, the SYSTEM command will not work unless a copy of the DOS system file COMMAND.COM is available. Note that if you are running PC-KIMMO under MS-DOS version 2, issuing the SYSTEM command with no argument will NOT invoke a new processor shell. To get a new PC-KIMMO Reference Manual Page 13 shell you must enter the command SYSTEM COMMAND. This will directly execute COMMAND.COM. Type EXIT to return to PC-KIMMO. The system command has the alias ! (exclamation point), which does not require a space between it and the following command. For example, !DIR performs the DOS directory command. 5.16 Edit a file EDIT The EDIT command attempts to edit a file using the editing program specified by the operating system environment variable EDITOR. If this environment variable is not defined, then the command will try to use EDLIN (on a DOS machine) or vi (on a UNIX machine) to edit the file. To set the environment variable, include a line such as this in your AUTOEXEC.BAT file: SET EDITOR= where specifies the path and full file name of your editing program, for example, \UTIL\EMACS.EXE. You can use the EDIT command, for example, to invoke your text editor and modify the rules or lexicon files. After saving the files and leaving the editor, you must LOAD the files again in order for PC-KIMMO to utilize the changes. 5.17 Halt the program EXIT The EXIT command causes PC-KIMMO to exit back to the operating system. QUIT The command QUIT is the same as EXIT. 6 Alphabetic list of commands This section documents each command, arranged alphabetically, of the PC-KIMMO system. Square brackets in the command line summaries indicate optional elements. The notation {x | y} means either x or y (but not both). Command keywords and arguments in boldface are typed literally; for instance, the command summary SET TRACING {ON | OFF} means to type either SET TRACING ON or SET TRACING OFF. Command arguments in italics are replaced by elements of the specified type; for instance, the command summary SET COMMENT means to replace with a single character, such as set comment ;. PC-KIMMO Reference Manual Page 14 ! [] Executes an operating system command or invoke a new command processor shell (same as SYSTEM). ? Displays a list of command names. CLOSE Turns logging off and closes the log file. EDIT Edits using the editing program specified by the operating system environment variable EDITOR. EXIT Exits PC-KIMMO and returns to the operating system. [FILE] COMPARE GENERATE [] Reads lexical forms from , submits them to the generator, and compares the resulting surface form(s) with the expected results listed in . [FILE] COMPARE RECOGNIZE [] Reads surface forms from , submits them to the recognizer, and compares the resulting lexical form(s) with the expected results listed in . [FILE] COMPARE PAIRS [] Reads pairs of lexical and surface forms from and analyzes them to see if the surface form can generated from the lexical form and the lexical form can be recognized from the surface form. FILE GENERATE [] Reads a list of lexical forms from , submits them to the generator, and returns each lexical form followed by the resulting surface form(s). FILE RECOGNIZE [] Reads a list of surface forms from , submits them to the recognizer, and returns each surface form followed by the resulting lexical form(s). PC-KIMMO Reference Manual Page 15 GENERATE [] Accepts as input a lexical form and returns one or more surface forms. HELP [] Without a command name argument, displays a list of commands with a brief explanation of each. With a command name argument, displays a usage summary for the command. LIST LEXICON Displays on the screen the names of the sublexicons of the lexicon currently in use. LIST PAIRS Displays the set of feasible pairs specified by the set of rules currently turned on. LIST RULES Displays the current state of the rules that are loaded. LOAD LEXICON [] Loads the lexicon from . LOAD RULES [] Loads rules from . LOG [] Turns the logging mechanism on. NEW Clears the rules and lexicon currently loaded. QUIT Same as EXIT. RECOGNIZE [] Accepts as input a surface form and returns one or more lexical forms. PC-KIMMO Reference Manual Page 16 SET COMMENT Changes the comment delimiter character. The default is semicolon (;). SET LIMIT {ON | OFF} Limits the result of a generation or recognition function to one form. SET RULES {ON | OFF} { | ALL} Turns selected rules on or off. SET TIMING {ON | OFF} Times the execution of generation and recognition functions and displays the result. SET TRACING {ON | OFF | } Turns the tracing mechanism on or off. SET VERBOSE {ON | OFF} Determines the amount of information shown on the screen during a file comparison operation. SHOW [STATUS] Same as STATUS. SHOW LEXICON Displays the contents of the named sublexicon. For each lexical entry it shows the lexical form, gloss, and continuation class. SHOW RULE Displays the number, on/off status, and name of the rule (similar to the list rules command). If the rule is turned on, it then displays each column header of the state table for that rule with the set of feasible pairs that it specifies. STATUS Displays the names of the rules and lexicon files currently loaded, the name of the log file (if logging is on), the comment delimiter character, and the status of the limit, timing, tracing, and verbose flags. Obeys the synonyms SHATUS and SHOW. PC-KIMMO Reference Manual Page 17 SYSTEM [] Executes an operating system command or invokes a new command processor shell. See also !. TAKE [] Reads and executes commands from . 7 File formats This section describes the formats for the files that are used as input to PC-KIMMO. In any of the files, comments can be added to any line by preceding the comment with the comment delimiter character. This character is normally a semicolon (;), but can be changed either on the PC-KIMMO command line with the -c option (see section 3) or with the SET COMMENT command (see section 5.6). Anything following a comment delimiter (until the end of the line) is considered part of the comment and is ignored by PC-KIMMO. In the descriptions below, reference to the use of a space character implies any whitespace character (that is, any character treated like a space character). The following control characters when used in a file are whitespace characters: ^I ( ASCII 9, tab), ^J ( ASCII 10, line feed), ^K ( ASCII 11, vertical tab), ^L ( ASCII 12, form feed), and ^M ( ASCII 13, carriage return). The control character ^Z ( ASCII 26) cannot be used because MS-DOS interprets it as marking the end of a file. Also the control character ^@ ( ASCII 0, null) cannot be used. Examples of each of the following file types are found on the release diskette as part of the English description. 7.1 Rules file The general structure of the rules file is a list of declarations composed of a keyword followed by data. The set of valid keywords is ALPHABET, NULL, ANY, BOUNDARY, SUBSET, RULE, and END. Only the SUBSET and RULE keywords can appear more than once. The ALPHABET declaration must appear first in the file. The other declarations can appear in any order. The NULL, ANY, BOUNDARY, and SUBSET declarations can even be interspersed among the rules. However, these declarations must appear before any rule that uses them or an error will result. Figure 1 shows the structure of a rules file. The order of the keyword declarations is according to common style. Note that the notation {x | y} means either x or y (but not both). The following specifications apply to the rules file. PC-KIMMO Reference Manual Page 18 Figure 1 Structure of the rules file ALPHABET NULL ANY <"wildcard" symbol> BOUNDARY SUBSET . (more subsets) . . RULE {: | .} . (more states) . . . (more rules) . . END o Extra spaces, blank lines, and comment lines are ignored. o The first line of the file (excluding comment lines) must contain the keyword ALPHABET. o is a list of single characters that make up the combined alphabet of all the characters used in both lexical and surface representations. Each character must be separated from the others by at least one space. The list can span multiple lines, but ends with the next valid keyword. All alphanumeric characters (such as a, B, and 2), symbols (such as $ and +), and punctuation characters (such as . and ?) are available as alphabet members. The characters in the IBM extended character set (above ASCII 127) are also available. Control characters (below ASCII 32) can also be used, with the exception of whitespace characters (see above), ^Z (end of file), and ^@ (null). The alphabet can contain a maximum of 255 characters. o After the ALPHABET declaration, the NULL, ANY, BOUNDARY, SUBSET, and RULE declarations can occur in any order. o The BOUNDARY declaration is obligatory, even if the rules do not use a BOUNDARY symbol. This is because the lexicon file requires a BOUNDARY symbol. The NULL, ANY, and SUBSET declarations are not obligatory if the rules do not use a NULL symbol, an ANY symbol, or subsets. o The keyword NULL is followed by a , a single character that represents a null (empty, zero) element. The NULL symbol is considered to be an alphabetic character, but cannot also be listed in the ALPHABET declaration. The NULL symbol declared in the rules file is also used in the lexicon file to represent a null lexical entry. o The keyword ANY is followed by a <"wildcard" symbol>, a single character that represents a match of any character in the alphabet. The ANY symbol is not considered to be an alphabetic character, though it is used in the column headers of state PC-KIMMO Reference Manual Page 19 tables. It cannot be listed in the ALPHABET declaration. It is not used in the lexicon file. o The keyword BOUNDARY is followed by a , a single character that represents an initial or final word boundary. The BOUNDARY symbol is considered to be an alphabetic character, but cannot also be listed in the ALPHABET declaration. When used in the column header of a state table, it can only appear as the pair #:# (where, for instance, # has been declared as the BOUNDARY symbol). The BOUNDARY symbol is also used in the lexicon file in the continuation class field of a lexical entry to indicate the end of a word (that is, no continuation class). o The keyword SUBSET is followed by the and . is a single word (one or more characters) that names the list of characters that follows it. The subset name must be unique (that is, if it is a single character it cannot also be in the alphabet or be any other declared symbol). It can be composed of any characters (except space); that is, it is not limited to the characters declared in the ALPHABET section. It must not be identical to any keyword used in the rules file. The subset name is used in rules to represent all members of the subset of the alphabet that it defines. Note that SUBSET declarations can be interspersed among the rules. This allows subsets to be placed near the rule that uses them if such a style is desired. However, a subset must be declared before a rule that uses it. o is a list of single characters, each of which is separated by at least one space. The list can span multiple lines. Each character in the list must be a member of the previously defined ALPHABET with the exception of the NULL symbol, which can appear in a subset list but is not included in the ALPHABET declaration. Neither the ANY symbol nor the BOUNDARY symbol can appear in a subset character list. o The keyword RULE signals that a state table immediately follows. o is the name or description of the rule which the state table encodes. It functions as an annotation to the state table and has no effect on the computational operation of the table. It is displayed by the LIST RULES and SHOW RULE commands and is also displayed in traces. The rule name must be surrounded by a pair of identical delimiter characters. Any material can be used between the delimiters of the rule name with the exception of the current comment delimiter character and of course the rule name delimiter character of the rule itself. Each rule in the file can use a different pair of delimiters. The rule name must be all on one line, but it does not have to be on the same line as the RULE keyword. o is the number of states (rows in the table) that will be defined for this table. The states must begin at 1 and go in sequence through the number defined here (that is, gaps in state numbers are not allowed). o is the number of state transitions (columns in the table) that will be defined for each state. PC-KIMMO Reference Manual Page 20 o is a list of elements separated by one or more spaces. Each element represents the lexical half of a lexical:surface correspondence which, when matched, defines a state transition. Each element in the list must be either a member of the alphabet, a subset name, the NULL symbol, the ANY symbol, or the BOUNDARY symbol (in which case the corresponding surface character must also be the BOUNDARY symbol). The list can span multiple lines, but the number of elements in the list must be equal to the number of columns defined for the rule. o is a list of elements separated by one or more spaces. Each element represents the surface half of a lexical:surface correspondence which, when matched, defines a state transition. Each element in the list must be either a member of the alphabet, a subset name, the NULL symbol, the ANY symbol, or the BOUNDARY symbol (in which case the corresponding lexical character must also be the BOUNDARY symbol). The list can span multiple lines, but the number of characters in the list must be equal to the number of columns defined for the rule. o is the number of the state or row of the table. The first state number must be 1, and subsequent state numbers must follow in numerical sequence without any gaps. o {: | .} is the final or nonfinal state indicator. This should be a colon (:) if the state is a final state and a period (.) if it is a nonfinal state. It must follow the with no intervening space. o is a list of state transition numbers for a particular state. Each number must be between 1 and the number of states (inclusive) declared for the table. The list can span multiple lines, but the number of elements in the list must be equal to the number of columns declared for this rule. o item The keyword END follows all rules and indicates the end of the rules file. Any material in the file thereafter is ignored by PC-KIMMO. The END keyword is optional; the physical end of the file also terminates the rules file. 7.2 Lexicon file The general structure of the lexicon file is a list of declarations composed of a keyword followed by data. The set of valid keywords is ALTERNATION, LEXICON, INCLUDE, and END. The only required declaration is LEXICON INITIAL; that is, a lexicon file must minimally be composed of one sublexicon named INITIAL. The declarations can appear in any order with the exception that an alternation name used in the continuation class field of a lexical entry (including a lexical entry in an INCLUDE file) must first be declared with the ALTERNATION keyword. Figure 2 shows the structure of a lexicon file. The order of the keyword declarations is according to common style. Note that the notation {x | y} means either x or y (but not both). The following specifications apply to the lexicon file. PC-KIMMO Reference Manual Page 21 Figure 2 Structure of the lexicon file ALTERNATION . (more alternations) . . LEXICON INITIAL { | } . (more lexical entries) . . INCLUDE . (more include files) . . LEXICON { | } . (more lexical entries) . . . (more sublexicons) . . END o Extra spaces, blank lines, and comment lines are ignored. o The keyword ALTERNATION is followed by an and an . o is a name associated with the following . It is a word composed of one or more characters, not limited to the ALPHABET characters declared in the rules file. An alternation name can be any word other than a keyword used in the lexicon file. The program does not check to see if an alternation name is actually used in the lexicon file. o is a list of sublexicon names. It can span multiple lines until the next valid keyword is encountered. Each sublexicon name in the list must be declared at some point in the file with the LEXICON keyword. Although it is not enforced at the time the lexicon file is loaded, an undeclared sublexicon named in an alternation list will cause an error when a recognition function tries to use it. o The keyword LEXICON is followed by a and a list of lexical entries. o is the name associated with a sublexicon. It is a word composed of one or more characters, not limited to the alphabetic characters declared in the rules file. A sublexicon name can be any word other than a keyword used in the lexicon file. o In each sublexicon section are lexical entries, each of which is composed of three parts or fields separated by one or more spaces. Each lexical entry must all be on one line. The PC-KIMMO Reference Manual Page 22 three parts are the lexical item, the continuation class, and the gloss. o is one or more characters that represent an element (typically a morpheme or word) of the lexicon. Each character must be in the alphabet defined for the language. The lexical item uses only the lexical subset of the alphabet. o { | } fills the continuation class field of a lexical entry. It must be either an alternation name or the BOUNDARY symbol declared in the rules file. o is a string of text surrounded by a pair of identical delimiter characters. Whenever the lexical item in the lexical entry is matched, everything between the delimiters is appended to the result. If there is no gloss associated with the lexical item, the gloss field must contain a pair of delimiters with nothing in between (for example, ""). Any material can be used between the delimiters of the gloss with the exception of the current comment delimiter character and of course the gloss delimiter character of the entry itself. Each lexical entry in the file can use a different pair of delimiters. The gloss must be all on one line with the rest of the lexical entry. o The INCLUDE keyword is followed by a that names a file containing another lexicon file. This included lexicon file has the same structure and specifications as the main lexicon file with the exception that it cannot contain an INCLUDE declaration; that is, INCLUDE files cannot be nested. Alternation names and sublexicon names in INCLUDE files must be unique; that is, not used anywhere else in the lexicon. The END keyword (or the physical end of the file) will terminate reading of the included file and return to reading the main lexicon file. o The keyword END follows all lexical information and indicates the end of the lexicon file. Any material in the file thereafter is ignored by PC-KIMMO. See also the use of the END keyword in an included file. The END keyword is optional; the physical end of the file also terminates the lexicon file. 7.3 Generation comparison file The generation comparison file serves as input to the COMPARE GENERATE command (see section 5.12). It consists of groupings of a lexical form followed by one or more surface forms that are expected to be generated from the lexical form. The following specifications apply to the generation comparison file. o Each form must be on a separate line. o Leading spaces are ignored. o A blank line (or end of file) indicates the end of a grouping. Extra blank lines are ignored. o The first form in each grouping is the lexical form to be input to the generator. Its gloss does not have to be included, since the generator does not use the lexicon; however, including PC-KIMMO Reference Manual Page 23 a gloss with the lexical form does no harm--it is simply ignored. o Succeeding forms in each grouping are surface forms that are the expected output of the generator. 7.4 Recognition comparison file The recognition comparison file serves as input to the COMPARE RECOGNIZE command (see section 5.12). It consists of groupings of a surface form followed by one or more lexical forms that are expected to be recognized from the surface form. The following specifications apply to the recognition comparison file. o Each form must be on a separate line. o Leading spaces are ignored. o A blank line (or end of file) indicates the end of a grouping. Extra blank lines are ignored. o The first form in each grouping is the surface form to be input to the recognizer. o Succeeding forms in each grouping are lexical forms that are the expected output of the recognizer. The gloss of a form follows it on the same line, separated by one or more spaces. The gloss must match exactly (including spaces) the way it is output from the recognizer. 7.5 Pairs comparison file The pairs comparison file serves as input to the COMPARE PAIRS command (see section 5.12). It consists of pairs of lexical and surface forms; that is, a lexical form followed by exactly one surface form. It is expected that the surface form will be recognized from the lexical form and that the lexical form will be generated from the surface form. Glosses do not have to be included with lexical forms, since the generator does not use the lexicon; however, including a gloss with the lexical form does no harm--it is simply ignored. When recognizing a surface form, the lexicon is used to identify the constituent morphemes and verify that they occur in the correct order, but the gloss part of a lexical entry is not used. The following specifications apply to the pairs comparison file. o Each form must be on a separate line. o Leading spaces are ignored. o A blank line (or end of file) indicates the end of a grouping. Extra blank lines are ignored. o The first form of a pair is the lexical form, which is input to the generator. It is the expected output on inputting the second (surface) form to the recognizer. The gloss is not included with the lexical form. PC-KIMMO Reference Manual Page 24 o The second form of a pair is the surface form, which is input to the recognizer. It is the expected output on inputting the first (lexical) form to the generator. 7.6 Generation file The generation file consists of a list of lexical forms. It serves as input to the FILE GENERATE command (see section 5.13), which returns a file (or screen display) whose format is identical to the generation comparison file. The following specifications apply to the generation file. o Each form must be on a separate line. o Extra white space, blank lines, and comment lines are ignored. o Each form is assumed to be a lexical form. If a gloss is included, it is ignored. 7.7 Recognition file The recognition file consists of a list of surface forms. It serves as input to the FILE RECOGNIZE command (see section 5.14), which returns a file (or screen display) whose format is identical to the recognition comparison file. The following specifications apply to the recognition file. o Each form must be on a separate line. o Extra spaces, blank lines, and comment lines are ignored. o Each form is assumed to be a surface form. 7.8 Summary of default file names and extensions Figure 3 summarizes the default file names and extensions assumed by PC-KIMMO. Two entries are given for the different kinds of files. The first is the name PC-KIMMO will assume if no file name at all is given to a command that expects that kind of file. The second entry (with the *) shows what extension PC-KIMMO will add if a file name without an extension is given. Figure 3 Default file names and extensions Rules file: RULES.RUL *.RUL Lexicon file: LEXICON.LEX *.LEX Generation comparison file: DATA.GEN *.GEN Recognition comparison file: DATA.REC *.REC Pairs comparison file: DATA.PAI *.PAI Take file: PCKIMMO.TAK *.TAK Log file: PCKIMMO.LOG PC-KIMMO Reference Manual Page 25 8 Trace formats This section explains how to read the output of the generator and recognizer traces. Traces are produced by the SET TRACING command described in section 5.6. The amount of detail shown in the trace display is set by the tracing level. The argument to the SET TRACING command can range from 0 to 3, where 0 is no tracing at all and 3 is the most detailed level of tracing. 8.1 Generator trace The purpose of the generator trace is to allow the user to see how a lexical form is processed through multiple recursive calls to the generator. The generator algorithm used to process the form is described in section 9.1. Figure 4 Level 1 generator trace `fox+s RESULT = 0fox0es foxes There are three levels of tracing differing in the amount of detail they display: Level 1 gives the least amount of detail, level 2 (the default) gives a moderate amount of detail, and level 3 gives the most detail. Figure 4 is a level 1 generator trace of the lexical form `fox+s (taken from the English example). The only difference from no tracing at all is that the RESULT line is displayed. This line differs from the normal result that is returned because it prints all NULL symbols in the output surface form. Figure 5 is from a level 2 generator trace for the form `fox+s. To limit the size of the trace, the Gemination rules (14 and 15) were turned off. Line numbers and column numbers are printed here for reference in the description that follows. Each description refers to an element beginning at the line and column indicated. PC-KIMMO Reference Manual Page 26 Figure 5 Level 2 generator trace 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 ` fox+s 2 0 #:# 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 0 `:0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 f:f 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 0 5 2 o:o 1 1 1 1 2 2 1 3 3 1 1 1 1 1 1 0f 6 3 x:x 1 1 1 1 1 1 1 7 4 2 1 1 1 1 1 0fo 7 4 +:0 1 1 3 3 2 2 1 4 4 1 1 1 1 1 1 0fox 8 5 s:s 1 1 5 5 1 1 2 4 4 1 1 1 1 1 1 0fox0 9 6 #:# 1 1 6 2 2 2 3 3 4 1 1 1 1 1 1 0fox0s 10 6- BLOCKED BY RULE 3: Epenthesis, 0:0 /<= [S|ch|sh|y:i] +:0___s[+:0|#] 11 5< 1 1 5 5 1 1 2 4 4 1 1 1 1 1 1 0fox0 12 5 s:0 1 1 5 5 1 1 2 4 4 1 1 1 1 1 1 0fox0 13 5- BLOCKED BY RULE 7: S-deletion, s:0 <=> +:0 (0:e) s +:0 '___ 14 5 0:e 1 1 5 5 1 1 2 4 4 1 1 1 1 1 1 0fox0 15 6 s:s 1 1 1 6 1 1 2 4 4 1 1 1 1 1 1 0fox0e 16 7 #:# 1 1 4 7 2 2 3 3 4 1 1 1 1 1 1 0fox0es 17 7 1 1 1 1 1 1 1 3 4 1 1 1 1 1 1 0fox0es 18 19 RESULT = 0fox0es 20 21 6< 1 1 1 6 1 1 2 4 4 1 1 1 1 1 1 0fox0e 22 6 s:0 1 1 1 6 1 1 2 4 4 1 1 1 1 1 1 0fox0e 23 6- BLOCKED BY RULE 4: Epenthesis, 0:e => [S|ch|sh|y:i] +:0___s[+:0|#] 24 6 0:e 1 1 1 6 1 1 2 4 4 1 1 1 1 1 1 0fox0e 25 6- BLOCKED BY RULE 4: Epenthesis, 0:e => [S|ch|sh|y:i] +:0___s[+:0|#] 26 5< 1 1 5 5 1 1 2 4 4 1 1 1 1 1 1 0fox0 27 4< 1 1 3 3 2 2 1 4 4 1 1 1 1 1 1 0fox 28 4 0:e 1 1 3 3 2 2 1 4 4 1 1 1 1 1 1 0fox 29 4- BLOCKED BY RULE 4: Epenthesis, 0:e => [S|ch|sh|y:i] +:0___s[+:0|#] ... 39 0< 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 40 0 0:e 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 41 0- BLOCKED BY RULE 4: Epenthesis, 0:e => [S|ch|sh|y:i] +:0___s[+:0|#] 42 foxes o Line 1: Input line. Lexical form input to the generator function. o Line 19: RESULT line. Surface form produced by the generator function. At the point where the input lexical form is empty and each automaton is in a final state, the trace shows that the generator has recorded a result. The generator continues looking for additional results (lines 21 and following). o Column 1: Level number (all lines except 1, 19, and 42). This represents the level of recursion. Level 0 represents the initial invocation of the generator. Notice that the number coincides with the number of characters in the result string so far. o Column 1: Backtracking indicator (lines 10, 11, 13). The symbol - indicates that the generator is blocked at that level. The symbol < indicates that the generator is backtracking (that is, returning to a lower level to try another path). o Column 2: Input pair (lines 2-9, 12, 14-16). This is the lexical:surface pair (from the set of feasible pairs) that is PC-KIMMO Reference Manual Page 27 currently being considered by the generator (for example, f:f on line 4). The rest of the line shows the results of stepping the automata with the pair as input. The results are indicated by either a new state configuration (for example, line 5) or a BLOCKED BY RULE message (for example, line 10). o Lines 10, 13: BLOCKED BY RULE message. Indicates that a feasible pair input to the function that steps the automata caused a rule to fail. Gives the number and name of the rule (from the header line of the state table) that failed. o Columns 3-17: State configuration (lines 2-9, 11-12, 14-17). These are the current states of each of the rules. The leftmost number is the state of rule 1, the second is rule 2, and so on. o Column 18: Result (lines 4-9, 11-12, 14-17). This is the current value of the result string. o Lines 21-41: The generator continues to backtrack, looking for other possible paths to a result, until finding no other path it returns to its initial state. There is one other tracing message not exemplified in the above display. This is the END OF INPUT message. It indicates that the end of the input form has been reached but the generator function has failed on the rule specified because it was not in a final state. For example, END OF INPUT, FAILED RULE 4: Palatalization would indicate that when the end of the input form was reached, rule 4 was not left in a final state. Figure 6 is part of a level 3 trace for the same form. Figure 6 Level 3 generator trace 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 `fox+s 2 0 #:# 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 0 `:0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 f:f 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 0 5 2 o:o 1 1 1 1 2 2 1 3 3 1 1 1 1 1 1 0f 6 3 x:x 1 1 1 1 1 1 1 7 4 2 1 1 1 1 1 0fo 7 4 +:0 1 1 3 3 2 2 1 4 4 1 1 1 1 1 1 0fox 8 5 s:s 1 1 5 5 1 1 2 4 4 1 1 1 1 1 1 0fox0 9 6 #:# 1 1 6 2 2 2 3 3 4 1 1 1 1 1 1 0fox0s 10 6- 1 1 0 ? ? ? ? ? ? ? ? ? ? ? ? 0fox0s 11 BLOCKED BY RULE 3: Epenthesis, 0:0 /<= [S|ch|sh|y:i] +:0___s[+:0|#] 12 5< 1 1 5 5 1 1 2 4 4 1 1 1 1 1 1 0fox0 13 5 s:0 1 1 5 5 1 1 2 4 4 1 1 1 1 1 1 0fox0 14 5- 1 1 1 1 1 1 0 ? ? ? ? ? ? ? ? 0fox0 15 BLOCKED BY RULE 7: S-deletion, s:0 <=> +:0 (0:e) s +:0 '___ 16 5 0:e 1 1 5 5 1 1 2 4 4 1 1 1 1 1 1 0fox0 17 6 s:s 1 1 1 6 1 1 2 4 4 1 1 1 1 1 1 0fox0e 18 7 #:# 1 1 4 7 2 2 3 3 4 1 1 1 1 1 1 0fox0es 19 7 1 1 1 1 1 1 1 3 4 1 1 1 1 1 1 0fox0es 20 21 RESULT = 0fox0es 22 23 foxes PC-KIMMO Reference Manual Page 28 The level 3 trace differs from the level 2 trace in how it displays rule failures that block the generator. Compare line 10 in the level 2 trace with lines 10 and 11 of the level 3 trace. The level 3 trace explicitly shows what state the automata are in after stepping them. In line 10 of the level 3 trace we can see that the proposed input pair puts rule 3 in state 0, which means that it fails. Notice that the rest of the state array is filled with question marks. This is because if one rule fails the whole configuration fails, so the rest of the rules are not even tried. (This shows that even though conceptually the automata operate in parallel they must still be stepped one at a time). 8.2 Recognizer trace The purpose of the recognizer trace is to allow the user to see how a surface form is processed through multiple recursive calls to the recognizer. The recognizer algorithm used to process the form is described in section 9.2. There are three levels of tracing differing in the amount of detail they display: level 1 gives the least amount of detail, level 2 (the default) gives a moderate amount of detail, and level 3 gives the most detail. Figure 7 is a level 1 recognizer trace of the surface form foxes (taken from the English example). Figure 7 Level 1 recognizer trace foxes ENTERING LEXICON INITIAL ENTERING LEXICON N_ROOT ENTERING LEXICON NUMBER ENTERING LEXICON GENITIVE ENTERING LEXICON End RESULT = `fox+0s [ N(fox)+ PL ] BACKING UP FROM LEXICON End TO LEXICON GENITIVE BACKING UP FROM LEXICON GENITIVE TO LEXICON NUMBER ENTERING LEXICON GENITIVE ENTERING LEXICON End BACKING UP FROM LEXICON End TO LEXICON GENITIVE BACKING UP FROM LEXICON GENITIVE TO LEXICON NUMBER BACKING UP FROM LEXICON NUMBER TO LEXICON N_ROOT BACKING UP FROM LEXICON N_ROOT TO LEXICON INITIAL ENTERING LEXICON ADJ_PREFIX ... BACKING UP FROM LEXICON V_ROOT_NEG TO LEXICON V_PREFIX BACKING UP FROM LEXICON V_PREFIX TO LEXICON INITIAL `fox+s [ N(fox)+ PL ] Like the level 1 generator trace, the level 1 recognizer trace displays the RESULT line but does not show the feasible pairs as they are tried or the states of the rules. However, it does display a record of how the recognizer moves through the lexicon, either with an ENTERING or a BACKING UP message. PC-KIMMO Reference Manual Page 29 Figure 8 is from a level 2 recognizer trace of the form foxes. To limit the size of the trace, the Gemination rules (14 and 15) were turned off. Line numbers and column numbers are printed here for reference in the description that follows. Each description refers to an element beginning at the line and column indicated. Figure 8 Level 2 recognizer trace 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 foxes 2 0 #:# 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 ENTERING LEXICON INITIAL 4 ACCEPTING NULL ENTRY 5 ENTERING LEXICON N_ROOT 6 0 `:0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [ 7 1 s:0 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 ` [ 8 1- BLOCKED BY RULE 7: S-deletion, s:0 <=> +:0 (0:e) s +:0 '___ 9 1 f:f 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 ` [ 10 2 o:o 1 1 1 1 2 2 1 3 3 1 1 1 1 1 1 `f [ 11 3 x:x 1 1 1 1 1 1 1 7 4 2 1 1 1 1 1 `fo [ 12 ENTERING LEXICON NUMBER 13 4 +:0 1 1 3 3 2 2 1 4 4 1 1 1 1 1 1 `fox [ N(fox) 14 5 s:0 1 1 5 5 1 1 2 4 4 1 1 1 1 1 1 `fox+ [ N(fox) 15 5- BLOCKED BY RULE 7: S-deletion, s:0 <=> +:0 (0:e) s +:0 '___ 16 5 0:e 1 1 5 5 1 1 2 4 4 1 1 1 1 1 1 `fox+ [ N(fox) 17 6 s:0 1 1 1 6 1 1 2 4 4 1 1 1 1 1 1 `fox+0 [ N(fox) 18 6- BLOCKED BY RULE 4: Epenthesis, 0:e => [S|ch|sh|y:i] +:0___s[+:0|#] 19 6 s:s 1 1 1 6 1 1 2 4 4 1 1 1 1 1 1 `fox+0 [ N(fox) 20 ENTERING LEXICON GENITIVE 21 7 +:0 1 1 4 7 2 2 3 3 4 1 1 1 1 1 1 `fox+0s [ N(fox)+PL 22 8- BLOCKED IN LEXICON GENITIVE, INPUT = 23 7< 1 1 4 7 2 2 3 3 4 1 1 1 1 1 1 `fox+0s [ N(fox)+PL 24 ACCEPTING NULL ENTRY 25 ENTERING LEXICON End 26 ACCEPTING NULL ENTRY 27 7 #:# 1 1 4 7 2 2 3 3 4 1 1 1 1 1 1 `fox+0s [ N(fox)+PL 28 7 1 1 1 1 1 1 1 3 4 1 1 1 1 1 1 `fox+0s [ N(fox)+PL 29 30 RESULT = `fox+0s [ N(fox)+PL ] ... 108 BACKING UP FROM LEXICON V_ROOT_NEG TO LEXICON V_PREFIX 109 0< 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [ 110 BACKING UP FROM LEXICON V_PREFIX TO LEXICON INITIAL 111 0< 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 112 `fox+s [ N(fox)+PL ] o Line 1: Input line. Surface form input to the recognizer function. o Line 30: RESULT line. At the point where there are no lexicons in the continuation class of an entry, the input surface form is empty, and each automaton is in a final state, the trace shows that the recognizer has recorded a result. The recognizer continues looking for additional results (lines 32 and following). o Column 1: Level number (lines 2, 6-11, 13-19, 21-23, 27-28). This represents the level of recursion. Level 0 represents the initial invocation of the recognizer. Notice that the number coincides with the number of characters in the result string so far. PC-KIMMO Reference Manual Page 30 o Column 1: Backtracking indicator (lines 8, 15, 18, 22-23). The symbol - indicates that the recognizer is blocked at that level. The symbol < indicates that the recognizer is backtracking (that is, returning to a lower level to try another path). o Column 2: Input pair (lines 2, 6-7, 9-11, and so on). This is the lexical:surface pair (from the set of feasible pairs) that is currently being considered by the recognizer (for example, f:f on line 9). The results of stepping the automata with the pair as input are indicated by either a new state configuration (for example, line 10) or a BLOCKED BY RULE message (for example, line 15). o Lines 3, 5, 12, 20, 25: ENTERING LEXICON message. This is the name of the sublexicon that the recognizer is about to search. o Lines 4, 24, 26: ACCEPTING NULL ENTRY message. Indicates that a null lexical entry (that is, an entry whose lexical item is the NULL symbol) has been accepted. o Line 22: BLOCKED IN LEXICON message. Indicates that no lexical entry could be found in the current lexicon that continues with the input pair under consideration. The remaining part of the input form is displayed on the line (in line 22 it happens that nothing is left of the input form). o Lines 108, 110: BACKING UP message. Indicates that there were no further sublexicons left in the continuation class, so the recognizer must back up to the previous lexicon branch. o Lines 8, 15, 18: BLOCKED BY RULE message. Indicates that a feasible pair input to the function that steps the automata caused a rule to fail. Gives the number and name of the rule (from the header line of the state table) that failed. o Columns 3-17: State configuration (lines 2, 6-7, 9-11, and so on). These are the current states of each of the rules. The leftmost number is the state of rule 1, the second is rule 2, and so on. o Column 18: Result (lines 6-7 and so on). This is the current value of the result string. o Lines 108-111: The recognizer continues to backtrack, looking for other possible paths to a result, until finding no other path it returns to its initial state. The END OF INPUT message may also occur in a recognizer trace. See section 8.1 on the generator trace for an explanation of it. PC-KIMMO Reference Manual Page 31 Figure 9 Level 3 recognizer trace 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 foxes 2 0 #:# 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 ENTERING LEXICON INITIAL 4 0- -:0 LEXICAL CHARACTER NOT MATCHED 5 0- `:0 LEXICAL CHARACTER NOT MATCHED 6 0- +:0 LEXICAL CHARACTER NOT MATCHED 7 0- s:0 LEXICAL CHARACTER NOT MATCHED 8 0- e:0 LEXICAL CHARACTER NOT MATCHED 9 0- f:f LEXICAL CHARACTER NOT MATCHED 10 ACCEPTING NULL ENTRY 11 ENTERING LEXICON N_ROOT 12 0- -:0 LEXICAL CHARACTER NOT MATCHED 13 0 `:0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [ 14 1- -:0 LEXICAL CHARACTER NOT MATCHED 15 1- `:0 LEXICAL CHARACTER NOT MATCHED 16 1- +:0 LEXICAL CHARACTER NOT MATCHED 17 1 s:0 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 ` [ 18 1- 1 1 1 1 1 1 0 ? ? ? ? ? ? ? ? ` [ 19 BLOCKED BY RULE 7: S-deletion, s:0 <=> +:0 (0:e) s +:0 '___ ... 75 ACCEPTING NULL ENTRY 76 7 #:# 1 1 4 7 2 2 3 3 4 1 1 1 1 1 1 `fox+0s [ N(fox)+PL 77 7 1 1 1 1 1 1 1 3 4 1 1 1 1 1 1 `fox+0s [ N(fox)+PL 78 79 RESULT = `fox+0s [ N(fox)+PL ] 80 81 `fox+s [ N(fox)+PL ] Figure 9 is part of a level 3 trace for the same form. Like level 3 of the generator trace, level 3 of the recognizer trace explicitly shows the state array when a rule fails. Compare line 8 of the level 2 trace with lines 18 and 19 of the level 3 trace. In addition, the level 3 recognizer trace shows pairs that are weeded out by the lexicon even before they are tried with the rules. Compare lines 3-4 of the level 2 trace with lines 3-10 of the level 3 trace. In lines 4-9 the level 3 trace shows explicitly several pairs that are tried but immediately fail. Since the recognizer is at the beginning of the input form, the only possible feasible pairs to try are those whose surface character is 0 (the NULL symbol) or f (the first character of the input form foxes). Rather than trying each of these pairs with the rules, the recognizer first looks to see if the lexical character of each pair matches any lexical character available in the sublexicon it is currently searching. In each case the match fails, indicated by the message LEXICAL CHARACTER NOT MATCHED. After trying all the pairs, the lexicon accepts the null entry and enters a new sublexicon. This exhaustive process takes place at each point in the recognition process where the recognizer is trying a new pair. PC-KIMMO Reference Manual Page 32 9 Algorithms The algorithms used by PC-KIMMO to generate surface forms and recognize lexical forms are based on descriptions in Karttunen 1983. 9.1 Generating surface forms The generator function recursively computes surface forms from a lexical form using a set of two-level rules expressed as finite state automata. The generator function does not make use of the lexicon. This means that it will accept input forms that are not found in the lexicon or that even violate the lexicon's constraints on morpheme order, and will still apply the phonological rules to them. To produce a surface form from a lexical form, the generator processes the input form one character at a time, left to right. For each lexical character, it tries every surface character that has been declared as corresponding to it in a feasible pair sanctioned by the description. The generator function has these inputs: Lexical form: Initially the input form, this string contains whatever is left to process. As the function is recursively called, this string gets shorter as the result string gets longer. Result: Initially empty, this string contains the results of the generator up to the point of the current function call. Rules: This is the set of active finite state automata defined for this language. Configuration: This is an array representing the current state of all rules (automata). Initially, all states are set to 1. The generator function also uses a list of feasible pairs sanctioned by the set of rules; these are all the lexical:surface pairs of alphabetic characters that appear as column headers in the state tables. The input pair is a feasible pair selected by the generator as a possible next lexical:surface pair in the process of computing a surface form that corresponds to the given lexical form. Each time the generator is called it iteratively goes through the list of feasible pairs, selecting one as the input pair. The generator algorithm works as follows: 1. If the lexical form is empty (that is, there are no more characters in it to process), do the following steps: (a) If any of the state tables contains a word boundary column header, step the automata using an input pair consisting of the BOUNDARY symbol as both the lexical and surface character. If this fails, then the result is rejected and the function returns to the previous level. (b) Check that the configuration array contains a valid final state for each of the rules. If so, then the result is accepted and added to the output list. Otherwise, it is rejected. In either case, the function returns to the previous level. PC-KIMMO Reference Manual Page 33 Otherwise, if the lexical form is not empty (that is, it contains more characters to process), do steps 2 and 3. 2. For each input pair containing the first character in the lexical form as the lexical character, do the following steps: (a) Step the automata using the input pair and the input configuration array, producing a new configuration. (b) If this succeeds, recursively call the generator function with these inputs: Lexical form: This is the input lexical form with the first character removed. Result: This is the input result string with the surface character from the current input pair appended. Configuration: This is the state array produced by stepping the automata. (c) If this fails, choose another input pair from the list of feasible pairs and do either step 2 or step 3. 3. For each input pair containing the NULL symbol as the lexical character, do the following steps: (a) Step the automata using the input pair and the input configuration array to produce a new configuration. (b) If this succeeds, recursively call the generator function with these inputs: Lexical form: This is the input lexical form with no character removed (since the lexical character posited was NULL). Result: This is the input result string with the surface character from the current input pair appended. Configuration: This is the state array produced by stepping the automata. (c) If this fails, choose another input pair from the list of feasible pairs and do either step 2 or step 3. 9.2 Recognizing lexical forms The recognizer function recursively computes lexical forms from a surface form using a lexicon and a set of two-level rules expressed as finite state automata. The recognizer function operates in a way similar to the generator, only in a surface to lexical direction. The recognizer processes the surface input form one character at a time, left to right. For each surface character, it tries every lexical character that has been declared as corresponding to it in a feasible pair sanctioned by the description. The recognizer also consults the lexicon. The lexical items recorded in the lexicon are structured as a letter tree. When the PC-KIMMO Reference Manual Page 34 recognizer tries a lexical character, it moves down the branch of the letter tree that has that character as its head node. If there is no branch starting with that letter, the lexicon blocks further progress and forces the recognizer to backtrack and try a different lexical character. For example, figure 10 is a letter tree for the lexical items spiel, spit, spy, and sty. Figure 10 A lexical letter tree +-----e-----l | | +-----i-+ | | | | +-----p-+ +-----t | | | | s-| +-----y | | +-----t-----y Besides applying the phonological rules and identifying morphemes, the recognizer also must enforce morpheme order constraints. The PC-KIMMO lexicon is divided into classes of lexical items that behave alike with respect to order constraints. These lexical classes are called sublexicons. The entry for each lexical item specifies the name of the sublexicon that can follow it. This following sublexicon is called a continuation class. Lexical items that occur only at the end of a word have no continuation class, indicated by the BOUNDARY symbol. The names of the sublexicons that make up the entire lexicon are used as nodes at the head of branches of the letter tree. The piece of a letter tree shown in figure 10 may actually be under a branch node called Noun. When the recognizer successfully finds a lexical item in the letter tree, it looks at its specified continuation class and jumps to the branch of the lexicon it names. It is often the case that at a given point in a word, more than one continuation is possible. Sets of alternative continuing sublexicons are called alternations. Thus the continuation class field of a lexical entry may contain the name of an alternation that specifies a list of the sublexicons that can follow it. When the recognizer successfully recognizes a lexical item (word or morpheme), it reads its gloss from its lexical entry and appends it to the gloss string being built up for the entire word. The recognizer function has these inputs: Surface form: Initially the input form, this string contains whatever is left to process. As the function is recursively called, this string gets shorter as the result string gets longer. PC-KIMMO Reference Manual Page 35 Result: Initially empty, this string contains the results of the recognizer up to the point of the current function call. Gloss: Initially empty, this string contains glosses for the lexical items contained in the result string. Rules: This is the set of active finite state automata defined for this language. Configuration: This is an array representing the current state of all rules (automata). Initially, all states are set to 1. Lexicon: Initially, this is the entire lexicon defined for the language. During the process of recognition it is restricted to a branch of the lexicon. Like the generator, the recognizer function uses a list of feasible pairs sanctioned by the set of rules; these are all the lexical:surface pairs of alphabetic characters that appear as column headers in the state tables. The input pair is a feasible pair selected by the recognizer as a possible next lexical:surface pair in the process of computing a lexical form that corresponds to the given surface form. Each time the recognizer is called it iteratively goes through the list of feasible pairs, selecting one as the input pair. When a complete lexical item has been recognized, the lexicon is at a terminal node of the letter tree. Terminal nodes have glosses and continuation classes attached to them. The recognizer algorithm is initialized as though it has successfully recognized a lexical item and the lexicon is at a terminal node pointing to a continuation class consisting of the INITIAL sublexicon. It then proceeds as follows: 1. If the input lexicon is at a terminal node, then for each sublexicon in the continuation class of that item, recursively call the recognizer function with these inputs: Surface form: This string contains whatever is left to process. Result: This string contains the results of the recognizer up to the point of the current function call. Gloss: This is the input gloss string with the gloss of the current lexical entry appended. Rules: This is the input set of rules. Configuration: This is the input configuration. Lexicon: This is the current continuation sublexicon. If the continuation class of the lexical entry is empty (that is, the lexical item can only be followed by word boundary) and the input surface form is empty, do the following steps: (a) If any of the state tables contains a word boundary column header, step the automata using an input pair consisting of the BOUNDARY symbol as both the lexical and surface character. PC-KIMMO Reference Manual Page 36 If this fails, then the result is rejected and the function returns to the previous level. (b) Check that the configuration array contains a valid final state for each of the rules. If so, then the result is accepted, the gloss of the lexical entry is appended to the gloss, and both the result and the gloss are added to the output list. Otherwise, the result is rejected. In either case, the function returns to the previous level. If the continuation class of the lexical entry is empty but the surface form is not empty, the result is rejected and the function returns to the previous level. 2. For each input pair that has the head of a branch in the lexicon as the lexical character and the first character of the surface form as the surface character, do the following steps: (a) Step the automata using the input pair and the input configuration array to produce a new configuration. (b) If this succeeds, recursively call the recognizer function with these inputs: Surface form: This is the input surface form with the first character removed. Result: This is the input result string with the lexical character from the current input pair appended. Gloss: This is the input gloss string. Rules: This is the input set of rules. Configuration: This is the state array produced by stepping the automata. Lexicon: This is the branch of the lexicon corresponding to the lexical character from the current input pair. 3. For each input pair that has the head of a branch in the lexicon as the lexical character and has the NULL symbol as the surface character, do the following steps: (a) Step the automata using the input pair and the input configuration array to produce a new configuration (b) If this succeeds, recursively call the recognizer function with these inputs: Surface form: This is the input surface form. Result: This is the input result string with the lexical character from the current input pair appended. Gloss: This is the input gloss string. Rules: This is the input set of rules. PC-KIMMO Reference Manual Page 37 Configuration: This is the state array produced by stepping the automata. Lexicon: This is the branch of the lexicon corresponding to the lexical character from the current input pair. 4. If the NULL symbol is the head of a branch of the lexicon (that is, a null lexical entry), recursively call the recognizer function with these inputs: Surface form: This is the input surface form. Result: This is the input result string. Gloss: This is the input gloss string. Rules: This is the input set of rules. Configuration: This is the input state array. Lexicon: This is the branch of the lexicon which has the NULL symbol as its head. 10 Error messages This section lists the various error and warning messages you may encounter. They are listed in numerical sequence and are generally grouped according to the type of error or warning. A warning means that the operation in progress has successfully completed, but an anomalous condition may have resulted. An error means that the operation in progress could not be successfully completed and was therefore prematurely terminated. Only in the case of a memory error is the PC-KIMMO program aborted and control returned to the operating system. Note that in the following error messages the words printed in angled brackets are not literal but are cover terms for a set of items of the type suggested by the term. For instance, when the error message "Missing keyword in command" actually appears on the computer screen, the term will be replaced by a specific command name, such as LOAD or SET. 10.1 Errors related to reading and parsing commands WARNING 100 Input line too long -- ignoring after first characters ERROR 101 Ambiguous command: did not specify a unique command. Type more of the command name to insure that it is not ambiguous. ERROR 102 Invalid command: is not a valid command. Type ? or HELP for a list of valid commands. PC-KIMMO Reference Manual Page 38 ERROR 103 Missing keyword in command Expected a keyword to be used with the command. Type the command name followed by ? for a list of valid keywords. ERROR 104 Missing argument in command Expected an argument to complete the command. Type HELP followed by the command name for an explanation of what arguments the command needs. ERROR 105 Ambiguous keyword in command: did not specify a unique keyword. Type more of the keyword to insure that it is not ambiguous. ERROR 106 Invalid keyword in command: is not a valid keyword. Type the command name followed by ? for a list of valid keywords for that command. ERROR 107 Invalid argument in command: was not valid for the command. Type HELP followed by the command name for an explanation of what arguments the command needs. ERROR 108 Missing input file argument in command Expected a file name with the command. ERROR 109 Cannot open input file in command Cannot find the file . Check to see if the file is in the current directory or the path you specified in the command. The command may also be expecting a different default file name or extension. ERROR 110 Cannot open output file in command Check to see if the file is in the current directory or in the path you specified in the command. The command may also be expecting a different default file name or extension. ERROR 111 Must load rules before loading lexicon The rules file must be loaded before the lexicon in order to verify the lexical forms in the lexicon against the alphabet defined in the rules file. PC-KIMMO Reference Manual Page 39 ERROR 112 TAKE files nested too deeply TAKE files can only be nested three deep. ERROR 113 TAKE file aborted due to invalid command: is not a valid command. Type ? or HELP for a list of valid commands. ERROR 114 No log file was open Result of issuing the CLOSE command when no log file has been opened. WARNING 115 Closing the existing log file Occurs when the LOG command is issued when a log file is already open. ERROR 116 Missing file name for EDIT command EDIT command must specify a file to be edited. 10.2 Errors related to reading the rules file ERROR 200 Rules file could not be opened: Check to see if the file is in the current directory or in the path you specified in the command. The command may also be expecting a different default file name or extension. ERROR 201 Unexpected end of rules file: The rules file is incomplete. Check to see if the last table in the file has fewer states than expected. ERROR 202 Expected ALPHABET keyword The first declaration in a rules file must be the ALPHABET declaration. ERROR 203 Alphabet contains no members The ALPHABET keyword does not have any characters listed after it. WARNING 204 Too many characters in the alphabet The alphabet can contain a maximum of 255 characters. PC-KIMMO Reference Manual Page 40 WARNING 205 Character is already in the alphabet: A character has been repeated in the ALPHABET declaration. ERROR 206 No value given for NULL keyword A single character must appear after the NULL keyword. ERROR 207 Value given for NULL symbol was already declared as alphabetic: The character specified for NULL may not also be declared in the ALPHABET. ERROR 208 The NULL symbol has already been defined There is more than one NULL declaration. ERROR 209 Value given for NULL symbol was already declared for ANY ERROR 210 Value given for NULL symbol was already declared for BOUNDARY ERROR 211 No value given for ANY keyword A single character must appear after the ANY keyword. ERROR 212 Value given for ANY symbol was already declared as alphabetic: The character specified for ANY may not also be declared in the ALPHABET. ERROR 213 The ANY symbol has already been defined There is more than one ANY declaration. ERROR 214 Value given for ANY symbol was already declared NULL ERROR 215 Value given for ANY symbol was already declared for BOUNDARY ERROR 216 No value given for BOUNDARY keyword A single character must appear after the BOUNDARY keyword. PC-KIMMO Reference Manual Page 41 ERROR 217 Value given for BOUNDARY symbol was already declared as alphabetic: The character specified for BOUNDARY may not also be declared in the ALPHABET. ERROR 218 The BOUNDARY symbol has already been defined There is more than one BOUNDARY declaration. ERROR 219 Value given for BOUNDARY symbol was already declared for NULL ERROR 220 Value given for BOUNDARY symbol was already declared for ANY ERROR 221 Subset name not given Occurs if there is a SUBSET keyword with nothing after it until the next keyword. ERROR 222 Subset name is not unique A subset name, if it is a single character, cannot be the same as one of the characters specified in the ALPHABET, NULL, ANY, or BOUNDARY declarations. If the subset name is more than one character, then it is a duplicate of another subset name already declared. ERROR 223 Subset contains no members ERROR 224 Subset contains a nonalphabetic character: All characters used in subsets must be listed in the ALPHABET declaration, with the exception of the NULL symbol, which can appear in a subset but is not included in the ALPHABET list. WARNING 225 Subset already contains A character has been repeated. ERROR 226 Invalid keyword: The only valid keywords in a rules file are ALPHABET, NULL, ANY, BOUNDARY, SUBSET, and RULE. WARNING 227 ANY symbol not defined Are you sure the rules do not use an ANY symbol? PC-KIMMO Reference Manual Page 42 WARNING 228 NULL symbol not defined Are you sure the rules do not use a NULL symbol? WARNING 229 BOUNDARY symbol not defined The BOUNDARY declaration is obligatory. Even if the BOUNDARY symbol is not used in the rules file, it must be used in the lexicon file. WARNING 230 Missing closing delimiter for the name of a rule: The first nonspace character after the RULE keyword is the opening delimiter of the rule name. A matching delimiter (identical character) was not found in the same line; thus PC-KIMMO will use everything up to the end of the line as the rule name. This is because the rule name must be contained in one line. ERROR 231 Invalid number of rows: Must be a number greater than zero. ERROR 232 Invalid number of columns: Must be a number greater than zero. ERROR 233 Invalid state number: State (row) numbers must start with 1 and ascend consecutively. ERROR 234 Expected final (:) or nonfinal (.) state indicator: A state (row) number must be followed by colon (:) or period (.) with no intervening space. ERROR 235 State table entry out of range: must not be greater than the specified number of states for the table. ERROR 236 Lexical character not in alphabet: A character in a table's lexical character list is not a member of the alphabet declared earlier in the rules file. PC-KIMMO Reference Manual Page 43 ERROR 237 Surface character not in alphabet: A character in a table's surface character list is not a member of the alphabet declared earlier in the rules file. ERROR 238 Nonnumeric character in state table: Expected a numeric state table entry but found a nonnumeric character. ERROR 239 Rule number , column pairs a BOUNDARY symbol with something else: Occurs if a column header consists of a BOUNDARY symbol is paired with anything but another BOUNDARY symbol; only #:# is allowed. WARNING 240 No feasible pairs for this set of rules Either there are no rules in the file or the rules contain only subset correspondences. In the latter case, simple rules listing all the default correspondences are needed. WARNING 241 RULE () -- : specified by both columns (:) and (:) There is an overlap between two columns of the state table. Issue a SHOW RULE command for the rule causing the warning and examine the set of pairs specified by each column header. WARNING 242 RULE () -- : not specified by any column The entire set of feasible pairs must be specified by each table. The table is probably missing an ANY:ANY column. ERROR 243 Rule number , column pairs two NULL symbols: NULL:NULL is not a legal column header, since it cannot be a feasible pair. 10.3 Errors related to reading the lexicon file ERROR 300 Lexicon file could not be opened: Check to see if the file is in the current directory or in the path you specified in the command. The command may also be expecting a different default file name or extension. PC-KIMMO Reference Manual Page 44 ERROR 301 No data in lexicon file ERROR 302 Missing alternation name The ALTERNATION keyword must be followed by an alternation name. WARNING 303 Empty alternation definition: An ALTERNATION keyword was found with no following alternation name or list of lexicon names. WARNING 304 Adding to existing alternation: ERROR 305 No lexicon sections in lexicon file A lexicon file must contain sublexicons. ERROR 306 Missing lexicon name The keyword LEXICON must be followed by a sublexicon name. WARNING 307 Lexicon section is not listed as a member of any alternations This will not necessarily result in a processing error if this is what you intended to do. ERROR 308 Expected continuation class or BOUNDARY symbol for A lexical entry is missing its continuation class element. ERROR 309 Invalid continuation class for A name appearing in the continuation class field of a lexical entry must be the name of an ALTERNATION that has already been declared. ERROR 310 Expected gloss element for Each lexical entry must have a gloss element. ERROR 311 Invalid gloss element for The gloss element must be bracketed by matching delimiters (identical characters). PC-KIMMO Reference Manual Page 45 ERROR 312 Form contains character not in alphabet: Each character used in lexical items must be listed in the ALPHABET declaration of the rules file. ERROR 313 INITIAL lexicon not found A lexicon file must as a minimum have a sublexicon named INITIAL. ERROR 314 Cannot nest lexicon INCLUDE files An INCLUDE file cannot call another INCLUDE file. ERROR 315 Missing INCLUDE file name An INCLUDE keyword must be followed by a file name. ERROR 316 Lexicon INCLUDE file could not be opened: ERROR 317 Invalid lexicon file keyword: The only valid keywords in a lexicon file are ALTERNATION, LEXICON, INCLUDE, and END. 10.4 Errors related to recognizing or generating a form WARNING 400 Surface form not found in comparison pairs file A lexical:surface pair in a pairs comparison file is missing the surface form. ERROR 800 Form [
] contains character not in alphabet: An input form contains a character that was not listed in the ALPHABET declaration in the rules file. ERROR 801 RULE is invalid--input : is not specified by any column Could happen if a table does not have an ANY:ANY column. ERROR 802 Invalid lexicon for recognizer Probably will never occur! PC-KIMMO Reference Manual Page 46 ERROR 803 Lexicon section is empty There are no lexical entries in the named sublexicon. ERROR 804 Cannot recognize forms without a lexicon The lexicon is not loaded. 10.5 Errors that abort program execution ERROR 900 Out of memory The rules and lexicon are too large to fit in memory. Runtime error--stack overflow Occurs when the generator or recognizer gets into an infinite loop due to an incorrectly written rule or lexicon continuation. References Antworth, Evan L. 1990. PC-KIMMO: a two-level processor for morphological analysis. Occasional Publications in Academic Computing No. 16. Dallas, TX: Summer Institute of Linguistics. ISBN 0-88312-639-7, 273 pages, paperbound. Karttunen, Lauri. 1983. KIMMO: a general morphological processor. Texas Linguistic Forum 22:163-186. _____ and K. Wittenburg. 1983. A two-level morphological analysis of English. Texas Linguistic Forum 22:217-228. Koskenniemi, Kimmo. 1983. Two-level morphology: a general computational model for word-form recognition and production. Publication No. 11. University of Helsinki: Department of General Linguistics. Errata The generator algorithm described in section 9.1 (pages 32-33) is slightly misleading. Step 3 (testing all feasible pairs containing a NULL lexical character, and recursively invoking the algorithm for each pair that successfully steps the automata) should be carried out even when the lexical form is empty. In other words, Step 3 actually takes place before Step 1. This reflects a bug in the implementation that was partially fixed in version 1.0B, and fully fixed in version 1.0.3 of PC-KIMMO.