KText Reference Manual analyzing/synthesizing texts with PC-Kimmo functions version 2.0b17 October 1997 by Evan Antworth and Stephen McConnel Copyright (C) 2000 SIL International Published by: Language Software Development SIL International 7500 W. Camp Wisdom Road Dallas, TX 75236 U.S.A. Permission is granted to make and distribute verbatim copies of this file provided the copyright notice and this permission notice are preserved in all copies. The author may be reached at the address above or via email as `steve@acadcomp.sil.org'. Overview of KTEXT ***************** This section briefly describes what KTEXT does, places KTEXT in its computational context, lists technical specifications of the program, and gives information on use and support of the program. What does KTEXT do? =================== KTEXT is a text processing program that uses the PC-KIMMO parser (see below about PC-KIMMO). KTEXT operates in two modes: analysis and synthesis. In analysis mode, KTEXT reads a text from a disk file, parses each word, and writes the results to a new disk file. This new file is in the form of a structured text file where each word of the original text is represented as a database record composed of several fields. Each word record contains a field for the original word, a field for the underlying or lexical form of the word, and a field for the gloss string. For example, if the text in the input file contains the word `beginning' (to use an English example), KTEXT's output file will have a record of this format: \a be`gin +ING \d be`gin-+ING \cat V \fd ing \w beginning This record consists of five fields, each tagged with a backslash code.(1) The first field, tagged with \a for analysis, contains the gloss string for the word. The second field, tagged with \d for (morpheme) decomposition, contains the underlying or lexical form of the word. The third field, tagged with \cat for category, contains the grammatical category of the word. The fourth field, tagged with \fd for feature descriptions, contains a list of feature abbreviations associated with the word, and the fourth field, tagged with \w for word, contains the original word. The word `pictures' (which can be analyzed as either a verb or a noun) demonstrates how KTEXT handles multiple parses: \a %2%`picture +3SG%`picture +PL% \d %2%`picture-+s%`picture-+s% \cat %2%V%N% \fd %d%s%-3sg pl% \w pictures Percent signs (or some other designated character) separate the multiple results in the \a, \d, \cat, and \fd fields, with a number indicating how many results were found. A word record also saves any capitalization or punctuation associated with the original word. For example, if a sentence begins "Obviously, this hypothesis...", KTEXT will output the first word like this: \a `obvious +AVR1 \d `obvious-+ly \cat AV \w obviously \c 1 \n , The \w field contains the original word without capitalization or the following comma. The \c field contains the number 1 which indicates that the first letter of the original word is upper case. The \n field contains the comma that follows the original word. The purpose of retaining the capitalization and punctuation of the original text is, of course, to enable one to recover the original text from KTEXT's output file. In synthesis mode, KTEXT takes an analysis file compatible with that produced by KTEXT in analysis mode and produces an orthographic text file comparable to the original. ---------- Footnotes ---------- (1) The particular choice of field markers and the order of fields in a record is due to the fact that KTEXT uses the same text-handling routines as an existing program called AMPLE (Weber et al., 1988). This has the advantage that KTEXT's output is compatible with that program, but the disadvantage that the record structure is perhaps not consistent with terminology already established for PC-KIMMO. It should also be noted that the quasi-database design of KTEXT's output is used by many other programs developed by SIL International. Placing KTEXT in its context ============================ KTEXT can only be understood by describing two other programs: PC-KIMMO and CARLA. First, we will take a look at PC-KIMMO. KTEXT is intended to be used with PC-KIMMO (though it is a stand-alone program). PC-KIMMO is a program for doing computational phonology and morphology. It is typically used to build morphological parsers for natural language processing systems. PC-KIMMO is described in the book `PC-KIMMO: a two-level processor for morphological analysis' by Evan L. Antworth, published by the Summer Institute of Linguistics (1990). The PC-KIMMO software is available for MS-DOS and Windows (IBM PCs and compatibles), Macintosh, and UNIX. The book (including software) is available for $23.00 (plus postage) from: International Academic Bookstore 7500 W. Camp Wisdom Road Dallas TX, 75236 U.S.A. phone: 972/708-7404 fax: 972/708-7433 The KTEXT program which this document describes will be of very little use to you without the PC-KIMMO program and book. The remainder of this document assumes that you are familiar with PC-KIMMO. PC-KIMMO was deliberately designed to be reuseable. The core of PC-KIMMO is a library of functions such as "load rules", "load lexicon", "generate", and "recognize". The PC-KIMMO program supplied on the release diskette is just a user shell built around these basic functions. This shell provides an environment for developing and testing sets of rules and lexicons. Since the shell is a development environment, it has very little built-in data processing capability. But because PC-KIMMO is modular and portable, you can write your own data processing program that uses PC-KIMMO's function library. KTEXT is an example of how to use PC-KIMMO to create a new natural language processing program. KTEXT is a text processing program that uses PC-KIMMO to do morphological parsing. KTEXT is also closely related to a system called CARLA, which stands for Computer Assisted Related Language Adaptation. CARLA is a type of machine translation system designed to work between closely related languages. CARLA is based on the Analysis Transfer Synthesis (ATS) paradigm of adaptation. This paradigm involves three stages: 1. *Analysis.* The text to be adapted is converted to an abstract representation, composed of units (in our case, words and morphemes) which are defined in source language dictionaries. (No attempt is made to represent the meaning of the text, only the units that comprise the text.) 2. *Transfer.* Given known, systematic differences between the source and target languages, the result of Analysis is converted to an abstract representation of what it should be for the target language, using units defined in target language dictionaries. 3. *Synthesis.* Given information about the target language, the abstract representation resulting from Transfer is converted to a concrete, textual form. When used in analysis mode, KTEXT performs the Analysis task. In the original CARLA system, analysis is done by a program called AMPLE (Weber et al. 1988), which is also a morphological parser designed to process text. KTEXT was created by replacing AMPLE's parsing engine with the PC-KIMMO parser. Thus KTEXT has the same text-handling mechanisms as AMPLE and produces output similar or even identical to AMPLE. The advantages of this design are (1) we were able to develop KTEXT very quickly and easily since it involved very little new code, and (2) existing programs that use AMPLE's output format can also use KTEXT's output. The disadvantage of basing KTEXT on AMPLE is that the format of the output file is perhaps not consistent with terminology already established for PC-KIMMO. When KTEXT is used in synthesis mode, it performs the Synthesis task. In the original CARLA system, synthesis is done by a program called STAMP (Weber et al. 1990). However, STAMP also performs the Transfer task; KTEXT does not have this capability. Technical specifications ======================== KTEXT runs under four operating systems: * MS-DOS (IBM PC compatibles with a 386 or higher processor) * Microsoft Windows * UNIX * Apple Macintosh KTEXT does not require any graphics capability. It handles eight-bit characters (such as the IBM PC extended character set or the Windows ANSI character set). The Windows and Macintosh versions have the same user interface as the MS-DOS and UNIX versions, namely a batch-processing, command-line interface. In other words, a GUI version does not exist. The MS-DOS executable requires a 386 or newer CPU and a DPMI server. This has the advantage of allowing the program to use as much memory as necessary without constraining it to the archaic 640K limit. (DPMI is provided automatically by Windows. A free DPMI server is distributed with the MS-DOS executable.) The program is written in C and is very portable. The Macintosh version was compiled with the Metrowerks C compiler. The sources available at URL ftp://ftp.sil.org/software/unix/ktext-*.zip can be compiled for any of the four target platforms. Program status ============== KTEXT was developed by Stephen McConnel and Evan Antworth of SIL International. Several qualifications apply to its use and support: 1. This software, source code and executable program, is copyrighted by SIL International. You may use this software at no cost. You are granted the right to distribute this software to others, provided that all files are included in unmodified form and that you charge no fee (except cost of media). This software is intended for academic or other personal use only, and may not be distributed or used for commercial profit without express permission of SIL International. 2. This software represents work in progress and bears no warranty, either expressed or implied, of its fitness for any particular purpose. 3. In releasing this software, SIL International is making no commitment to maintain it. It is, however, committed to forwarding user feedback to the software's authors who may or may not choose to develop the software further. Bug reports, wish lists, requests for support, and positive feedback should be directed to Evan Antworth at this address: Stephen McConnel Language Software Development SIL International 7500 W. Camp Wisdom Road Dallas, TX 75236 phone: 972/708-7361 email: Stephen_McConnel@sil.org Examples of using KTEXT *********************** Using KTEXT to analyze a text ============================= Typically, the steps involved in using KTEXT to analyze texts are: 1. Collect a corpus of language data suitable for phonological and morphological analysis (typically paradigms of words). 2. Do phonological and morphological analysis on the data. 3. Use the PC-KIMMO shell to develop a rules file, lexicon, and grammar file that encode your phonological and morphological analyses and to test them against your corpus of data. 4. Select a text and keyboard it. 5. Set up the additional control files required for KTEXT analysis. 6. Use the rules, lexicon, and grammar you developed to process the text with KTEXT in analysis mode. 7. Edit KTEXT's output file to remove multiple parses. 8. Use the edited file as input to some other program. To demonstrate how to use KTEXT to process a text in analysis, we will use Englex, a morphological grammar of English for PC-KIMMO, and analyze a paragraph of `Alice's Adventures in Wonderland', by Lewis Caroll. The first paragraph of the text is shown in figure 1. Figure 1. Excerpt from Alice \id Alice.txt - Lewis Carroll's Alice's Adventures in Wonderland \ti Down the Rabbit-Hole \p Alice was beginning to get very tired of sitting by her sister on the bank and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, "and what is the use of a book," thought Alice, "without pictures or conversations?" \p So she was considering, in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her. The text was keyboarded using a very simple system of document markup that tags parts of the document with backslash codes. The `\it' tag identifies the text; the `\ti' tag indicates the title of the story; and the `\p' tag indicates the beginning of a paragraph. The next step is to process the text with KTEXT in analysis mode. Run the KTEXT application with these command line options: ktext -x ana.ctl -i alice.txt -o alice.ana -l ana.log where `ana.ctl' is the analysis control file, `alice.txt' is the input text file, `alice.ana' is the output analysis file, and `alice.log' is the analysis log file. The following display will appear on the screen: KTEXT (analyze/synthesize words using PC-Kimmo functions) Version 2.0b11 (November 1, 1996), Copyright 1996 SIL Beta test version compiled Nov 7 1996 15:11:16 with PC-Kimmo functions version 2.1b7 (November 6, 1996) and PC-PATR functions version 0.99b0 (November 7, 1996) For 386 CPU (or better) under MS-DOS [compiled with DJGPP 2.1/GNU C 2.7] affix.lex 255 entries noun.lex 10461 entries verb.lex 4215 entries adjectiv.lex 3345 entries adverb.lex 400 entries minor.lex 379 entries proper.lex 1057 entries abbrev.lex 127 entries technica.lex 813 entries natural.lex 435 entries foreign.lex 88 entries 5...2.2..2 ..22.2.2.. 2.222.2.22 ..22.2.222 22..22.22. 2..23.22.. 2.2.22.225 2..2...22. ......2... 4.22.2..4. 100 ...2..2..2 2.222 Each dot represents one word successfully processed. Multiple analyses of a word are indicated by numbers; thus the first word down received five analyses. When the program is done, it will return you to the operating system prompt. A fragment of the resulting output file is shown in figure 2. Figure 2 Output of KTEXT \a %5%`down%`down%`down%`down%`down% \d %5%`down%`down%`down%`down%`down% \cat %5%AV%V%AJ%N%PP% \fd %5%%vbase%%sg%% \w down \c 1 \a the \d the \cat DT \w the \a `rabbit - `hole \d `rabbit---`hole \cat N \fd sg \w rabbit-hole \c 516 \n \n One obvious way to continue is to reassemble the text in interlinear format. That is, we could write a program that would take the data structures shown in figure 2 and create a new file where the text is stored in interlinear format. The resulting interlinear text is shown in figure 3. An interlinear text editor like IT(1) could then be used to add more lines of annotations to the text. Figure 3 An English example of interlinear text format Down the Rabbit - Hole `down the `rabbit - `hole PP DT N - N Interlinear translation is a time-honored format for presenting analyzed vernacular texts. An interlinear text consists of a baseline text and one or more lines of annotations that are vertically aligned with the baseline. In the text shown in figure 3, the first line is the baseline text. The second line provides the lexical form of each original word, including morpheme breaks. The third line gives the category or part-of-speech of each word. Another way to proceed would be to take the output of KTEXT as shown in figure 2 and format it directly for printing. In other words, there would be no disk file of interlinear text corresponding to figure 3; rather, the interlinear text is created on the fly as it is prepared for printing. Fortunately, the software required to print interlinear text is now available. As a complement to the IT program, a system for formatting interlinear text for typesetting has recently been developed (see Kew and McConnel, 1991). Called ITF, for Interlinear Text Formatter,(2) it is a set of TeX(3) macros that can format an arbitrary number of aligning annotations with up to two freeform (nonaligning) annotations. While ITF is primarily intended to format the data files produced by IT (similar to the interlinear text shown in figure 3), an auxiliary program provided with ITF accepts the output of the KTEXT program. The final printed result of the formatting process is shown in figure 4.(4) It should be noted that this is just one of many formats that ITF can produce. Because ITF is built on a full-featured typesetting system, virtually all aspects of the formatting detail can be customized, including half a dozen different schemes for laying out the freeform annotations relative to the interlinear text. Figure 4. Output of ITF [ This figure is not available in plain text documentation. ] ---------- Footnotes ---------- (1) IT (pronounced "eye-tee") is an interlinear text editor that maintains the vertical alignment of the interlinear lines of text and uses a lexicon to semi-automatically gloss the text. See Simons and Versaw (1991) and Simons and Thomson (1988). (2) ITF was developed by the Academic Computing Department of the Summer Institute of Linguistics. It runs under MS-DOS, UNIX, and the Apple Macintosh. (3) TeX is a typesetting language developed by Donald Knuth (see Knuth, 1986). (4) The plain text version of this documentation does not include figure 4, since it is an image of typeset output. Using KTEXT to synthesize a text ================================ Normally, in an adaptation project, the text is adapted from a source language to a target language via a Transfer component. For the purpose of this example, we will use English as both the source and target language, thus obviating the need for a Transfer component. If the synthesis operation produces a text which is identical to the original text, then we have proved the efficacy of the system. Typically, the steps involved in using KTEXT in synthesis mode are: 1. Collect a corpus of language data suitable for phonological and morphological analysis (typically paradigms of words). 2. Do phonological and morphological analysis on the data. 3. Use the PC-KIMMO shell to develop a rules file and a lexicon file that encode your phonological and morphological analyses and to test them against your corpus of data. 4. Set up the control files required by KTEXT for synthesis mode. Using the rules and lexicon you developed, use KTEXT in synthesis mode to process an analyzed text. To synthesize the original text from the analysis file, run the KTEXT application with these command line options: ktext -s -x syn.ctl -i alice.ana -o alice.syn -l syn.log where `syn.ctl' is the synthesis control file, `alice.ana' is the input analysis file, `alice.syn' is the synthesized output text file, and `syn.log' is the synthesis log file. The following display will appear on the screen: KTEXT (analyze/synthesize words using PC-Kimmo functions) Version 2.0b11 (November 1, 1996), Copyright 1996 SIL Beta test version compiled Nov 7 1996 15:11:16 with PC-Kimmo functions version 2.1b7 (November 6, 1996) and PC-PATR functions version 0.99b0 (November 7, 1996) For 386 CPU (or better) under MS-DOS [compiled with DJGPP 2.1/GNU C 2.7] affix.lex 255 entries noun.lex 10461 entries verb.lex 4215 entries adjectiv.lex 3345 entries adverb.lex 400 entries minor.lex 379 entries proper.lex 1057 entries abbrev.lex 127 entries technica.lex 813 entries natural.lex 435 entries foreign.lex 88 entries .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... 100 .......... ..... Notice that every word received a single synthesis. Open the output file `alice.syn' and you will see that it is identical to the input text file shown in figure 1. Running KTEXT ************* This section describes KTEXT's user interface and the input files it uses. KTEXT is a batch-processing program. This means that the program takes as input a text from a disk file and returns as output the processed text in a new disk file. KTEXT is run from the command line by giving it the information it needs (file names and other options). It does not have an interactive interface. The user controls KTEXT's operation by means of special files that contain all the information KTEXT needs to process the input text. These files are called control files. The operation of the program is controlled by using command line options. To see a list of the command line options, run the KTEXT application with `-h' as a command line option. You will see a display similar to this: KTEXT (analyze/synthesize words using PC-Kimmo functions) Version 2.0b11 (November 1, 1996), Copyright 1996 SIL Beta test version compiled Nov 7 1996 15:11:16 with PC-Kimmo functions version 2.1b7 (November 6, 1996) and PC-PATR functions version 0.99b0 (November 7, 1996) For 386 CPU (or better) under MS-DOS [compiled with DJGPP 2.1/GNU C 2.7] Usage: ktext [options] -c char make char the comment character for the control files (default is ;) -s synthesis mode (default is analysis) -v for synthesis, verify each result with a word parse -x ctlfile specify the KTEXT control file (default is ktext.ctl) -i infile specify the input file (required: no default) -o outfile specify the output file (default is based on infile) -l logfile specify the KTEXT log file (default is none) The command line options (`-c', `-s', and so on) are all lower case letters. Here is a detailed description of each command line option. `-c' The `-c' option takes an argument that sets the comment character used in the KTEXT control files (analysis, synthesis, TEXTIN, and TXTOUT control files). It has no effect on any other files used by KTEXT. If the `-c' option is not used, the semicolon (;) is used as the default comment character. `-s' The `-s' option causes KTEXT to run in synthesis mode. Without this option, KTEXT by default will run in analysis mode. `-v' The `-v' option applies only to synthesis mode; it causes KTEXT to verify the synthesis by using the word grammar (if one is specified in the analysis control file). The default is not to use the word grammar (even if one is specified). `-x' The `-x' option takes an argument that specifies the name of either the analysis or synthesis control file. These control files contains the name of the TEXTIN or TXTOUT control files and the names of the rules, lexicon, and word grammar files. They can also specify consistent changes to be made to the output. The `-x' option accepts a default file name extension of `.ctl'; for example if you use `-x english' KTEXT will try to load the file `english.ctl'. If the `-x' option is not used, KTEXT will try to load a control file with the default file name `ktext.ctl'. `-i' The `-i' option takes an argument that specifies the name of the input file containing the text that KTEXT will process. If the `-i' option is not used, KTEXT displays an error message and quits. `-o' The `-o' option takes an argument that specifies the name of the output file that KTEXT creates. If the `-o' option is not used, the output filename is constructed from the input filename. `-l' The `-l' option takes an argument that specifies the name of a log file. The log file contains messages about any analysis failures or other anomalous behavior during processing of the input text. In all instances where file names are supplied to KTEXT, an optional directory path can be included; for example, `-i c:\texts\alice.txt'. KTEXT's functional structure **************************** Analysis mode ============= KTEXT uses three main functional modules in analysis mode: the "text input" module, the "analysis" module, and the "structured output" module. The diagram in figure 5 shows the flow of data through these modules. The input text is fed into the text input module which outputs the text as a stream of normalized words with capitalization and punctuation stripped out and saved. The text input module is controlled by a file that specifies orthographic changes. Each word is then passed to the analysis module where it is parsed. The analysis module is controlled by the PC-KIMMO rules, lexicon, and grammar files. The parsed words are then passed to the structured output module and written to the output file as database records. Figure 5. An overview of KTEXT analysis original input text file | | +--------------------------------+ | | | | +------------+ | text input | | TEXT | | control file --->|--->| INPUT |-----+ | | +------------+ | | | | punctuation | | words formatting | | | capitalization | | | | | rules file ----->|--->+------------+ | | lexicon files -->|--->| ANALYSIS | | | grammar file --->|--->+------------+ | | | | | | | parsed words | | | | | | | +------------+ | | | | STRUCTURED |<----+ | | | OUTPUT | | | +------------+ | | | | +--------------------------------+ | | structured text file with parsed words In analysis mode, KTEXT uses six different input files and produces one output file (plus an optional log file). These six input file are: 1. the text data file, 2. the KTEXT control file, 3. the text input control file (optional), 4. the PC-KIMMO rules file, 5. the PC-KIMMO lexicon file, and 6. the PC-KIMMO grammar file (optional). The PC-KIMMO rules, lexicon, and grammar files are described in the PC-KIMMO documentation and will not be discussed further in this document; see Antworth (1990) and Antworth (1995). The other input files and the analysis output data file are described in the following chapters. Synthesis mode ============== KTEXT also uses three main functional modules in synthesis mode: the "structured input" module, the "synthesis" module, and the "text output" module. The diagram in figure 6 shows the flow of data through these modules. A structured input text containing parsed words is fed into the structured input module, which outputs the text as a stream of parsed words with capitalization and punctuation stripped out and saved. Each parsed word is then passed to the synthesis module where it is rebuilt from its pieces. The synthesis module is controlled by the PC-KIMMO rules and lexicon files. (Synthesis normally does not use the grammar file.) The synthesized words are then passed to the text output module and written to the output file as a synthesized text with the punctuation and capitalization merged back in. The text output module is controlled by a file that specifies orthographic changes. Figure 6. An overview of KTEXT synthesis structured text file with parsed words | | +--------------------------------+ | | | | +------------+ | | | STRUCTURED | | | | INPUT |-----+ | | +------------+ | | | | punctuation | | parsed formatting | | words capitalization | | | | | rules file ----->|--->+------------+ | | lexicon files -->|--->| SYNTHESIS | | | | +------------+ | | | | | | | synthesized words | | | | | | | +------------+ | | text output | | TEXT |<----+ | control file --->|--->| OUTPUT | | | +------------+ | | | | +--------------------------------+ | | synthesized output text file In synthesis mode, KTEXT also uses six different input files and produces one output file (plus an optional log file). These six input file are: 1. the analysis data file, 2. the KTEXT control file, 3. the text output control file (optional), 4. the PC-KIMMO rules file, 5. the PC-KIMMO lexicon file, and 6. the PC-KIMMO grammar file (optional). The PC-KIMMO rules, lexicon, and grammar files are described in the PC-KIMMO documentation and will not be discussed further in this document; see Antworth (1990) and Antworth (1995). The other input files and the synthesis output text file are described in the following chapters. The input text file ******************* The input text file contains the text that KTEXT will process. It must be a plain text file, not a file formatted by a word processor. If you use a word processor such as Microsoft Word to create your text, you must save it as plain text with no formatting. KTEXT preserves all the "white space" used in the text file. That is, it saves in its output file the location of all line breaks, blank lines, tabs, spaces, and other nonalphabetic characters. This enables you to recover from the output file the precise format and page layout of the original text. While KTEXT will accept text with no formatting information other than white space characters, it will also handle text that contains special format markers. These format markers can indicate parts of the text such as sentences, paragraphs, sections, section headings, and titles. The use of special format markers is called descriptive markup. KTEXT (because it is based on AMPLE) works best with a system of descriptive markup called "standard format" that is used by SIL International. SIL standard format marks the beginning of each text unit with a format marker. There is no explicit indication of the end of a unit. A format marker is composed of a special character (a backslash by default) followed by a code of one or more letter. For example, `\ti' for title, `\ch' for chapter, `\p' for paragraph, `\s' for sentence, and so on. KTEXT does not "know" any particular format markers. You can use whatever markers you like, as long as you declare them in the text input control file. For more on format markers, see section 7.3.2 below. One of the best know systems of descriptive markup is SGML (Standard Generalized Markup Language). One very significant difference between SGML and SIL standard format is that SGML uses markers in pairs, one at the beginning of a text unit and a matching one at the end. This should not pose a problem for KTEXT, since KTEXT just preserves all format markers wherever they occur. Another difference is that SGML flags format markers with angle brackets, for instance . KTEXT can recognize SGML markers by changing the format marker flag character from backslash to left angled bracket (see section 7.3.2 below). Recognizing the end of the SGML format marker is a bit of a problem. While SGML uses a matching right angled bracket to indicate the end of the marker, SIL standard format simply uses a space to delineate the format marker from the following text. This means that for KTEXT to find the end of an SGML tag, you must leave at least one space after it, and there must not be any spaces in the middle of the SGML tag. The KTEXT control file ********************** KTEXT uses an overall control file to customize its operation. This file is structured as a "standard format database", composed of various fields marked by backslash codes. The fields in the control file are as follows. `\textin' specifies the text input control file. This is used only in analysis mode. `\textout' specifies the text output control file. This is used only in synthesis mode. `\rules' specifies the PC-KIMMO phonological rules file. `\lexicon' specifies the primary PC-KIMMO lexicon file. `\grammar' specifies the PC-KIMMO word grammar file. This is normally used only in analysis mode. `\ach' defines an analysis field change. This is used only in analysis mode, after a word has been parsed. `\dch' defines a decomposition field change. This is used only in analysis mode, after a word has been parsed. `\scl' defines a string class for use with the analysis or decomposition changes. `\cat' defines how to extract the word category (part of speech) from the feature structure built by the word grammar. This is used only in analysis mode, and only if a word grammar file is loaded. `\fd' defines a labeled feature structure. This is used only in analysis mode, and only if a word grammar file is loaded. This gives names to feature structures produced by the word grammar for output to the analysis data file. `\rd' defines the root delimiters. The default pair of delimiters are < and >. Figure 7 shows a sample KTEXT control file. Figure 7. Sample KTEXT control file \textin engintx.ctl \rules d:\opac\test\ktext\englex\english.rul \lexicon d:\opac\test\ktext\englex\english.lex \grammar d:\opac\test\ktext\englex\english.grm \textout engoutx.ctl \cat \fd singular = SG \fd plural = PL When KTEXT reads its control file, it ignores any lines beginning with field codes other than those listed above. For example, a line beginning `\co' would be ignored. Such lines are treated as comments. Comments in the control file can also be indicated with the comment character, which by default is semicolon. This is the only way to place comments on the same line as a field. The comment character can be changed with the command line option `-c' when running KTEXT (see chapter 3). The text input control file *************************** This chapter describes the expected characteristics of an input text file, and the options offered for describing these characteristics by a "text input control file".(1) ---------- Footnotes ---------- (1) This chapter is adapted from chapters 7, 8, and 9 of Weber (1988). Input text files ================ Text input control files define a simple model of input text files. They are plain text files with two types of embedded format markers. 1. A primary format marker consists of one or more contiguous characters beginning with a special flag character. The default character initiating format markers is the backslash (`\'). Thus, each of the following would be recognized as a format marker and would not be processed by the program: \ \p \sp \begin{enumerate} \very-long.and;muddled/format*marker,to#be$sure Note that format markers cannot have a space or tab embedded in them; the first space or tab encountered terminates the format marker. One final note: the format character under discussion here applies only to the input text files which are to be processed. It has absolutely nothing to do with the use of backslash (`\') to flag field codes in control files such as the text input control file. 2. A secondary type of marker consists of a flag character followed by a single character from a list of known values. This secondary flag character must be different than the primary flag character. Its default value is the vertical bar (`|'), causing this type of format marker to be frequently called a bar code. The following could be valid (secondary) format markers and would not be processed by the program: |b |i |r Consider the following two lines of input text: \bgoodbye\r |bgoodbye|r Using the default definitions of format markers, the first line is considered to be a single format marker, and provides nothing which the program should try to parse. The second line, however contains two format markers, `|b' and `|r', and the word `goodbye' which would be processed by the program. The primary format markers serve to divide the text into fields. See `Fields to Exclude: \excl' and `Fields to Include: \incl' below for details on how these fields are used. There is no requirement that the format markers be at the beginning of a line as with the field codes used in KTEXT control files. Ambiguity Marker Character: \ambig ================================== The `\ambig' field defines the character used to mark ambiguities and failures in the analysis output file. For example, to use the hash mark (`#'), the text input control file would include: \ambig # This would cause an ambiguous analysis to be output as follows: \a #3#< N0 kay >#< V1 ka > IMP#< V1 ka > INF# It makes sense to use the `\ambig' field only once in the text input control file. If multiple `\ambig' fields do occur in the file, the value given in the first one is used. If the text input control file does not have an `\ambig' field, the percent sign (`%') is used. The first printing character following the `\ambig' field code is used as the ambiguity marker. The character currently being used to mark comments cannot be assigned to also mark ambiguities in the output file. Thus, the semicolon (`;') cannot normally be used as the ambiguity marker. Logically, this field should be in the KTEXT control file rather than the text *input* control file since it affects output instead of input. Nevertheless, compatibility demands that it stays this way. Bar code format marker character: \barchar ========================================== The `\barchar' defines the character that begins a two-character secondary format marker. For example, if this type of format marker begins with the dollar sign (`$'), the following would be placed in the text input control file: \barchar $ An empty `\barchar' field in the text input control file prevents any bar code format markers from being recognized. Thus, the following field effectively turns off special treatment of this style of format marking (assuming the `;' is marking comments): \barchar ; no bar character It makes sense to use the `\barchar' field only once in the text input control file. If multiple `\barchar' fields do occur in the file, the value given in the first one is used. The first printing character following the `\barchar' field code is used as the bar code format marker. The character currently being used to mark comments cannot be assigned to also flag format markers in input text files. That is, `\barchar ;' is treated as `\barchar' followed only by a comment, which effectively removes the concept of bar codes since no marker character is defined. Bar Code Format Code Characters: \barcodes ========================================== In conjunction with the special format marking character discussed in the previous section, the `\barcodes' field defines the individual characters used with in bar codes. These characters may be separated by spaces or lumped together. Thus, the following two fields are equivalent: \barcodes abcdefg ; lumped together \barcodes a b c d e f g ; separated If provided more than one `\barcodes' field in the text input control file, the combination of all characters defined in all such fields is used. No check is made for repeated characters: the previous example would be accepted without complaint despite the redundancy of the second line. The default value for the bar codes is the set of lowercase alphabetic letters `a'-`z'. Therefore, if the text input control file contains neither a `\barchar' nor a `\barcodes' field, the following bar codes are considered to be formatting information by KTEXT: `|a', `|b', `|c', ..., `|x', `|y', and `|z'. Text Orthography Change: \ch ============================ An orthography change is defined by the `\ch' field code followed by the actual orthography change. Any number of orthography changes may be defined in the text input control file. The output of each change serves as the input the following change. That is, each change is applied as many times as necessary to an input word before the next change from the text input control file is applied. Basic changes ------------- To substitute one string of characters for another, these must be made known to the program in a change. (The technical term for this sort of change is a production, but we will simply call them changes.) In the simplest case, a change is given in three parts: (1) the field code `\ch' must be given at the extreme left margin to indicate that this line contains a change; (2) the match string is the string for which the program must search; and (3) the substitution string is the replacement for the match string, wherever it is found. The beginning and end of the match and substitution strings must be marked. The first printing character following `\ch' (with at least one space or tab between) is used as the delimiter for that line. The match string is taken as whatever lies between the first and second occurrences of the delimiter on the line and the substitution string is whatever lies between the third and fourth occurrences. For example, the following lines indicate the change of hi to bye, where the delimiters are the double quote mark (`"'), the single quote mark (`''), the period (`.'), and the at sign (`@'). \ch "hi" "bye" \ch 'hi' 'bye' \ch .hi. .bye. \ch @hi@ @bye@ Throughout this document, we use the double quote mark as the delimiter unless there is some reason to do otherwise. Change tables follow these conventions: 1. Any characters (other than the delimiter) may be placed between the match and substitution strings. This allows various notations to symbolize the change. For example, the following are equivalent: \ch "thou" "you" \ch "thou" to "you" \ch "thou" > "you" \ch "thou" --> "you" \ch "thou" becomes "you" 2. Comments included after the substitution string are initiated by a semicolon (`;'), or whatever is indicated as the comment character by means of the `-c' option when KTEXT is started. The following lines illustrate the use of comments: \ch "qeki" "qiki" ; for cases like wawqeki \ch "thou" "you" ; for modern English 3. A change can be ignored temporarily by turning it into a comment field. This is done either by placing an unrecognized field code in front of the normal `\ch', or by placing the comment character (`;') in front of it. For example, only the first of the following three lines would effect a change: \ch "nb" "mp" \no \ch "np" "np" ;\ch "mb" "nb" The changes in the text input control file are applied as an ordered set of changes. The first change is applied to the entire word by searching from left to right for any matching strings and, upon finding any, replacing them with the substitution string. After the first change has been applied to the entire word, then the next change is applied, and so on. Thus, each change applies to the result of all prior changes. When all the changes have been applied, the resulting word is returned. For example, suppose we have the following changes: \ch "aib" > "ayb" \ch "yb" > "yp" Consider the effect these have on the word paiba. The first changes i to y, yielding payba; the second changes b to p, to yield paypa. (This would be better than the single change of aib to ayp if there were sources of yb other than the output of the first rule.) The way in which change tables are applied allows certain tricks. For example, suppose that for Quechua, we wish to change hw to f, so that hwista becomes fista and hwis becomes fis. However, we do not wish to change the sequence shw or chw to sf or cf (respectively). This could be done by the following sequence of changes. (Note, `@' and `$' are not otherwise used in the orthography.) \ch "shw" > "@" ; (1) \ch "chw" > "$" ; (2) \ch "hw" > "f" ; (3) \ch "@" > "shw" ; (4) \ch "$" > "chw" ; (5) Lines (1) and (2) protect the sh and ch by changing them to distinguished symbols. This clears the way for the change of hw to f in (3). Then lines (4) and (5) restore `@' and `$' to sh and ch, respectively. (An alternative, simpler way to do this is discussed in the next section.) Environmentally constrained changes ----------------------------------- It is possible to impose string environment constraints (SECs) on changes in the orthography change tables. The syntax of SECs is described in detail in section {No Value For "words.vs.format"}. For example, suppose we wish to change the mid vowels (e and o) to high vowels (i and u respectively) immediately before and after q. This could be done with the following changes: \ch "o" "u" / _ q / q _ \ch "e" "i" / _ q / q _ This is not entirely a hypothetical example; some Quechua practical orthographies write the mid vowels e and o. However, in the environment of /q/ these could be considered phonemically high vowels /i/ and /u/. Changing the mid vowels to high upon loading texts has the advantage that-for cases like upun "he drinks" and upoq "the one who drinks"-the root needs to be represented internally only as upu "drink". But note, because of Spanish loans, it is not possible to change all cases of e to i and o to u. The changes must be conditioned. In reality, the regressive vowel-lowering effect of /q/ can pass over various intervening consonants, including /y/, /w/, /l/, /ll/, /r/, /m/, /n/, and /n/. For example, /ullq/ becomes ollq, /irq/ becomes erq, and so on. Rather than list each of these cases as a separate constraint, it is convenient to define a class (which we label `+resonant') and use this class to simplify the SEC. Note that the string class must be defined (with the `\scl' field code) before it is used in a constraint. \scl +resonant y w l ll r m n n~ \ch "o" "u" / q _ / _ ([+resonant]) q \ch "e" "i" / q _ / _ ([+resonant]) q This says that the mid vowels become high vowels after /q/ and before /q/, possibly with an intervening /y/, /w/, /l/, /ll/, /r/, /m/, /n/, or /n/. Consider the problem posed for Quechua in the previous section, that of changing hw to f. An alternative is to condition the change so that it does not apply adjacent to a member of the string class `Affric' which contains s and c. \scl Affric c s \ch "hw" "f" / [Affric] ~_ It is sometimes convenient to make certain changes only at word boundaries, that is, to change a sequence of characters only if they initiate or terminate the word. This conditioning is easily expressed, as shown in the following examples. \ch "this" "that" ; anywhere in the word \ch "this" "that" / # _ ; only if word initial \ch "this" "that" / _ # ; only if word final \ch "this" "that" / # _ # ; only if entire word Using text orthography changes ------------------------------ The purpose of orthography change is to convert text from an external orthography to an internal representation more suitable for morphological analysis. In many cases this is unnecessary, the practical orthography being completely adequate as the internal representation. In other cases, the practical orthography is an inconvenience that can be circumvented by converting to a more phonemic representation. Let us take a simple example from Latin. In the Latin orthography, the nominative singular masculine of the word "king" is rex. However, phonemically, this is really /reks/; /rek/ is the root meaning king and the /s/ is an inflectional suffix. If the program is to recover such an analysis, then it is necessary to convert the x of the external, practical orthography into ks internally. This can be done by including the following orthography change in the text input control file: \ch "x" "ks" In this, x is the match string and ks is the substitution string, as discussed in section {No Value For "output.file"}. Whenever x is found, ks is substituted for it. Let us consider next an example from Huallaga Quechua. The practical orthography currently represents long vowels by doubling the vowel. For example, what is written as kaa is /ka:/ "I am", where the length (represented by a colon) is the morpheme meaning "first person subject". Other examples, such as upoo /upu:/ "I drink" and upichee /upi-chi-:/ "I extinguish", motivate us to convert all long vowels into a vowel followed by a colon. The following changes do this: \ch "aa" "a:" \ch "ee" "i:" \ch "ii" "i:" \ch "oo" "u:" \ch "uu" "u:" Note that the long high vowels (i and u) have become mid vowels (e and o respectively); consequently, the vowel in the substitution string is not necessarily the same as that of the match string. What is the utility of these changes? In the lexicon, the morphemes can be represented in their phonemic forms; they do not have to be represented in all their orthographic variants. For example, the first person subject morpheme can be represented simply as a colon (-:), rather than as -a in cases like kaa, as -o in cases like qoo, and as -e as in cases like upichee. Further, the verb "drink" can be represented as upu and the causative suffix (in upichee) can be represented as -chi; these are the forms these morphemes have in other (nonlowered) environments. As the next example, let us suppose that we are analyzing Spanish, and that we wish to work internally with k rather than c (before a, o, and u) and qu (before i and e). (Of course, this is probably not the only change we would want to make.) Consider the following changes: \ch "ca" "ka" \ch "co" "ko" \ch "cu" "ku" \ch "qu" "k" The first three handle c and the last handles qu. By virtue of including the vowel after c, we avoid changing ch to kh. There are other ways to achieve the same effect. One way exploits the fact that each change is applied to the output of all previous changes. Thus, we could first protect ch by changing it to some distinguished character (say `@'), then changing c to k, and then restoring `@' to ch: \ch "ch" "@" \ch "c" "k" \ch "@" "ch" \ch "qu" "k" Another approach conditions the change by the adjacent characters. The changes could be rewritten as \ch "c" "k" / _a / _o / _u ; only before a, o, or u \ch "qu" "k" ; in all cases The first change says, "change c to k when followed by a, o, or u." (This would, for example, change como to komo, but would not affect chal.) The syntax of such conditions is exactly that used in string environment constraints; see section {No Value For "words.vs.format"}. Where orthography changes apply ------------------------------- Input orthography changes are made when the text being processed may be written in a practical orthography. Rather than requiring that it be converted as a prerequisite to running the program, it is possible to have the program convert the orthography as it loads and before it processes each word. The changes loaded from the text input control file are applied after all the text is converted to lower case (and the information about upper and lower case, along with information about format marking, punctuation and white space, has been put to one side.) Consequently, the match strings of these orthography changes should be all lower case; any change that has an uppercase character in the match string will never apply. A sample orthography change table --------------------------------- We include here the entire orthography input change table for Caquinte (a language of Peru). There are basically four changes that need to be made: (1) nasals, which in the practical orthography reflect their assimilation to the point of articulation of a following noncontinuant, must be changed into an unspecified nasal, represented by N; (2) c and qu are changed to k; (3) j is changed to h; and (4) gu is changed to g before i and e. \ch "mp" "Np" ; for unspecified nasals \ch "nch" "Nch" \ch "nc" "Nk" \ch "nqu" "Nk" \ch "nt" "Nt" \ch "ch" "@" ; to protect ch \ch "c" "k" ; other c's to k \ch "@" "ch" ; to restore ch \ch "qu" "k" \ch "j" "h" \ch "gue" "ge" \ch "gui" "gi" This change table can be simplified by the judicious use of string environment constraints: \ch "m" > "N" / _p \ch "n" > "N" / _c / _t / _qu \ch "c" > "k" / _~h \ch "qu" > "k" \ch "j" > "h" \ch "gu" > "g" / _e /_i As suggested by the preceding examples, the text orthography change table is composed of all the `\ch' fields found in the text input control file. These may appear anywhere in the file relative to the other fields. It is recommended that all the orthography changes be placed together in one section of the text input control file, rather than being mixed in with other fields. Syntax of Orthography Changes ----------------------------- This section presents a grammatical description of the syntax of orthography changes in BNF notation. 1a. ::= 1b. 2a. ::= 2b. 2c. 3. ::= any printing character not used in either the ``from'' string or the ``to'' string 4. ::= one or more characters other than the quote character used by this orthography change 5a. ::= 5b. 6a. ::= 6b. 6c. 7a. ::= 7b. 7c. 8a. ::= 8b. 8c. 9a. ::= 9b. 9c. ... 10a. ::= 10b. ( ) 11a. ::= ~ 11b. 11c. [ ] 12. ::= / +/ 13. ::= _ ~_ 14. ::= # ~# 15. ::= one or more contiguous characters Comments on selected BNF rules .............................. 2. The same `' character must be used at both the beginning and the end of both the "from" string and the "to" string. 3. The double quote (`"') and single quote (`'') characters are most often used. 7-8. Note that what can appear to the left of the environment bar is a mirror image of what can appear to the right. 9c. An ellipsis (`...') indicates a possible break in contiguity. 10b. Something enclosed in parentheses is optional. 11a. A tilde (`~') reverses the desirability of an element, causing the constraint to fail if it is found rather than fail if it is not found. 11c. A literal enclosed in square brackets must be the name of a string class defined by a `\scl' field in the analysis data file, or earlier in the dictionary orthography change file. 12. A `+/' is usually used for morpheme environment constraints, but may used for change environment constraints in `\ch' fields in the dictionary orthography change table file. 13. A tilde attached to the environment bar (`~_') inverts the sense of the constraint as a whole. 14b. The boundary marker preceded by a tilde (`~#') indicates that it must not be a word boundary. 15. The special characters used by environment constraints can be included in a literal only if they are immediately preceded by a backslash: \+ \/ \# \~ \[ \] \( \) \. \_ \\ Decomposition Separation Character: \dsc ======================================== The `\dsc' field defines the character used to separate the morphemes in the decomposition field of the output analysis file. For example, to use the equal sign (`='), the text input control file would include: \dsc = This would cause a decomposition field to be output as follows: \d %3%kay%ka=y%ka=y% It makes sense to use the `\dsc' field only once in the text input control file. If multiple `\dsc' fields do occur in the file, the value given in the first one is used. If the text input control file does not have an `\dsc' field, a dash (`-') is used. The first printing character following the `\dsc' field code is used as the morpheme decomposition separator character. The same character cannot be used both for separating decomposed morphemes in the analysis output file and for marking comments in the input control files. Thus, one normally cannot use the semicolon (`;') as the decomposition separation character. Logically, this field should be in the KTEXT control file rather than the text *input* control file since it affects output instead of input. Nevertheless, compatibility demands that it stays this way. Fields to Exclude: \excl ======================== The `\excl' field excludes one or more fields from processing. For example, to have the program ignore everything in `\co' and `\id' fields, the following line is included in the text input control file: \excl \co \id ; ignore these fields If more than one `\excl' field is found in the text input control file, the contents of each field is added to the overall list of text fields to exclude. This list is initially empty, and stays empty unless the text input control file contains an `\excl' field. Thus, no text fields are excluded from processing by default. If the text input control file contains `\excl' fields, then only those text fields are not processed. Every word in every text field not mentioned explicitly in an `\excl' field will be processed. Note that every text field in the input text files is processed unless the text input control file contains either an `\excl' or an `\incl' field. One or the other is used to limit processing, but never both. Primary format marker character: \format ======================================== The `\format' field designates a single character to flag the beginning of a primary format marker. For example, if the format markers in the text files begin with the at sign (`@'), the following would be placed in the text input control file: \format @ This would be used, for example, if the text contained format markers like the following: @ @p @sp @make(Article) @very-long.and;muddled/format*marker,to#be$sure If a `\format' field occurs in the text input control file without a following character to serve for flagging format markers, then the program will not recognize any format markers and will try to parse everything other than punctuation characters. It makes sense to use the `\format' field only once in the text input control file. If multiple `\format' fields do occur in the file, the value given in the first one is used. The first printing character following the `\format' field code is used to flag format markers. The character currently used to mark comments cannot be assigned to also flag format markers. Thus, the semicolon (`;') cannot normally be used to flag format markers. Fields to Include: \incl ======================== The `\incl' field explicitly includes one or more text fields for processing, excluding all other fields. For instance, to process everything in `\txt' and `\qt' fields, but ignore everything else, the following line is placed in the text input control file: \incl \txt \qt ; process these fields If more than one `\incl' field is found in the text input control file, the contents of each field is added to the overall list of text fields to process. This list is initially empty, and stays empty unless the text input control file contains an `\incl' field. If the text input control file contains `\incl' fields, then only those text fields are processed. Every word in every text field not mentioned explicitly in an `\incl' field will not be processed. Note that every text field in the input text files is processed unless the text input control file contains either an `\excl' or an `\incl' field. One or the other is used to limit processing, but never both. Lowercase/uppercase character pairs: \luwfc =========================================== To break a text into words, the program needs to know which characters are used to form words. It always assumes that the letters `A' through `Z' and `a' through `z' are used as word formation characters. If the orthography of the language the user is working in uses any other characters that have lowercase and uppercase forms, these must given in a `\luwfc' field in the text input control file. The `\luwfc' field defines pairs of characters; the first member of each pair is a lowercase character and the second is the corresponding uppercase character. Several such pairs may be placed in the field or they may be placed on separate fields. Whitespace may be interspersed freely. For example, the following three examples are equivalent: \luwfc éÉ ñÑ or \luwfc éÉ ; e with acute accent \luwfc ñÑ ; enyee or \luwfc é É ñ Ñ Note that comments can be used as well (just as they can in any KTEXT control file). This means that the comment character cannot be designated as a word formation character. If the orthography includes the semicolon (`;'), then a different comment character must be defined with the `-c' command line option when KTEXT is initiated; see `Running KTEXT' above. The `\luwfc' field can be entered anywhere in the text input control file, although a natural place would be before the `\wfc' (word formation character) field. Any standard alphabetic character (that is `a' through `z' or `A' through `Z') in the `\luwfc' field will override the standard lower- upper case pairing. For example, the following will treat `X' as the upper case equivalent of `z': \luwfc z X Note that `Z' will still have `z' as its lower-case equivalent in this case. The `\luwfc' field is allowed to map multiple lower case characters to the same upper case character, and vice versa. This is needed for languages that do not mark tone on upper case letters. Multibyte lowercase/uppercase character pairs: \luwfcs ====================================================== The `\luwfcs' field extends the character pair definitions of the `\luwfc' field to multibyte character sequences. Like the `\luwfc' field, the `\luwfcs' field defines pairs of characters; the first member of each pair is a multibyte lowercase character and the second is the corresponding multibyte uppercase character. Several such pairs may be placed in the field or they may be placed on separate fields. Whitespace separates the members of each pair, and the pairs from each other. For example, the following three examples are equivalent: \luwfcs e' E` n~ N^ ç C& or \luwfcs e' E` ; e with acute accent \luwfcs n~ N^ ; enyee \luwfcs ç C& ; c cedilla or \luwfcs e' E` n~ N^ ç C& Note that comments can be used as well (just as they can in any KTEXT control file). This means that the comment character cannot be designated as a word formation character. If the orthography includes the semicolon (`;'), then a different comment character must be defined with the `-c' command line option when KTEXT is initiated; see `Running KTEXT' above. Also note that there is no requirement that the lowercase form be the same length (number of bytes) as the uppercase form. The examples shown above are only one or two bytes (character codes) in length, but there is no limit placed on the length of a multibyte character. The `\luwfcs' field can be entered anywhere in the text input control file. `\luwfcs' fields may be mixed with `\luwfc' fields in the same file. Any standard alphabetic character (that is `a' through `z' or `A' through `Z') in the `\luwfcs' field will override the standard lower- upper case pairing. For example, the following will treat `X' as the upper case equivalent of `z': \luwfcs z X Note that `Z' will still have `z' as its lowercase equivalent in this case. The `\luwfcs' field is allowed to map multiple multibyte lowercase characters to the same multibyte uppercase character, and vice versa. This may be useful in some situations, but it introduces an element of ambiguity into the decapitalization and recapitalization processes. If ambiguous capitalization is supported, then for the previous example, `z' will have both `X' and `Z' as uppercase equivalents, and `X' will have both `x' and `Z' as lowercase equivalents. Maximum number of decapitalizations: \maxdecap ============================================== The `\maxdecap' field sets the maximum number of different decapitalizations allowed. Since the `\luwfc' field can map several lowercase characters onto a single uppercase character, a word with uppercase characters can (logically) generate a number of alternatives when decapitalized. This is especially true of words that are entirely capitalized to begin with. The default limit is 100. Prevent Any Decapitalization: \nocap ==================================== The usual behavior is to normalize input words to lowercase. The program remembers the case of the word as one of four possibilities: 1. all uppercase 2. all lowercase 3. only the first letter uppercase 4. mixed uppercase and lowercase However, not all orthographies use the concept of capitalization. To help deal with these, the field code `\nocap' disables all case normalization if it appears anywhere in the text input control file. Prevent Decapitalization of Individual Characters: \noincap =========================================================== The handling of mixed uppercase and lowercase is limited in utility, and sometimes causes more problems than it solves. For this reason, the `\noincap' field code turns off mixed case decapitalization. The program would still decapitalize words that are entirely capitalized and words that begin with a capital letter. String class: \scl ================== A string class is defined by the `\scl' field code followed by the class name, which is followed in turn by one or more contiguous character strings or (previously defined) string class names. A string class name used as part of the class definition must be enclosed in square brackets. The class name must be a single, contiguous sequence of printing characters. Characters and words which have special meanings in tests should not be used. The actual character strings have no such restrictions. The individual members of the class are separated by spaces, tabs, or newlines. Each `\scl' field defines a single string class. Any number of `\scl' fields may appear in the file. The only restriction is that a string class must be defined before it is used. String classes must be defined before being used. For example, the first two lines of the simpler Caquinte example above could be given as follows: \scl -bilabial c t qu \ch "m" > "N" / _ p \ch "n" > "N" / _ [-bilabial] The string class definition could be in another control file: string classes defined elsewhere can be used in the text input control file as well. If no `\scl' fields appear in the text input control file, then KTEXT does not allow any string classes in text input orthography change environment constraints unless they are defined in the KTEXT control file. Caseless word formation characters: \wfc ======================================== To break a text into words, the program needs to know which characters are used to form words. It always assumes that the letters `A' through `Z' and `a' through `z' are used as word formation characters. If the orthography of the language the user is working in uses any characters that do not have different lowercase and uppercase forms, these must given in a `\wfc' field in the text input control file. For example, English uses an apostrophe character (`'') that could be considered a word formation character. This information is provided by the following example: \wfc ' ; needed for words like don't Notice that the characters in the `\wfc' field may be separated by spaces, although it is not required to do so. If more than one `\wfc' field occurs in the text input control file, the program uses the combination of all characters defined in all such fields as word formation characters. The comment character cannot be designated as a word formation character. If the orthography includes the semicolon (`;'), then a different comment character must be defined with the `-c' command line option when KTEXT is initiated; see `Running KTEXT' above. Multibyte caseless word formation characters: \wfcs =================================================== The `\wfcs' field allows multibyte characters to be defined as "caseless" word formation characters. It has the same relationship to `\wfc' that `\luwfcs' has to `\luwfc'. The multibyte word formation characters are separated from each other by whitespace. A sample text input control file ================================ The following is the complete text input control file for Huallaga Quechua (a language of Peru): \id HGTEXT.CTL - for Huallaga Quechua, 25-May-88 \co WORD FORMATION CHARACTERS \wfc ' ~ \co FIELDS TO EXCLUDE \excl \id ; identification fields \co ORTHOGRAPHY CHANGES \ch "aa" > "a:" ; for long vowels \ch "ee" > "i:" \ch "ii" > "i:" \ch "oo" > "u:" \ch "uu" > "u:" \ch "qeki" > "qiki" ; for cases like wawqeki \ch "~n" > "n~" ; for typos ; for Spanish loans like hwista \scl sib s c ; sibilants \ch "hw" > "f" / ~[sib]_ The text output control file **************************** The text output module restores a processed document from the internal format to its textual form. It re-imposes capitalization on words and restores punctuation, format markers, white space, and line breaks. Also, orthography changes can be made, and the delimiter that marks ambiguities and failures can be changed. This chapter describes the control file given to the text output module.(1) ---------- Footnotes ---------- (1) This chapter is adapted from chapter 8 of Weber (1990). Text output ambiguity delimiter: \ambig ======================================= The text output module flags words that either produced no results or multiple results when processed. These are flagged with percent signs (`%') by default, but this can be changed by declaring the desired character with the \ambig field code. For example, the following would change the ambiguity delimiter to `@': \ambig @ Text output orthographic changes: \ch ===================================== The text output module allows orthographic changes to be made to the processed words. These are given in the text output control file. (They have exactly the same form as the input orthographic changes; see The output orthographic changes allow conversion from the internal representation used by the program to the practical orthography of the target language. These changes are applied to the words after they have been processed, but before the text is re-assembled (from the internal format) for output. \ch "N" "m" / _ p ; assimilates before p \ch "N" "n" ; otherwise stays n The first change makes N into m when it directly precedes p; the second makes all other N's into n. Decomposition Separation Character: \dsc ======================================== The `\dsc' field defines the character used to separate the morphemes in the decomposition field of the input analysis file. For example, to use the equal sign (`='), the text input control file would include: \dsc = This would handle a decomposition field like the following: \d %3%kay%ka=y%ka=y% It makes sense to use the `\dsc' field only once in the text output control file. If multiple `\dsc' fields do occur in the file, the value given in the first one is used. If the text output control file does not have an `\dsc' field, a dash (`-') is used. The first printing character following the `\dsc' field code is used as the morpheme decomposition separator character. The same character cannot be used both for separating decomposed morphemes in the analysis output file and for marking comments in the output control files. Thus, one normally cannot use the semicolon (`;') as the decomposition separation character. This field is provided for use by the INTERGEN program. It is of little use to KTEXT. Primary format marker character: \format ======================================== The `\format' field designates a single character to flag the beginning of a primary format marker. For example, if the format markers in the text files begin with the at sign (`@'), the following would be placed in the text input control file: \format @ This would be used, for example, if the text contained format markers like the following: @ @p @sp @make(Article) @very-long.and;muddled/format*marker,to#be$sure If a `\format' field occurs in the text input control file without a following character to serve for flagging format markers, then the program will not recognize any format markers and will try to parse everything other than punctuation characters. It makes sense to use the `\format' field only once in the text input control file. If multiple `\format' fields do occur in the file, the value given in the first one is used. The first printing character following the `\format' field code is used to flag format markers. The character currently used to mark comments cannot be assigned to also flag format markers. Thus, the semicolon (`;') cannot normally be used to flag format markers. This field is provided for use by the INTERGEN program. It is of little use to KTEXT. Lowercase/uppercase character pairs: \luwfc =========================================== To break a text into words, the program needs to know which characters are used to form words. It always assumes that the letters `A' through `Z' and `a' through `z' are used as word formation characters. If the orthography of the language the user is working in uses any other characters that have lowercase and uppercase forms, these must given in a `\luwfc' field in the text input control file. The `\luwfc' field defines pairs of characters; the first member of each pair is a lowercase character and the second is the corresponding uppercase character. Several such pairs may be placed in the field or they may be placed on separate fields. Whitespace may be interspersed freely. For example, the following three examples are equivalent: \luwfc éÉ ñÑ or \luwfc éÉ ; e with acute accent \luwfc ñÑ ; enyee or \luwfc é É ñ Ñ Note that comments can be used as well (just as they can in any KTEXT control file). This means that the comment character cannot be designated as a word formation character. If the orthography includes the semicolon (`;'), then a different comment character must be defined with the `-c' command line option when KTEXT is initiated; see `Running KTEXT' above. The `\luwfc' field can be entered anywhere in the text input control file, although a natural place would be before the `\wfc' (word formation character) field. Any standard alphabetic character (that is `a' through `z' or `A' through `Z') in the `\luwfc' field will override the standard lower- upper case pairing. For example, the following will treat `X' as the upper case equivalent of `z': \luwfc z X Note that `Z' will still have `z' as its lower-case equivalent in this case. The `\luwfc' field is allowed to map multiple lower case characters to the same upper case character, and vice versa. This is needed for languages that do not mark tone on upper case letters. Multibyte lowercase/uppercase character pairs: \luwfcs ====================================================== The `\luwfcs' field extends the character pair definitions of the `\luwfc' field to multibyte character sequences. Like the `\luwfc' field, the `\luwfcs' field defines pairs of characters; the first member of each pair is a multibyte lowercase character and the second is the corresponding multibyte uppercase character. Several such pairs may be placed in the field or they may be placed on separate fields. Whitespace separates the members of each pair, and the pairs from each other. For example, the following three examples are equivalent: \luwfcs e' E` n~ N^ ç C& or \luwfcs e' E` ; e with acute accent \luwfcs n~ N^ ; enyee \luwfcs ç C& ; c cedilla or \luwfcs e' E` n~ N^ ç C& Note that comments can be used as well (just as they can in any KTEXT control file). This means that the comment character cannot be designated as a word formation character. If the orthography includes the semicolon (`;'), then a different comment character must be defined with the `-c' command line option when KTEXT is initiated; see `Running KTEXT' above. Also note that there is no requirement that the lowercase form be the same length (number of bytes) as the uppercase form. The examples shown above are only one or two bytes (character codes) in length, but there is no limit placed on the length of a multibyte character. The `\luwfcs' field can be entered anywhere in the text input control file. `\luwfcs' fields may be mixed with `\luwfc' fields in the same file. Any standard alphabetic character (that is `a' through `z' or `A' through `Z') in the `\luwfcs' field will override the standard lower- upper case pairing. For example, the following will treat `X' as the upper case equivalent of `z': \luwfcs z X Note that `Z' will still have `z' as its lowercase equivalent in this case. The `\luwfcs' field is allowed to map multiple multibyte lowercase characters to the same multibyte uppercase character, and vice versa. This may be useful in some situations, but it introduces an element of ambiguity into the decapitalization and recapitalization processes. If ambiguous capitalization is supported, then for the previous example, `z' will have both `X' and `Z' as uppercase equivalents, and `X' will have both `x' and `Z' as lowercase equivalents. Text output string classes: \scl ================================ It is possible to define string classes, as discussed in section `String class: \scl' above. For example, the sample text output control file given below contains the following lines: a. \scl X t s c b. \ch "h" "j" / [X] ~_ Line a defines a string class including t, s, and c; change rule b makes use of this class to block the change of h to j when it occurs in the digraphs th, sh, and ch. Changes in the text output control file may also make use of string classes defined in the KTEXT control file. Caseless word formation characters: \wfc ======================================== To break a text into words, the program needs to know which characters are used to form words. It always assumes that the letters `A' through `Z' and `a' through `z' are used as word formation characters. If the orthography of the language the user is working in uses any characters that do not have different lowercase and uppercase forms, these must given in a `\wfc' field in the text input control file. For example, English uses an apostrophe character (`'') that could be considered a word formation character. This information is provided by the following example: \wfc ' ; needed for words like don't Notice that the characters in the `\wfc' field may be separated by spaces, although it is not required to do so. If more than one `\wfc' field occurs in the text input control file, the program uses the combination of all characters defined in all such fields as word formation characters. The comment character cannot be designated as a word formation character. If the orthography includes the semicolon (`;'), then a different comment character must be defined with the `-c' command line option when KTEXT is initiated; see `Running KTEXT' above. Multibyte caseless word formation characters: \wfcs =================================================== The `\wfcs' field allows multibyte characters to be defined as "caseless" word formation characters. It has the same relationship to `\wfc' that `\luwfcs' has to `\luwfc'. The multibyte word formation characters are separated from each other by whitespace. A sample text output control file ================================= A complete text output control file used for adapting to Asheninca Campa is given below. \id AEouttx.ctl for Asheninca Campa \ch "N" "m" / _ p ; assimilates before p \ch "N" "n" ; otherwise becomes n \ch "ny" "n~" \ch "ts" "th" / ~_ i ; (N)tsi is unchanged \ch "tsy" "ch" \ch "sy" "sh" \ch "t" "tz" / n _ i \ch "k" "qu" / _ i / _ e \ch "k" "q" / _ y \ch "k" "c" \scl X t s c ; define class of t s c \ch "h" "j" / [X] ~_ ; change except in th, sh, ch \ch "#" " " ; remove fixed space \ch "@" "" ; remove blocking character KTEXT analysis output ********************* Analysis files are "record oriented standard format files". This means that the files are divided into records, each representing a single word in the original input text file, and records are divided into fields. An analysis file contains at least one record, and may contain a large number of records. Each record contains one or more fields. Each field occupies at least one line, and is marked by a "field code" at the beginning of the line. A field code begins with a backslash character (`\'), and contains 1 or more letters in addition. Analysis file fields ==================== This section describes the possible fields in an analysis file. The only field that is guaranteed to exist is the analysis (`\a') field. All other fields are either data dependent or optional. Analysis field: \a ------------------ The analysis field (`\a') starts each record of an analysis file. It has the following form: \a PFX IFX PFX < CAT root CAT root > SFX IFX SFX where `PFX' is a prefix morphname, `IFX' is an infix morphname, `SFX' is a suffix morphname, `CAT' is a root category, and `root' is a root gloss or etymology. In the simplest case, an analysis field would look like this: \a < CAT root > where `CAT' is a root category and `root' is a root gloss or etymology. Decomposition field: \d ----------------------- The morpheme decomposition field (`\d') follows the analysis field. It has the following form: \d anti-dis-establish-ment-arian-ism-s where the hyphens separate the individual morphemes in the surface form of the word. The `\dsc' field in the text input control file can replace the hyphen with another character for separating the morphemes; see `Decomposition Separation Character: \dsc' above. Category field: \cat -------------------- The category field (`\cat') provides rudimentary category information. This may be useful for sentence level parsing. It has the following form: \cat CAT where `CAT' is the word category. To request KTEXT to output the final category, include the field `\cat' in the KTEXT control file. This field specifies the feature path in the word level feature structure that contains the grammatical category (part of speech). Note that this requires a word grammar to be loaded. If there are multiple analyses, there will be multiple categories in the output, separated by ambiguity markers. Feature Descriptors field: \fd ------------------------------ The feature descriptor field (`\fd') contains the feature names associated with each morpheme in the analysis. It has the following form: \fd ==feat1 feat2=feat3= where `feat1', `feat2', and `feat3' are feature descriptors. The equal signs (`=') serve to separate the feature descriptors of the individual morphemes. Note that morphemes may have more than one feature descriptor, with the names separated by spaces, or no feature descriptors at all. The feature descriptor field requires a word grammar and one or more `\feat' fields in the KTEXT control file. If there are multiple analyses, there will be multiple feature sets in the output, separated by ambiguity markers. Word field: \w -------------- The original word field (`\w') contains the original input word as it looks before decapitalization and orthography changes. It looks like this: \w The Note that this is a gratuitous change from earlier versions of AMPLE and KTEXT, which wrote the decapitalized form. Formatting field: \f -------------------- The format information field (`\f') records any formatting codes or punctuation that appeared in the input text file before the word. It looks like this: \f \\id MAT 5 HGMT05.SFM, 14-feb-84 D. Weber, Huallaga Quechua\n \\c 5\n\n \\s where backslashes (`\') in the input text are doubled, newlines are represented by `\n', and additional lines in the field start with a tab character. The format information field is written to the output analysis file whenever it is needed, that is, whenever formatting codes or punctuation exist before words. Capitalization field: \c ------------------------ The capitalization field (`\c') records any capitalization of the input word. It looks like this: \c 1 where the number following the field code has one of these values: `1' the first (or only) letter of the word is capitalized `2' all letters of the word are capitalized `4-32767' some letters of the word are capitalized and some are not Note that the third form is of limited utility, but still exists because of words like the author's last name. The capitalization field is written to the output analysis file whenever any of the letters in the word are capitalized; see Nonalphabetic field: \n ----------------------- The nonalphabetic field (`\n') records any trailing punctuation, bar code or whitespace characters. It looks like this: \n |r.\n where newlines are represented by `\n'. The nonalphabetic field ends with the last whitespace character immediately following the word. The nonalphabetic field is written to the output analysis file whenever the word is followed by anything other than a single space character. This includes the case when a word ends a file with nothing following it. Ambiguous analyses ================== The previous section assumed that only one analysis is produced for each word. This is not always possible since words in isolation are frequently ambiguous. Multiple analyses are handled by writing each analysis field in parallel, with the number of analyses at the beginning of each output field. For example, \a %2%< A0 imaika > CNJT AUG%< A0 imaika > ADVS% \d %2%imaika-Npa-ni%imaika-Npani% \cat %2%A0 A0=A0/A0=A0/A0%A0 A0=A0/A0% \p %2%==%=% \fd %2%==%=% \u %2%imaika-Npa-ni%imaika-Npani% \w Imaicampani \f \\v124 \c 1 \n \n where the percent sign (`%') separates the different analyses in each field. Note that only those fields which contain analysis information are marked for ambiguity. The other fields (`\w', `\f', `\c', and `\n') are the same regardless of the number of analyses. The `\ambig' field in the text input control file can replace the percent sign with another character for separating the analyses; see `Ambiguity Marker Character: \ambig' above. Analysis failures ================= The previous sections assumed that words are successfully analyzed. This does not always happen. Analysis failures are marked the same way as multiple analyses, but with zero (`0') for the ambiguity count. For example, \a %0%ta% \d %0%ta% \cat %0%% \p %0%% \fd %0%% \u %0%% \w TA \f \\v 12 |b \c 2 \n |r\n Note that only the `\a' and `\d' fields contain any information, and those both have the original word as a place holder. The other analysis fields (`\cat', `\p', `\fd', and `\u') are marked for failure, but otherwise left empty. The `\ambig' field in the text input control file can replace the percent sign with another character for marking analysis failures and ambiguities; see `Ambiguity Marker Character: \ambig' above. KTEXT synthesis output ********************** KTEXT tries to recreate the format of the original input to analysis in its synthesis output. The main feature worth noting is that synthesis ambiguities and failures are marked similarly to analysis ambiguities and failures in KTEXT analysis output. Bibliography ************ 1. Antworth, Evan L.. 1990. `PC-KIMMO: a two-level processor for morphological analysis'. Occasional Publications in Academic Computing No. 16. Dallas, TX: Summer Institute of Linguistics. 2. Antworth, Evan L.. 1995. `User's Guide to PC-KIMMO version 2'. URL ftp://ftp.sil.org/software/dos/pc-kimmo/guide.zip (visited 1997, June 11). 3. Bloomfield, Leonard. 1917. `Tagalog texts with grammatical analysis.' Urbana, IL: University of Illinois. 4. Jang, Taeho. 1995. `Computer assisted adaptation of text from Turkish to Korean: design and implementation'. Master of Arts in Linguistics, University of Texas at Arlington, Arlington, TX. 5. Kew, Jonathan and Stephen R. McConnel. 1991. `Formatting interlinear text'. Occasional Publications in Academic Computing No. 17. Dallas, TX: Summer Institute of Linguistics. 6. Knuth, Donald E.. 1986. `The TeXbook'. Reading, MA: Addison Wesley Publishing Company. 7. Oflazer, Kemal. 1994a. Two-level Description of Turkish Morphology. `Literary and Linguistic Computing'. 9(2), 137-148. 8. Oflazer, Kemal. 1994b. `TURKLEX'. URL ftp://crl.nmsu.edu/CLR/tools/ling-analysis/morphology/turklex/turklex.tar.z (visited 1997, June 11). 9. Simons, Gary F., and John Thomson. 1988. `How to use IT: interlinear text processing on the Macintosh'. Edmonds, WA: Linguist's Software. 10. Simons, Gary F., and Larry Versaw. 1991. `How to use IT: a guide to interlinear text processing', 3rd ed. Dallas, TX: Summer Institute of Linguistics. 11. Weber, David J., H. Andrew Black, and Stephen R. McConnel. 1988. `AMPLE: a tool for exploring morphology'. Occasional Publications in Academic Computing No. 12. Dallas, TX: Summer Institute of Linguistics. 12. Weber, David J., H. Andrew Black, Stephen R. McConnel, and Alan Buseman. 1990. `STAMP: a tool for dialect adaptation'. Occasional Publications in Academic Computing No. 15. Dallas, TX: Summer Institute of Linguistics.