KText Reference Manual
         analyzing/synthesizing texts with PC-Kimmo functions
                            version 2.0b17
                             October 1997

                 by Evan Antworth and Stephen McConnel

                 Copyright (C) 2000 SIL International

Published by:
Language Software Development
SIL International
7500 W. Camp Wisdom Road
Dallas, TX 75236
U.S.A.
Permission is granted to make and distribute verbatim copies of this
file provided the copyright notice and this permission notice are
preserved in all copies.

The author may be reached at the address above or via email as
`steve@acadcomp.sil.org'.

Overview of KTEXT
*****************

This section briefly describes what KTEXT does, places KTEXT in its
computational context, lists technical specifications of the program,
and gives information on use and support of the program.

What does KTEXT do?
===================

KTEXT is a text processing program that uses the PC-KIMMO parser (see
below about PC-KIMMO).  KTEXT operates in two modes: analysis and
synthesis.  In analysis mode, KTEXT reads a text from a disk file,
parses each word, and writes the results to a new disk file.  This new
file is in the form of a structured text file where each word of the
original text is represented as a database record composed of several
fields.  Each word record contains a field for the original word, a
field for the underlying or lexical form of the word, and a field for
the gloss string.  For example, if the text in the input file contains
the word `beginning' (to use an English example), KTEXT's output file
will have a record of this format:

     \a be`gin +ING
     \d be`gin-+ING
     \cat V
     \fd ing
     \w beginning

This record consists of five fields, each tagged with a backslash
code.(1)  The first field, tagged with \a for analysis, contains the
gloss string for the word.  The second field, tagged with \d for
(morpheme) decomposition, contains the underlying or lexical form of
the word.  The third field, tagged with \cat for category, contains the
grammatical category of the word.  The fourth field, tagged with \fd
for feature descriptions, contains a list of feature abbreviations
associated with the word, and the fourth field, tagged with \w for
word, contains the original word.  The word `pictures' (which can be
analyzed as either a verb or a noun) demonstrates how KTEXT handles
multiple parses:

     \a %2%`picture +3SG%`picture +PL%
     \d %2%`picture-+s%`picture-+s%
     \cat %2%V%N%
     \fd %d%s%-3sg pl%
     \w pictures

Percent signs (or some other designated character) separate the
multiple results in the \a, \d, \cat, and \fd fields, with a number
indicating how many results were found.

A word record also saves any capitalization or punctuation associated
with the original word.  For example, if a sentence begins "Obviously,
this hypothesis...", KTEXT will output the first word like this:

     \a `obvious +AVR1
     \d `obvious-+ly
     \cat AV
     \w obviously
     \c 1
     \n ,

The \w field contains the original word without capitalization or the
following comma.  The \c field contains the number 1 which indicates
that the first letter of the original word is upper case.  The \n field
contains the comma that follows the original word.  The purpose of
retaining the capitalization and punctuation of the original text is,
of course, to enable one to recover the original text from KTEXT's
output file.

In synthesis mode, KTEXT takes an analysis file compatible with that
produced by KTEXT in analysis mode and produces an orthographic text
file comparable to the original.

---------- Footnotes ----------

(1) The particular choice of field markers and the order of fields in a
record is due to the fact that KTEXT uses the same text-handling
routines as an existing program called AMPLE (Weber et al., 1988).
This has the advantage that KTEXT's output is compatible with that
program, but the disadvantage that the record structure is perhaps not
consistent with terminology already established for PC-KIMMO.  It
should also be noted that the quasi-database design of KTEXT's output is
used by many other programs developed by SIL International.

Placing KTEXT in its context
============================

KTEXT can only be understood by describing two other programs: PC-KIMMO
and CARLA.  First, we will take a look at PC-KIMMO.

KTEXT is intended to be used with PC-KIMMO (though it is a stand-alone
program).  PC-KIMMO is a program for doing computational phonology and
morphology.  It is typically used to build morphological parsers for
natural language processing systems.  PC-KIMMO is described in the book
`PC-KIMMO: a two-level processor for morphological analysis' by Evan L.
Antworth, published by the Summer Institute of Linguistics (1990).  The
PC-KIMMO software is available for MS-DOS and Windows (IBM PCs and
compatibles), Macintosh, and UNIX.  The book (including software) is
available for $23.00 (plus postage) from:

     International Academic Bookstore
         7500 W. Camp Wisdom Road
         Dallas TX, 75236
         U.S.A.
     
         phone: 972/708-7404
         fax:   972/708-7433

The KTEXT program which this document describes will be of very little
use to you without the PC-KIMMO program and book.  The remainder of
this document assumes that you are familiar with PC-KIMMO.

PC-KIMMO was deliberately designed to be reuseable.  The core of
PC-KIMMO is a library of functions such as "load rules", "load
lexicon", "generate", and "recognize".  The PC-KIMMO program supplied
on the release diskette is just a user shell built around these basic
functions.  This shell provides an environment for developing and
testing sets of rules and lexicons.  Since the shell is a development
environment, it has very little built-in data processing capability.
But because PC-KIMMO is modular and portable, you can write your own
data processing program that uses PC-KIMMO's function library.  KTEXT
is an example of how to use PC-KIMMO to create a new natural language
processing program.  KTEXT is a text processing program that uses
PC-KIMMO to do morphological parsing.

KTEXT is also closely related to a system called CARLA, which stands
for Computer Assisted Related Language Adaptation.  CARLA is a type of
machine translation system designed to work between closely related
languages.  CARLA is based on the Analysis Transfer Synthesis (ATS)
paradigm of adaptation.  This paradigm involves three stages:

  1. *Analysis.*  The text to be adapted is converted to an abstract
     representation, composed of units (in our case, words and
     morphemes) which are defined in source language dictionaries.  (No
     attempt is made to represent the meaning of the text, only the
     units that comprise the text.)

  2. *Transfer.*  Given known, systematic differences between the
     source and target languages, the result of Analysis is converted
     to an abstract representation of what it should be for the target
     language, using units defined in target language dictionaries.

  3. *Synthesis.* Given information about the target language, the
     abstract representation resulting from Transfer is converted to a
     concrete, textual form.

When used in analysis mode, KTEXT performs the Analysis task.  In the
original CARLA system, analysis is done by a program called AMPLE
(Weber et al. 1988), which is also a morphological parser designed to
process text.  KTEXT was created by replacing AMPLE's parsing engine
with the PC-KIMMO parser.  Thus KTEXT has the same text-handling
mechanisms as AMPLE and produces output similar or even identical to
AMPLE.  The advantages of this design are (1) we were able to develop
KTEXT very quickly and easily since it involved very little new code,
and (2) existing programs that use AMPLE's output format can also use
KTEXT's output.  The disadvantage of basing KTEXT on AMPLE is that the
format of the output file is perhaps not consistent with terminology
already established for PC-KIMMO.

When KTEXT is used in synthesis mode, it performs the Synthesis task.
In the original CARLA system, synthesis is done by a program called
STAMP (Weber et al. 1990).  However, STAMP also performs the Transfer
task; KTEXT does not have this capability.

Technical specifications
========================

KTEXT runs under four operating systems:

   * MS-DOS (IBM PC compatibles with a 386 or higher processor)

   * Microsoft Windows

   * UNIX

   * Apple Macintosh

KTEXT does not require any graphics capability.  It handles eight-bit
characters (such as the IBM PC extended character set or the Windows
ANSI character set).  The Windows and Macintosh versions have the same
user interface as the MS-DOS and UNIX versions, namely a
batch-processing, command-line interface.  In other words, a GUI
version does not exist.

The MS-DOS executable requires a 386 or newer CPU and a DPMI server.
This has the advantage of allowing the program to use as much memory as
necessary without constraining it to the archaic 640K limit.  (DPMI is
provided automatically by Windows.  A free DPMI server is distributed
with the MS-DOS executable.)

The program is written in C and is very portable.  The Macintosh
version was compiled with the Metrowerks C compiler.  The sources
available at URL ftp://ftp.sil.org/software/unix/ktext-*.zip can be
compiled for any of the four target platforms.

Program status
==============

KTEXT was developed by Stephen McConnel and Evan Antworth of SIL
International.  Several qualifications apply to its use and support:

  1. This software, source code and executable program, is copyrighted
     by SIL International.  You may use this software at no cost.  You
     are granted the right to distribute this software to others,
     provided that all files are included in unmodified form and that
     you charge no fee (except cost of media).  This software is
     intended for academic or other personal use only, and may not be
     distributed or used for commercial profit without express
     permission of SIL International.

  2. This software represents work in progress and bears no warranty,
     either expressed or implied, of its fitness for any particular
     purpose.

  3. In releasing this software, SIL International is making no
     commitment to maintain it.  It is, however, committed to
     forwarding user feedback to the software's authors who may or may
     not choose to develop the software further.

Bug reports, wish lists, requests for support, and positive feedback
should be directed to Evan Antworth at this address:

     Stephen McConnel
         Language Software Development
         SIL International
         7500 W. Camp Wisdom Road
         Dallas, TX  75236
         phone: 972/708-7361
         email: Stephen_McConnel@sil.org

Examples of using KTEXT
***********************

Using KTEXT to analyze a text
=============================

Typically, the steps involved in using KTEXT to analyze texts are:
  1. Collect a corpus of language data suitable for phonological and
     morphological analysis (typically paradigms of words).

  2. Do phonological and morphological analysis on the data.

  3. Use the PC-KIMMO shell to develop a rules file, lexicon, and
     grammar file that encode your phonological and morphological
     analyses and to test them against your corpus of data.

  4. Select a text and keyboard it.

  5. Set up the additional control files required for KTEXT analysis.

  6. Use the rules, lexicon, and grammar you developed to process the
     text with KTEXT in analysis mode.

  7. Edit KTEXT's output file to remove multiple parses.

  8. Use the edited file as input to some other program.

To demonstrate how to use KTEXT to process a text in analysis, we will
use Englex, a morphological grammar of English for PC-KIMMO, and
analyze a paragraph of `Alice's Adventures in Wonderland', by Lewis
Caroll.  The first paragraph of the text is shown in figure
1.

     Figure 1. Excerpt from Alice
     
     \id Alice.txt - Lewis Carroll's Alice's Adventures in Wonderland
     
     \ti Down the Rabbit-Hole
     \p
     Alice was beginning to get very tired of sitting by her sister
     on the bank and of having nothing to do: once or twice she had
     peeped into the book her sister was reading, but it had no pictures
     or conversations in it, "and what is the use of a book," thought Alice,
     "without pictures or conversations?"
     \p
     So she was considering, in her own mind (as well as she could, for
     the hot day made her feel very sleepy and stupid), whether the pleasure
     of making a daisy-chain would be worth the trouble of getting up and
     picking the daisies, when suddenly a White Rabbit with pink eyes ran
     close by her.

The text was keyboarded using a very simple system of document markup
that tags parts of the document with backslash codes.  The `\it' tag
identifies the text; the `\ti' tag indicates the title of the story;
and the `\p' tag indicates the beginning of a paragraph.  The next step
is to process the text with KTEXT in analysis mode.  Run the KTEXT
application with these command line options:

     ktext -x ana.ctl -i alice.txt -o alice.ana -l ana.log

where `ana.ctl' is the analysis control file, `alice.txt' is the input
text file, `alice.ana' is the output analysis file, and `alice.log' is
the analysis log file.  The following display will appear on the screen:

     KTEXT (analyze/synthesize words using PC-Kimmo functions)
     Version 2.0b11 (November 1, 1996), Copyright 1996 SIL
     Beta test version compiled Nov  7 1996 15:11:16
     with PC-Kimmo functions version 2.1b7 (November 6, 1996)
       and PC-PATR functions version 0.99b0 (November 7, 1996)
     For 386 CPU (or better) under MS-DOS [compiled with DJGPP 2.1/GNU C 2.7]
     
       affix.lex                          255 entries
       noun.lex                         10461 entries
       verb.lex                          4215 entries
       adjectiv.lex                      3345 entries
       adverb.lex                         400 entries
       minor.lex                          379 entries
       proper.lex                        1057 entries
       abbrev.lex                         127 entries
       technica.lex                       813 entries
       natural.lex                        435 entries
       foreign.lex                         88 entries
     
     5...2.2..2  ..22.2.2..  2.222.2.22  ..22.2.222  22..22.22.
     2..23.22..  2.2.22.225  2..2...22.  ......2...  4.22.2..4.  100
     ...2..2..2  2.222

Each dot represents one word successfully processed.  Multiple analyses
of a word are indicated by numbers; thus the first word down received
five analyses.  When the program is done, it will return you to the
operating system prompt.  A fragment of the resulting output file is
shown in figure 2.

     Figure 2  Output of KTEXT
     
     \a	%5%`down%`down%`down%`down%`down%
     \d	%5%`down%`down%`down%`down%`down%
     \cat	%5%AV%V%AJ%N%PP%
     \fd	%5%%vbase%%sg%%
     \w	down
     \c	1
     
     \a	the
     \d	the
     \cat	DT
     \w	the
     
     \a	`rabbit - `hole
     \d	`rabbit---`hole
     \cat	N
     \fd	sg
     \w	rabbit-hole
     \c	516
     \n	\n

One obvious way to continue is to reassemble the text in interlinear
format.  That is, we could write a program that would take the data
structures shown in figure 2 and create a new file where the text is
stored in interlinear format.  The resulting interlinear text is shown
in figure 3.  An interlinear text editor like IT(1) could then be used
to add more lines of annotations to the text.

     Figure 3  An English example of interlinear text format
     
     Down	the	Rabbit	-	Hole
     `down	the	`rabbit	-	`hole
     PP	DT	N	-	N

Interlinear translation is a time-honored format for presenting
analyzed vernacular texts.  An interlinear text consists of a baseline
text and one or more lines of annotations that are vertically aligned
with the baseline.  In the text shown in figure 3, the first line is
the baseline text.  The second line provides the lexical form of each
original word, including morpheme breaks.  The third line gives the
category or part-of-speech of each word.

Another way to proceed would be to take the output of KTEXT as shown in
figure 2 and format it directly for printing.  In other words, there
would be no disk file of interlinear text corresponding to figure
3; rather, the interlinear text is created on the fly as it is prepared
for printing.  Fortunately, the software required to print interlinear
text is now available.  As a complement to the IT program, a system for
formatting interlinear text for typesetting has recently been developed
(see Kew and McConnel, 1991).  Called ITF, for Interlinear Text
Formatter,(2)  it is a set of TeX(3) macros that can format an
arbitrary number of aligning annotations with up to two freeform
(nonaligning) annotations.  While ITF is primarily intended to format
the data files produced by IT (similar to the interlinear text shown in
figure 3), an auxiliary program provided with ITF accepts the output of
the KTEXT program.  The final printed result of the formatting process
is shown in figure 4.(4)  It should be noted that this is just one of
many formats that ITF can produce.  Because ITF is built on a
full-featured typesetting system, virtually all aspects of the
formatting detail can be customized, including half a dozen different
schemes for laying out the freeform annotations relative to the
interlinear text.

Figure 4. Output of ITF

[ This figure is not available in plain text documentation. ]

---------- Footnotes ----------

(1) IT (pronounced "eye-tee") is an interlinear text editor that
maintains the vertical alignment of the interlinear lines of text and
uses a lexicon to semi-automatically gloss the text.  See Simons and
Versaw (1991) and Simons and Thomson (1988).

(2) ITF was developed by the Academic Computing Department of the
Summer Institute of Linguistics. It runs under MS-DOS, UNIX, and the
Apple Macintosh.

(3) TeX is a typesetting language developed by Donald Knuth (see Knuth,
1986).

(4) The plain text version of this documentation does not include
figure 4, since it is an image of typeset output.

Using KTEXT to synthesize a text
================================

Normally, in an adaptation project, the text is adapted from a source
language to a target language via a Transfer component.  For the purpose
of this example, we will use English as both the source and target
language, thus obviating the need for a Transfer component.  If the
synthesis operation produces a text which is identical to the original
text, then we have proved the efficacy of the system.

Typically, the steps involved in using KTEXT in synthesis mode are:
  1. Collect a corpus of language data suitable for phonological and
     morphological analysis (typically paradigms of words).

  2. Do phonological and morphological analysis on the data.

  3. Use the PC-KIMMO shell to develop a rules file and a lexicon file
     that encode your phonological and morphological analyses and to
     test them against your corpus of data.

  4. Set up the control files required by KTEXT for synthesis mode.
     Using the rules and lexicon you developed, use KTEXT in synthesis
     mode to process an analyzed text.

To synthesize the original text from the analysis file, run the KTEXT
application with these command line options:

     ktext -s -x syn.ctl -i alice.ana -o alice.syn -l syn.log
where `syn.ctl' is the synthesis control file, `alice.ana' is the input
analysis file, `alice.syn' is the synthesized output text file, and
`syn.log' is the synthesis log file.  The following display will appear
on the screen:

     KTEXT (analyze/synthesize words using PC-Kimmo functions)
     Version 2.0b11 (November 1, 1996), Copyright 1996 SIL
     Beta test version compiled Nov  7 1996 15:11:16
     with PC-Kimmo functions version 2.1b7 (November 6, 1996)
       and PC-PATR functions version 0.99b0 (November 7, 1996)
     For 386 CPU (or better) under MS-DOS [compiled with DJGPP 2.1/GNU C 2.7]
     
       affix.lex                          255 entries
       noun.lex                         10461 entries
       verb.lex                          4215 entries
       adjectiv.lex                      3345 entries
       adverb.lex                         400 entries
       minor.lex                          379 entries
       proper.lex                        1057 entries
       abbrev.lex                         127 entries
       technica.lex                       813 entries
       natural.lex                        435 entries
       foreign.lex                         88 entries
     .......... .......... .......... .......... ..........
     .......... .......... .......... .......... ..........  100
     .......... .....

Notice that every word received a single synthesis.  Open the output
file `alice.syn' and you will see that it is identical to the input
text file shown in figure 1.

Running KTEXT
*************

This section describes KTEXT's user interface and the input files it
uses.

KTEXT is a batch-processing program.  This means that the program takes
as input a text from a disk file and returns as output the processed
text in a new disk file.  KTEXT is run from the command line by giving
it the information it needs (file names and other options).  It does
not have an interactive interface.  The user controls KTEXT's operation
by means of special files that contain all the information KTEXT needs
to process the input text.  These files are called control files.

The operation of the program is controlled by using command line
options.  To see a list of the command line options, run the KTEXT
application with `-h' as a command line option.  You will see a display
similar to this:

     KTEXT (analyze/synthesize words using PC-Kimmo functions)
     Version 2.0b11 (November 1, 1996), Copyright 1996 SIL
     Beta test version compiled Nov  7 1996 15:11:16
     with PC-Kimmo functions version 2.1b7 (November 6, 1996)
       and PC-PATR functions version 0.99b0 (November 7, 1996)
     For 386 CPU (or better) under MS-DOS [compiled with DJGPP 2.1/GNU C 2.7]
     
     Usage:  ktext [options]
         -c char      make char the comment character for the control files
                      (default is ;)
         -s           synthesis mode (default is analysis)
         -v           for synthesis, verify each result with a word parse
         -x ctlfile   specify the KTEXT control file (default is ktext.ctl)
         -i infile    specify the input file (required: no default)
         -o outfile   specify the output file (default is based on infile)
         -l logfile   specify the KTEXT log file (default is none)

The command line options (`-c', `-s', and so on) are all lower case
letters.  Here is a detailed description of each command line option.

`-c'
     The `-c' option takes an argument that sets the comment character
     used in the KTEXT control files (analysis, synthesis, TEXTIN, and
     TXTOUT control files).  It has no effect on any other files used by
     KTEXT.  If the `-c' option is not used, the semicolon (;) is used
     as the default comment character.

`-s'
     The `-s' option causes KTEXT to run in synthesis mode. Without this
     option, KTEXT by default will run in analysis mode.

`-v'
     The `-v' option applies only to synthesis mode; it causes KTEXT to
     verify the synthesis by using the word grammar (if one is
     specified in the analysis control file).  The default is not to
     use the word grammar (even if one is specified).

`-x'
     The `-x' option takes an argument that specifies the name of
     either the analysis or synthesis control file.  These control files
     contains the name of the TEXTIN or TXTOUT control files and the
     names of the rules, lexicon, and word grammar files.  They can
     also specify consistent changes to be made to the output.  The
     `-x' option accepts a default file name extension of `.ctl'; for
     example if you use `-x english' KTEXT will try to load the file
     `english.ctl'.  If the `-x' option is not used, KTEXT will try to
     load a control file with the default file name `ktext.ctl'.

`-i'
     The `-i' option takes an argument that specifies the name of the
     input file containing the text that KTEXT will process.  If the
     `-i' option is not used, KTEXT displays an error message and quits.

`-o'
     The `-o' option takes an argument that specifies the name of the
     output file that KTEXT creates.  If the `-o' option is not used,
     the output filename is constructed from the input filename.

`-l'
     The `-l' option takes an argument that specifies the name of a log
     file.  The log file contains messages about any analysis failures
     or other anomalous behavior during processing of the input text.

In all instances where file names are supplied to KTEXT, an optional
directory path can be included; for example, `-i c:\texts\alice.txt'.

KTEXT's functional structure
****************************

Analysis mode
=============

KTEXT uses three main functional modules in analysis mode: the "text
input" module, the "analysis" module, and the "structured output"
module.  The diagram in figure 5 shows the flow of data through these
modules.  The input text is fed into the text input module which
outputs the text as a stream of normalized words with capitalization
and punctuation stripped out and saved.  The text input module is
controlled by a file that specifies orthographic changes.  Each word is
then passed to the analysis module where it is parsed.  The analysis
module is controlled by the PC-KIMMO rules, lexicon, and grammar files.
The parsed words are then passed to the structured output module and
written to the output file as database records.

     Figure 5. An overview of KTEXT analysis
     
                     original input text file
                                  |
                                  |
                      +--------------------------------+
                      |           |                    |
                      |    +------------+              |
     text input       |    |    TEXT    |              |
     control file --->|--->|    INPUT   |-----+        |
                      |    +------------+     |        |
                      |           |      punctuation   |
                      |         words     formatting   |
                      |           |     capitalization |
                      |           |           |        |
     rules file ----->|--->+------------+     |        |
     lexicon files -->|--->|  ANALYSIS  |     |        |
     grammar file --->|--->+------------+     |        |
                      |           |           |        |
                      |     parsed words      |        |
                      |           |           |        |
                      |    +------------+     |        |
                      |    | STRUCTURED |<----+        |
                      |    |   OUTPUT   |              |
                      |    +------------+              |
                      |           |                    |
                      +--------------------------------+
                                  |
                                  |
                  structured text file with parsed words

In analysis mode, KTEXT uses six different input files and produces one
output file (plus an optional log file).  These six input file are:

  1. the text data file,

  2. the KTEXT control file,

  3. the text input control file (optional),

  4. the PC-KIMMO rules file,

  5. the PC-KIMMO lexicon file, and

  6. the PC-KIMMO grammar file (optional).

The PC-KIMMO rules, lexicon, and grammar files are described in the
PC-KIMMO documentation and will not be discussed further in this
document; see Antworth (1990) and Antworth (1995).  The other input
files and the analysis output data file are described in the following
chapters.

Synthesis mode
==============

KTEXT also uses three main functional modules in synthesis mode: the
"structured input" module, the "synthesis" module, and the "text
output" module.  The diagram in figure 6 shows the flow of data through
these modules.  A structured input text containing parsed words is fed
into the structured input module, which outputs the text as a stream of
parsed words with capitalization and punctuation stripped out and
saved.  Each parsed word is then passed to the synthesis module where
it is rebuilt from its pieces.  The synthesis module is controlled by
the PC-KIMMO rules and lexicon files.  (Synthesis normally does not use
the grammar file.)  The synthesized words are then passed to the text
output module and written to the output file as a synthesized text with
the punctuation and capitalization merged back in.  The text output
module is controlled by a file that specifies orthographic changes.

     Figure 6. An overview of KTEXT synthesis
     
                  structured text file with parsed words
                                  |
                                  |
                      +--------------------------------+
                      |           |                    |
                      |    +------------+              |
                      |    | STRUCTURED |              |
                      |    |    INPUT   |-----+        |
                      |    +------------+     |        |
                      |           |      punctuation   |
                      |        parsed     formatting   |
                      |         words   capitalization |
                      |           |           |        |
     rules file ----->|--->+------------+     |        |
     lexicon files -->|--->| SYNTHESIS  |     |        |
                      |    +------------+     |        |
                      |           |           |        |
                      |   synthesized words   |        |
                      |           |           |        |
                      |    +------------+     |        |
     text output      |    |    TEXT    |<----+        |
     control file --->|--->|   OUTPUT   |              |
                      |    +------------+              |
                      |           |                    |
                      +--------------------------------+
                                  |
                                  |
                     synthesized output text file

In synthesis mode, KTEXT also uses six different input files and
produces one output file (plus an optional log file).  These six input
file are:

  1. the analysis data file,

  2. the KTEXT control file,

  3. the text output control file (optional),

  4. the PC-KIMMO rules file,

  5. the PC-KIMMO lexicon file, and

  6. the PC-KIMMO grammar file (optional).

The PC-KIMMO rules, lexicon, and grammar files are described in the
PC-KIMMO documentation and will not be discussed further in this
document; see Antworth (1990) and Antworth (1995).  The other input
files and the synthesis output text file are described in the following
chapters.

The input text file
*******************

The input text file contains the text that KTEXT will process.  It must
be a plain text file, not a file formatted by a word processor.  If you
use a word processor such as Microsoft Word to create your text, you
must save it as plain text with no formatting.  KTEXT preserves all the
"white space" used in the text file.  That is, it saves in its output
file the location of all line breaks, blank lines, tabs, spaces, and
other nonalphabetic characters.  This enables you to recover from the
output file the precise format and page layout of the original text.

While KTEXT will accept text with no formatting information other than
white space characters, it will also handle text that contains special
format markers.  These format markers can indicate parts of the text
such as sentences, paragraphs, sections, section headings, and titles.
The use of special format markers is called descriptive markup.  KTEXT
(because it is based on AMPLE) works best with a system of descriptive
markup called "standard format" that is used by SIL International.  SIL
standard format marks the beginning of each text unit with a format
marker.  There is no explicit indication of the end of a unit.  A format
marker is composed of a special character (a backslash by default)
followed by a code of one or more letter.  For example, `\ti' for
title, `\ch' for chapter, `\p' for paragraph, `\s' for sentence, and so
on.  KTEXT does not "know" any particular format markers.  You can use
whatever markers you like, as long as you declare them in the text
input control file.  For more on format markers, see section 7.3.2
below.

One of the best know systems of descriptive markup is SGML (Standard
Generalized Markup Language).  One very significant difference between
SGML and SIL standard format is that SGML uses markers in pairs, one at
the beginning of a text unit and a matching one at the end.  This
should not pose a problem for KTEXT, since KTEXT just preserves all
format markers wherever they occur.  Another difference is that SGML
flags format markers with angle brackets, for instance <paragraph>.
KTEXT can recognize SGML markers by changing the format marker flag
character from backslash to left angled bracket (see section 7.3.2
below).  Recognizing the end of the SGML format marker is a bit of a
problem.  While SGML uses a matching right angled bracket to indicate
the end of the marker, SIL standard format simply uses a space to
delineate the format marker from the following text.  This means that
for KTEXT to find the end of an SGML tag, you must leave at least one
space after it, and there must not be any spaces in the middle of the
SGML tag.

The KTEXT control file
**********************

KTEXT uses an overall control file to customize its operation.  This
file is structured as a "standard format database", composed of various
fields marked by backslash codes.  The fields in the control file are
as follows.

`\textin'
     specifies the text input control file.  This is used only in
     analysis mode.

`\textout'
     specifies the text output control file.  This is used only in
     synthesis mode.

`\rules'
     specifies the PC-KIMMO phonological rules file.

`\lexicon'
     specifies the primary PC-KIMMO lexicon file.

`\grammar'
     specifies the PC-KIMMO word grammar file.  This is normally used
     only in analysis mode.

`\ach'
     defines an analysis field change.  This is used only in analysis
     mode, after a word has been parsed.

`\dch'
     defines a decomposition field change.  This is used only in
     analysis mode, after a word has been parsed.

`\scl'
     defines a string class for use with the analysis or decomposition
     changes.

`\cat'
     defines how to extract the word category (part of speech) from the
     feature structure built by the word grammar.  This is used only in
     analysis mode, and only if a word grammar file is loaded.

`\fd'
     defines a labeled feature structure.  This is used only in analysis
     mode, and only if a word grammar file is loaded.  This gives names
     to feature structures produced by the word grammar for output to
     the analysis data file.

`\rd'
     defines the root delimiters.  The default pair of delimiters are <
     and >.

Figure 7 shows a sample KTEXT control file.

     Figure 7. Sample KTEXT control file
     
     \textin  engintx.ctl
     \rules   d:\opac\test\ktext\englex\english.rul
     \lexicon d:\opac\test\ktext\englex\english.lex
     \grammar d:\opac\test\ktext\englex\english.grm
     \textout engoutx.ctl
     
     \cat <head pos>
     
     \fd singular <number> = SG
     \fd plural   <number> = PL

When KTEXT reads its control file, it ignores any lines beginning with
field codes other than those listed above.  For example, a line
beginning `\co' would be ignored.  Such lines are treated as comments.
Comments in the control file can also be indicated with the comment
character, which by default is semicolon.  This is the only way to
place comments on the same line as a field.  The comment character can
be changed with the command line option `-c' when running KTEXT (see
chapter 3).

The text input control file
***************************

This chapter describes the expected characteristics of an input text
file, and the options offered for describing these characteristics by a
"text input control file".(1)

---------- Footnotes ----------

(1) This chapter is adapted from chapters 7, 8, and 9 of Weber (1988).

Input text files
================

Text input control files define a simple model of input text files.
They are plain text files with two types of embedded format markers.
  1. A primary format marker consists of one or more contiguous
     characters beginning with a special flag character.  The default
     character initiating format markers is the backslash (`\').  Thus,
     each of the following would be recognized as a format marker and
     would not be processed by the program:

          \
          \p
          \sp
          \begin{enumerate}
          \very-long.and;muddled/format*marker,to#be$sure


     Note that format markers cannot have a space or tab embedded in
     them; the first space or tab encountered terminates the format
     marker.

     One final note: the format character under discussion here applies
     only to the input text files which are to be processed.  It has
     absolutely nothing to do with the use of backslash (`\') to flag
     field codes in control files such as the text input control file.

  2. A secondary type of marker consists of a flag character followed
     by a single character from a list of known values.  This secondary
     flag character must be different than the primary flag character.
     Its default value is the vertical bar (`|'), causing this type of
     format marker to be frequently called a bar code.  The following
     could be valid (secondary) format markers and would not be
     processed by the program:

          |b
          |i
          |r


Consider the following two lines of input text:

     \bgoodbye\r
     |bgoodbye|r


Using the default definitions of format markers, the first line is
considered to be a single format marker, and provides nothing which the
program should try to parse.  The second line, however contains two
format markers, `|b' and `|r', and the word `goodbye' which would be
processed by the program.

The primary format markers serve to divide the text into fields.  See
`Fields to Exclude: \excl' and `Fields to Include: \incl' below for
details on how these fields are used.  There is no requirement that the
format markers be at the beginning of a line as with the field codes
used in KTEXT control files.

Ambiguity Marker Character: \ambig
==================================

The `\ambig' field defines the character used to mark ambiguities and
failures in the analysis output file.  For example, to use the hash
mark (`#'), the text input control file would include:

     \ambig  #

This would cause an ambiguous analysis to be output as follows:

     \a #3#< N0 kay >#< V1 ka > IMP#< V1 ka > INF#

It makes sense to use the `\ambig' field only once in the text input
control file.  If multiple `\ambig' fields do occur in the file, the
value given in the first one is used.  If the text input control file
does not have an `\ambig' field, the percent sign (`%') is used.

The first printing character following the `\ambig' field code is used
as the ambiguity marker.  The character currently being used to mark
comments cannot be assigned to also mark ambiguities in the output file.
Thus, the semicolon (`;') cannot normally be used as the ambiguity
marker.  Logically, this field should be in the KTEXT control file
rather than the text *input* control file since it affects output
instead of input.  Nevertheless, compatibility demands that it stays
this way.

Bar code format marker character: \barchar
==========================================

The `\barchar' defines the character that begins a two-character
secondary format marker.  For example, if this type of format marker
begins with the dollar sign (`$'), the following would be placed in the
text input control file:

     \barchar  $

An empty `\barchar' field in the text input control file prevents any
bar code format markers from being recognized.  Thus, the following
field effectively turns off special treatment of this style of format
marking (assuming the `;' is marking comments):

     \barchar       ; no bar character

It makes sense to use the `\barchar' field only once in the text input
control file.  If multiple `\barchar' fields do occur in the file, the
value given in the first one is used.

The first printing character following the `\barchar' field code is
used as the bar code format marker.  The character currently being used
to mark comments cannot be assigned to also flag format markers in
input text files.  That is, `\barchar ;' is treated as `\barchar'
followed only by a comment, which effectively removes the concept of
bar codes since no marker character is defined.

Bar Code Format Code Characters: \barcodes
==========================================

In conjunction with the special format marking character discussed in
the previous section, the `\barcodes' field defines the individual
characters used with in bar codes.  These characters may be separated by
spaces or lumped together.  Thus, the following two fields are
equivalent:

     \barcodes    abcdefg         ; lumped together
     \barcodes    a b c d e f g   ; separated


If provided more than one `\barcodes' field in the text input control
file, the combination of all characters defined in all such fields is
used.  No check is made for repeated characters: the previous example
would be accepted without complaint despite the redundancy of the
second line.

The default value for the bar codes is the set of lowercase alphabetic
letters `a'-`z'.  Therefore, if the text input control file contains
neither a `\barchar' nor a `\barcodes' field, the following bar codes
are considered to be formatting information by KTEXT: `|a', `|b', `|c',
..., `|x', `|y', and `|z'.

Text Orthography Change: \ch
============================

An orthography change is defined by the `\ch' field code followed by
the actual orthography change.  Any number of orthography changes may
be defined in the text input control file.  The output of each change
serves as the input the following change.  That is, each change is
applied as many times as necessary to an input word before the next
change from the text input control file is applied.

Basic changes
-------------

To substitute one string of characters for another, these must be made
known to the program in a change.  (The technical term for this sort of
change is a production, but we will simply call them changes.)  In the
simplest case, a change is given in three parts: (1) the field code
`\ch' must be given at the extreme left margin to indicate that this
line contains a change; (2) the match string is the string for which
the program must search; and (3) the substitution string is the
replacement for the match string, wherever it is found.

The beginning and end of the match and substitution strings must be
marked.  The first printing character following `\ch' (with at least
one space or tab between) is used as the delimiter for that line.  The
match string is taken as whatever lies between the first and second
occurrences of the delimiter on the line and the substitution string is
whatever lies between the third and fourth occurrences.  For example,
the following lines indicate the change of hi to bye, where the
delimiters are the double quote mark (`"'), the single quote mark
(`''), the period (`.'), and the at sign (`@').
     \ch "hi" "bye"
     \ch 'hi' 'bye'
     \ch .hi. .bye.
     \ch @hi@ @bye@

Throughout this document, we use the double quote mark as the delimiter
unless there is some reason to do otherwise.

Change tables follow these conventions:
  1. Any characters (other than the delimiter) may be placed between the
     match and substitution strings.  This allows various notations to
     symbolize the change.  For example, the following are equivalent:
          \ch "thou" "you"
          \ch "thou" to "you"
          \ch "thou" > "you"
          \ch "thou" --> "you"
          \ch "thou" becomes "you"

  2. Comments included after the substitution string are initiated by a
     semicolon (`;'), or whatever is indicated as the comment character
     by means of the `-c' option when KTEXT is started.  The following
     lines illustrate the use of comments:
          \ch "qeki" "qiki" ; for cases like wawqeki
          \ch "thou" "you"  ; for modern English

  3. A change can be ignored temporarily by turning it into a comment
     field.  This is done either by placing an unrecognized field code
     in front of the normal `\ch', or by placing the comment character
     (`;') in front of it.  For example, only the first of the
     following three lines would effect a change:
          \ch "nb" "mp"
          \no \ch "np" "np"
          ;\ch "mb" "nb"

The changes in the text input control file are applied as an ordered
set of changes.  The first change is applied to the entire word by
searching from left to right for any matching strings and, upon finding
any, replacing them with the substitution string.  After the first
change has been applied to the entire word, then the next change is
applied, and so on.  Thus, each change applies to the result of all
prior changes.  When all the changes have been applied, the resulting
word is returned.  For example, suppose we have the following changes:
     \ch "aib" > "ayb"
     \ch "yb"  > "yp"

Consider the effect these have on the word paiba.  The first changes i
to y, yielding payba; the second changes b to p, to yield paypa.  (This
would be better than the single change of aib to ayp if there were
sources of yb other than the output of the first rule.)

The way in which change tables are applied allows certain tricks.  For
example, suppose that for Quechua, we wish to change hw to f, so that
hwista becomes fista and hwis becomes fis.  However, we do not wish to
change the sequence shw or chw to sf or cf (respectively).  This could
be done by the following sequence of changes. (Note, `@' and `$' are
not otherwise used in the orthography.)
     \ch "shw" > "@"     ; (1)
     \ch "chw" > "$"      ; (2)
     \ch "hw"  > "f"      ; (3)
     \ch "@"   > "shw"   ; (4)
     \ch "$"   > "chw"    ; (5)

Lines (1) and (2) protect the sh and ch by changing them to
distinguished symbols.  This clears the way for the change of hw to f
in (3).  Then lines (4) and (5) restore `@' and `$' to sh and ch,
respectively. (An alternative, simpler way to do this is discussed in
the next section.)

Environmentally constrained changes
-----------------------------------

It is possible to impose string environment constraints (SECs) on
changes in the orthography change tables.  The syntax of SECs is
described in detail in section {No Value For "words.vs.format"}.

For example, suppose we wish to change the mid vowels (e and o) to high
vowels (i and u respectively) immediately before and after q.  This
could be done with the following changes:
     \ch "o" "u"  / _ q  / q _
     \ch "e" "i"  / _ q  / q _

This is not entirely a hypothetical example; some Quechua practical
orthographies write the mid vowels e and o.  However, in the
environment of /q/ these could be considered phonemically high vowels
/i/ and /u/.  Changing the mid vowels to high upon loading texts has
the advantage that-for cases like upun "he drinks" and upoq "the one
who drinks"-the root needs to be represented internally only as upu
"drink".  But note, because of Spanish loans, it is not possible to
change all cases of e to i and o to u.  The changes must be conditioned.

In reality, the regressive vowel-lowering effect of /q/ can pass over
various intervening consonants, including /y/, /w/, /l/, /ll/, /r/,
/m/, /n/, and /n/.  For example, /ullq/ becomes ollq, /irq/ becomes erq,
and so on.  Rather than list each of these cases as a separate
constraint, it is convenient to define a class (which we label
`+resonant') and use this class to simplify the SEC.  Note that the
string class must be defined (with the `\scl' field code) before it is
used in a constraint.
     \scl +resonant y w l ll r m n n~
     \ch "o" "u" / q _ / _ ([+resonant]) q
     \ch "e" "i" / q _ / _ ([+resonant]) q

This says that the mid vowels become high vowels after /q/ and before
/q/, possibly with an intervening /y/, /w/, /l/, /ll/, /r/, /m/, /n/,
or /n/.

Consider the problem posed for Quechua in the previous section, that of
changing hw to f.  An alternative is to condition the change so that it
does not apply adjacent to a member of the string class `Affric' which
contains s and c.
     \scl Affric c s
     \ch "hw" "f" / [Affric] ~_

It is sometimes convenient to make certain changes only at word
boundaries, that is, to change a sequence of characters only if they
initiate or terminate the word.  This conditioning is easily expressed,
as shown in the following examples.
     \ch "this" "that"           ; anywhere in the word
     \ch "this" "that"  / # _    ; only if word initial
     \ch "this" "that"  /   _ #  ; only if word final
     \ch "this" "that"  / # _ #  ; only if entire word

Using text orthography changes
------------------------------

The purpose of orthography change is to convert text from an external
orthography to an internal representation more suitable for
morphological analysis.  In many cases this is unnecessary, the
practical orthography being completely adequate as the internal
representation.  In other cases, the practical orthography is an
inconvenience that can be circumvented by converting to a more phonemic
representation.

Let us take a simple example from Latin.  In the Latin orthography, the
nominative singular masculine of the word "king" is rex.  However,
phonemically, this is really /reks/; /rek/ is the root meaning king and
the /s/ is an inflectional suffix.  If the program is to recover such
an analysis, then it is necessary to convert the x of the external,
practical orthography into ks internally.  This can be done by
including the following orthography change in the text input control
file:
     \ch  "x"  "ks"

In this, x is the match string and ks is the substitution string, as
discussed in section {No Value For "output.file"}.  Whenever x is
found, ks is substituted for it.

Let us consider next an example from Huallaga Quechua.  The practical
orthography currently represents long vowels by doubling the vowel.
For example, what is written as kaa is /ka:/ "I am", where the length
(represented by a colon) is the morpheme meaning "first person
subject".  Other examples, such as upoo /upu:/ "I drink" and upichee
/upi-chi-:/ "I extinguish", motivate us to convert all long vowels into
a vowel followed by a colon.  The following changes do this:
     \ch  "aa"  "a:"
     \ch  "ee"  "i:"
     \ch  "ii"  "i:"
     \ch  "oo"  "u:"
     \ch  "uu"  "u:"

Note that the long high vowels (i and u) have become mid vowels (e and
o respectively); consequently, the vowel in the substitution string is
not necessarily the same as that of the match string.  What is the
utility of these changes?  In the lexicon, the morphemes can be
represented in their phonemic forms; they do not have to be represented
in all their orthographic variants.  For example, the first person
subject morpheme can be represented simply as a colon (-:), rather than
as -a in cases like kaa, as -o in cases like qoo, and as -e as in cases
like upichee.  Further, the verb "drink" can be represented as upu and
the causative suffix (in upichee) can be represented as -chi; these are
the forms these morphemes have in other (nonlowered) environments.  As
the next example, let us suppose that we are analyzing Spanish, and
that we wish to work internally with k rather than c (before a, o, and
u) and qu (before i and e). (Of course, this is probably not the only
change we would want to make.)  Consider the following changes:
     \ch  "ca"  "ka"
     \ch  "co"  "ko"
     \ch  "cu"  "ku"
     \ch  "qu"  "k"

The first three handle c and the last handles qu.  By virtue of
including the vowel after c, we avoid changing ch to kh.  There are
other ways to achieve the same effect.  One way exploits the fact that
each change is applied to the output of all previous changes.  Thus, we
could first protect ch by changing it to some distinguished character
(say `@'), then changing c to k, and then restoring `@' to ch:
     \ch  "ch"  "@"
     \ch  "c"  "k"
     \ch  "@"  "ch"
     \ch  "qu"  "k"

Another approach conditions the change by the adjacent characters.  The
changes could be rewritten as
     \ch  "c"  "k"  / _a  / _o  / _u  ; only before a, o, or u
     \ch  "qu"  "k"                   ; in all cases

The first change says, "change c to k when followed by a, o, or u."
(This would, for example, change como to komo, but would not affect
chal.)  The syntax of such conditions is exactly that used in string
environment constraints; see section {No Value For "words.vs.format"}.

Where orthography changes apply
-------------------------------

Input orthography changes are made when the text being processed may be
written in a practical orthography.  Rather than requiring that it be
converted as a prerequisite to running the program, it is possible to
have the program convert the orthography as it loads and before it
processes each word.

The changes loaded from the text input control file are applied after
all the text is converted to lower case (and the information about
upper and lower case, along with information about format marking,
punctuation and white space, has been put to one side.)  Consequently,
the match strings of these orthography changes should be all lower
case; any change that has an uppercase character in the match string
will never apply.

A sample orthography change table
---------------------------------

We include here the entire orthography input change table for Caquinte
(a language of Peru).  There are basically four changes that need to be
made: (1) nasals, which in the practical orthography reflect their
assimilation to the point of articulation of a following noncontinuant,
must be changed into an unspecified nasal, represented by N; (2) c and
qu are changed to k; (3) j is changed to h; and (4) gu is changed to g
before i and e.

     \ch  "mp"  "Np"     ; for unspecified nasals
     \ch  "nch" "Nch"
     \ch  "nc"  "Nk"
     \ch  "nqu" "Nk"
     \ch  "nt"  "Nt"
     
     \ch  "ch"  "@"     ; to protect ch
     \ch  "c"   "k"      ; other c's to k
     \ch  "@"   "ch"    ; to restore ch
     \ch  "qu"  "k"
     
     \ch  "j"   "h"
     
     \ch  "gue" "ge"
     \ch  "gui" "gi"

This change table can be simplified by the judicious use of string
environment constraints:

     \ch  "m"  >  "N"  / _p
     \ch  "n"  >  "N"  / _c  / _t  / _qu
     
     \ch  "c"  >  "k"  / _~h
     \ch  "qu" >  "k"
     
     \ch  "j"  >  "h"
     
     \ch  "gu" >  "g"  / _e  /_i

As suggested by the preceding examples, the text orthography change
table is composed of all the `\ch' fields found in the text input
control file.  These may appear anywhere in the file relative to the
other fields.  It is recommended that all the orthography changes be
placed together in one section of the text input control file, rather
than being mixed in with other fields.

Syntax of Orthography Changes
-----------------------------

This section presents a grammatical description of the syntax of
orthography changes in BNF notation.


      1a. <orthochange>  ::= <basic_change>
      1b.                    <basic_change> <constraints>
     
      2a. <basic_change> ::= <quote><quote> <quote><string><quote>
      2b.                    <quote><string><quote> <quote><quote>
      2c.                    <quote><string><quote> <quote><string><quote>
     
      3.  <quote>        ::= any printing character not used in either
                             the ``from'' string or the ``to'' string
     
      4.  <string>       ::= one or more characters other than the quote
                             character used by this orthography change
     
      5a. <constraints>  ::= <change_envir>
      5b.                    <change_envir> <constraints>
     
      6a. <change_envir> ::= <marker> <leftside> <envbar> <rightside>
      6b.                    <marker> <leftside> <envbar>
      6c.                    <marker> <envbar> <rightside>
     
      7a. <leftside>   ::= <side>
      7b.                  <boundary>
      7c.                  <boundary> <side>
     
      8a. <rightside>  ::= <side>
      8b.                  <boundary>
      8c.                  <side> <boundary>
     
      9a. <side>       ::= <item>
      9b.                  <item> <side>
      9c.                  <item> ... <side>
     
     10a. <item>       ::= <piece>
     10b.                  ( <piece> )
     
     11a. <piece>      ::= ~ <piece>
     11b.                  <literal>
     11c.                  [ <literal> ]
     
     12.  <marker>     ::= /
                           +/
     
     13.  <envbar>     ::= _
                           ~_
     
     14.  <boundary>   ::= #
                           ~#
     
     15.  <literal>    ::= one or more contiguous characters


Comments on selected BNF rules
..............................

2.
     The same `<quote>' character must be used at both the beginning
     and the end of both the "from" string and the "to" string.

3.
     The double quote (`"') and single quote (`'') characters are most
     often used.

7-8.
     Note that what can appear to the left of the environment bar is a
     mirror image of what can appear to the right.

9c.
     An ellipsis (`...') indicates a possible break in contiguity.

10b.
     Something enclosed in parentheses is optional.

11a.
     A tilde (`~') reverses the desirability of an element, causing the
     constraint to fail if it is found rather than fail if it is not
     found.

11c.
     A literal enclosed in square brackets must be the name of a string
     class defined by a `\scl' field in the analysis data file, or
     earlier in the dictionary orthography change file.

12.
     A `+/' is usually used for morpheme environment constraints, but
     may used for change environment constraints in `\ch' fields in the
     dictionary orthography change table file.

13.
     A tilde attached to the environment bar (`~_') inverts the sense of
     the constraint as a whole.

14b.
     The boundary marker preceded by a tilde (`~#') indicates that it
     must not be a word boundary.

15.
     The special characters used by environment constraints can be
     included in a literal only if they are immediately preceded by a
     backslash:

          \+  \/  \#  \~  \[  \]  \(  \)  \.  \_  \\

Decomposition Separation Character: \dsc
========================================

The `\dsc' field defines the character used to separate the morphemes
in the decomposition field of the output analysis file.  For example,
to use the equal sign (`='), the text input control file would include:

     \dsc  =

This would cause a decomposition field to be output as follows:

     \d %3%kay%ka=y%ka=y%

It makes sense to use the `\dsc' field only once in the text input
control file.  If multiple `\dsc' fields do occur in the file, the
value given in the first one is used.  If the text input control file
does not have an `\dsc' field, a dash (`-') is used.

The first printing character following the `\dsc' field code is used as
the morpheme decomposition separator character.  The same character
cannot be used both for separating decomposed morphemes in the analysis
output file and for marking comments in the input control files.  Thus,
one normally cannot use the semicolon (`;') as the decomposition
separation character.

Logically, this field should be in the KTEXT control file rather than
the text *input* control file since it affects output instead of input.
Nevertheless, compatibility demands that it stays this way.

Fields to Exclude: \excl
========================

The `\excl' field excludes one or more fields from processing.  For
example, to have the program ignore everything in `\co' and `\id'
fields, the following line is included in the text input control file:

     \excl  \co  \id      ; ignore these fields

If more than one `\excl' field is found in the text input control file,
the contents of each field is added to the overall list of text fields
to exclude.  This list is initially empty, and stays empty unless the
text input control file contains an `\excl' field.  Thus, no text
fields are excluded from processing by default.

If the text input control file contains `\excl' fields, then only those
text fields are not processed.  Every word in every text field not
mentioned explicitly in an `\excl' field will be processed.

Note that every text field in the input text files is processed unless
the text input control file contains either an `\excl' or an `\incl'
field.  One or the other is used to limit processing, but never both.

Primary format marker character: \format
========================================

The `\format' field designates a single character to flag the beginning
of a primary format marker.  For example, if the format markers in the
text files begin with the at sign (`@'), the following would be placed
in the text input control file:

     \format  @

This would be used, for example, if the text contained format markers
like the following:

     @
     @p
     @sp
     @make(Article)
     @very-long.and;muddled/format*marker,to#be$sure


If a `\format' field occurs in the text input control file without a
following character to serve for flagging format markers, then the
program will not recognize any format markers and will try to parse
everything other than punctuation characters.

It makes sense to use the `\format' field only once in the text input
control file.  If multiple `\format' fields do occur in the file, the
value given in the first one is used.

The first printing character following the `\format' field code is used
to flag format markers.  The character currently used to mark comments
cannot be assigned to also flag format markers.  Thus, the
semicolon (`;') cannot normally be used to flag format markers.

Fields to Include: \incl
========================

The `\incl' field explicitly includes one or more text fields for
processing, excluding all other fields.  For instance, to process
everything in `\txt' and `\qt' fields, but ignore everything else, the
following line is placed in the text input control file:

     \incl  \txt  \qt      ; process these fields

If more than one `\incl' field is found in the text input control file,
the contents of each field is added to the overall list of text fields
to process.  This list is initially empty, and stays empty unless the
text input control file contains an `\incl' field.

If the text input control file contains `\incl' fields, then only those
text fields are processed.  Every word in every text field not
mentioned explicitly in an `\incl' field will not be processed.

Note that every text field in the input text files is processed unless
the text input control file contains either an `\excl' or an `\incl'
field.  One or the other is used to limit processing, but never both.

Lowercase/uppercase character pairs: \luwfc
===========================================

To break a text into words, the program needs to know which characters
are used to form words.  It always assumes that the letters `A' through
`Z' and `a' through `z' are used as word formation characters.  If the
orthography of the language the user is working in uses any other
characters that have lowercase and uppercase forms, these must given in
a `\luwfc' field in the text input control file.

The `\luwfc' field defines pairs of characters; the first member of
each pair is a lowercase character and the second is the corresponding
uppercase character.  Several such pairs may be placed in the field or
they may be placed on separate fields.  Whitespace may be interspersed
freely.  For example, the following three examples are equivalent:

     \luwfc  �� ��
or

     \luwfc  ��      ; e with acute accent
     \luwfc  ��      ; enyee

or

     \luwfc  � �  � �

Note that comments can be used as well (just as they can in any KTEXT
control file).  This means that the comment character cannot be
designated as a word formation character.  If the orthography includes
the semicolon (`;'), then a different comment character must be defined
with the `-c' command line option when KTEXT is initiated; see `Running
KTEXT' above.

The `\luwfc' field can be entered anywhere in the text input control
file, although a natural place would be before the `\wfc' (word
formation character) field.

Any standard alphabetic character (that is `a' through `z' or `A'
through `Z') in the `\luwfc' field will override the standard lower-
upper case pairing.  For example, the following will treat `X' as the
upper case equivalent of `z':

     \luwfc z X

Note that `Z' will still have `z' as its lower-case equivalent in this
case.

The `\luwfc' field is allowed to map multiple lower case characters to
the same upper case character, and vice versa.  This is needed for
languages that do not mark tone on upper case letters.

Multibyte lowercase/uppercase character pairs: \luwfcs
======================================================

The `\luwfcs' field extends the character pair definitions of the
`\luwfc' field to multibyte character sequences.  Like the `\luwfc'
field, the `\luwfcs' field defines pairs of characters; the first
member of each pair is a multibyte lowercase character and the second
is the corresponding multibyte uppercase character.  Several such pairs
may be placed in the field or they may be placed on separate fields.
Whitespace separates the members of each pair, and the pairs from each
other.  For example, the following three examples are equivalent:

     \luwfcs  e' E` n~ N^ � C&
or

     \luwfcs  e' E`      ; e with acute accent
     \luwfcs  n~ N^      ; enyee
     \luwfcs  �  C&      ; c cedilla

or

     \luwfcs  e' E`
              n~ N^
              �  C&


Note that comments can be used as well (just as they can in any KTEXT
control file).  This means that the comment character cannot be
designated as a word formation character.  If the orthography includes
the semicolon (`;'), then a different comment character must be defined
with the `-c' command line option when KTEXT is initiated; see `Running
KTEXT' above.

Also note that there is no requirement that the lowercase form be the
same length (number of bytes) as the uppercase form.  The examples shown
above are only one or two bytes (character codes) in length, but there
is no limit placed on the length of a multibyte character.

The `\luwfcs' field can be entered anywhere in the text input control
file.  `\luwfcs' fields may be mixed with `\luwfc' fields in the same
file.

Any standard alphabetic character (that is `a' through `z' or `A'
through `Z') in the `\luwfcs' field will override the standard lower-
upper case pairing.  For example, the following will treat `X' as the
upper case equivalent of `z':

     \luwfcs z X

Note that `Z' will still have `z' as its lowercase equivalent in this
case.

The `\luwfcs' field is allowed to map multiple multibyte lowercase
characters to the same multibyte uppercase character, and vice versa.
This may be useful in some situations, but it introduces an element of
ambiguity into the decapitalization and recapitalization processes.  If
ambiguous capitalization is supported, then for the previous example,
`z' will have both `X' and `Z' as uppercase equivalents, and `X' will
have both `x' and `Z' as lowercase equivalents.

Maximum number of decapitalizations: \maxdecap
==============================================

The `\maxdecap' field sets the maximum number of different
decapitalizations allowed.  Since the `\luwfc' field can map several
lowercase characters onto a single uppercase character, a word with
uppercase characters can (logically) generate a number of alternatives
when decapitalized.  This is especially true of words that are entirely
capitalized to begin with.  The default limit is 100.

Prevent Any Decapitalization: \nocap
====================================

The usual behavior is to normalize input words to lowercase.  The
program remembers the case of the word as one of four possibilities:
  1. all uppercase

  2. all lowercase

  3. only the first letter uppercase

  4. mixed uppercase and lowercase
     However, not all orthographies use the concept of capitalization.
To help deal with these, the field code `\nocap' disables all case
normalization if it appears anywhere in the text input control file.

Prevent Decapitalization of Individual Characters: \noincap
===========================================================

The handling of mixed uppercase and lowercase is limited in utility,
and sometimes causes more problems than it solves.  For this reason,
the `\noincap' field code turns off mixed case decapitalization.  The
program would still decapitalize words that are entirely capitalized
and words that begin with a capital letter.

String class: \scl
==================

A string class is defined by the `\scl' field code followed by the
class name, which is followed in turn by one or more contiguous
character strings or (previously defined) string class names.  A string
class name used as part of the class definition must be enclosed in
square brackets.

The class name must be a single, contiguous sequence of printing
characters.  Characters and words which have special meanings in tests
should not be used.  The actual character strings have no such
restrictions.  The individual members of the class are separated by
spaces, tabs, or newlines.

Each `\scl' field defines a single string class.  Any number of `\scl'
fields may appear in the file.  The only restriction is that a string
class must be defined before it is used.

String classes must be defined before being used.  For example, the
first two lines of the simpler Caquinte example above could be given as
follows:
     \scl  -bilabial  c t qu
     \ch  "m"  >  "N"  / _ p
     \ch  "n"  >  "N"  / _ [-bilabial]

The string class definition could be in another control file: string
classes defined elsewhere can be used in the text input control file as
well.

If no `\scl' fields appear in the text input control file, then KTEXT
does not allow any string classes in text input orthography change
environment constraints unless they are defined in the KTEXT control
file.

Caseless word formation characters: \wfc
========================================

To break a text into words, the program needs to know which characters
are used to form words.  It always assumes that the letters `A' through
`Z' and `a' through `z' are used as word formation characters.  If the
orthography of the language the user is working in uses any characters
that do not have different lowercase and uppercase forms, these must
given in a `\wfc' field in the text input control file.

For example, English uses an apostrophe character (`'') that could be
considered a word formation character.  This information is provided by
the following example:

     \wfc  '    ; needed for words like don't

Notice that the characters in the `\wfc' field may be separated by
spaces, although it is not required to do so.  If more than one `\wfc'
field occurs in the text input control file, the program uses the
combination of all characters defined in all such fields as word
formation characters.

The comment character cannot be designated as a word formation
character.  If the orthography includes the semicolon (`;'), then a
different comment character must be defined with the `-c' command line
option when KTEXT is initiated; see `Running KTEXT' above.

Multibyte caseless word formation characters: \wfcs
===================================================

The `\wfcs' field allows multibyte characters to be defined as
"caseless" word formation characters.  It has the same relationship to
`\wfc' that `\luwfcs' has to `\luwfc'.  The multibyte word formation
characters are separated from each other by whitespace.

A sample text input control file
================================

The following is the complete text input control file for Huallaga
Quechua (a language of Peru):
     \id HGTEXT.CTL - for Huallaga Quechua, 25-May-88
     
     \co         WORD FORMATION CHARACTERS
     
     \wfc  ' ~
     
     \co         FIELDS TO EXCLUDE
     
     \excl  \id            ; identification fields
     
     \co         ORTHOGRAPHY CHANGES
     
     \ch  "aa" > "a:"      ; for long vowels
     \ch  "ee" > "i:"
     \ch  "ii" > "i:"
     \ch  "oo" > "u:"
     \ch  "uu" > "u:"
     \ch  "qeki" > "qiki"  ; for cases like wawqeki
     \ch  "~n" > "n~"      ; for typos
     ; for Spanish loans like hwista
     \scl sib s c          ; sibilants
     \ch  "hw" > "f"  / ~[sib]_

The text output control file
****************************

The text output module restores a processed document from the internal
format to its textual form.  It re-imposes capitalization on words and
restores punctuation, format markers, white space, and line breaks.
Also, orthography changes can be made, and the delimiter that marks
ambiguities and failures can be changed.  This chapter describes the
control file given to the text output module.(1)

---------- Footnotes ----------

(1) This chapter is adapted from chapter 8 of Weber (1990).

Text output ambiguity delimiter: \ambig
=======================================

The text output module flags words that either produced no results or
multiple results when processed.  These are flagged with percent signs
(`%') by default, but this can be changed by declaring the desired
character with the \ambig field code.  For example, the following would
change the ambiguity delimiter to `@':
     \ambig @

Text output orthographic changes: \ch
=====================================

The text output module allows orthographic changes to be made to the
processed words.  These are given in the text output control file.
(They have exactly the same form as the input orthographic changes; see
The output orthographic changes allow conversion from the internal
representation used by the program to the practical orthography of the
target language.  These changes are applied to the words after they
have been processed, but before the text is re-assembled (from the
internal format) for output.
     \ch "N"   "m"  / _ p       ; assimilates before p
     \ch "N"   "n"              ; otherwise stays n

The first change makes N into m when it directly precedes p; the second
makes all other N's into n.

Decomposition Separation Character: \dsc
========================================

The `\dsc' field defines the character used to separate the morphemes
in the decomposition field of the input analysis file.  For example, to
use the equal sign (`='), the text input control file would include:

     \dsc  =

This would handle a decomposition field like the following:

     \d %3%kay%ka=y%ka=y%

It makes sense to use the `\dsc' field only once in the text output
control file.  If multiple `\dsc' fields do occur in the file, the
value given in the first one is used.  If the text output control file
does not have an `\dsc' field, a dash (`-') is used.

The first printing character following the `\dsc' field code is used as
the morpheme decomposition separator character.  The same character
cannot be used both for separating decomposed morphemes in the analysis
output file and for marking comments in the output control files.  Thus,
one normally cannot use the semicolon (`;') as the decomposition
separation character.

This field is provided for use by the INTERGEN program.  It is of little
use to KTEXT.

Primary format marker character: \format
========================================

The `\format' field designates a single character to flag the beginning
of a primary format marker.  For example, if the format markers in the
text files begin with the at sign (`@'), the following would be placed
in the text input control file:

     \format  @

This would be used, for example, if the text contained format markers
like the following:

     @
     @p
     @sp
     @make(Article)
     @very-long.and;muddled/format*marker,to#be$sure


If a `\format' field occurs in the text input control file without a
following character to serve for flagging format markers, then the
program will not recognize any format markers and will try to parse
everything other than punctuation characters.

It makes sense to use the `\format' field only once in the text input
control file.  If multiple `\format' fields do occur in the file, the
value given in the first one is used.

The first printing character following the `\format' field code is used
to flag format markers.  The character currently used to mark comments
cannot be assigned to also flag format markers.  Thus, the
semicolon (`;') cannot normally be used to flag format markers.

This field is provided for use by the INTERGEN program.  It is of little
use to KTEXT.

Lowercase/uppercase character pairs: \luwfc
===========================================

To break a text into words, the program needs to know which characters
are used to form words.  It always assumes that the letters `A' through
`Z' and `a' through `z' are used as word formation characters.  If the
orthography of the language the user is working in uses any other
characters that have lowercase and uppercase forms, these must given in
a `\luwfc' field in the text input control file.

The `\luwfc' field defines pairs of characters; the first member of
each pair is a lowercase character and the second is the corresponding
uppercase character.  Several such pairs may be placed in the field or
they may be placed on separate fields.  Whitespace may be interspersed
freely.  For example, the following three examples are equivalent:

     \luwfc  �� ��
or

     \luwfc  ��      ; e with acute accent
     \luwfc  ��      ; enyee

or

     \luwfc  � �  � �

Note that comments can be used as well (just as they can in any KTEXT
control file).  This means that the comment character cannot be
designated as a word formation character.  If the orthography includes
the semicolon (`;'), then a different comment character must be defined
with the `-c' command line option when KTEXT is initiated; see `Running
KTEXT' above.

The `\luwfc' field can be entered anywhere in the text input control
file, although a natural place would be before the `\wfc' (word
formation character) field.

Any standard alphabetic character (that is `a' through `z' or `A'
through `Z') in the `\luwfc' field will override the standard lower-
upper case pairing.  For example, the following will treat `X' as the
upper case equivalent of `z':

     \luwfc z X

Note that `Z' will still have `z' as its lower-case equivalent in this
case.

The `\luwfc' field is allowed to map multiple lower case characters to
the same upper case character, and vice versa.  This is needed for
languages that do not mark tone on upper case letters.

Multibyte lowercase/uppercase character pairs: \luwfcs
======================================================

The `\luwfcs' field extends the character pair definitions of the
`\luwfc' field to multibyte character sequences.  Like the `\luwfc'
field, the `\luwfcs' field defines pairs of characters; the first
member of each pair is a multibyte lowercase character and the second
is the corresponding multibyte uppercase character.  Several such pairs
may be placed in the field or they may be placed on separate fields.
Whitespace separates the members of each pair, and the pairs from each
other.  For example, the following three examples are equivalent:

     \luwfcs  e' E` n~ N^ � C&
or

     \luwfcs  e' E`      ; e with acute accent
     \luwfcs  n~ N^      ; enyee
     \luwfcs  �  C&      ; c cedilla

or

     \luwfcs  e' E`
              n~ N^
              �  C&


Note that comments can be used as well (just as they can in any KTEXT
control file).  This means that the comment character cannot be
designated as a word formation character.  If the orthography includes
the semicolon (`;'), then a different comment character must be defined
with the `-c' command line option when KTEXT is initiated; see `Running
KTEXT' above.

Also note that there is no requirement that the lowercase form be the
same length (number of bytes) as the uppercase form.  The examples shown
above are only one or two bytes (character codes) in length, but there
is no limit placed on the length of a multibyte character.

The `\luwfcs' field can be entered anywhere in the text input control
file.  `\luwfcs' fields may be mixed with `\luwfc' fields in the same
file.

Any standard alphabetic character (that is `a' through `z' or `A'
through `Z') in the `\luwfcs' field will override the standard lower-
upper case pairing.  For example, the following will treat `X' as the
upper case equivalent of `z':

     \luwfcs z X

Note that `Z' will still have `z' as its lowercase equivalent in this
case.

The `\luwfcs' field is allowed to map multiple multibyte lowercase
characters to the same multibyte uppercase character, and vice versa.
This may be useful in some situations, but it introduces an element of
ambiguity into the decapitalization and recapitalization processes.  If
ambiguous capitalization is supported, then for the previous example,
`z' will have both `X' and `Z' as uppercase equivalents, and `X' will
have both `x' and `Z' as lowercase equivalents.

Text output string classes: \scl
================================

It is possible to define string classes, as discussed in section
`String class: \scl' above.  For example, the sample text output
control file given below contains the following lines:
     a. \scl X t s c
     b. \ch "h"   "j"   / [X] ~_

Line a defines a string class including t, s, and c; change rule b
makes use of this class to block the change of h to j when it occurs in
the digraphs th, sh, and ch.

Changes in the text output control file may also make use of string
classes defined in the KTEXT control file.

Caseless word formation characters: \wfc
========================================

To break a text into words, the program needs to know which characters
are used to form words.  It always assumes that the letters `A' through
`Z' and `a' through `z' are used as word formation characters.  If the
orthography of the language the user is working in uses any characters
that do not have different lowercase and uppercase forms, these must
given in a `\wfc' field in the text input control file.

For example, English uses an apostrophe character (`'') that could be
considered a word formation character.  This information is provided by
the following example:

     \wfc  '    ; needed for words like don't

Notice that the characters in the `\wfc' field may be separated by
spaces, although it is not required to do so.  If more than one `\wfc'
field occurs in the text input control file, the program uses the
combination of all characters defined in all such fields as word
formation characters.

The comment character cannot be designated as a word formation
character.  If the orthography includes the semicolon (`;'), then a
different comment character must be defined with the `-c' command line
option when KTEXT is initiated; see `Running KTEXT' above.

Multibyte caseless word formation characters: \wfcs
===================================================

The `\wfcs' field allows multibyte characters to be defined as
"caseless" word formation characters.  It has the same relationship to
`\wfc' that `\luwfcs' has to `\luwfc'.  The multibyte word formation
characters are separated from each other by whitespace.

A sample text output control file
=================================

A complete text output control file used for adapting to Asheninca
Campa is given below.

     \id AEouttx.ctl for Asheninca Campa
     \ch "N"   "m"  / _ p       ; assimilates before p
     \ch "N"   "n"              ; otherwise becomes n
     \ch "ny"  "n~"
     
     \ch "ts"  "th" / ~_ i      ; (N)tsi is unchanged
     \ch "tsy" "ch"
     \ch "sy"  "sh"
     \ch "t"   "tz" / n _ i
     
     \ch "k"   "qu" / _ i / _ e
     \ch "k"   "q"  / _ y
     \ch "k"   "c"
     
     \scl X t s c               ; define class of  t   s   c
     \ch "h"   "j"   / [X] ~_   ; change except in th, sh, ch
     
     \ch "#"   " "              ; remove fixed space
     \ch "@"   ""              ; remove blocking character

KTEXT analysis output
*********************

Analysis files are "record oriented standard format files".  This means
that the files are divided into records, each representing a single
word in the original input text file, and records are divided into
fields.  An analysis file contains at least one record, and may contain
a large number of records.  Each record contains one or more fields.
Each field occupies at least one line, and is marked by a "field code"
at the beginning of the line.  A field code begins with a backslash
character (`\'), and contains 1 or more letters in addition.

Analysis file fields
====================

This section describes the possible fields in an analysis file.  The
only field that is guaranteed to exist is the analysis (`\a') field.
All other fields are either data dependent or optional.

Analysis field: \a
------------------

The analysis field (`\a') starts each record of an analysis file.  It
has the following form:

     \a PFX IFX PFX < CAT root CAT root > SFX IFX SFX

where `PFX' is a prefix morphname, `IFX' is an infix morphname, `SFX'
is a suffix morphname, `CAT' is a root category, and `root' is a root
gloss or etymology.  In the simplest case, an analysis field would look
like this:

     \a < CAT root >

where `CAT' is a root category and `root' is a root gloss or etymology.

Decomposition field: \d
-----------------------

The morpheme decomposition field (`\d') follows the analysis field.  It
has the following form:

     \d anti-dis-establish-ment-arian-ism-s

where the hyphens separate the individual morphemes in the surface form
of the word.

The `\dsc' field in the text input control file can replace the hyphen
with another character for separating the morphemes; see `Decomposition
Separation Character: \dsc' above.

Category field: \cat
--------------------

The category field (`\cat') provides rudimentary category information.
This may be useful for sentence level parsing.  It has the following
form:

     \cat CAT

where `CAT' is the word category.

To request KTEXT to output the final category, include the field `\cat'
in the KTEXT control file.  This field specifies the feature path in
the word level feature structure that contains the grammatical category
(part of speech).  Note that this requires a word grammar to be loaded.

If there are multiple analyses, there will be multiple categories in
the output, separated by ambiguity markers.

Feature Descriptors field: \fd
------------------------------

The feature descriptor field (`\fd') contains the feature names
associated with each morpheme in the analysis.  It has the following
form:

     \fd ==feat1 feat2=feat3=

where `feat1', `feat2', and `feat3' are feature descriptors.  The equal
signs (`=') serve to separate the feature descriptors of the individual
morphemes.  Note that morphemes may have more than one feature
descriptor, with the names separated by spaces, or no feature
descriptors at all.

The feature descriptor field requires a word grammar and one or more
`\feat' fields in the KTEXT control file.

If there are multiple analyses, there will be multiple feature sets in
the output, separated by ambiguity markers.

Word field: \w
--------------

The original word field (`\w') contains the original input word as it
looks before decapitalization and orthography changes.  It looks like
this:

     \w The

Note that this is a gratuitous change from earlier versions of AMPLE
and KTEXT, which wrote the decapitalized form.

Formatting field: \f
--------------------

The format information field (`\f') records any formatting codes or
punctuation that appeared in the input text file before the word.  It
looks like this:

     \f \\id MAT 5 HGMT05.SFM, 14-feb-84 D. Weber, Huallaga Quechua\n
             \\c 5\n\n
             \\s


where backslashes (`\') in the input text are doubled, newlines are
represented by `\n', and additional lines in the field start with a tab
character.

The format information field is written to the output analysis file
whenever it is needed, that is, whenever formatting codes or
punctuation exist before words.

Capitalization field: \c
------------------------

The capitalization field (`\c') records any capitalization of the input
word.  It looks like this:

     \c 1

where the number following the field code has one of these values:
`1'
     the first (or only) letter of the word is capitalized

`2'
     all letters of the word are capitalized

`4-32767'
     some letters of the word are capitalized and some are not

Note that the third form is of limited utility, but still exists
because of words like the author's last name.

The capitalization field is written to the output analysis file
whenever any of the letters in the word are capitalized; see

Nonalphabetic field: \n
-----------------------

The nonalphabetic field (`\n') records any trailing punctuation, bar
code or whitespace characters.  It looks like this:

     \n |r.\n

where newlines are represented by `\n'.  The nonalphabetic field ends
with the last whitespace character immediately following the word.

The nonalphabetic field is written to the output analysis file whenever
the word is followed by anything other than a single space character.
This includes the case when a word ends a file with nothing following
it.

Ambiguous analyses
==================

The previous section assumed that only one analysis is produced for
each word.  This is not always possible since words in isolation are
frequently ambiguous.  Multiple analyses are handled by writing each
analysis field in parallel, with the number of analyses at the
beginning of each output field.  For example,

     \a %2%< A0 imaika > CNJT AUG%< A0 imaika > ADVS%
     \d %2%imaika-Npa-ni%imaika-Npani%
     \cat %2%A0 A0=A0/A0=A0/A0%A0 A0=A0/A0%
     \p %2%==%=%
     \fd %2%==%=%
     \u %2%imaika-Npa-ni%imaika-Npani%
     \w Imaicampani
     \f \\v124
     \c 1
     \n \n


where the percent sign (`%') separates the different analyses in each
field.  Note that only those fields which contain analysis information
are marked for ambiguity.  The other fields (`\w', `\f', `\c', and
`\n') are the same regardless of the number of analyses.

The `\ambig' field in the text input control file can replace the
percent sign with another character for separating the analyses; see
`Ambiguity Marker Character: \ambig' above.

Analysis failures
=================

The previous sections assumed that words are successfully analyzed.
This does not always happen.  Analysis failures are marked the same way
as multiple analyses, but with zero (`0') for the ambiguity count.  For
example,

     \a %0%ta%
     \d %0%ta%
     \cat %0%%
     \p %0%%
     \fd %0%%
     \u %0%%
     \w TA
     \f \\v 12 |b
     \c 2
     \n |r\n


Note that only the `\a' and `\d' fields contain any information, and
those both have the original word as a place holder.  The other
analysis fields (`\cat', `\p', `\fd', and `\u') are marked for failure,
but otherwise left empty.

The `\ambig' field in the text input control file can replace the
percent sign with another character for marking analysis failures and
ambiguities; see `Ambiguity Marker Character: \ambig' above.

KTEXT synthesis output
**********************

KTEXT tries to recreate the format of the original input to analysis in
its synthesis output.  The main feature worth noting is that synthesis
ambiguities and failures are marked similarly to analysis ambiguities
and failures in KTEXT analysis output.

Bibliography
************

  1. Antworth, Evan L.. 1990.  `PC-KIMMO: a two-level processor for
     morphological analysis'.  Occasional Publications in Academic
     Computing No. 16. Dallas, TX: Summer Institute of Linguistics.

  2. Antworth, Evan L.. 1995. `User's Guide to PC-KIMMO version 2'.  URL
     ftp://ftp.sil.org/software/dos/pc-kimmo/guide.zip (visited 1997,
     June 11).

  3. Bloomfield, Leonard. 1917.  `Tagalog texts with grammatical
     analysis.'  Urbana, IL: University of Illinois.

  4. Jang, Taeho. 1995.  `Computer assisted adaptation of text from
     Turkish to Korean: design and implementation'.  Master of Arts in
     Linguistics, University of Texas at Arlington, Arlington, TX.

  5. Kew, Jonathan and Stephen R. McConnel. 1991.  `Formatting
     interlinear text'.  Occasional Publications in Academic Computing
     No. 17.  Dallas, TX: Summer Institute of Linguistics.

  6. Knuth, Donald E.. 1986.  `The TeXbook'.  Reading, MA: Addison
     Wesley Publishing Company.

  7. Oflazer, Kemal. 1994a.  Two-level Description of Turkish
     Morphology.  `Literary and Linguistic Computing'.  9(2), 137-148.

  8. Oflazer, Kemal. 1994b.  `TURKLEX'. URL


     ftp://crl.nmsu.edu/CLR/tools/ling-analysis/morphology/turklex/turklex.tar.z
     (visited 1997, June 11).

  9. Simons, Gary F., and John Thomson. 1988.  `How to use IT:
     interlinear text processing on the Macintosh'.  Edmonds, WA:
     Linguist's Software.

 10. Simons, Gary F., and Larry Versaw. 1991.  `How to use IT: a guide
     to interlinear text processing', 3rd ed.  Dallas, TX: Summer
     Institute of Linguistics.

 11. Weber, David J., H. Andrew Black, and Stephen R. McConnel. 1988.
     `AMPLE: a tool for exploring morphology'.  Occasional Publications
     in Academic Computing No. 12.  Dallas, TX: Summer Institute of
     Linguistics.

 12. Weber, David J., H. Andrew Black, Stephen R. McConnel, and Alan
     Buseman. 1990.  `STAMP: a tool for dialect adaptation'.
     Occasional Publications in Academic Computing No. 15.  Dallas, TX:
     Summer Institute of Linguistics.