STAMP Reference Manual
            Synthesizing after Transferring AMPLE Analyses
                             version 2.1b1
                               May 1999

        by Stephen McConnel (and H. Andrew Black for v. 2.1b1)

                 Copyright (C) 2000 SIL International

Published by:
Language Software Development
SIL International
7500 W. Camp Wisdom Road
Dallas, TX 75236
U.S.A.
Permission is granted to make and distribute verbatim copies of this
file provided the copyright notice and this permission notice are
preserved in all copies.

The author may be reached at the address above or via email as
`steve@acadcomp.sil.org'.

Introduction to the STAMP program
*********************************

This manual describes STAMP, a computer program for adapting text in
conjunction with the AMPLE program.  This combination falls under the
Analysis Transfer Synthesis (ATS) paradigm.  It involves the following
steps:

(1) AMPLE is a morphological parser that is applied to source language
  text to analyze each word into morphemes.

(2) STAMP is applied to these analyses to make changes that will produce
   the corresponding target language word.

(3) An interactive editor is applied to STAMP output to correct the
words that AMPLE failed to analyze and for which AMPLE and STAMP
produced multiple possibilities.  The result is a word-for-word draft
 of the source language text in the target language.

(4) After eliminating analysis failures and ambiguities, the text must
be     checked and corrected by a competent speaker of the target
language.

STAMP incorporates no language-specific facts; the user makes linguistic
facts known to STAMP entirely through external files.  STAMP is
sufficiently general to serve over a wide range of language families.
(However, AMPLE and STAMP do not adequately handle highly isolating
languages, that is, languages which have virtually no morphology.)

The name STAMP is derived by taking AMP from AMPLE, and T and S from
STAMP's main modules, TRANSFER and SYNTHESIS, which are applied in
succession to the output of AMPLE.  Thus one can think of STAMP as
S(T(AMP)), or more explicitly as:

     adapted text = Synthesis[Transfer[AMPLE[source text]]]

Note: much of this reference manual is based almost verbatim on the book
published in 1990 (Weber, Black, McConnel, and Buseman), without
explicit permission from the coauthors.

New features
............

1. Version 2.1 (May 1999) introduced punctuation environment constraints in the allomorph fields of the dictionary files.  These are handled by a new built-in test called PEC_ST.  This version also added two punctuation-oriented clauses to user-written tests.
Running STAMP
*************

STAMP is a batch process oriented program.  It reads a number of control
files, and then processes one or more input analysis files to produce an
equal number of output files.

STAMP Command Options
=====================

The STAMP program uses an old-fashioned command line interface
following the convention of options starting with a dash character
(`-').  The available options are listed below in alphabetical order.
Those options which require an argument have the argument type
following the option letter.

`-a'
     causes all possible syntheses to be generated.

`-c character'
     selects the control file comment character.  The default is the
     vertical bar (`|').

`-d number'
     selects the maximum dictionary trie depth.  The default is 2, which
     favors reduced memory needs over speed.

`-f filename'
     opens a command file containing the names of the control and data
     files.  The default is to read those names from the standard input
     (keyboard); see `Program Interaction' below.

`-i filename'
     selects a single input analysis file.

`-m'
     monitors progress of an analysis: `*' means an analysis failure,
     `.' means a single analysis, `2'-`9' means 2-9 ambiguities, and
     `>' means 10 or more ambiguities.  This is not compatible with the
     `-q' option.

`-n'
     prevents root categories from being checked.

`-o filename'
     selects a single output text (or analysis if the `-x' option is
     also selected) file.

`-q'
     causes AMPLE to operate "quietly" with minimal screen output.  This
     is not compatible with the `-m' option.

`-r'
     causes morphnames that are not found in the dictionary to be
     reported.

`-t'
     causes the transfer and synthesis processes to be traced.  This
     produces a huge amount of output.

`-u'
     signals that dictionaries are unified, not split into prefix,
     infix, suffix, and root files.

`-v'
     verifies tests by pretty printing the parse trees.

`-x'
     restricts STAMP to only the transfer process.

The following options exist only in beta-test versions of the program,
since they are used only for debugging.

`-/'
     increments the debugging level.  The default is zero (no debugging
     output).

`-z filename'
     opens a file for recording a memory allocation log.

`-Z address,count'
     traps the program at the point where `address' is allocated or
     freed for the `count''th time.

Program Interaction
===================

If the `-f', `-i', and `-o' command options are not used, STAMP prompts
for a number of file names, reading the standard input for the desired
values.  The interactive dialog goes like this:

     C> stamp
     STAMP: Synthesis(Transfer(AMPle(text))) = adapted text
     Version 2.0b1 (July 21, 1998), Copyright 1998 SIL, Inc.
     Beta test version compiled Jul 27 1998 16:04:11
            Transfer/Synthesis Performed Tue Jul 28 14:54:04 1998
     STAMP declarations file (zzSTAMP.DEC): pnstamp.dec
     Transfer file (xxzzTR.CHG) [none]: hgpntr.chg
     Synthesis file (zzSYNT.CHG) [none]: pnsynt.chg
     Dictionary code table (zzSYCD.TAB): pnsycd.tab
     Dictionary orthography change table (zzORDC.TAB) [none]: pnordc.tab
     	10 changes loaded from suffix dictionary code table.
     Suffix dictionary file (zzSF01.DIC): pnsf01.dic
             SUFFIX DICTIONARY: Loaded 137 records
     	10 changes loaded from root dictionary code table.
     Root dictionary file (xxRTnn.DIC): pnsyrt.dic
             ROOT DICTIONARY: Loaded 176 records
     Next Root dictionary file (xxRTnn.DIC) [no more]:
     Output text control file (zzOUTTX.CTL) [none]: pnoutx.ctl
     10 output orthography changes were loaded from pnoutx.ctl
     
     First Input file: pntest.ana
     Output file: pntest.syn
     
     Next Input file (or RETURN if no more):
     C>


Note that each prompt contains a reminder of the expected form of the
answer in parentheses and ends with a colon.  Several of the prompts
also contain the default answer in brackets.

Using the command options does not change the appearance of the program
screen output significantly, but the program displays the answers to
each of its prompts without waiting for input.  Assume that the file
`pntest.cmd' contains the following, which is the same as the answers
given above:

     pnstamp.dec
     hgpntr.chg
     pnsynt.chg
     pnsycd.tab
     pnordc.tab
     pnsf01.dic
     pnsyrt.dic
     
     pnoutx.ctl


Then running STAMP with the command options produces screen output like
the following:

     C> stamp -f pntest.cmd -i pntest.ana -o pntest.syn
     STAMP: Synthesis(Transfer(AMPle(text))) = adapted text
     Version 2.0b1 (July 21, 1998), Copyright 1998 SIL, Inc.
     Beta test version compiled Jul 27 1998 16:04:11
            Transfer/Synthesis Performed Tue Jul 28 14:59:34 1998
     STAMP declarations file (zzSTAMP.DEC): pnstamp.dec
     Transfer file (xxzzTR.CHG) [none]: hgpntr.chg
     Synthesis file (zzSYNT.CHG) [none]: pnsynt.chg
     Dictionary code table (zzSYCD.TAB): pnsycd.tab
     Dictionary orthography change table (zzORDC.TAB) [none]: pnordc.tab
     	10 changes loaded from suffix dictionary code table.
     Suffix dictionary file (zzSF01.DIC): pnsf01.dic
             SUFFIX DICTIONARY: Loaded 137 records
     	10 changes loaded from root dictionary code table.
     Root dictionary file (xxRTnn.DIC): pnsyrt.dic
             ROOT DICTIONARY: Loaded 176 records
     Next Root dictionary file (xxRTnn.DIC) [no more]:
     Output text control file (zzOUTTX.CTL) [none]: pnoutx.ctl
     10 output orthography changes were loaded from pnoutx.ctl
     
     C>


The only difference in the screen output is that the prompts for the
input text file and the output analysis file are not displayed.

Standard format
***************

The input control files and input analysis file that STAMP reads are
all "standard format" files.  This means that the files are divided
into records and fields.  Each file contains at least one record, and
some files may contain a large number of records.  Each record contains
one or more fields.  Each field occupies at least one line, and is
marked by a "field code" at the beginning of the line.  A field code
begins with a backslash character (`\'), and contains 1 or more
printing characters (usually alphabetic) in addition.

If the file is designed to have multiple records, then one of the field
codes must be designated to be the "record marker", and every record
begins with that field, even if it is empty apart from the field code.
If the file contains only one record, then the relative order of the
fields is constrained only by their semantics.

It is worth emphasizing that field codes must be at the *beginning* of
a line.  Even a single space before the backslash character prevents it
from being recognized as a field code.

It is also worth emphasizing that record markers *must* be present even
if that field has no information for that record.  Omitting the record
marker causes two records to be merge into a single record, with
unpredictable results.

STAMP Declarations File
***********************

The fields that STAMP recognizes in its "declarations file" are
described below.  Fields that start with any other backslash codes are
ignored by STAMP.

Analytic ambiguity delimiter: \ambig
====================================

When AMPLE produces more than one analysis, each analysis is set off by
a unique character.  Likewise, when AMPLE fails to analyze a source
language word, it flags this word with the same character, the default
for which is a percent sign (`%').  However, a user may override
AMPLE's default.

Like AMPLE, STAMP assumes this delimiter to be a percent sign.  If an
analyzed text does not use this character, STAMP must be informed as to
what character was used.  To do this, use the \ambig field to define the
desired character.  For example, the following would change the analytic
ambiguity delimiter to `@':

     \ambig @

Allomorph property declaration: \ap
===================================

Allomorph properties are defined by the field code `\ap' followed by
one or more allomorph property names.  An allomorph property name must
be a single, contiguous sequence of printing characters.  Characters
and words which have special meanings in tests should not be used.  For
example, the following would declare the allomorph properties
`deletedK', `deletedG', and `underlyingV':


     \ap deletedK deletedG  | elided morpheme final velars
     \ap underlyingV        | underlying long vowel


A maximum of 255 properties (including both allomorph and morpheme
properties) may be defined unless the `\maxprops' field is used to
define a larger number.  Any number of `\ap' fields may be used so long
as the number of property names does not exceed 255 (or the number
defined by the `\maxprops' field).  Note that any `\maxprops' field
must occur before any `\ap' or `\mp' fields.

Category declarations: \ca
==========================

Categories are defined by the field code `\ca' followed by one or more
category names.  A category name must be a single, contiguous sequence
of printing characters.  Characters and words which have special
meanings in tests should not be used.

A maximum of 255 categories may be defined.  Any number of `\ca' fields
may be used so long as the number of category names does not exceed 255.

Category output control: \cat
=============================

The category information to write to the analysis output file is
defined by the field code `\cat' followed by one or two words.  The
first word must be either `prefix' or `suffix' (or an abbreviation of
one of those words), either capitalized or lowercase.

The `\cat' field may appear any number of times, but once is enough.
If more than one such field occurs, the last one is the one that is
used.

NOTE: at present, this does not do anything in the code.  Is this a
feature that has never been used?  When was it introduced?  I'd be quite
willing to rip it out of the code.

Category class declaration: \ccl
================================

A category class declaration has three parts: the field code `\ccl',
the name of the class, and the list of categories in the class
(separated by spaces).  For example, the following defines the class
`IVERB' containing the categories `V1X' and `V1Y':

     \ccl IVERB V1X V1Y

The class name must be a single, contiguous sequence of printing
characters.  Characters and words which have special meanings in tests
should not be used.  The category names must have been defined by an
earlier `\ca' field; see `Category declarations' above.

In transfer, category classes can only be used in the match strings of
lexical changes; see `Lexical change' below.

Each `\ccl' field defines a single category class.  Any number of
`\ccl' fields may appear in the file.

Maximum number of properties: \maxprops
=======================================

The maximum number of properties that can be defined can be increased
from the default of 255 by giving the `\maxprops' field code followed
by a number greater than or equal to 255 but less than 65536.

The `\maxprops' field may appear any number of times, but once is
enough.  If more than one such field occurs, the one containing the
largest valid value is the one that is used.

The `\maxprops' must be used before any properties are defined.  This
is the case for both morpheme and allomorph properties.

If no `\maxprops' fields appear in the declarations file, then STAMP
limits the number of properties which can be defined to 255.

Morpheme class declaration: \mcl
================================

A morpheme class declaration has three parts: the field code `\mcl',
the name of the class, and the list of morphnames in the class
(separated by spaces).  For example, a morpheme class `DIRECTION' could
be defined as follows:

     \mcl DIRECTION UP DOWN IN OUT

Such a class could be used in conditioning environments for lexical
changes, insertion rules, or substitution rules.  For example, the
following environment would limit the rule to apply only preceding one
of the directional morphemes:

     / _ [DIRECTION]

The class name must be a single, contiguous sequence of printing
characters.  Characters and words which have special meanings in tests
should not be used.  The morpheme names should be defined by an entry
in one of the dictionary files.

Each `\mcl' field defines a single morpheme class.  Any number of
`\mcl' fields may appear in the file.

Morpheme property declaration: \mp
==================================

Morpheme properties are defined by the field code `\mp' followed by one
or more morpheme property names.  A morpheme property name must be a
single, contiguous sequence of printing characters.  Characters and
words which have special meanings in tests should not be used.  For
example, the following would declare the morpheme properties `XYZ',
`ABC', and `DEF':

     \mp XYZ
     \mp ABC DEF


A maximum of 255 properties (including both allomorph and morpheme
properties) may be defined unless the `\maxprops' field is used to
define a larger number.  Any number of `\mp' fields may be used so long
as the number of property names does not exceed 255 (or the number
defined by the `\maxprops' field).  Note that any `\maxprops' field
must occur before any `\mp' or `\ap' fields.

Punctuation class: \pcl
=======================

A punctuation class is defined by the field code `\pcl' followed by the
class name, which is followed in turn by one or more punctuation
characters or (previously defined) punctuation class names.  A
punctuation class name used as part of the class definition must be
enclosed in square brackets.

The class name must be a single, contiguous sequence of printing
characters.  The individual members of the class are separated by
spaces, tabs, or newlines.

Each `\pcl' field defines a single punctuation class.  Any number of
`\pcl' fields may appear in the file.

If no `\pcl' fields appear in the declarations file, then STAMP does
not allow any punctuation classes in tests, and does not allow any
punctuation classes in punctuation environment constraints.

Root delimiter: \rd
===================

For each analysis, the root (or roots), and the category of the first
root, are delimited by a pair of reserved characters.  By default, AMPLE
uses wedges (`<' and `>').  If some characters other than wedges are
used for this purpose, they must be declared using the `\rd' field.
(`\rd' is mnemonic for "root delimiter".)  For example, the following
line might be included in the input control file:

     \rd ( )

Two characters are expected after the field code, optionally separated
by white space.  The first is taken to be the opening (that is, left)
delimiter and the second is taken to be the closing (that is, right)
delimiter.  Different characters must be used for the opening and
closing delimiters.

The delimiters used to set off the root should not be used for any other
purpose in the analysis field.  The following may not be used for a
delimiter: the backslash (`\'), whatever character is used to indicate
analytic failures and ambiguities, or any orthographic character.

The `\rd' field may appear any number of times, but once is enough.  If
more than one such field occurs, the last one is the one that is used.

If no `\rd' fields appear in the declarations file, then STAMP uses the
delimiter characters `<' and `>'.

String class declaration: \scl
==============================

A string class declaration has three parts: the field code `\scl', the
name of the class, and the list of strings in the class (separated by
spaces).  String classes are used in synthesis in specifying string
environment constraints on regular sound changes and on allomorph
entries in the dictionaries.

The class name must be a single, contiguous sequence of printing
characters.  Characters and words which have special meanings in tests
should not be used.  The actual character strings have no such
restrictions.  The individual members of the class are separated by
spaces, tabs, or newlines.

Each `\scl' field defines a single string class.  Any number of `\scl'
fields may appear in the file.

Valid allomorph and string environment characters: \strcheck
============================================================

The characters considered to be valid for allomorph strings and string
environment constraints are defined by a `\strcheck' field code
followed by the list of characters.  Spaces are not significant in this
list.

The `\strcheck' field may appear any number of times, but once is
enough.  If more than one such field occurs, the last one is the one
that is used.

If no `\strcheck' fields appear in the analysis data file, then STAMP
does not check allomorph strings and string environment constraints for
containing only valid characters.

Transfer Control File
*********************

The transfer control file for the STAMP program is a standard format
file containing a single data record.

Transfer file fields
====================

Analytic ambiguity delimiter: \ambig
------------------------------------

This field can also occur in the STAMP declarations file or the STAMP
synthesis control file instead; see `Analytic ambiguity delimiter' in
the chapter `STAMP declarations file' above.

Allomorph property declaration: \ap
-----------------------------------

This field can also occur in the STAMP declarations file or the STAMP
synthesis control file instead; see `Allomorph property declaration' in
the chapter `STAMP declarations file' above.

Category declarations: \ca
--------------------------

This field can also occur in the STAMP declarations file or the STAMP
synthesis control file instead; see `Category declarations' in the
chapter `STAMP declarations file' above.

Category class declaration: \ccl
--------------------------------

This field can also occur in the STAMP declarations file or the STAMP
synthesis control file instead; see `Category class declaration' in the
chapter `STAMP declarations file' above.

Copy rule: \cr
--------------

Suppose that a single source language morpheme corresponds to either of
two target language morphemes, where the choice between them is not
determined by any contextual factor within the word.  For example,
Huallaga Quechua -ra `PAST' (simple past) corresponds to two suffixes in
Pasco Quechua, in some cases to -rqU `RECPST' (recent past) and in other
cases to -rqa `REMPST' (remote past).  Whether the recent or remote past
tense is appropriate is a semantic matter, not determinable by any
structural factor, morphological or syntactic.

The best one can do in such a case is to create ambiguous output for
every instance of the past tense morpheme and leave the choice between
them to the person who edits the computer-adapted text.  Thus, for every
Huallaga Quechua word containing -ra `PAST' (as in 1 below) the program
should produce two Pasco Quechua words (as in 2), one with -rqU `RECPST'
and the other with -rqa `SIMPST':

     1. aywaran
     2. %2%aywarqun%aywarqan%


This can be accomplished by means of a copying rule.  A copying rule
produces two output analyses.  It produces a copy of the input and then
the result of applying the rule as though it were a substitution rule;
see `Substitution rule' below.  The only syntactic difference between a
copying rule and a substitution rule is that the former begins with the
field code `\cr' and the latter begins with a `\sr'.  See `Syntax of
transfer rules' below For a description of the syntax for these rules.

Returning to the Quechua example, the copying rule in 3 would apply to
the analysis in 4 to produce the two analyses in 5:

     3. \cr "PAST" "REMPST"
     4. < V1 *aywa > PAST 3
     5. %2%< V1 *aywa > PAST 3%< V1 *aywa > REMPST 3%


Copying rules apply to the output of previous rules in the transfer
file, and subsequent rules apply to each of the outputs.  Because
subsequent rules apply to each of the outputs of a copying rule, it is
possible for copying rules to feed copying rules.  For example,
consider the two hypothetical copying rules in a below.  If these are
applied to a single analysis containing A and Q (as schematized in 7
below), they would produce the four analyses shown in 9.

     6. \cr "A" "B"
        \cr "Q" "R"
     7. ...A...Q...
     8. %2%...A...Q...%...B...Q...%
     9. %4%...A...Q...%...A...R...%...B...Q...%...B...R...%


That is, the first rule of a produces the two analyses in 8.  Then the
the second rule of a applies to each of these outputs, producing two
outputs for each as in 9.

Note that if the original analysis had been ambiguous (with two analyses
each containing A and Q), the two copying rules would have produced an
eight-way ambiguity.  The moral: if copying rules are used too
liberally, there might be a dramatic increase in the levels of
ambiguity produced.  Let the user beware!

Flag rule: \fl
--------------

Flags are a mechanism for temporarily remembering some information about
an analysis.  A rule conditioned by one or more flags affects an
analysis only if all the conditions implied by those flags are true for
that analysis.  Flags make it possible to "insulate" source language
phenomena from target language phenomena.

The definition of a flag has three obligatory parts: (1) the field code
`\fl', (2) the name of the flag, and (3) the list of morphnames which
trigger the raising of that flag.  For example, consider the following
definition:

     \fl PLURAL PLIMPF PLDIR PLSTAT

The name of the flag is `PLURAL'.  The morphnames whose presence causes
the flag to be raised are `PLIMPF', `PLDIR', and `PLSTAT'.

Flag definitions are a type of rule.  Recall that rules are applied in
the order in which they are given in the transfer file.  (This excludes
lexical changes, which are applied before all rules.)  Thus, a flag is
raised only if one of the morphnames in its definition is present in the
analysis resulting from all previous rules (in the order they are given
in the transfer file).  For example, the plural flag defined above would
be raised only if `PLIMPF', `PLDIR', or `PLSTAT' were present in an
analysis at the point where the rule is defined in the transfer file.

Suppose there are two rules.  The first deletes `PLIMPF'.  The next one
is a flag definition which raises the `PLURAL' flag whenever `PLIMPF'
is present in an analysis.  This flag-raising rule only sees the result
of all previous rules.  Thus it would never raise the `PLURAL' flag in
this case, since `PLIMPF' would always have been deleted by the
preceding rule.  To get the proper effect, the flag definition rule
should be ordered before the rule which deletes the morphname that
causes the flag to be raised.

Flags cannot be used in a rule until they have been defined with a
`\fl' field.  Sometimes it is conceptually simpler to define all the
flags at the beginning of the transfer file; in other cases it is
advantageous to define each of them close to--but preceding--the rules
which use them.

Flags may be tested in copy and substitution rules following the match
and substitution strings, or in insertion rules following the string to
be inserted.  The flag names must always precede any conditioning
environments.  A flag name can be preceded by tilde to complement the
sense of the flag; that is, a rule so modified applies only if the named
flag is not raised.

When a particular analysis has undergone all the rules of the transfer
file, all the flags are automatically lowered before another analysis is
considered.  To put it another way, flags do not stay raised from one
word to another.  Many flags are raised and never lowered until the next
word, but in some cases it is desirable to have the flags lowered before
some subsequent rule.

Whenever an analysis changes as the result of a rule that tests a flag
(or flags), then that rule's flags are automatically lowered.  The user
does not have to do anything because TRANSFER is designed to
automatically lower flags under this condition.  This avoids the
application of subsequent rules on the basis of the same flag.  For
example, consider the following rules.  (Insertion rules are discussed
in the next section.)

     \fl XFLG M1 M2       |raise flag when M1 or M2 present
     \ir "X1" XFLG / M1 _ |insert X1 after M1 when XFLG up
     \ir "X2" XFLG / _ M2 |insert X2 before M2 when XFLG up


Given an analysis with the sequence `M1 M2', the result of these
changes is `M1 X1 M2'.  It would not be `M1 X1 X2 M2' because, when the
first insertion rule applies, it lowers `XFLG'.  Consequently, the
second insertion rule does not apply.

Suppose one wished to drop a flag (say `PFLG') whenever a particular
morpheme (say `XY') is present in an analysis.  The following rule
would do it.  (Substitution rules are discussed below.)

     \sr "XY" "XY" PFLG |substitute XY for XY when PFLG up

When this rule applies, it produces no net change in the analysis, but
it has the important side effect of dropping the `PFLG' flag.

Consider the following substitution rule, where the `FLG' flag is
followed by three environments:

     \sr "XX" "YY" FLG / M1 _ / M2 _ / M3 _

This is equivalent to the following sequence of rules:

     \sr "XX" "YY" FLG / M1 _
     \sr "XX" "YY" FLG / M2 _
     \sr "XX" "YY" FLG / M3 _


Note that if the first applies, `FLG' is lowered so the second and
third could not apply.  Likewise, if the second applies, the third could
not apply.  Only one rule could possibly apply; transfer's behavior is
the same even when multiple environments are included in a single rule
or spread across several rules.

It is possible to limit a rule by a complemented flag, in which case the
rule applies only if the flag is not raised.  For example, definition 1
would raise the flag `KUFLAG' when `REFL', `INTNS', or `CMPLT' were
present, and rule 2 would insert `KU' when `KUFLAG' is not raised and
`INSERTKUFLAG' is raised:

     1. \fl KUFLAG REFL INTNS CMPLT
     2. \ir "KU" ~KUFLAG INSERTKUFLAG


Note that the constraint imposed by each of these flags must be
simultaneously met for the rule to apply.

In general, a flag automatically lowers whenever a rule constrained by
that flag applies.  Note, however, that the converse is not true for a
complemented flag that constrains a rule that applies.  If rule 2 above
were to apply, then `INSERTKUFLAG' would be lowered but `KUFLAG' would
not be affected, that is, it would remain lowered.

Flags and morpheme classes have some interesting similarities and
differences.  Both are defined in the same way (but with different field
codes), and both are used as conditions on rules.  They differ, however,
in where they are used in rules: morpheme classes occur in environments,
whereas flags occur between the rule's main part and before any
conditioning environments.  Perhaps the most important difference is one
of persistence.  Once defined, a morpheme class persists until STAMP
finishes; there is simply no way to "undefine" a class.  But flags are
volatile: they are raised when certain morphemes are present in an
analysis and lowered when a rule having the flag applies effectively.

Insertion rule: \ir
-------------------

Insertion rules insert morphnames into an analysis.  An insertion rule
may have (in order): (1) the field code \ir, (2) the morphname to be
inserted, (3) optionally one or more flags, (4) optionally one or more
environments into which the morphname should be inserted, and (5) a
comment.  The insertion string is delimited by some printing character.
The morphname is inserted into an analysis if all the rule's flags are
raised (or, for complemented flags, not raised) and at least one of its
environments is satisfied (if any are specified).

Of the five parts listed above, only the first two are obligatory.
However, an insertion rule comprised only of a field code and the
morphname (without any conditioning flags or environments) would insert
the morphname into every analysis.  The following example has the first
three parts; it would insert `PL' whenever the `PLFLG' flag is up:

     \ir "PL" PLFLG

(How transfer determines where to insert `PL' is discussed below.)  The
following inserts `PXT' immediately after `BDJ':

     \ir "PXT" / BDJ _

The following inserts `PLDIR' whenever both the `PLURAL' flag is up and
a directional suffix (that is, a member of the class `DIR') is present.
`PLDIR' is inserted immediately following the directional morpheme.

     \ir "PLDIR" PLURAL / [DIR] _

When an insertion rule has multiple environments, it applies only for
the first environment satisfied by a given analysis.  For example,
consider the following:

     \ir "PXT" / BDJ _  / QMR _

This rule is applied in the following way.  Potential insertion sites in
the current analysis are considered in order from left to right.  At the
first one, if `BDJ' occurs to the left, `PXT' is inserted, and nothing
more is done by this rule.  If `BDJ' does not occur there but `QMR'
does, `PXT' is inserted and nothing more is done by this rule.  Failing
to find either `BDJ' or `QMR' to the left, the potential insertion site
is shifted one place to the right and the process is repeated.  In this
way, all potential insertion sites in the analysis are evaluated until
either an insertion is made or there are no more potential insertion
sites in the analysis.  When one of these conditions is met, the next
rule in the transfer file is applied.

Each insertion rule may affect an analysis only once.  This prevents
multiple insertions in cases where more than one environment is
satisfied.  For example, consider the following rule:

     \ir "X" / _ Y / Z _

Consider how this affects an analysis with the sequence `Z Y'.  Both
environments are satisfied, so if multiple insertions were permitted by
a single rule, the result would be `Z X X Y'.  However, the desired
result is more likely to be `Z X Y', which is what the program will
produce.  Note, it is possible to get the former result by using two
rules:

     \ir "X" / _ Y
     \ir "X" / Z _


Since there are two rules, both would apply to `Z Y', the first
producing `Z X Y', the second applying to `Z X Y' to produce `Z X X Y'.

Insertion rules are frequently conditioned by flags; thus some comments
about them are in order.  First, recall the discussion above: the
application of a rule automatically results in the lowering of any
flags in that rule.

Second, flags may be complemented by prefixing a tilde (`~').  The
following set of rules illustrates the use of a complemented flag.  It
is motivated by a situation in Quechua, where pluralization with -pakU
occurs only in the absence of other morphemes which have kU.

     \fl KUFLG REFL CMPL INTNS |flag for suffixes with kU
     \fl PLFLG PL1 PL2         |flag for pluralizers
     \sr "PL1" ""              |remove PL1
     \sr "PL2" ""              |remove PL2
     \ir "PLKU" PLFLG ~KUFLG   |insert if plural and no kU


The first line defines a flag for suffixes containing kU; the second
defines a flag for pluralizers.  The third and fourth lines delete the
pluralizers.  The last line is a rule that inserts `PLKU' if the
`PLFLG' is up and the `KUFLG' is down, that is, the original analysis
had a pluralizer and no suffix with kU is now present in the analysis.
If `PLKU' is ever inserted by this rule, `PLFLG' will be lowered.

Determining the site for insertion
----------------------------------

If an insertion rule has an environment with a simple environment bar,
then the position of the bar defines the site for insertion.  But when
the rule has no environment, or when the environment bar has ellipsis
marking, then the insertion site is not explicitly defined.  TRANSFER
has mechanisms for treating these cases.

Generally, the items to be inserted are either prefixes or suffixes.  In
the absence of an explicit environment statement, prefixes are inserted
somewhere before the leftmost root and suffixes are inserted somewhere
after the rightmost root.  TRANSFER determines whether the morpheme to
be inserted is a prefix or a suffix by determining which dictionary it
occurs in.  (For this reason each affix should have a unique morphname.)
Then it uses the orderclass of the morpheme, as defined in the
dictionary entry, to determine exactly where to insert the morpheme.

Consider an insertion rule with no environment, such as the following
one which inserts `ABC' whenever `XYZFLAG' is raised:

     \ir "ABC" XYZFLAG

TRANSFER determines the orderclass of `ABC' from the affix dictionaries
of the target language.  Given an analysis, if the `XYZFLAG' is up at
the point this insertion rule is applied, TRANSFER searches for an
acceptable place to put `ABC', attempting to place it as far right as
possible without violating orderclass, that is, without placing it
after an affix with a greater orderclass.  To illustrate, consider an
analysis like the following (with the orderclasses given below each
morphname):

     < C1 root > M1 M2 M3 M4 M5
                 10 20 30 40 50


Assuming that the orderclass of `ABC' is 40, the result of the
insertion rule would be the following:

     < C1 root > M1 M2 M3 M4 ABC M5

If it is necessary to insert a sequence of morphnames, they can be
inserted by a sequence of insertion rules.  For example, the following
three rules inserts `ABC DEF GHI' when `XYZFLAG' is up:

     \ir "ABC" XYZFLAG
     \ir "DEF" / ABC _
     \ir "GHI" / DEF _


(A slightly more complicated solution would be needed if there were
analyses containing `ABC' or `DEF' into which these rules would
incorrectly insert `DEF' or `GHI'.)  Applied to the previous example,
the result would be:

     < C1 root > M1 M2 M3 M4 ABC DEF GHI M5

Whenever the insertion site is not precisely defined by the environment
bar, insertion will be based on orderclass.  Therefore, ellipsis marking
can be used to constrain an insertion by the presence of one or more
morphnames and yet have the insertion based on orderclass.  For example,
either of the following rules inserts `PQR' as far right as possible
without violating orderclass whenever `M4' occurs in an analysis:

     \ir "PQR" / _... M4
     \ir "PQR" / M4 ..._


Lexical change: \lc
-------------------

FIX ME!

Maximum number of properties: \maxprops
---------------------------------------

This field can also occur in the STAMP declarations file or the STAMP
synthesis file instead; see `Maximum number of properties' in the
chapter `STAMP declarations file' above.

Morpheme class declaration: \mcl
--------------------------------

This field can also occur in the STAMP declarations file or the STAMP
synthesis control file instead; see `Morpheme class declaration' in the
chapter `STAMP declarations file' above.

Morpheme property declaration: \mp
----------------------------------

This field can also occur in the STAMP declarations file or the STAMP
synthesis control file instead; see `Morpheme property declaration' in
the chapter `STAMP declarations file' above.

Punctuation class: \pcl
-----------------------

This field can also occur in the STAMP declarations file or the STAMP
synthesis file instead; see `Punctuation class' in the chapter `STAMP
declarations file' above.

Root delimiter: \rd
-------------------

This field can also occur in the STAMP declarations file or the STAMP
synthesis control file instead; see `Root delimiter' in the chapter
`STAMP declarations file' above.

String class declaration: \scl
------------------------------

This field can also occur in the STAMP declarations file or the STAMP
synthesis control file instead; see `String class declaration' in the
chapter `STAMP declarations file' above.

Substitution rule: \sr
----------------------

FIX ME!

Syntax of transfer rules
========================

FIX ME!

Syntax of lexical changes
=========================

FIX ME!

Synthesis Control File
**********************

Analytic ambiguity delimiter: \ambig
====================================

This field can also occur in the STAMP declarations file or the STAMP
transfer control file instead; see `Analytic ambiguity delimiter' in
the chapter `STAMP declarations file' above.

Allomorph property declaration: \ap
===================================

This field can also occur in the STAMP declarations file or the STAMP
transfer control file instead; see `Allomorph property declaration' in
the chapter `STAMP declarations file' above.

Category declarations: \ca
==========================

This field can also occur in the STAMP declarations file or the STAMP
transfer control file instead; see `Category declarations' in the
chapter `STAMP declarations file' above.

Category class declaration: \ccl
================================

This field can also occur in the STAMP declarations file or the STAMP
transfer control file instead; see `Category class declaration' in the
chapter `STAMP declarations file' above.

Lexical change: \lc
===================

FIX ME!

Maximum number of properties: \maxprops
=======================================

This field can also occur in the STAMP declarations file or the STAMP
transfer control file instead; see `Maximum number of properties' in
the chapter `STAMP declarations file' above.

Morpheme class declaration: \mcl
================================

This field can also occur in the STAMP declarations file or the STAMP
transfer control file instead; see `Morpheme class declaration' in the
chapter `STAMP declarations file' above.

Morpheme property declaration: \mp
==================================

This field can also occur in the STAMP declarations file or the STAMP
transfer control file instead; see `Morpheme property declaration' in
the chapter `STAMP declarations file' above.

Punctuation class: \pcl
=======================

This field can also occur in the STAMP declarations file or the STAMP
transfer control file instead; see `Punctuation class' in the chapter
`STAMP declarations file' above.

Root delimiter: \rd
===================

This field can also occur in the STAMP declarations file or the STAMP
transfer control file instead; see `Root delimiter' in the chapter
`STAMP declarations file' above.

Regular sound change: \rsc
==========================

FIX ME!

Regular sound change markers: \rscid
====================================

FIX ME!

String class declaration: \scl
==============================

This field can also occur in the STAMP declarations file or the STAMP
transfer control file instead; see `String class declaration' in the
chapter `STAMP declarations file' above.

Synthesis test: \test
=====================

FIX ME!

Dictionary Code Table File
**************************

The fourth control file read by STAMP contains the dictionary code
table.  Each entry of an STAMP dictionary (whether for roots, prefixes,
infixes, or suffixes) is structured by field codes that indicate the
type of information that follows.  The dictionary code table maps the
field codes used in the dictionary files onto the internal codes that
STAMP uses.  This allows linguists to use their favorite dictionary
field codes rather than constraining them to a predefined set.

The dictionary code table is divided into one or more sections, one for
each type of dictionary file.  Each section contains several mappings
of field codes in the form of simple changes.  The field codes used in
the dictionary code table file are described in the remainder of this
chapter.

Change standard format marker to internal code: \ch
===================================================

A dictionary field code change is defined by `\ch' followed by two
quoted strings.  The first string is the field code used in the
dictionary (including the leading backslash character).  The second
string is the single capital letter designating the field type.  For
the lists of dictionary field type codes, see `Dictionary Files' below.

Any character not found in either the dictionary field code string or
the dictionary field type code may be used as the quoting character.
The double quote (`"') or single quote (`'') are most often used for
this purpose.

Infix dictionary fields: \infix
===============================

The set of dictionary field code changes for an infix dictionary file
begins with `\infix', optionally followed by the record marker field
code for the infix dictionary.  If the record marker is not given, then
the field code ("from string") from the first infix dictionary field
code change is used.  See `Dictionary Files' below for the set of infix
dictionary field type codes.

Prefix dictionary fields: \prefix
=================================

The set of dictionary field code changes for a prefix dictionary file
begins with `\prefix', optionally followed by the record marker field
code for the prefix dictionary.  If the record marker is not given,
then the field code ("from string") from the first prefix dictionary
field code change is used.  See `Dictionary Files' below for the set of
prefix dictionary field type codes.

Root dictionary fields: \root
=============================

The set of dictionary field code changes for a root dictionary file
begins with `\root', optionally followed by the record marker field
code for the root dictionary.  If the record marker is not given, then
the field code ("from string") from the first root dictionary field
code change is used.  See `Dictionary Files' below for the set of root
dictionary field type codes.

Suffix dictionary fields: \suffix
=================================

The set of dictionary field code changes for a suffix dictionary file
begins with `\suffix', optionally followed by the record marker field
code for the suffix dictionary.  If the record marker is not given,
then the field code ("from string") from the first suffix dictionary
field code change is used.  See `Dictionary Files' below for the set of
suffix dictionary field type codes.

Unified dictionary fields: \unified
===================================

The set of dictionary field code changes for a unified dictionary file
begins with `\unified', optionally followed by the record marker field
code for the unified dictionary.  If the record marker is not given,
then the field code ("from string") from the first unified dictionary
field code change is used.  See `Dictionary Files' below for the set of
unified dictionary field type codes.

Dictionary Orthography Change Table File
****************************************

The fifth control file read by STAMP, and the third optional one,
contains the dictionary orthography change table.  This table maps the
allomorph strings in the dictionary files into the internal
orthographic representation.  When the text and internal orthographies
differ, it may be desirable to have the allomorphs in the dictionaries
stored in the same orthography as the texts, or it may be desirable to
have them in the internal form, or it might even be desirable to have
them in a third form.  STAMP allows for any of these choices.

The dictionary orthography change table is defined by a special
standard format file.  This file contains a single record with two
types of fields, either of which may appear any number of times.  The
rest of this chapter describes these fields, focusing on the syntax of
the orthography changes.

Dictionary Orthography Change: \ch
==================================

An orthography change is defined by the `\ch' field code followed by
the actual orthography change.  Any number of orthography changes may
be defined in the dictionary orthography change table.  The output of
each change serves as the input the following change.  That is, each
change is applied as many times as necessary to a dictionary allomorph
before the next change from the dictionary orthography change table is
applied.  See `Text Orthography Change: \ch' below for the syntax of
orthography changes.

String class: \scl
==================

A string class is defined by the `\scl' field code followed by the
class name, which is followed in turn by one or more contiguous
character strings or (previously defined) string class names.  A string
class name used as part of the class definition must be enclosed in
square brackets.

The class name must be a single, contiguous sequence of printing
characters.  Characters and words which have special meanings in tests
should not be used.  The actual character strings have no such
restrictions.  The individual members of the class are separated by
spaces, tabs, or newlines.

Each `\scl' field defines a single string class.  Any number of `\scl'
fields may appear in the file.  The only restriction is that a string
class must be defined before it is used.

If no `\scl' fields appear in the dictionary orthography changes file,
then STAMP does not allow any string classes in dictionary orthography
change environment constraints unless they are defined in the STAMP
declarations file, the transfer control file, or the synthesis control
file.

Dictionary Files
****************

This chapter describes the content of STAMP dictionary files.  These
are normally divided into
  1. a prefix dictionary file (if needed),

  2. an infix dictionary file (if needed),

  3. an suffix dictionary file (if needed), and

  4. one or more root dictionary files.

With the `-u' command line option in conjunction with the `\unified'
field in the dictionary code table file, the dictionary can be stored
as one or more files containing entries of any type: prefix, infix,
suffix, or root.

The following sections describe the different types of fields used in
the different types of dictionary files.  Remember, the mapping from
the actual field codes used in the dictionary files to the type codes
that STAMP uses internally is controlled by the dictionary code table
file (see `Dictionary Code Table File' above).

Allomorph (internal code A)
===========================

Each dictionary entry must contain one or more allomorph fields.  Each
of these contains one of the infix's allomorphs, that is, the string of
characters by which the affix is represented in text and recognized by
STAMP.

If an affix has multiple allomorphs, each one must be entered in its own
allomorph field.  These fields should be ordered with those on which the
strictest constraints have been imposed preceding those with less strict
or no constraints.  The only exception to this is the use of indexed
string classes to indicate reduplication.  (See lines 20 and 21 below.)

Properties, constraints, and comments may follow the allomorph string.
Any properties must be listed before any constraints.  String,
punctuation and morpheme environment constraints may be intermixed, but
must come before any comments.  A complete BNF grammar of an allomorph
field is given below.


      1a. <allomorph_field> ::= <allomorph>
      1b.                       <allomorph> <properties>
      1c.                       <allomorph> <constraints>
      1d.                       <allomorph> <properties> <constraints>
      1e.                       <allomorph> <comment>
      1f.                       <allomorph> <properties> <comment>
      1g.                       <allomorph> <constraints> <comment>
      1h.                       <allomorph> <properties> <constraints> <comment>
     
      2a. <allomorph>       ::= <literal>
      2b.                       <literal> { <literal> }
      2c.                       <redup_pattern>
      2d.                       <redup_pattern> { <literal> }
     
      3a. <properties>      ::= <literal>
      3b.                       <literal> <properties>
     
      4a. <constraints>     ::= <string_constraint>
      4b.                       <morph_constraint>
      4c.                       <punct_constraint>
      4d.                       <string_constraint> <constraints>
      4e.                       <morph_constraint> <constraints>
      4f.                       <punct_constraint> <constraints>
     
      5.  <comment>         ::= <comment_char> anything to the end of the line
     
      6a. <string_constraint> ::= / <envbar> <string_right>
      6b.                         / <string_left> <envbar>
      6c.                         / <string_left> <envbar> <string_right>
     
      7a. <string_left>       ::= <string_side>
      7b.                         <boundary>
      7c.                         <boundary> <string_side>
      7d.                         <string_side> # <string_side>
      7e.                         <boundary> <string_side> # <string_side>
     
      8a. <string_right>      ::= <string_side>
      8b.                         <boundary>
      8c.                         <string_side> <boundary>
      8d.                         <string_side> # <string_side>
      8e.                         <string_side> # <string_side> <boundary>
     
      9a. <string_side>       ::= <string_item>
      9b.                         <string_item> <string_side>
      9c.                         <string_item> ... <string_side>
     
     10a. <string_item>       ::= <string_piece>
     10b.                         ( <string_piece> )
     
     11a. <string_piece>      ::= ~ <string_piece>
     11b.                         <literal>
     11c.                         [ <literal> ]
     11d.                         [ <indexed_literal> ]
     
     12a. <morph_constraint>  ::= +/ <envbar> <morph_right>
     12b.                         +/ <morph_left> <envbar>
     12c.                         +/ <morph_left> <envbar> <morph_right>
     
     13a. <morph_left>        ::= <morph_side>
     13b.                         <boundary>
     13c.                         <boundary> <morph_side>
     13d.                         <morph_side> # <morph_side>
     13e.                         <boundary> <morph_side> # <morph_side>
     
     14a. <morph_right>       ::= <morph_side>
     14b.                         <boundary>
     14c.                         <morph_side> <boundary>
     14d.                         <morph_side> # <morph_side>
     14e.                         <morph_side> # <morph_side> <boundary>
     
     15a. <morph_side>        ::= <morph_item>
     15b.                         <morph_item> <morph_side>
     15c.                         <morph_item> ... <morph_side>
     
     16a. <morph_item>        ::= <morph_piece>
     16b.                         ( <morph_piece> )
     
     17a. <morph_piece>       ::= ~ <morph_piece>
     17b.                         <literal>
     17c.                         [ <literal> ]
     17d.                         { <literal> }
     
     18a. <punct_constraint>  ::= ./ <envbar> <punct_right>
     18b.                         ./ <punct_left> <envbar>
     18c.                         ./ <punct_left> <envbar> <punct_right>
     
     19a. <punct_left>        ::= <punct_side>
     19b.                         <boundary>
     19c.                         <boundary> <punct_side>
     
     20a. <punct_right>       ::= <punct_side>
     20b.                         <boundary>
     20c.                         <punct_side> <boundary>
     
     21a. <punct_side>        ::= <punct_item>
     21b.                         <punct_item> <punct_side>
     
     22a. <punct_item>        ::= <punct_piece>
     22b.                         ( <punct_piece> )
     
     23a. <punct_piece>       ::= ~ <punct_piece>
     23b.                         <literal>
     23c.                         [ <literal> ]
     
     24a. <envbar>            ::= _
     24b.                         ~_
     
     25a. <boundary>          ::= #
     25b.                         ~#
     
     26a. <redup_pattern>     ::= [ <indexed_literal> ]
     26b.                         <literal> [ <indexed_literal> ]
     26c.                         [ <indexed_literal> ] <literal>
     26d.                         [ <indexed_literal> ] <redup_pattern>
     26e.                         <redup_pattern> [ <indexed_literal> ]
     
     27.  <indexed_literal>   ::= <literal> ^ <number>
     
     28.  <literal>           ::= one or more contiguous characters
     
     29.  <comment_char>      ::= character defined by `-c' command
                                  line option, or `|' by default
     
     30.  <number>            ::= one or more contiguous digits (0-9)


Comments on selected BNF rules
..............................

2.
     The (first) literal string is a surface form representation of the
     morpheme.  The literal string enclosed in braces is a unique
     allomorph identification string.  (The identification string is a
     feature added to support LinguaLinks.  It is not stored unless the
     `-b' command line option is used.

3.
     Each literal string is an allomorph property defined by a `\ap'
     field in the analysis data file.

4.
     String, punctuation and morpheme constraints can be mixed
     together, but it is recommended that you group the string
     constraints together, the punctuation constraints together and the
     morpheme constraints together.

5.
     A comment begins with a specified character and ends with the end
     of the line.

7-8.
     Note that what can appear to the left of the environment bar is a
     mirror image of what can appear to the right.

7de.
8de.
     These should be avoided, and other means used to prune analyses
     based on adjacent words.

9c.
     An ellipsis (`...') indicates a possible break in contiguity.

10b.
     Something enclosed in parentheses is optional.

11a.
     A tilde (`~') reverses the desirability of an element, causing the
     constraint to fail if it is found rather than fail if it is not
     found.

11b.
     A literal is matched against the surface form of the word.

11c.
     A literal enclosed in square brackets must be the name of a string
     class defined by a `\scl' field in the analysis data file or the
     dictionary orthography change table file.

11d.
     The indexed literal enclosed in square brackets must match an
     indexed literal given as part of the reduplication allomorph
     pattern.  (See 2c, 2d, and 26.)

13-14.
     Note that what can appear to the left of the environment bar is a
     mirror image of what can appear to the right.

13de.
14de.
     These should be avoided, and other means used to prune analyses
     based on adjacent words.

15c.
     An ellipsis (`...') indicates a possible break in contiguity.

16b.
     Something enclosed in parentheses is optional.

17a.
     A tilde (`~') reverses the desirability of an element, causing the
     constraint to fail if it is found rather than fail if it is not
     found.

17b.
     A literal is a morphname from one of the dictionary files.

17c.
     A literal enclosed in square brackets must be the name of a
     morpheme class defined by a `\mcl' field in the analysis data file.

17d.
     A literal enclosed in curly braces must be one of the following
     (checked in this order):
       1. one of the keywords `root', `prefix', `infix', or `suffix'

       2. a property name defined by an `\ap' or `\mp' field in the
          analyis data file

       3. a category name defined by a `\ca' field in the analysis data
          file

       4. a category class name defined by a `\ccl' field in the
          analysis data file

       5. a morpheme class name defined by a `\mcl' field in the
          analysis data file

19-20.
     Note that what can appear to the left of the environment bar is a
     mirror image of what can appear to the right.

22b.
     Something enclosed in parentheses is optional.

23a.
     A tilde (`~') reverses the desirability of an element, causing the
     constraint to fail if it is found rather than fail if it is not
     found.

23b.
     A literal is a punctuation character.

     The punctuation characters can match punctuation characters either
     before or after the current word.  Unlike string constraints,
     punctuation constraints effectively ignore the position of the
     conditioned allomorph within the word.  All that matters are any
     punctuation characters immediately preceding or following the
     current word.  Further note that neither ellipsis nor cross word
     boundary conditions are allowed.

24.
     A tilde (`~') attached to the environment bar inverts the sense of
     the constraint as a whole.

25b.
     The boundary marker preceded by a tilde (`~#') indicates that it
     must not be a word boundary.

26-27.
     Although the BNF has spaces in it to improve readability, these
     two items cannot have embedded spaces in the dictionary file.

26.
     The reduplication allomorph pattern contains references to string
     classes and possibly literal strings.  The string class names are
     indexed to indicate identical shared values, either in the string
     environment constraint or in more than one location in the
     reduplication allomorph pattern itself.  *Note: this has been
     implemented only for AMPLE at this point.*

27.
     The literal (without the following index given by an ASCII caret
     (`^') and a number) must be the name of a string class defined by a
     `\scl' field in the analysis data file or the dictionary
     orthography change table file.

28.
     The special characters used by environment constraints can be
     included in a literal only if they are immediately preceded by a
     backslash:

          \+  \/  \#  \~  \[  \]  \(  \)  \{  \}  \.  \_  \\

The allomorph field is used in all types of dictionary entries: prefix,
infix, suffix, and root.

Category (internal code C)
==========================

Each dictionary entry must contain a category field.  If multiple
category fields exist, then their contents are merged together.

For affix entries, this field must contain at least one category pair
for the morpheme, but may contain any number of category pairs
separated by spaces or tabs.  Each category pair consists of two
category names separated by a slash (`/').  The category names must
have been defined by a `\ca' field in the analysis data file.  The
first category is the "from category", that is, the category of the
unit to which this morpheme can be affixed.  The second category is the
"to category", that is, the category of the result after this morpheme
has been affixed.

For root entries, this field contains one or more morphological
categories as defined by a `\ca' field in the analysis data file.  If
multiple categories are listed, they should be separated by spaces or
tabs.

The category field is used in all types of dictionary entries: prefix,
infix, suffix, and root.

Elsewhere Allomorph (internal code E)
=====================================

WRITE ME!

The elsewhere allomorph field is used in all types of dictionary
entries: prefix, infix, suffix, and root.

Infix location (internal code L)
================================

The infix location field serves to restrict where infixes may be found,
and must be included in each infix dictionary entry.  Subject to the
constraints imposed by the infix location field, STAMP searches the
rest of the word for any occurrence of any allomorph string of the
infix.  This makes infixes rather expensive, computationally, so they
should be constrained as much as possible.


      1.  <infix_location> ::= <types> <constraints>
     
      2a. <types>          ::= <type>
      2b.                      <type> <types>
     
      3a. <constraints>    ::= <environment>
      3b.                      <environment> <constraints>
     
      4a. <environment>    ::= <marker> <leftside> <envbar> <rightside>
      4b.                      <marker> <leftside> <envbar>
      4c.                      <marker> <envbar> <rightside>
     
      5a. <leftside>       ::= <side>
      5b.                      <boundary>
      5c.                      <boundary> <side>
     
      6a. <rightside>      ::= <side>
      6b.                      <boundary>
      6c.                      <side> <boundary>
     
      7a. <side>           ::= <item>
      7b.                      <item> <side>
      7c.                      <item> ... <side>
     
      8a. <item>           ::= <piece>
      8b.                      ( <piece> )
     
      9a. <piece>          ::= ~ <piece>
      9b.                      <literal>
      9c.                      [ <literal> ]
     
     10a. <type>           ::= prefix
     10b.                      root
     10c.                      suffix
     
     11a. <marker>         ::= /
     11b.                      +/
     
     12a. <envbar>         ::= _
     12b.                      ~_
     
     13a. <boundary>       ::= #
     13b.                      ~#
     
     14.  <literal>        ::= one or more contiguous characters


Comments on selected BNF rules
..............................

2.
     The first part of the infix location field lists the type of
     morpheme in which the infix may be hidden.  This consists of one
     or more of the words `prefix', `root', or `suffix'.  If `prefix'
     is given, then STAMP looks for infixes after exhausting the
     possible prefixes at a given point in the word, and resumes
     looking for more prefixes after finding an infix.  Similarly, if
     `root' is given, then STAMP looks for infixes after running out of
     roots while parsing the word, and if it finds an infix, it looks
     for more roots.  Suffixes are treated the same way if `suffix' is
     given in the infix location field.

5.
     A boundary marker (`#') on the left side of the environment bar
     refers to the place in the word which the parse has reached before
     looking for infixes, not to the beginning of the word.

6.
     A boundary marker (`#') on the right side of the environment bar
     refers to the end of the word.

7c.
     An ellipsis (`...') indicates a possible break in contiguity.

8b.
     Something enclosed in parentheses is optional.

9a.
     A tilde (`~') reverses the desirability of an element, causing the
     constraint to fail if it is found rather than fail if it is not
     found.

11.
     A `+/' is usually used for morpheme environment constraints, but
     may used for infix location environment constraints as well.

12.
     A tilde attached to the environment bar (`~_') inverts the sense of
     the constraint as a whole.

13b.
     The boundary marker preceded by a tilde (`~#') indicates that it
     must not be a word boundary.

14.
     The special characters used by environment constraints can be
     included in a literal only if they are immediately preceded by a
     backslash:

          \+  \/  \#  \~  \[  \]  \(  \)  \{  \}  \.  \_  \\

The infix location field is used only in infix dictionary entries.

Morphname (internal code M)
===========================

A morphname is an arbitrary name for a given morpheme.  Only the first
word (string of contiguous nonspace characters) following the morphname
field code is used as the morphname.  Morphnames must be less than 64
characters long.

A morphname serves two important functions:
  1. It identifies a morpheme in morpheme environment constraints,
     morpheme co-occurrence constraints, ad hoc pairs, and tests.

  2. It is the default morpheme identifier written to the output
     analysis file.

Generally, a morphname is an identifier of a morpheme and does not need
to faithfully represent that morpheme's meaning or function.

If a dictionary entry has more than one morphname field, the morphname
from the first one is used; the others cause an error message.  The
morphname field is used in all types of dictionary entries: prefix,
infix, suffix, and root.  The usage differs somewhat between affix and
root dictionary entries, so these two types of morphnames are described
separately.

Affix morphnames
----------------

Every affix dictionary entry must have a morphname field.  Users are
strongly encouraged to observe the following suggestions in creating
affix morphnames:

  1. Make each morphname unique.  If two morphemes have the same name,
     it is impossible to refer unambiguously to them.  The same
     morphname should not be used in different affix dictionaries (that
     is, in the prefix dictionary and in the suffix dictionary).

  2. Keep morphnames short.  This reduces the size of analysis files and
     makes text glossing more aesthetically pleasing.  For example, for
     a verbal person marker, use simply `1' rather than `1P' unless
     there is good reason to add the `P' for person or possessive.  For
     a first person object marker, `1O' might serve as well as `1OBJ'.

  3. Use only uppercase alphabetic characters and numbers for contrast
     with root morphnames, which are generally made up of lowercase
     alphabetic characters.  Be cautious in using hyphens, periods,
     underscores, slashes, backslashes, or other nonalphanumeric
     characters.  The reason to avoid these is that other programs
     which apply to the resulting analysis may make use of
     nonalphanumerics in different ways.

  4. Design a syntax of names and stick to it for inflectional morphemes
     which combine more than one semantic notion.  For example, for
     Latin nominal inflections, which indicate gender, number, and
     case, the syntax might be

          MORPHNAME = GENDER CASE NUMBER

     where `GENDER' is `M' for masculine, `F' for feminine and `N' for
     neuter; `CASE' is `N' for nominative, `A' for accusative, `G' for
     genitive, and so on; and `NUMBER' is `S' for singular and `P' for
     plural.  The name for masculine nominative singular would then be
     `MNS'.

Root morphnames
---------------

Root morphnames are generally either glosses or etymologies.
Etymologies are frequently marked with a leading asterisk (`*').  (This
is used by STAMP to indicate regular sound changes.)

If the morphname field contains only an asterisk, the morphname becomes
an asterisk followed by whatever allomorph is matched.  If the
morphname field is omitted, or if it contains only a comment, STAMP
puts whatever allomorph was matched in the text into the analysis.  If
the morpheme contains any alternate forms, it is wise to include an
explicit morphname field.

Order class (internal code O)
=============================

The order class of an affix is a number indicating its position
relative to other morphemes.  Prefixes should be assigned negative
numbers and suffixes should be assigned positive numbers.  Infixes
should be assigned order class values appropriate to where they can
appear in the word relative to the prefixes and suffixes.

If the order class field is omitted, then a default value of zero (0)
is assigned to the affix.  Order class values must be between -32767
and 32767.

Order classes are used only by tests in the analysis data file.  They
are needed only if appropriate tests are written to take advantage of
them.

The order class field is used only in affix type dictionary entries:
prefix, infix, and suffix.  Roots always have an implicit order class
of zero.

Morpheme property (internal code P)
===================================

This field contains one or more morpheme properties.  These properties
must have been defined by a `\mp' field in the analysis data file.  A
morpheme property is inherited by all allomorphs of the morpheme.

The morpheme property field is optional, and may be repeated.  If
multiple properties apply to a morpheme, they may be given all in a
single field or each in a separate field.

Morpheme properties typically indicate a characteristic of the morpheme
which conditions the occurrence of allomorphs of an adjacent morpheme.
Morpheme properties are used in tests defined in the analysis data file
and in morpheme environment constraints.

The morpheme property field is used in all types of dictionary entries:
prefix, infix, suffix, and root.

Morpheme type (internal code T)
===============================

In a unified dictionary, the type of an entry is determined by the
first letter following the morpheme type field code: `p' or `P' for
prefixes, `i' or `I' for infixes, `s' or `S' for suffixes, and `r' or
`R' for roots.  The morpheme type field is not needed for root entries
because the entry type defaults to root.

The morpheme type field is used only in unified dictionary files, since
the morpheme type is otherwise implicit.

Do not load (internal code !)
=============================

When a "do not load" field is included in a record, STAMP ignores the
record altogether.  This makes it possible to include records in the
dictionary for linguistic purposes, while not needlessly taking up
memory space if the dictionary is used for some other purpose.

The "do not load" field is used in all types of dictionary entries:
prefix, infix, suffix, and root.

Text Output Control File
************************

The text output module restores a processed document from the internal
format to its textual form.  It re-imposes capitalization on words and
restores punctuation, format markers, white space, and line breaks.
Also, orthography changes can be made, and the delimiter that marks
ambiguities and failures can be changed.  This chapter describes the
control file given to the text output module.(1)

---------- Footnotes ----------

(1) This chapter is adapted from chapter 8 of Weber (1990).

Text output ambiguity delimiter: \ambig
=======================================

The text output module flags words that either produced no results or
multiple results when processed.  These are flagged with percent signs
(`%') by default, but this can be changed by declaring the desired
character with the \ambig field code.  For example, the following would
change the ambiguity delimiter to `@':
     \ambig @

Text output orthographic changes: \ch
=====================================

The text output module allows orthographic changes to be made to the
processed words.  These are given in the text output control file.

An orthography change is defined by the `\ch' field code followed by
the actual orthography change.  Any number of orthography changes may
be defined in the text output control file.  The output of each change
serves as the input the following change.  That is, each change is
applied as many times as necessary to an input word before the next
change from the text output control file is applied.

Basic changes
-------------

To substitute one string of characters for another, these must be made
known to the program in a change.  (The technical term for this sort of
change is a production, but we will simply call them changes.)  In the
simplest case, a change is given in three parts: (1) the field code
`\ch' must be given at the extreme left margin to indicate that this
line contains a change; (2) the match string is the string for which
the program must search; and (3) the substitution string is the
replacement for the match string, wherever it is found.

The beginning and end of the match and substitution strings must be
marked.  The first printing character following `\ch' (with at least
one space or tab between) is used as the delimiter for that line.  The
match string is taken as whatever lies between the first and second
occurrences of the delimiter on the line and the substitution string is
whatever lies between the third and fourth occurrences.  For example,
the following lines indicate the change of hi to bye, where the
delimiters are the double quote mark (`"'), the single quote mark
(`''), the period (`.'), and the at sign (`@').
     \ch "hi" "bye"
     \ch 'hi' 'bye'
     \ch .hi. .bye.
     \ch @hi@ @bye@

Throughout this document, we use the double quote mark as the delimiter
unless there is some reason to do otherwise.

Change tables follow these conventions:
  1. Any characters (other than the delimiter) may be placed between the
     match and substitution strings.  This allows various notations to
     symbolize the change.  For example, the following are equivalent:
          \ch "thou" "you"
          \ch "thou" to "you"
          \ch "thou" > "you"
          \ch "thou" --> "you"
          \ch "thou" becomes "you"

  2. Comments included after the substitution string are initiated by a
     designated character such as a semicolon (`;').  The following
     lines illustrate the use of comments:
          \ch "qeki" "qiki" | for cases like wawqeki
          \ch "thou" "you"  | for modern English

  3. A change can be ignored temporarily by turning it into a comment
     field.  This is done either by placing an unrecognized field code
     in front of the normal `\ch', or by placing the comment character
     (`|') in front of it.  For example, only the first of the
     following three lines would effect a change:
          \ch "nb" "mp"
          \no \ch "np" "np"
          |\ch "mb" "nb"

The changes in the text output control file are applied as an ordered
set of changes.  The first change is applied to the entire word by
searching from left to right for any matching strings and, upon finding
any, replacing them with the substitution string.  After the first
change has been applied to the entire word, then the next change is
applied, and so on.  Thus, each change applies to the result of all
prior changes.  When all the changes have been applied, the resulting
word is returned.  For example, suppose we have the following changes:
     \ch "aib" > "ayb"
     \ch "yb"  > "yp"

Consider the effect these have on the word paiba.  The first changes i
to y, yielding payba; the second changes b to p, to yield paypa.  (This
would be better than the single change of aib to ayp if there were
sources of yb other than the output of the first rule.)

The way in which change tables are applied allows certain tricks.  For
example, suppose that for Quechua, we wish to change hw to f, so that
hwista becomes fista and hwis becomes fis.  However, we do not wish to
change the sequence shw or chw to sf or cf (respectively).  This could
be done by the following sequence of changes. (Note, `@' and `$' are
not otherwise used in the orthography.)
     \ch "shw" > "@"     | (1)
     \ch "chw" > "$"      | (2)
     \ch "hw"  > "f"      | (3)
     \ch "@"   > "shw"   | (4)
     \ch "$"   > "chw"    | (5)

Lines (1) and (2) protect the sh and ch by changing them to
distinguished symbols.  This clears the way for the change of hw to f
in (3).  Then lines (4) and (5) restore `@' and `$' to sh and ch,
respectively. (An alternative, simpler way to do this is discussed in
the next section.)

Environmentally constrained changes
-----------------------------------

It is possible to impose string environment constraints (SECs) on
changes in the orthography change tables.  The syntax of SECs is
described in detail in section {No Value For "words.vs.format"}.

For example, suppose we wish to change the mid vowels (e and o) to high
vowels (i and u respectively) immediately before and after q.  This
could be done with the following changes:
     \ch "o" "u"  / _ q  / q _
     \ch "e" "i"  / _ q  / q _

This is not entirely a hypothetical example; some Quechua practical
orthographies write the mid vowels e and o.  However, in the
environment of /q/ these could be considered phonemically high vowels
/i/ and /u/.  Changing the mid vowels to high upon loading texts has
the advantage that-for cases like upun "he drinks" and upoq "the one
who drinks"-the root needs to be represented internally only as upu
"drink".  But note, because of Spanish loans, it is not possible to
change all cases of e to i and o to u.  The changes must be conditioned.

In reality, the regressive vowel-lowering effect of /q/ can pass over
various intervening consonants, including /y/, /w/, /l/, /ll/, /r/,
/m/, /n/, and /n/.  For example, /ullq/ becomes ollq, /irq/ becomes erq,
and so on.  Rather than list each of these cases as a separate
constraint, it is convenient to define a class (which we label
`+resonant') and use this class to simplify the SEC.  Note that the
string class must be defined (with the `\scl' field code) before it is
used in a constraint.
     \scl +resonant y w l ll r m n n~
     \ch "o" "u" / q _ / _ ([+resonant]) q
     \ch "e" "i" / q _ / _ ([+resonant]) q

This says that the mid vowels become high vowels after /q/ and before
/q/, possibly with an intervening /y/, /w/, /l/, /ll/, /r/, /m/, /n/,
or /n/.

Consider the problem posed for Quechua in the previous section, that of
changing hw to f.  An alternative is to condition the change so that it
does not apply adjacent to a member of the string class `Affric' which
contains s and c.
     \scl Affric c s
     \ch "hw" "f" / [Affric] ~_

It is sometimes convenient to make certain changes only at word
boundaries, that is, to change a sequence of characters only if they
initiate or terminate the word.  This conditioning is easily expressed,
as shown in the following examples.
     \ch "this" "that"           | anywhere in the word
     \ch "this" "that"  / # _    | only if word initial
     \ch "this" "that"  /   _ #  | only if word final
     \ch "this" "that"  / # _ #  | only if entire word

Using text orthography changes
------------------------------

The purpose of orthography change is to convert text from an external
orthography to an internal representation more suitable for
morphological analysis.  In many cases this is unnecessary, the
practical orthography being completely adequate as the internal
representation.  In other cases, the practical orthography is an
inconvenience that can be circumvented by converting to a more phonemic
representation.

Let us take a simple example from Latin.  In the Latin orthography, the
nominative singular masculine of the word "king" is rex.  However,
phonemically, this is really /reks/; /rek/ is the root meaning king and
the /s/ is an inflectional suffix.  If the program is to recover such
an analysis, then it is necessary to convert the x of the external,
practical orthography into ks internally.  This can be done by
including the following orthography change in the text output control
file:
     \ch  "x"  "ks"

In this, x is the match string and ks is the substitution string, as
discussed in section {No Value For "output.file"}.  Whenever x is
found, ks is substituted for it.

Let us consider next an example from Huallaga Quechua.  The practical
orthography currently represents long vowels by doubling the vowel.
For example, what is written as kaa is /ka:/ "I am", where the length
(represented by a colon) is the morpheme meaning "first person
subject".  Other examples, such as upoo /upu:/ "I drink" and upichee
/upi-chi-:/ "I extinguish", motivate us to convert all long vowels into
a vowel followed by a colon.  The following changes do this:
     \ch  "aa"  "a:"
     \ch  "ee"  "i:"
     \ch  "ii"  "i:"
     \ch  "oo"  "u:"
     \ch  "uu"  "u:"

Note that the long high vowels (i and u) have become mid vowels (e and
o respectively); consequently, the vowel in the substitution string is
not necessarily the same as that of the match string.  What is the
utility of these changes?  In the lexicon, the morphemes can be
represented in their phonemic forms; they do not have to be represented
in all their orthographic variants.  For example, the first person
subject morpheme can be represented simply as a colon (-:), rather than
as -a in cases like kaa, as -o in cases like qoo, and as -e as in cases
like upichee.  Further, the verb "drink" can be represented as upu and
the causative suffix (in upichee) can be represented as -chi; these are
the forms these morphemes have in other (nonlowered) environments.  As
the next example, let us suppose that we are analyzing Spanish, and
that we wish to work internally with k rather than c (before a, o, and
u) and qu (before i and e). (Of course, this is probably not the only
change we would want to make.)  Consider the following changes:
     \ch  "ca"  "ka"
     \ch  "co"  "ko"
     \ch  "cu"  "ku"
     \ch  "qu"  "k"

The first three handle c and the last handles qu.  By virtue of
including the vowel after c, we avoid changing ch to kh.  There are
other ways to achieve the same effect.  One way exploits the fact that
each change is applied to the output of all previous changes.  Thus, we
could first protect ch by changing it to some distinguished character
(say `@'), then changing c to k, and then restoring `@' to ch:
     \ch  "ch"  "@"
     \ch  "c"  "k"
     \ch  "@"  "ch"
     \ch  "qu"  "k"

Another approach conditions the change by the adjacent characters.  The
changes could be rewritten as
     \ch  "c"  "k"  / _a  / _o  / _u  | only before a, o, or u
     \ch  "qu"  "k"                   | in all cases

The first change says, "change c to k when followed by a, o, or u."
(This would, for example, change como to komo, but would not affect
chal.)  The syntax of such conditions is exactly that used in string
environment constraints; see section {No Value For "words.vs.format"}.

Where orthography changes apply
-------------------------------

Input orthography changes are made when the text being processed may be
written in a practical orthography.  Rather than requiring that it be
converted as a prerequisite to running the program, it is possible to
have the program convert the orthography as it loads and before it
processes each word.

The changes loaded from the text output control file are applied after
all the text is converted to lower case (and the information about
upper and lower case, along with information about format marking,
punctuation and white space, has been put to one side.)  Consequently,
the match strings of these orthography changes should be all lower
case; any change that has an uppercase character in the match string
will never apply.

A sample orthography change table
---------------------------------

We include here the entire orthography input change table for Caquinte
(a language of Peru).  There are basically four changes that need to be
made: (1) nasals, which in the practical orthography reflect their
assimilation to the point of articulation of a following noncontinuant,
must be changed into an unspecified nasal, represented by N; (2) c and
qu are changed to k; (3) j is changed to h; and (4) gu is changed to g
before i and e.

     \ch  "mp"  "Np"     | for unspecified nasals
     \ch  "nch" "Nch"
     \ch  "nc"  "Nk"
     \ch  "nqu" "Nk"
     \ch  "nt"  "Nt"
     
     \ch  "ch"  "@"     | to protect ch
     \ch  "c"   "k"      | other c's to k
     \ch  "@"   "ch"    | to restore ch
     \ch  "qu"  "k"
     
     \ch  "j"   "h"
     
     \ch  "gue" "ge"
     \ch  "gui" "gi"

This change table can be simplified by the judicious use of string
environment constraints:

     \ch  "m"  >  "N"  / _p
     \ch  "n"  >  "N"  / _c  / _t  / _qu
     
     \ch  "c"  >  "k"  / _~h
     \ch  "qu" >  "k"
     
     \ch  "j"  >  "h"
     
     \ch  "gu" >  "g"  / _e  /_i

As suggested by the preceding examples, the text orthography change
table is composed of all the `\ch' fields found in the text output
control file.  These may appear anywhere in the file relative to the
other fields.  It is recommended that all the orthography changes be
placed together in one section of the text output control file, rather
than being mixed in with other fields.

Syntax of Orthography Changes
-----------------------------

This section presents a grammatical description of the syntax of
orthography changes in BNF notation.


      1a. <orthochange>  ::= <basic_change>
      1b.                    <basic_change> <constraints>
     
      2a. <basic_change> ::= <quote><quote> <quote><string><quote>
      2b.                    <quote><string><quote> <quote><quote>
      2c.                    <quote><string><quote> <quote><string><quote>
     
      3.  <quote>        ::= any printing character not used in either
                             the ``from'' string or the ``to'' string
     
      4.  <string>       ::= one or more characters other than the quote
                             character used by this orthography change
     
      5a. <constraints>  ::= <change_envir>
      5b.                    <change_envir> <constraints>
     
      6a. <change_envir> ::= <marker> <leftside> <envbar> <rightside>
      6b.                    <marker> <leftside> <envbar>
      6c.                    <marker> <envbar> <rightside>
     
      7a. <leftside>   ::= <side>
      7b.                  <boundary>
      7c.                  <boundary> <side>
     
      8a. <rightside>  ::= <side>
      8b.                  <boundary>
      8c.                  <side> <boundary>
     
      9a. <side>       ::= <item>
      9b.                  <item> <side>
      9c.                  <item> ... <side>
     
     10a. <item>       ::= <piece>
     10b.                  ( <piece> )
     
     11a. <piece>      ::= ~ <piece>
     11b.                  <literal>
     11c.                  [ <literal> ]
     
     12.  <marker>     ::= /
                           +/
     
     13.  <envbar>     ::= _
                           ~_
     
     14.  <boundary>   ::= #
                           ~#
     
     15.  <literal>    ::= one or more contiguous characters


Comments on selected BNF rules
..............................

2.
     The same `<quote>' character must be used at both the beginning
     and the end of both the "from" string and the "to" string.

3.
     The double quote (`"') and single quote (`'') characters are most
     often used.

7-8.
     Note that what can appear to the left of the environment bar is a
     mirror image of what can appear to the right.

9c.
     An ellipsis (`...') indicates a possible break in contiguity.

10b.
     Something enclosed in parentheses is optional.

11a.
     A tilde (`~') reverses the desirability of an element, causing the
     constraint to fail if it is found rather than fail if it is not
     found.

11c.
     A literal enclosed in square brackets must be the name of a string
     class defined by a `\scl' field in the analysis data file, or
     earlier in the dictionary orthography change file.

12.
     A `+/' is usually used for morpheme environment constraints, but
     may used for change environment constraints in `\ch' fields in the
     dictionary orthography change table file.

13.
     A tilde attached to the environment bar (`~_') inverts the sense of
     the constraint as a whole.

14b.
     The boundary marker preceded by a tilde (`~#') indicates that it
     must not be a word boundary.

15.
     The special characters used by environment constraints can be
     included in a literal only if they are immediately preceded by a
     backslash:

          \+  \/  \#  \~  \[  \]  \(  \)  \.  \_  \\

Decomposition Separation Character: \dsc
========================================

The `\dsc' field defines the character used to separate the morphemes
in the decomposition field of the input analysis file.  For example, to
use the equal sign (`='), the text input control file would include:

     \dsc  =

This would handle a decomposition field like the following:

     \d %3%kay%ka=y%ka=y%

It makes sense to use the `\dsc' field only once in the text output
control file.  If multiple `\dsc' fields do occur in the file, the
value given in the first one is used.  If the text output control file
does not have an `\dsc' field, a dash (`-') is used.

The first printing character following the `\dsc' field code is used as
the morpheme decomposition separator character.  The same character
cannot be used both for separating decomposed morphemes in the analysis
output file and for marking comments in the output control files.  Thus,
one normally cannot use the vertical bar (`|') as the decomposition
separation character.

This field is provided for use by the INTERGEN program.  It is of little
use to STAMP.

Primary format marker character: \format
========================================

The `\format' field designates a single character to flag the beginning
of a primary format marker.  For example, if the format markers in the
text files begin with the at sign (`@'), the following would be placed
in the text input control file:

     \format  @

This would be used, for example, if the text contained format markers
like the following:

     @
     @p
     @sp
     @make(Article)
     @very-long.and;muddled/format*marker,to#be$sure


If a `\format' field occurs in the text input control file without a
following character to serve for flagging format markers, then the
program will not recognize any format markers and will try to parse
everything other than punctuation characters.

It makes sense to use the `\format' field only once in the text input
control file.  If multiple `\format' fields do occur in the file, the
value given in the first one is used.

The first printing character following the `\format' field code is used
to flag format markers.  The character currently used to mark comments
cannot be assigned to also flag format markers.  Thus, the
vertical bar (`|') cannot normally be used to flag format markers.

This field is provided for use by the INTERGEN program.  It is of little
use to STAMP.

Lowercase/uppercase character pairs: \luwfc
===========================================

To break a text into words, the program needs to know which characters
are used to form words.  It always assumes that the letters `A' through
`Z' and `a' through `z' are used as word formation characters.  If the
orthography of the language the user is working in uses any other
characters that have lowercase and uppercase forms, these must given in
a `\luwfc' field in the text input control file.

The `\luwfc' field defines pairs of characters; the first member of
each pair is a lowercase character and the second is the corresponding
uppercase character.  Several such pairs may be placed in the field or
they may be placed on separate fields.  Whitespace may be interspersed
freely.  For example, the following three examples are equivalent:

     \luwfc  �� ��
or

     \luwfc  ��      | e with acute accent
     \luwfc  ��      | enyee

or

     \luwfc  � �  � �

Note that comments can be used as well (just as they can in any STAMP
control file).  This means that the comment character cannot be
designated as a word formation character.  If the orthography includes
the vertical bar (`|'), then a different comment character must be
defined with the `-c' command line option when STAMP is initiated; see
`STAMP Command Options' above.

The `\luwfc' field can be entered anywhere in the text input control
file, although a natural place would be before the `\wfc' (word
formation character) field.

Any standard alphabetic character (that is `a' through `z' or `A'
through `Z') in the `\luwfc' field will override the standard lower-
upper case pairing.  For example, the following will treat `X' as the
upper case equivalent of `z':

     \luwfc z X

Note that `Z' will still have `z' as its lower-case equivalent in this
case.

The `\luwfc' field is allowed to map multiple lower case characters to
the same upper case character, and vice versa.  This is needed for
languages that do not mark tone on upper case letters.

Multibyte lowercase/uppercase character pairs: \luwfcs
======================================================

The `\luwfcs' field extends the character pair definitions of the
`\luwfc' field to multibyte character sequences.  Like the `\luwfc'
field, the `\luwfcs' field defines pairs of characters; the first
member of each pair is a multibyte lowercase character and the second
is the corresponding multibyte uppercase character.  Several such pairs
may be placed in the field or they may be placed on separate fields.
Whitespace separates the members of each pair, and the pairs from each
other.  For example, the following three examples are equivalent:

     \luwfcs  e' E` n~ N^ � C&
or

     \luwfcs  e' E`      | e with acute accent
     \luwfcs  n~ N^      | enyee
     \luwfcs  �  C&      | c cedilla

or

     \luwfcs  e' E`
              n~ N^
              �  C&


Note that comments can be used as well (just as they can in any STAMP
control file).  This means that the comment character cannot be
designated as a word formation character.  If the orthography includes
the vertical bar (`|'), then a different comment character must be
defined with the `-c' command line option when STAMP is initiated; see
`STAMP Command Options' above.

Also note that there is no requirement that the lowercase form be the
same length (number of bytes) as the uppercase form.  The examples shown
above are only one or two bytes (character codes) in length, but there
is no limit placed on the length of a multibyte character.

The `\luwfcs' field can be entered anywhere in the text input control
file.  `\luwfcs' fields may be mixed with `\luwfc' fields in the same
file.

Any standard alphabetic character (that is `a' through `z' or `A'
through `Z') in the `\luwfcs' field will override the standard lower-
upper case pairing.  For example, the following will treat `X' as the
upper case equivalent of `z':

     \luwfcs z X

Note that `Z' will still have `z' as its lowercase equivalent in this
case.

The `\luwfcs' field is allowed to map multiple multibyte lowercase
characters to the same multibyte uppercase character, and vice versa.
This may be useful in some situations, but it introduces an element of
ambiguity into the decapitalization and recapitalization processes.  If
ambiguous capitalization is supported, then for the previous example,
`z' will have both `X' and `Z' as uppercase equivalents, and `X' will
have both `x' and `Z' as lowercase equivalents.

Text output string classes: \scl
================================

A string class is defined by the `\scl' field code followed by the
class name, which is followed in turn by one or more contiguous
character strings or (previously defined) string class names.  A string
class name used as part of the class definition must be enclosed in
square brackets.  For example, the sample text output control file
given below contains the following lines:
     a. \scl X t s c
     b. \ch "h"   "j"   / [X] ~_

Line a defines a string class including t, s, and c; change rule b
makes use of this class to block the change of h to j when it occurs in
the digraphs th, sh, and ch.

The class name must be a single, contiguous sequence of printing
characters.  Characters and words which have special meanings in tests
should not be used.  The actual character strings have no such
restrictions.  The individual members of the class are separated by
spaces, tabs, or newlines.

Each `\scl' field defines a single string class.  Any number of `\scl'
fields may appear in the file.  The only restriction is that a string
class must be defined before it is used.

Caseless word formation characters: \wfc
========================================

To break a text into words, the program needs to know which characters
are used to form words.  It always assumes that the letters `A' through
`Z' and `a' through `z' are used as word formation characters.  If the
orthography of the language the user is working in uses any characters
that do not have different lowercase and uppercase forms, these must
given in a `\wfc' field in the text input control file.

For example, English uses an apostrophe character (`'') that could be
considered a word formation character.  This information is provided by
the following example:

     \wfc  '    | needed for words like don't

Notice that the characters in the `\wfc' field may be separated by
spaces, although it is not required to do so.  If more than one `\wfc'
field occurs in the text input control file, the program uses the
combination of all characters defined in all such fields as word
formation characters.

The comment character cannot be designated as a word formation
character.  If the orthography includes the vertical bar (`|'), then a
different comment character must be defined with the `-c' command line
option when STAMP is initiated; see `STAMP Command Options' above.

Multibyte caseless word formation characters: \wfcs
===================================================

The `\wfcs' field allows multibyte characters to be defined as
"caseless" word formation characters.  It has the same relationship to
`\wfc' that `\luwfcs' has to `\luwfc'.  The multibyte word formation
characters are separated from each other by whitespace.

A sample text output control file
=================================

A complete text output control file used for adapting to Asheninca
Campa is given below.

     \id AEouttx.ctl for Asheninca Campa
     \ch "N"   "m"  / _ p       | assimilates before p
     \ch "N"   "n"              | otherwise becomes n
     \ch "ny"  "n~"
     
     \ch "ts"  "th" / ~_ i      | (N)tsi is unchanged
     \ch "tsy" "ch"
     \ch "sy"  "sh"
     \ch "t"   "tz" / n _ i
     
     \ch "k"   "qu" / _ i / _ e
     \ch "k"   "q"  / _ y
     \ch "k"   "c"
     
     \scl X t s c               | define class of  t   s   c
     \ch "h"   "j"   / [X] ~_   | change except in th, sh, ch
     
     \ch "#"   " "              | remove fixed space
     \ch "@"   ""              | remove blocking character

Input Analysis Files
********************

Analysis files are "record oriented standard format files".  This means
that the files are divided into records, each representing a single
word in the original input text file, and records are divided into
fields.  An analysis file contains at least one record, and may contain
a large number of records.  Each record contains one or more fields.
Each field occupies at least one line, and is marked by a "field code"
at the beginning of the line.  A field code begins with a backslash
character (`\'), and contains 1 or more letters in addition.

Analysis file fields
====================

This section describes the possible fields in an analysis file.  The
only field that is guaranteed to exist is the analysis (`\a') field.
All other fields are either data dependent or optional.

Analysis field: \a
------------------

The analysis field (`\a') starts each record of an analysis file.  It
has the following form:

     \a PFX IFX PFX < CAT root CAT root > SFX IFX SFX

where `PFX' is a prefix morphname, `IFX' is an infix morphname, `SFX'
is a suffix morphname, `CAT' is a root category, and `root' is a root
gloss or etymology.  In the simplest case, an analysis field would look
like this:

     \a < CAT root >

where `CAT' is a root category and `root' is a root gloss or etymology.

Decomposition field: \d
-----------------------

The morpheme decomposition field (`\d') follows the analysis field.  It
has the following form:

     \d anti-dis-establish-ment-arian-ism-s

where the hyphens separate the individual morphemes in the surface form
of the word.

Category field: \cat
--------------------

The category field (`\cat') provides rudimentary category information.
This may be useful for sentence level parsing.  It has the following
form:

     \cat CAT

where `CAT' is the word category.

If there are multiple analyses, there will be multiple categories in
the output, separated by ambiguity markers.

Properties field: \p
--------------------

The properties field (`\p') contains the names of any allomorph or
morpheme properties found in the analysis of the word.  It has the form:

     \p ==prop1 prop2=prop3=

where `prop1', `prop2', and `prop3' are property names.  The equal
signs (`=') serve to separate the property information of the
individual morphemes.  Note that morphemes may have more than one
property, with the names separated by spaces, or no properties at all.

Feature Descriptors field: \fd
------------------------------

The feature descriptor field (`\fd') contains the feature names
associated with each morpheme in the analysis.  It has the following
form:

     \fd ==feat1 feat2=feat3=

where `feat1', `feat2', and `feat3' are feature descriptors.  The equal
signs (`=') serve to separate the feature descriptors of the individual
morphemes.  Note that morphemes may have more than one feature
descriptor, with the names separated by spaces, or no feature
descriptors at all.

If there are multiple analyses, there will be multiple feature sets in
the output, separated by ambiguity markers.

Underlying form field: \u
-------------------------

The underlying form field (`\u') is similar to the decomposition field
except that it shows underlying forms instead of surface forms.  It
looks like this:

     \u a-para-a-i-ri-me

where the hyphens separate the individual morphemes.

Word field: \w
--------------

The original word field (`\w') contains the original input word as it
looks before decapitalization and orthography changes.  It looks like
this:

     \w The

Note that this is a gratuitous change from earlier versions of AMPLE
and KTEXT, which wrote the decapitalized form.

Formatting field: \f
--------------------

The format information field (`\f') records any formatting codes or
punctuation that appeared in the input text file before the word.  It
looks like this:

     \f \\id MAT 5 HGMT05.SFM, 14-feb-84 D. Weber, Huallaga Quechua\n
             \\c 5\n\n
             \\s


where backslashes (`\') in the input text are doubled, newlines are
represented by `\n', and additional lines in the field start with a tab
character.

The format information field is written to the output analysis file
whenever it is needed, that is, whenever formatting codes or
punctuation exist before words.

Capitalization field: \c
------------------------

The capitalization field (`\c') records any capitalization of the input
word.  It looks like this:

     \c 1

where the number following the field code has one of these values:
`1'
     the first (or only) letter of the word is capitalized

`2'
     all letters of the word are capitalized

`4-32767'
     some letters of the word are capitalized and some are not

Note that the third form is of limited utility, but still exists
because of words like the author's last name.

The capitalization field is written to the output analysis file
whenever any of the letters in the word are capitalized.

Nonalphabetic field: \n
-----------------------

The nonalphabetic field (`\n') records any trailing punctuation, bar
codes, or whitespace characters.  It looks like this:

     \n |r.\n

where newlines are represented by `\n'.  The nonalphabetic field ends
with the last whitespace character immediately following the word.

The nonalphabetic field is written to the output analysis file whenever
the word is followed by anything other than a single space character.
This includes the case when a word ends a file with nothing following
it.

Ambiguous analyses
==================

The previous section assumed that only one analysis is produced for
each word.  This is not always possible since words in isolation are
frequently ambiguous.  Multiple analyses are handled by writing each
analysis field in parallel, with the number of analyses at the
beginning of each output field.  For example,

     \a %2%< A0 imaika > CNJT AUG%< A0 imaika > ADVS%
     \d %2%imaika-Npa-ni%imaika-Npani%
     \cat %2%A0 A0=A0/A0=A0/A0%A0 A0=A0/A0%
     \p %2%==%=%
     \fd %2%==%=%
     \u %2%imaika-Npa-ni%imaika-Npani%
     \w Imaicampani
     \f \\v124
     \c 1
     \n \n


where the percent sign (`%') separates the different analyses in each
field.  Note that only those fields which contain analysis information
are marked for ambiguity.  The other fields (`\w', `\f', `\c', and
`\n') are the same regardless of the number of analyses.

Analysis failures
=================

The previous sections assumed that words are successfully analyzed.
This does not always happen.  Analysis failures are marked the same way
as multiple analyses, but with zero (`0') for the ambiguity count.  For
example,

     \a %0%ta%
     \d %0%ta%
     \cat %0%%
     \p %0%%
     \fd %0%%
     \u %0%%
     \w TA
     \f \\v 12 |b
     \c 2
     \n |r\n


Note that only the `\a' and `\d' fields contain any information, and
those both have the original word as a place holder.  The other
analysis fields (`\cat', `\p', `\fd', and `\u') are marked for failure,
but otherwise left empty.

Bibliography
************

  1. Weber, David J., H. Andrew Black, and Stephen R. McConnel. 1988.
     `AMPLE: a tool for exploring morphology'.  Occasional Publications
     in Academic Computing No. 12.  Dallas, TX: Summer Institute of
     Linguistics.

  2. Weber, David J., H. Andrew Black, Stephen R. McConnel, and Alan
     Buseman. 1990.  `STAMP: a tool for dialect adaptation'.
     Occasional Publications in Academic Computing No. 15.  Dallas, TX:
     Summer Institute of Linguistics.


Index
*****

* Menu:

* -/:                                    Command options.
* -a:                                    Command options.
* -c character:                          Command options.
* -d number:                             Command options.
* -f filename:                           Command options.
* -i filename:                           Command options.
* -m:                                    Command options.
* -n:                                    Command options.
* -o filename:                           Command options.
* -q:                                    Command options.
* -r:                                    Command options.
* -t:                                    Command options.
* -u:                                    Command options.
* -v:                                    Command options.
* -x:                                    Command options.
* -Z address,count:                      Command options.
* -z filename:                           Command options.
* \a:                                    \a.
* \ambig <1>:                            \ambig.
* \ambig <2>:                            \ambig (zzSYNT.CHG).
* \ambig <3>:                            \ambig (xxzzTR.CHG).
* \ambig:                                \ambig (zzSTAMP.DEC).
* \ap <1>:                               \ap (zzSYNT.CHG).
* \ap <2>:                               \ap (xxzzTR.CHG).
* \ap:                                   \ap.
* \c:                                    \c.
* \ca <1>:                               \ca (zzSYNT.CHG).
* \ca <2>:                               \ca (xxzzTR.CHG).
* \ca:                                   \ca.
* \cat <1>:                              \cat.
* \cat:                                  \cat (zzSTAMP.DEC).
* \ccl <1>:                              \ccl (zzSYNT.CHG).
* \ccl <2>:                              \ccl (xxzzTR.CHG).
* \ccl:                                  \ccl.
* \ch <1>:                               \ch.
* \ch <2>:                               \ch (xxORDC.TAB).
* \ch:                                   \ch (xxSYCD.TAB).
* \cr:                                   \cr.
* \d:                                    \d.
* \dsc:                                  \dsc.
* \f:                                    \f.
* \fd:                                   \fd.
* \fl:                                   \fl.
* \format:                               \format.
* \infix:                                \infix.
* \ir:                                   \ir.
* \lc <1>:                               \lc (zzSYNT.CHG).
* \lc:                                   \lc.
* \luwfc:                                \luwfc.
* \luwfcs:                               \luwfcs.
* \maxprops <1>:                         \maxprops (zzSYNT.CHG).
* \maxprops <2>:                         \maxprops (xxzzTR.CHG).
* \maxprops:                             \maxprops.
* \mcl <1>:                              \mcl (zzSYNT.CHG).
* \mcl <2>:                              \mcl (xxzzTR.CHG).
* \mcl:                                  \mcl.
* \mp <1>:                               \mp (zzSYNT.CHG).
* \mp <2>:                               \mp (xxzzTR.CHG).
* \mp:                                   \mp.
* \n:                                    \n.
* \p:                                    \p.
* \pcl <1>:                              \pcl (zzSYNT.CHG).
* \pcl <2>:                              \pcl (xxzzTR.CHG).
* \pcl:                                  \pcl.
* \prefix:                               \prefix.
* \rd <1>:                               \rd (zzSYNT.CHG).
* \rd <2>:                               \rd (xxzzTR.CHG).
* \rd:                                   \rd.
* \root:                                 \root.
* \rsc:                                  \rsc.
* \rscid:                                \rscid.
* \scl <1>:                              \scl.
* \scl <2>:                              \scl (xxORDC.TAB).
* \scl <3>:                              \scl (zzSYNT.CHG).
* \scl <4>:                              \scl (xxzzTR.CHG).
* \scl:                                  \scl (zzSTAMP.DEC).
* \sr:                                   \sr.
* \strcheck:                             \strcheck.
* \suffix:                               \suffix.
* \test:                                 \test.
* \u:                                    \u.
* \unified:                              \unified.
* \w:                                    \w.
* \wfc:                                  \wfc.
* \wfcs:                                 \wfcs.
* analysis input file:                   Analysis files.
* dictionary code table:                 Dictionary code table file.
* dictionary files:                      Dictionary files.
* dictionary orthography change table:   Dictionary orthography change table file.
* input analysis file:                   Analysis files.
* standard format:                       Standard format.
* text output control:                   Text output control file.