AMPLE Reference Manual

A Morphological Parser for Linguistic Exploration

version 3.3

April 2000

by Stephen McConnel and H. Andrew Black

1. Introduction to the AMPLE program

Since it was released in 1988, the AMPLE program has been used for morphological analysis in many different languages. It is a complex program designed to tackle a complex problem. This manual is intended for reference purposes, to clarify fine points of input and behavior. It is not designed as a tutorial or as a "cookbook" of how to use AMPLE.

AMPLE uses a plethora of input files to control its behavior. These include two mandatory control files (the analysis data file and dictionary code table file), two optional control files (the dictionary orthography change table file and text control file), and a set of dictionary files. The format of each of these files is described in this manual.

1.0.0.1 New features

1. Version 3.1 (July 1998) introduced enhanced multibyte character: handling, especially with regard to capitalization.
2. Version 3.2 (October 1998) introduced reduplication patterns in: the allomorph fields of the dictionary files.
3. Version 3.3 (May 1999) introduced punctuation environment: constraints in the allomorph fields of the dictionary files. These are handled by a new built-in test called PEC_ST. This version also added two punctuation-oriented clauses to user-written tests.
4. Version 3.3.4 (November 1999) added XAMPLE compilation to the: standard distribution, and added the \patr field to the analysis data file for use by XAMPLE in controlling the PCPATR word parser.
5. Version 3.3.7 (January 2000) added the PromoteDefAtoms: value to the \patr field in the analysis data file for use by XAMPLE in controlling the PCPATR word parser.
5. Version 3.3.10 (April 2000) added the PropertyIsFeature: value to the \patr field in the analysis data file for use by XAMPLE in controlling the PCPATR word parser.

2. Running AMPLE

AMPLE is a batch process oriented program. It reads a number of control files, and then processes one or more input text files to produce an equal number of output analysis files.

2.1 AMPLE Command Options

The AMPLE program uses an old-fashioned command line interface following the convention of options starting with a dash character (`-'). The available options are listed below in alphabetical order. Those options which require an argument have the argument type following the option letter.

-a: causes debugging output for allomorph conditions.
-b: allows the allomorph identifiers to be stored in memory. (This feature was added to support LinguaLinks.)
-c character: selects the control file comment character. The default is the vertical bar (|).
-d number: selects the maximum dictionary trie depth. The default is 2, which favors reduced memory needs over speed.
-e filename: selects the PCPATR grammar file for XAMPLE to use. (XAMPLE is a version of AMPLE that adds a PCPATR style word parser to AMPLE.) This option is not recognized by AMPLE.
-f filename: opens a command file containing the names of the control and data files. The default is to read those names from the standard input (keyboard); see section 2.2 Program Interaction.
-g: causes root glosses to be output in the analysis file, and enables the internal code G in the dictionary code table.
-i filename: selects a single input text file.
-m: monitors progress of an analysis: * means an analysis failure, . means a single analysis, 2-9 means 2-9 ambiguities, and > means 10 or more ambiguities. This is not compatible with the `-q' option.
-n number: sets the maximum recommended morphname length. Any morphnames longer than number characters are truncated (with a warning message).
-o filename: selects a single output analysis file.
-q: causes AMPLE to operate "quietly" with minimal screen output. This is not compatible with the `-m' option.
-p: causes ambiguous word percentages to be reported.
-r: checks references to morphnames in all tests.
-s filename: opens a file contains morphnames (or allomorphs) for a selective analysis. This is usually used together with the `-t' (trace) option.
-t: causes analyses to be traced. This produces a huge amount of output. Repeating the -t option causes SGML style trace output to be produced.
-u: signals that dictionaries are unified, not split into prefix, infix, suffix, and root files.
-w fields: selects one or more of these optional output fields for writing to the analysis file: d enables writing the \d (morpheme decomposition) field
p enables writing the \p (properties) field
w enables writing the \w (original word) field The default is to ask interactively about the \d and \w fields, and to write the \p field without asking. All three fields can be selected for output by `-w dpw' or by `-w d -w p -w w'.
-x fields: prevents one or more of these optional output fields from being written to the analysis file: d disables writing the \d (morpheme decomposition) field
p disables writing the \p (properties) field
w disables writing the \w (original word) field The default is to ask interactively about the \d and \w fields, and to write the \p field without asking. All three fields can be excluded from output by `-x dpw' or by `-x d -x p -x w'.
-v: verifies tests by pretty printing the parse trees.

The following options exist only in beta-test versions of the program, since they are used only for debugging.

-/: increments the debugging level. The default is zero (no debugging output).
-z filename: opens a file for recording a memory allocation log.
-Z address,count: traps the program at the point where address is allocated or freed for the count'th time.

2.2 Program Interaction

If the `-f', `-i', and `-o' command options are not used, AMPLE prompts for a number of file names, reading the standard input for the desired values. The interactive dialog goes like this:


C> ample
AMPLE: A Morphological Parser for Linguistic Exploration
Version 3.0b9 (April 4, 1997), Copyright 1997 SIL, Inc.
Beta test version compiled Apr  4 1997 12:18:27
                Analysis Performed Wed Apr  4 14:41:02 1997
Analysis data file (xxAD01.CTL): hgad01.ctl
Dictionary code table (xxANCD.TAB or xxGyCD.TAB): hgancd.tab
Dictionary orthography change table (xxORDC.TAB) [none]:

Suffix dictionary file (xxSF01.DIC): hgsf01.dic
        8 changes loaded from suffix dictionary code table.
        SUFFIX DICTIONARY: Loaded 116 records

Root dictionary file (xxRTnn.DIC): hgrt01.dic
        7 changes loaded from root dictionary code table.
        ROOT DICTIONARY: Loaded 43 records
Next Root dictionary file (xxRTnn.DIC) [no more]:
Text Control File (xxINTX.CTL) [none]: hgintx.ctl
Include the original word in the output (Y or N) [n]? y
Include the morpheme decomposition in the output (Y or N) [n]? y

First Input file: hgtest.txt
Output file: hgtest.ana

INPUT: 78 words processed.

Next Input file [no more]:
C>

Note that each prompt contains a reminder of the expected form of the answer in parentheses and ends with a colon. Several of the prompts also contain the default answer in brackets.

Using the command options does not change the appearance of the program screen output significantly, but the program displays the answers to each of its prompts without waiting for input. Assume that the file `hgtest.cmd' contains the following, which is the same as the answers given above:


hgad01.ctl
hgancd.tab

hgsf01.dic
hgrt01.dic

hgintx.ctl
y
y

Then running AMPLE with the command options produces screen output like the following:


C> ample -f hgtest.cmd -i hgtest.txt -o hgtest.ana
AMPLE: A Morphological Parser for Linguistic Exploration
Version 3.0b9 (April 4, 1997), Copyright 1997 SIL, Inc.
Beta test version compiled Apr  4 1997 12:18:27
                Analysis Performed Wed Apr  4 14:41:32 1997
Analysis data file (xxAD01.CTL): hgad01.ctl
Dictionary code table (xxANCD.TAB or xxGyCD.TAB): hgancd.tab
Dictionary orthography change table (xxORDC.TAB) [none]:

Suffix dictionary file (xxSF01.DIC): hgsf01.dic
        8 changes loaded from suffix dictionary code table.
        SUFFIX DICTIONARY: Loaded 116 records

Root dictionary file (xxRTnn.DIC): hgrt01.dic
        7 changes loaded from root dictionary code table.
        ROOT DICTIONARY: Loaded 43 records
Next Root dictionary file (xxRTnn.DIC) [no more]:
Text Control File (xxINTX.CTL) [none]: hgintx.ctl
Include the original word in the output (Y or N) [n]? y
Include the morpheme decomposition in the output (Y or N) [n]? y

INPUT: 78 words processed.
C>

The only difference in the screen output is that the prompts for the input text file and the output analysis file are not displayed.

3. Standard format

The input control files that AMPLE reads and the output analysis files that AMPLE writes are all standard format files. This means that the files are divided into records and fields. Each file contains at least one record, and some files may contain a large number of records. Each record contains one or more fields. Each field occupies at least one line, and is marked by a field code at the beginning of the line. A field code begins with a backslash character (\), and contains 1 or more printing characters (usually alphabetic) in addition.

If the file is designed to have multiple records, then one of the field codes must be designated to be the record marker, and every record begins with that field, even if it is empty apart from the field code. If the file contains only one record, then the relative order of the fields is constrained only by their semantics.

It is worth emphasizing that field codes must be at the beginning of a line. Even a single space before the backslash character prevents it from being recognized as a field code.

It is also worth emphasizing that record markers must be present even if that field has no information for that record. Omitting the record marker causes two records to be merge into a single record, with unpredictable results.

4. Analysis Data File

The primary control file for the AMPLE program is called the analysis data file. It is a standard format file containing a single data record.

4.1 Analysis Data File Fields

The fields that AMPLE recognizes for the analysis data file are described below. Fields that start with any other backslash codes are ignored by AMPLE.

4.1.1 Allomorph properties: \ap

Allomorph properties are defined by the field code \ap followed by one or more allomorph property names. An allomorph property name must be a single, contiguous sequence of printing characters. Characters and words which have special meanings in tests should not be used.

A maximum of 255 properties (including both allomorph and morpheme properties) may be defined. Any number of \ap fields may be used so long as the number of property names does not exceed 255.

If no \ap fields appear in the analysis data file, then AMPLE does not allow allomorph properties to be used in the dictionary files or in the tests.

4.1.2 Categories: \ca

Categories are defined by the field code \ca followed by one or more category names. A category name must be a single, contiguous sequence of printing characters. Characters and words which have special meanings in tests should not be used.

A maximum of 255 categories may be defined. Any number of \ca fields may be used so long as the number of category names does not exceed 255.

If no \ca fields appear in the analysis data file, then AMPLE does not allow categories to be used in the dictionary entries or in the tests. This is inconceivable for AMPLE's model of morphology.

4.1.3 Category output control: \cat

The category information to write to the analysis output file is defined by the field code \cat followed by one or two words. The first word must be either prefix or suffix (or an abbreviation of one of those words), either capitalized or lowercase. The second word, if present, must be morpheme (or an abbreviation thereof), either capitalized or lowercase.

The \cat field may appear any number of times, but once is enough. If more than one such field occurs, the last one is the one that is used.

If no \cat fields appear in the analysis data file, then AMPLE does not write any category information to the output file.

4.1.4 Category class: \ccl

A category class is defined by the field code \ccl followed by the class name, which is followed in turn by one or more category names or (previously defined) category class names. A category class name used as part of the class definition must be enclosed in square brackets.

The class name must be a single, contiguous sequence of printing characters. Characters and words which have special meanings in tests should not be used. The category names must have been defined by an earlier \ca field.

Each \ccl field defines a single category class. Any number of \ccl fields may appear in the file.

If no \ccl fields appear in the analysis data file, then AMPLE does not allow any category classes to be used in tests or morpheme environment constraints.

4.1.5 Compound root category pair: \cr

An allowable compound root category pair is defined by the \cr field code followed by two category names previously defined in a \ca field. The order of the category names is significant.

Any number of compound root category pairs may be declared. If compound roots are not allowed by a \maxr field, then the compound root category pairs are ignored.

If no \cr fields appear in the analysis data file, then AMPLE does not allow any compound roots. This is, of course, immaterial if the maximum number of roots is one (1).

4.1.6 Dictionary decapitalization control: \dicdecap

The \dicdecap field indicates that allomorph strings in dictionary entries should be decapitalized. Only the field code is significant; anything else in the field is ignored.

The \dicdecap field may appear any number of times, but once is enough.

If no \dicdecap fields appear in the analysis data file, then AMPLE stores dictionary entries verbatim without decapitalizing allomorph strings.

4.1.7 Final test: \ft

A final test is defined by the \ft field code followed by the test name and possibly a test body. The test body is not needed if the test name is that of a built-in test (either MEC_FT or MCC_FT), or a previously defined successor test that is to be used as a final test.

Any number of final tests may be defined in the file. For details about the syntax of final tests, see section 4.2 Test Syntax.

If no \ft fields appear in the analysis data file, AMPLE still applies the built-in final tests MEC_FT and MCC_FT.

4.1.8 Infix ad hoc pair: \iah

An infix ad hoc pair is defined by the \iah field code followed by two morpheme identifiers. The first morphname may belong to a prefix, root, or suffix depending on what is allowed by the infix dictionary entries. The second must belong to an infix.

Any number of infix ad hoc pairs may be defined in the file. However, their use is strongly discouraged on linguistic grounds.

If no \iah fields appear in the analysis data file, then AMPLE never eliminates any analyses via the infix ADHOC_ST test.

4.1.9 Infix successor test: \it

An infix successor test is defined by the \it field code followed by the test name and possibly a test body. The test body is not needed if the test name is that of a built-in test (either SEC_ST ADHOC_ST, or PEC_ST), or a previously defined prefix test that is to be used as an infix test.

Infix tests are applied in the order they appear in the analysis data file. If not explicitly listed, SEC_ST, ADHOC_ST, and PEC_ST are applied after all the user-defined infix tests.

Any number of infix successor tests may be defined in the file. For the syntax of successor tests, see section 4.2 Test Syntax.

If no \it fields appear in the analysis data file, AMPLE still applies the built-in infix tests SEC_ST, ADHOC_ST and PEC_ST.

4.1.10 Maximum number of infixes: \maxi

The maximum number of infixes that may appear in a word is defined by the \maxi field code followed by a number greater than or equal to zero.

The \maxi field may appear any number of times, but once is enough. If more than one such field occurs, the last one is the one that is used.

If no \maxi fields appear in the analysis data file, then AMPLE assumes that the language does not have infixes.

4.1.11 Maximum number of null allomorphs: \maxnull

The maximum number of null allomorphs that may appear in a word is defined by the \maxnull field code followed by a number greater than or equal to zero.

The \maxnull field may appear any number of times, but once is enough. If more than one such field occurs, the last one is the one that is used.

If no \maxnull fields appear in the analysis data file, then AMPLE limits the number of null allomorphs in a word to ten (10).

4.1.12 Maximum number of prefixes: \maxp

The maximum number of prefixes that may appear in a word is defined by the \maxp field code followed by a number greater than or equal to zero.

The \maxp field may appear any number of times, but once is enough. If more than one such field occurs, the last one is the one that is used.

If no \maxp fields appear in the analysis data file, then AMPLE assumes that the language does not have prefixes.

4.1.13 Maximum number of properties: \maxprops

The maximum number of properties that can be defined can be increased from the default of 255 by giving the \maxprops field code followed by a number greater than or equal to 255 but less than 65536.

The \maxprops field may appear any number of times, but once is enough. If more than one such field occurs, the one containing the largest valid value is the one that is used.

The \maxprops must be used before any properties are defined. This is the case for both morpheme and allomorph properties.

If no \maxprops fields appear in the analysis data file, then AMPLE limits the number of properties which can be defined to 255.

4.1.14 Maximum number of roots: \maxr

The maximum number of roots that may appear in a word is defined by the \maxr field code followed by a number greater than or equal to one.

The \maxr field may appear any number of times, but once is enough. If more than one such field occurs, the last one is the one that is used.

If no \maxr fields appear in the analysis data file, then AMPLE assumes that only a single root can appear in a word.

4.1.15 Maximum number of suffixes: \maxs

The maximum number of suffixes that may appear in a word is defined by the \maxs field code followed by a number greater than or equal to zero.

The \maxs field may appear any number of times, but once is enough. If more than one such field occurs, the last one is the one that is used.

If no \maxs fields appear in the analysis data file, then AMPLE assumes that up to 100 suffixes can occur in a word.

4.1.16 Morpheme Co-occurrence Constraint: \mcc

A morpheme co-occurrence constraint is defined by the \mcc field code followed by one or more morpheme names or morpheme class names, and finally a morpheme environment constraint. Each morpheme class name must be enclosed in square brackets, and must have been defined by a prior \mcl field.

For the syntax of morpheme co-occurrence constraints, see section 4.3 Morpheme Co-occurrence Constraint Syntax.

If no \mcc fields appear in the analysis data file, then AMPLE does not eliminate any analyses by the MCC_FT test.

4.1.17 Morpheme class: \mcl

A morpheme class is defined by the \mcl field code followed by the class name, which is followed in turn by one or more morpheme names or (previously defined) morpheme class names. A morpheme class name used as part of the class definition must be enclosed in square brackets.

The class name must be a single, contiguous sequence of printing characters. Characters and words which have special meanings in tests should not be used. The morpheme names should be defined by an entry in one of the dictionary files.

Each \mcl field defines a single morpheme class. Any number of \mcl fields may appear in the file.

If no \mcl fields appear in the analysis data file, then AMPLE does not allow any morpheme classes in morpheme environment constraints or tests.

4.1.18 Morpheme properties: \mp

Morpheme properties are defined by the field code \mp followed by one or more morpheme property names. An morpheme property name must be a single, contiguous sequence of printing characters. Characters and words which have special meanings in tests should not be used.

A maximum of 255 properties (including both allomorph and morpheme properties) may be defined. Any number of \mp fields may be used so long as the number of property names does not exceed 255.

If no \mp fields appear in the analysis data file, then AMPLE does not allow any morpheme properties in dictionary files or tests.

4.1.19 Prefix ad hoc pair: \pah

A prefix ad hoc pair is defined by the \pah field code followed by two morpheme identifiers. The first morphname may belong to either a prefix or an infix (if infixes exist and can mingle with prefixes). The second must belong to an prefix.

Any number of prefix ad hoc pairs may be defined in the file. However, their use is strongly discouraged on linguistic grounds.

If no \pah fields appear in the analysis data file, then AMPLE never eliminates any analyses via the prefix ADHOC_ST test.

4.1.20 Word parser parameter settings: \patr

The \patr field is recognized only by XAMPLE, not by AMPLE, and has effect only if a grammar file is selected by the -e command line option. Each instance of this field sets one of the PCPATR control parameters. Several instances of the field can occur in the analysis data file in order to set several different parameters. Each field contains a parameter name followed by an argument giving its value. These parameters and allowable arguments are discussed below.

Note that the parameter names and arguments following the \patr field code are not case sensitive: ON is the same as On, which is the same as on. Also, the parameter names and arguments may be abbreviated to the shortest unique value: off could be written of, since that is sufficient to distinguish it from on.

CheckCycles: This parameter controls a check against introducing cycles into the parse chart. This makes the parse safer, but slows it down. Legal grammars should not introduce cycles, but it can happen while developing grammars. \patr CheckCycles ON enables this check, and \patr CheckCycles OFF disables it. The default is ON.
DebuggingLevel: This parameter specifies the amount of PCPATR debugging information which will be written to the log file. Its argument is a number greater than or equal to zero. If zero, then no extra debugging information will be written to the log file. The default value is 0. NOTE: this parameter is most useful for the programmer. It can produce huge amounts of cryptic output.
FeatureStyle: This parameter controls the way that feature structures are written to either the output analysis file or the log file, but not whether they are written. \patr FeatureStyle Full causes features to be displayed in an indented format that makes obvious the embedded structure of each feature. \patr FeatureStyle Flat causes features to be displayed in a flat, linear string that uses less space. The default style is Flat.
MaxAmbiguity: This parameter controls the maximum number of different parses for a particular AMPLE word analysis that will be written to either the output analysis file or the log file. Its argument is a number greater than or equal to one. The default maximum is 10.
PromoteDefAtoms: This parameter controls whether default atomic feature values loaded from the lexicon are "promoted" to ordinary atomic feature values before parsing begins. \patr PromoteDefAtoms On causes default atomic values to be promoted. \patr PromoteDefAtoms Off causes parsing to use default atomic values still marked as default. (This can affect feature unification since a conflicting default value does not cause a failure: the default value merely disappears.) The default value is On.
PropertyIsFeature: This parameter controls whether the values in the AMPLE analysis \p (property) field are to be interpreted as feature template names, the same as the values in the AMPLE analysis \fd (feature descriptor) field. \patr PropertyIsFeature On turns on this behavior, and \patr PropertyIsFeature Off turns it off. The default value is On.
ShowAllFeatures: This parameter controls whether the feature structures for all nodes in the parse tree are written to the output files, or just the feature structure for the top node in the parse tree. \patr ShowAllFeatures On causes features for all nodes to be written. \patr ShowAllFeatures Off causes only the feature structure for the top node of the parse to be written. The default value is On.
ShowFailures: This parameter controls how the parser handles parse failures. An AMPLE analysis may fail to parse either by failing the feature constraints or by failing the phrase structure rules. \patr ShowFailures On causes partial results indicating the cause of parse failures to be written to the log file. \patr ShowFailures Off prevents any extra output to the log file. The default value is Off. NOTE: since the purpose of using the PCPATR word parser in XAMPLE is to weed out incorrect AMPLE analyses, a large number of parse failures are to be expected, which can cause huge log files. This parameter is best used in conjunction with the -t command line option when tracing the analysis of a single word, or a small number of words.
ShowFeatures: This parameter controls whether or not any feature structures are written to the output analysis file or the log file. It does not affect any of the other parameters related to how feature structures are written. \patr ShowFeatures On enables writing feature structures to the output files. \patr ShowFeatures Off disables writing feature structures. The default value is On.
ShowGlosses: This parameter controls whether morpheme glosses are displayed in the parse tree output. \patr ShowGlosses On enables writing glosses in the parse tree output. \patr ShowGlosses Off disables writing glosses. If no morpheme glosses exist in the dictionary, then this parameter is ignored. The default value is On.
TimeLimit: This parameter limits the amount of time that parsing an AMPLE analysis can take. Its argument is a number greater than or equal to zero, which is the maximum number of seconds than a parse is allowed before being cancelled. The default value is 0, which has the special meaning that no limit is imposed. NOTE: this feature is new and still somewhat experimental. It may not be fully debugged, and may cause unforeseen side effects such as program crashes some time after one or more parses are cancelled due to exceeding the set time limit.
TopDownFilter: This parameter controls whether simple top-down filtering based on the grammar categories is applied to the parse process. \patr TopDownFilter On enables this top-down filtering. \patr TopDownFilter Off disables the top-down filtering, slowing down the parse but possibly finding more solutions. The default value is On.
TreeStyle: This parameter controls how parse trees are written to either the analyis output file or the log file. \patr TreeStyle Full causes parses to be written in a somewhat graphic tree display format, using ASCII characters to draw the branches of the tree. \patr TreeStyle Flat causes parses to be written as parenthesized strings, similar to the way that LISP represents trees. This is the default value: it may be cryptic, but it requires the least space. \patr TreeStyle Indented causes parses to be written in an indented format sometimes called a northwest tree. \patr TreeStyle XML causes parses to be written in an XML format, with each node containing the feature structure associated with that node of the parse tree. This setting causes the FeatureStyle parameter to be ignored. \patr TreeStyle Off prevents parses from being written. This allows PCPATR word grammars to be used for filtering invalid AMPLE analyses without cluttering up the output analysis files.
TrimEmptyFeatures: This parameter controls whether empty feature structures are written to the output files. \patr TrimEmptyFeatures On disables the display of empty feature values. \patr TrimEmptyFeatures Off enables the display of empty features. The default value is Off.
Unification: This parameter controls whether the parsing process allows unification failures to block successful parsing. \patr Unification On causes the constituent structure rules to constrain the parse. \patr Unification Off causes feature unification failures to be ignored while parsing. (Most likely, this would be useful only while debugging the word grammar.) The default value is On.

4.1.21 Punctuation class: \pcl

A punctuation class is defined by the field code \pcl followed by the class name, which is followed in turn by one or more punctuation characters or (previously defined) punctuation class names. A punctuation class name used as part of the class definition must be enclosed in square brackets.

The class name must be a single, contiguous sequence of printing characters. The individual members of the class are separated by spaces, tabs, or newlines.

Each \pcl field defines a single punctuation class. Any number of \pcl fields may appear in the file.

If no \pcl fields appear in the analysis data file, then AMPLE does not allow any punctuation classes in tests, and does not allow any punctuation classes in punctuation environment constraints.

4.1.22 Prefix successor test: \pt

A prefix successor test is defined by the \pt field code followed by the test name and possibly a test body. The test body is not needed if the test name is that of a built-in test (either SEC_ST, ADHOC_ST, or PEC_ST).

Prefix tests are applied in the order they appear in the analysis data file. If not explicitly listed, SEC_ST, ADHOC_ST, and PEC_ST are applied after all the user-defined prefix tests.

Any number of prefix successor tests may be defined in the file. For the syntax of successor tests, see section 4.2 Test Syntax.

If no \pt fields appear in the analysis data file, AMPLE still applies the built-in prefix tests SEC_ST, ADHOC_ST, and PEC_ST.

4.1.23 Root ad hoc pair: \rah

A root ad hoc pair is defined by the \rah field code followed by two morpheme identifiers. The first identifier may belong to a prefix, an infix (if infixes exist and can mingle with prefixes or roots), or a root (if compound roots are allowed). The second morpheme identifier must belong to a root.

A prefix or infix identifier in a root ad hoc pair must be the affix's morphname. A root identifier in a root ad hoc pair must be given exactly as it occurs in the analysis (an etymology or a gloss, depending on the assignment to the M field in the root section of the dictionary code table).

Any number of root ad hoc pairs may be defined in the file. However, their use is strongly discouraged on linguistic grounds.

If no \rah fields appear in the analysis data file, then AMPLE never eliminates any analyses via the root ADHOC_ST test.

4.1.24 Root Delimiter Characters: \rd

The root delimiter characters used in the output analysis file are defined by the \rd field code followed by two characters, possibly separated by spaces. The first character is used to mark the beginning of a root analysis and the second is used to mark its end.

The \rd field may appear any number of times, but once is enough. If more than one such field occurs, the last one is the one that is used.

If no \rd fields appear in the analysis data file, then AMPLE uses the delimiter characters < and >.

4.1.25 Root successor test: \rt

A root successor test is defined by the \rt field code followed by the test name and possibly a test body. The test body is not needed if the test name is that of a built-in test (SEC_ST, ADHOC_ST, ROOTS_ST, or PEC_ST), or a previously defined prefix or infix test that is to be used as a root test.

Root tests are applied in the order they appear in the analysis data file. If not explicitly listed, SEC_ST, ADHOC_ST, ROOT_ST, and PEC_ST are applied after all the user-defined root tests.

Any number of root successor tests may be defined in the file. For the syntax of successor tests, see section 4.2 Test Syntax.

If no \rt fields appear in the analysis data file, AMPLE still applies the built-in root tests SEC_ST, ADHOC_ST, ROOTS_ST, and PEC_ST.

4.1.26 Suffix ad hoc pair: \sah

A suffix ad hoc pair is defined by the \sah field code followed by two morpheme identifiers. The first identifier may belong to a root, an infix (if infixes exist and can mingle with roots or suffixes), or a suffix. The second morpheme identifier must belong to a suffix.

A suffix or infix identifier in a suffix ad hoc pair must be the affix's morphname. A root identifier in a suffix ad hoc pair must be given exactly as it occurs in the analysis (an etymology or a gloss, depending on the assignment to the M field in the root section of the dictionary code table).

Any number of suffix ad hoc pairs may be defined in the file. However, their use is strongly discouraged on linguistic grounds.

If no \sah fields appear in the analysis data file, then AMPLE never eliminates any analyses via the suffix ADHOC_ST test.

4.1.27 String class: \scl

A string class is defined by the \scl field code followed by the class name, which is followed in turn by one or more contiguous character strings or (previously defined) string class names. A string class name used as part of the class definition must be enclosed in square brackets.

The class name must be a single, contiguous sequence of printing characters. Characters and words which have special meanings in tests should not be used. The actual character strings have no such restrictions. The individual members of the class are separated by spaces, tabs, or newlines.

Each \scl field defines a single string class. Any number of \scl fields may appear in the file.

If no \scl fields appear in the analysis data file, then AMPLE does not allow any string classes in tests, and does not allow any string classes in string environment constraints unless they are defined in the text input control file or the dictionary orthography changes file.

4.1.28 Suffix successor test: \st

A suffix successor test is defined by the \st field code followed by the test name and possibly a test body. The test body is not needed if the test name is that of a built-in test (either SEC_ST, ADHOC_ST, or PEC_ST), or a previously defined prefix, infix, or root test that is to be used as a suffix test.

Suffix tests are applied in the order they appear in the analysis data file. If not explicitly listed, SEC_ST, ADHOC_ST, and PEC_ST are applied after all the user-defined suffix tests.

Any number of suffix successor tests may be defined in the file. For the syntax of successor tests, see section 4.2 Test Syntax.

If no \st fields appear in the analysis data file, AMPLE still applies the built-in suffix tests SEC_ST, ADHOC_ST, and PEC_ST.

4.1.29 Valid allomorph and string environment characters: \strcheck

The characters considered to be valid for allomorph strings and string environment constraints are defined by a \strcheck field code followed by the list of characters. Spaces are not significant in this list.

The \strcheck field may appear any number of times, but once is enough. If more than one such field occurs, the last one is the one that is used.

If no \strcheck fields appear in the analysis data file, then AMPLE does not check allomorph strings and string environment constraints for containing only valid characters.

4.2 Test Syntax

The remainder of this chapter presents grammatical descriptions of the syntax of tests and morpheme co-occurrence constraints in BNF notation. The following comments explain how to read the syntax rules given below:

Names shown inside wedges (<>) are nonterminal symbols. These must eventually be expanded into terminal symbols.
The symbol ::= means "is replaced by."
Items on the righthand side of the rule (following the ::=) that are not enclosed in wedges are terminal symbols, and appear in the rule exactly as they must appear in an AMPLE control file. Whitespace is largely optional; it is required only to separate identifiers and keywords. (Keywords are the alphabetic terminal symbols shown in the rules below.)
Alternative ways of replacing a nonterminal symbol are listed on separate lines.


 1.  <test>          ::= <identifier> <body>

 2a. <body>          ::= <body> <logop> <factor>
 2b.                     IF <factor> THEN <factor>
 2c.                     <forleft> <factor>
 2d.                     <forright> <factor>
 2e.                     <factor>

 3a. <factor>        ::= NOT <factor>
 3b.                     ( <body> )
 3c.                     <property_expr>
 3d.                     <string_expr>
 3e.                     <type_expr>
 3f.                     <category_expr>
 3g.                     <order_expr>
 3h.                     <cap_expr>
 3i.                     <punct_expr>

 4.  <property_expr> ::= <position> property is <identifier>

 5a. <string_expr>   ::= <position> morphname is <identifier>
 5b.                     <position> morphname is member <identifier>
 5c.                     <position> morphname is <position> morphname
 5d.                     <position> allomorph is <identifier>
 5e.                     <position> allomorph is member <identifier>
 5f.                     <position> allomorph is <position> allomorph
 5g.                     <position> allomorph matches <identifier>
 5h.                     <position> allomorph matches member <identifier>
 5i.                     <position> allomorph matches <position> allomorph
 5j.                     <position> surface is <identifier>
 5k.                     <position> surface is member <identifier>
 5l.                     <position> surface is <position> allomorph
 5m.                     <position> surface matches <identifier>
 5n.                     <position> surface matches member <identifier>
 5o.                     <position> surface matches <position> allomorph
 5p.                     <neighbor> word is <identifier>
 5q.                     <neighbor> word is member <identifier>
 5r.                     <neighbor> word matches <identifier>
 5s.                     <neighbor> word matches member <identifier>

 6.  <type_expr>     ::= <position> type is <type>

 7a. <category_expr> ::= <position> fromcategory is <position> fromcategory
 7b.                     <position> fromcategory is <position> tocategory
 7c.                     <position> tocategory is <position> fromcategory
 7d.                     <position> tocategory is <position> tocategory
 7e.                     <position> fromcategory is member <identifier>
 7f.                     <position> tocategory is member <identifier>
 7g.                     <position> fromcategory is <identifier>
 7h.                     <position> tocategory is <identifier>

 8a. <cap_expr>      ::= <position> allomorph is capitalized
 8b.                     word is capitalized

 9a. <order_expr>    ::= <position> orderclass <relop> <position> orderclass
 9b.                     <position> orderclass <relop> <constant>

10a. <punct_expr>    ::= <neighbor> punctuation is <identifier>
10b.                     <neighbor> punctuation is member <identifier>

11.  <logop>         ::= AND
                         OR
                         XOR
                         IFF

12.  <forleft>       ::= FOR_ALL_LEFT
                         FOR-ALL-LEFT
                         FORALLLEFT
                         FOR_SOME_LEFT
                         FOR-SOME-LEFT
                         FORSOMELEFT

13.  <forright>      ::= FOR_ALL_RIGHT
                         FOR-ALL-RIGHT
                         FORALLRIGHT
                         FOR_SOME_RIGHT
                         FOR-SOME-RIGHT
                         FORSOMERIGHT

14.  <neighbor>      ::= last
                         next

15.  <type>          ::= prefix
                         infix
                         root
                         suffix
                         initial
                         final

16.  <relop>        ::= =
                        >
                        >=
                        <=
                        <
                        ~=

17.  <position>     ::= left
                        right
                        current
                        LEFT
                        RIGHT
                        initial
                        final

18a. <identifier>   ::= "<word>"
18b.                    '<word>'
18c.                    .<word>.
18d.                    [<word>]
18e.                    <word>

19.  <word>         ::= <wchar>
                        <wchar><word>

20.  <wchar>        ::= one of the following characters:
                            !"#$%&'*+,-./0123456789:;?
                            @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
                            `abcdefghijklmnopqrstuvwxyz{}
                            \200-\376 (character codes 128-254)

21.  <constant>     ::= <number>
                        -<number>

22.  <number>       ::= <digit>
                        <digit><number>

23.  <digit>        ::= one of the following characters:  0123456789

4.2.0.1 Comments on selected BNF rules

1.: A test consists of an identifier followed by the body of the test. The identifier is the name by which a test is known. The body consists of the expressions which are interpreted to evaluate the test.
4.: The identifier in a property expression must be a property name defined with \mp or \ap in the analysis data file.
5a.: In a string expression involving morphnames, an identifier must be equal to some morphname; for example, left morphname is "PAST" indicates that the name of the morpheme to the left is PAST.
5b.: A member identifier in such expressions must be the name of a class of morphnames defined with \mcl in the analysis data file.
5dgjmpr.: In a string expression involving allomorphs, surface strings, or adjacent words, an identifier must be equal to some portion of a word after any orthography change has been applied. For example, left allomorph is "abadaba" indicates that the allomorph of the morpheme to the left is abadaba.
5ehknqs.: A member identifier in such expressions must be the name of a class of strings defined with \scl in the analysis data file.
5d-i.: If reference is made to left, LEFT or INITIAL, the allomorph is tested to see if it ends with the string; for example, left allomorph matches "ba" indicates that the allomorph of the morpheme to the left ends in ba. If reference is made to current, right, RIGHT, or FINAL, the allomorph is tested to see if it begins with the string.
5j-o.: If reference is made to left, LEFT or INITIAL, the surface string is tested to see if it ends with the given value. If reference is made to current, right, RIGHT, or FINAL, the surface string is tested to see if it begins with the given value.
5p-s.: If reference is made to last, the word is tested to see if it ends with the string. If reference is made to next, the word is tested to see if it begins with the string. (These should be avoided, and other means used to prune analyses based on adjacent words.)
6.: The type must be a keyword indicating whether the morpheme referred to is a prefix, an infix, a root, and so on
7ef.: The identifier must be the name of a class of categories defined with \ccl in the analysis data file.
7gh.: The identifier must be a category defined with \ca in the analysis data file.
9b.: A constant is an integer between -32767 and (positive) 32767. The relational operator (relop) must be among those listed in rule 16.
10.: Punctuation expressions always refer to punctuation either immediately before or after the current word. A <neighbor> value of last refers to immediately before the current word and a <neighbor> value of next refers to immediately after the current word.
18a-d.: The quoted forms of an identifier are needed only if the identifier is the same as one of the AMPLE test keywords. It is recommended that the quoted identifier not contain the closing quote character.

4.3 Morpheme Co-occurrence Constraint Syntax

This section presents a grammatical description of the syntax of morpheme co-occurrence constraints in BNF notation. These constraints are found either in the analysis data file (see section 4.1.16 Morpheme Co-occurrence Constraint: \mcc) or in a dictionary file (see section 7.12 Morpheme Co-occurrence Constraint (internal code Z)).


 1a. <constraint>   ::= <morphnames> <environments>
 1b.                    { <literal> } <morphnames> <environments>

 2a. <morphnames>   ::= <literal>
 2b.                    <literal> <morphnames>
 2c.                    [ <literal> ]
 2d.                    [ <literal> ] <morphnames>

 3a. <environments> ::= <environment>
 3b.                    <environment> <environments>

 4a. <environment>  ::= <marker> <leftside> <envbar> <rightside>
 4b.                    <marker> <leftside> <envbar>
 4c.                    <marker> <envbar> <rightside>

 5a. <leftside>     ::= <side>
 5b.                    <boundary>
 5c.                    <boundary> <side>
 5d.                    <side> # <side>
 5e.                    <boundary> <side> # <side>

 6a. <rightside>  ::= <side>
 6b.                  <boundary>
 6c.                  <side> <boundary>
 6d.                  <side> # <side>
 6e.                  <side> # <side> <boundary>

 7a. <side>       ::= <item>
 7b.                  <item> <side>
 7c.                  <item> ... <side>

 8a. <item>       ::= <piece>
 8b.                  ( <piece> )

 9a. <piece>      ::= ~ <piece>
 9b.                  <literal>
 9c.                  [ <literal> ]
 9d.                  { <literal> }

10.  <marker>     ::= /
                      +/

11.  <envbar>     ::= _
                      ~_

12.  <boundary>   ::= #
                      ~#

13.  <literal>    ::= one or more contiguous characters

4.3.0.1 Comments on selected BNF rules

1b.

A literal enclosed in braces is an arbitary identifier for this morpheme co-occurrence constraint. (This feature was added to support LinguaLinks.)

2ab.

A literal is a morphname from one of the dictionary files.

2cd.

A literal enclosed in square brackets must be the name of a morpheme class defined by a \mcl field in the analysis data file.

5-6.

Note that what can appear to the left of the environment bar is a mirror image of what can appear to the right.

5de.

6de.

These should be avoided, and other means used to prune analyses based on adjacent words.

7c.

An ellipsis (...) indicates a possible break in contiguity.

8b.

Something enclosed in parentheses is optional.

9a.

A tilde (~) reverses the desirability of an element, causing the constraint to fail if it is found rather than fail if it is not found.

9b.

A literal is a morphname from one of the dictionary files.

9c.

A literal enclosed in square brackets must be the name of a morpheme class defined by a \mcl field in the analysis data file.

9d.

A literal enclosed in curly braces must be one of the following (checked in this order):

one of the keywords root, prefix, infix, or suffix
a property name defined by an \ap or \mp field in the analyis data file
a category name defined by a \ca field in the analysis data file
a category class name defined by a \ccl field in the analysis data file
a morpheme class name defined by a \mcl field in the analysis data file

10.

A / is usually used for string environment constraints, but may used for morpheme environment constraints in \mcc fields in the analysis data file.

11.

A tilde (~) attached to the environment bar inverts the sense of the constraint as a whole.

12b.

The boundary marker preceded by a tilde (~#) indicates that it must not be a word boundary.

13.

The special characters used by environment constraints can be included in a literal only if they are immediately preceded by a backslash:


\+  \/  \#  \~  \[  \]  \(  \)  \{  \}  \.  \_  \\

5. Dictionary Code Table File

The second control file read by AMPLE contains the dictionary code table. Each entry of an AMPLE dictionary (whether for roots, prefixes, infixes, or suffixes) is structured by field codes that indicate the type of information that follows. The dictionary code table maps the field codes used in the dictionary files onto the internal codes that AMPLE uses. This allows linguists to use their favorite dictionary field codes rather than constraining them to a predefined set.

The dictionary code table is divided into one or more sections, one for each type of dictionary file. Each section contains several mappings of field codes in the form of simple changes. The field codes used in the dictionary code table file are described in the remainder of this chapter.

5.1 Change standard format marker to internal code: \ch

A dictionary field code change is defined by \ch followed by two quoted strings. The first string is the field code used in the dictionary (including the leading backslash character). The second string is the single capital letter designating the field type. For the lists of dictionary field type codes, see section 7. Dictionary Files.

Any character not found in either the dictionary field code string or the dictionary field type code may be used as the quoting character. The double quote (") or single quote (') are most often used for this purpose.

5.2 Infix dictionary fields: \infix

The set of dictionary field code changes for an infix dictionary file begins with \infix, optionally followed by the record marker field code for the infix dictionary. If the record marker is not given, then the field code ("from string") from the first infix dictionary field code change is used. See section 7. Dictionary Files, for the set of infix dictionary field type codes.

5.3 Prefix dictionary fields: \prefix

The set of dictionary field code changes for a prefix dictionary file begins with \prefix, optionally followed by the record marker field code for the prefix dictionary. If the record marker is not given, then the field code ("from string") from the first prefix dictionary field code change is used. See section 7. Dictionary Files, for the set of prefix dictionary field type codes.

5.4 Root dictionary fields: \root

The set of dictionary field code changes for a root dictionary file begins with \root, optionally followed by the record marker field code for the root dictionary. If the record marker is not given, then the field code ("from string") from the first root dictionary field code change is used. See section 7. Dictionary Files, for the set of root dictionary field type codes.

5.5 Suffix dictionary fields: \suffix

The set of dictionary field code changes for a suffix dictionary file begins with \suffix, optionally followed by the record marker field code for the suffix dictionary. If the record marker is not given, then the field code ("from string") from the first suffix dictionary field code change is used. See section 7. Dictionary Files, for the set of suffix dictionary field type codes.

5.6 Unified dictionary fields: \unified

The set of dictionary field code changes for a unified dictionary file begins with \unified, optionally followed by the record marker field code for the unified dictionary. If the record marker is not given, then the field code ("from string") from the first unified dictionary field code change is used. See section 7. Dictionary Files, for the set of unified dictionary field type codes.

6. Dictionary Orthography Change Table File

The third control file read by AMPLE, and the first optional one, contains the dictionary orthography change table. This table maps the allomorph strings in the dictionary files into the internal orthographic representation. When the text and internal orthographies differ, it may be desirable to have the allomorphs in the dictionaries stored in the same orthography as the texts, or it may be desirable to have them in the internal form, or it might even be desirable to have them in a third form. AMPLE allows for any of these choices.

The dictionary orthography change table is defined by a special standard format file. This file contains a single record with two types of fields, either of which may appear any number of times. The rest of this chapter describes these fields, focusing on the syntax of the orthography changes.

6.1 Dictionary Orthography Change: \ch

An orthography change is defined by the \ch field code followed by the actual orthography change. Any number of orthography changes may be defined in the dictionary orthography change table. The output of each change serves as the input the following change. That is, each change is applied as many times as necessary to a dictionary allomorph before the next change from the dictionary orthography change table is applied. See section 8.5 Text Orthography Change: \ch, for the syntax of orthography changes.

6.2 String class: \scl

Each \scl field defines a single string class. Any number of \scl fields may appear in the file. The only restriction is that a string class must be defined before it is used.

If no \scl fields appear in the dictionary orthography changes file, then AMPLE does not allow any string classes in dictionary orthography change environment constraints unless they are defined in the analysis data file.

7. Dictionary Files

This chapter describes the content of AMPLE dictionary files. These are normally divided into

a prefix dictionary file (if needed),
an infix dictionary file (if needed),
an suffix dictionary file (if needed), and
one or more root dictionary files.

With the `-u' command line option in conjunction with the \unified field in the dictionary code table file, the dictionary can be stored as one or more files containing entries of any type: prefix, infix, suffix, or root.

The following sections describe the different types of fields used in the different types of dictionary files. Remember, the mapping from the actual field codes used in the dictionary files to the type codes that AMPLE uses internally is controlled by the dictionary code table file (see section 5. Dictionary Code Table File).

7.1 Allomorph (internal code A)

Each dictionary entry must contain one or more allomorph fields. Each of these contains one of the infix's allomorphs, that is, the string of characters by which the affix is represented in text and recognized by AMPLE.

If an affix has multiple allomorphs, each one must be entered in its own allomorph field. These fields should be ordered with those on which the strictest constraints have been imposed preceding those with less strict or no constraints. The only exception to this is the use of indexed string classes to indicate reduplication. (See lines 20 and 21 below.)

Properties, constraints, and comments may follow the allomorph string. Any properties must be listed before any constraints. String, punctuation and morpheme environment constraints may be intermixed, but must come before any comments. A complete BNF grammar of an allomorph field is given below.


 1a. <allomorph_field> ::= <allomorph>
 1b.                       <allomorph> <properties>
 1c.                       <allomorph> <constraints>
 1d.                       <allomorph> <properties> <constraints>
 1e.                       <allomorph> <comment>
 1f.                       <allomorph> <properties> <comment>
 1g.                       <allomorph> <constraints> <comment>
 1h.                       <allomorph> <properties> <constraints> <comment>

 2a. <allomorph>       ::= <literal>
 2b.                       <literal> { <literal> }
 2c.                       <redup_pattern>
 2d.                       <redup_pattern> { <literal> }

 3a. <properties>      ::= <literal>
 3b.                       <literal> <properties>

 4a. <constraints>     ::= <string_constraint>
 4b.                       <morph_constraint>
 4c.                       <punct_constraint>
 4d.                       <string_constraint> <constraints>
 4e.                       <morph_constraint> <constraints>
 4f.                       <punct_constraint> <constraints>

 5.  <comment>         ::= <comment_char> anything to the end of the line

 6a. <string_constraint> ::= / <envbar> <string_right>
 6b.                         / <string_left> <envbar>
 6c.                         / <string_left> <envbar> <string_right>

 7a. <string_left>       ::= <string_side>
 7b.                         <boundary>
 7c.                         <boundary> <string_side>
 7d.                         <string_side> # <string_side>
 7e.                         <boundary> <string_side> # <string_side>

 8a. <string_right>      ::= <string_side>
 8b.                         <boundary>
 8c.                         <string_side> <boundary>
 8d.                         <string_side> # <string_side>
 8e.                         <string_side> # <string_side> <boundary>

 9a. <string_side>       ::= <string_item>
 9b.                         <string_item> <string_side>
 9c.                         <string_item> ... <string_side>

10a. <string_item>       ::= <string_piece>
10b.                         ( <string_piece> )

11a. <string_piece>      ::= ~ <string_piece>
11b.                         <literal>
11c.                         [ <literal> ]
11d.                         [ <indexed_literal> ]

12a. <morph_constraint>  ::= +/ <envbar> <morph_right>
12b.                         +/ <morph_left> <envbar>
12c.                         +/ <morph_left> <envbar> <morph_right>

13a. <morph_left>        ::= <morph_side>
13b.                         <boundary>
13c.                         <boundary> <morph_side>
13d.                         <morph_side> # <morph_side>
13e.                         <boundary> <morph_side> # <morph_side>

14a. <morph_right>       ::= <morph_side>
14b.                         <boundary>
14c.                         <morph_side> <boundary>
14d.                         <morph_side> # <morph_side>
14e.                         <morph_side> # <morph_side> <boundary>

15a. <morph_side>        ::= <morph_item>
15b.                         <morph_item> <morph_side>
15c.                         <morph_item> ... <morph_side>

16a. <morph_item>        ::= <morph_piece>
16b.                         ( <morph_piece> )

17a. <morph_piece>       ::= ~ <morph_piece>
17b.                         <literal>
17c.                         [ <literal> ]
17d.                         { <literal> }

18a. <punct_constraint>  ::= ./ <envbar> <punct_right>
18b.                         ./ <punct_left> <envbar>
18c.                         ./ <punct_left> <envbar> <punct_right>

19a. <punct_left>        ::= <punct_side>
19b.                         <boundary>
19c.                         <boundary> <punct_side>

20a. <punct_right>       ::= <punct_side>
20b.                         <boundary>
20c.                         <punct_side> <boundary>

21a. <punct_side>        ::= <punct_item>
21b.                         <punct_item> <punct_side>

22a. <punct_item>        ::= <punct_piece>
22b.                         ( <punct_piece> )

23a. <punct_piece>       ::= ~ <punct_piece>
23b.                         <literal>
23c.                         [ <literal> ]

24a. <envbar>            ::= _
24b.                         ~_

25a. <boundary>          ::= #
25b.                         ~#

26a. <redup_pattern>     ::= [ <indexed_literal> ]
26b.                         <literal> [ <indexed_literal> ]
26c.                         [ <indexed_literal> ] <literal>
26d.                         [ <indexed_literal> ] <redup_pattern>
26e.                         <redup_pattern> [ <indexed_literal> ]

27.  <indexed_literal>   ::= <literal> ^ <number>

28.  <literal>           ::= one or more contiguous characters

29.  <comment_char>      ::= character defined by `-c' command
                             line option, or | by default

30.  <number>            ::= one or more contiguous digits (0-9)

7.1.0.1 Comments on selected BNF rules

2.

The (first) literal string is a surface form representation of the morpheme. The literal string enclosed in braces is a unique allomorph identification string. (The identification string is a feature added to support LinguaLinks. It is not stored unless the `-b' command line option is used.

3.

Each literal string is an allomorph property defined by a \ap field in the analysis data file.

4.

String, punctuation and morpheme constraints can be mixed together, but it is recommended that you group the string constraints together, the punctuation constraints together and the morpheme constraints together.

5.

A comment begins with a specified character and ends with the end of the line.

7-8.

Note that what can appear to the left of the environment bar is a mirror image of what can appear to the right.

7de.

8de.

These should be avoided, and other means used to prune analyses based on adjacent words.

9c.

An ellipsis (...) indicates a possible break in contiguity.

10b.

Something enclosed in parentheses is optional.

11a.

A tilde (~) reverses the desirability of an element, causing the constraint to fail if it is found rather than fail if it is not found.

11b.

A literal is matched against the surface form of the word.

11c.

A literal enclosed in square brackets must be the name of a string class defined by a \scl field in the analysis data file or the dictionary orthography change table file.

11d.

The indexed literal enclosed in square brackets must match an indexed literal given as part of the reduplication allomorph pattern. (See 2c, 2d, and 26.)

13-14.

Note that what can appear to the left of the environment bar is a mirror image of what can appear to the right.

13de.

14de.

These should be avoided, and other means used to prune analyses based on adjacent words.

15c.

An ellipsis (...) indicates a possible break in contiguity.

16b.

Something enclosed in parentheses is optional.

17a.

A tilde (~) reverses the desirability of an element, causing the constraint to fail if it is found rather than fail if it is not found.

17b.

A literal is a morphname from one of the dictionary files.

17c.

A literal enclosed in square brackets must be the name of a morpheme class defined by a \mcl field in the analysis data file.

17d.

A literal enclosed in curly braces must be one of the following (checked in this order):

one of the keywords root, prefix, infix, or suffix
a property name defined by an \ap or \mp field in the analyis data file
a category name defined by a \ca field in the analysis data file
a category class name defined by a \ccl field in the analysis data file
a morpheme class name defined by a \mcl field in the analysis data file

19-20.

Note that what can appear to the left of the environment bar is a mirror image of what can appear to the right.

22b.

Something enclosed in parentheses is optional.

23a.

A tilde (~) reverses the desirability of an element, causing the constraint to fail if it is found rather than fail if it is not found.

23b.

A literal is a punctuation character. All such punctuation characters should not be listed in the set of word formation characters. See section 8. Text Input Control File. The punctuation characters can match punctuation characters either before or after the current word. Unlike string constraints, punctuation constraints effectively ignore the position of the conditioned allomorph within the word. All that matters are any punctuation characters immediately preceding or following the current word. Further note that neither ellipsis nor cross word boundary conditions are allowed.

24.

A tilde (~) attached to the environment bar inverts the sense of the constraint as a whole.

25b.

The boundary marker preceded by a tilde (~#) indicates that it must not be a word boundary.

26-27.

Although the BNF has spaces in it to improve readability, these two items cannot have embedded spaces in the dictionary file.

26.

The reduplication allomorph pattern contains references to string classes and possibly literal strings. The string class names are indexed to indicate identical shared values, either in the string environment constraint or in more than one location in the reduplication allomorph pattern itself. Note: this has been implemented only for AMPLE at this point.

27.

The literal (without the following index given by an ASCII caret (^) and a number) must be the name of a string class defined by a \scl field in the analysis data file or the dictionary orthography change table file.

28.

The special characters used by environment constraints can be included in a literal only if they are immediately preceded by a backslash:


\+  \/  \#  \~  \[  \]  \(  \)  \{  \}  \.  \_  \\

The allomorph field is used in all types of dictionary entries: prefix, infix, suffix, and root.

7.2 Category (internal code C)

Each dictionary entry must contain a category field. If multiple category fields exist, then their contents are merged together.

For affix entries, this field must contain at least one category pair for the morpheme, but may contain any number of category pairs separated by spaces or tabs. Each category pair consists of two category names separated by a slash (/). The category names must have been defined by a \ca field in the analysis data file. The first category is the from category, that is, the category of the unit to which this morpheme can be affixed. The second category is the to category, that is, the category of the result after this morpheme has been affixed.

For root entries, this field contains one or more morphological categories as defined by a \ca field in the analysis data file. If multiple categories are listed, they should be separated by spaces or tabs.

The category field is used in all types of dictionary entries: prefix, infix, suffix, and root.

7.3 Elsewhere Allomorph (internal code E)

For compatibility with STAMP, the "elsewhere" field defines an allomorph. In AMPLE, this field also provides a default value for the underlying form.

The syntax of the elsewhere allomorph field is the same as the syntax of the normal allomorph field. See section 7.1 Allomorph (internal code A).

The elsewhere allomorph field is used in all types of dictionary entries: prefix, infix, suffix, and root.

7.4 Feature Descriptor (internal code F)

The feature descriptor field is always optional. It contains the names of one or more features that are written verbatim to the \fd field of the output analysis file. It is not otherwise used by AMPLE.

If a dictionary entry contains multiple feature descriptor fields, their contents are merged together.

The feature descriptor field is used in all types of dictionary entries: prefix, infix, suffix, and root.

7.5 Root Gloss (internal code G)

The root gloss field contains an alternative morphname for writing to the output analysis file. It is enabled by the `-g' command line option. Without this command line option, it is totally ignored by AMPLE. See section 7.7 Morphname (internal code M). Only one root gloss field is allowed in each dictionary entry. If an entry has more than one root gloss field, then the first one is used and the others trigger provoke an error message.

The root gloss field is used only in root dictionary entries.

7.6 Infix location (internal code L)

The infix location field serves to restrict where infixes may be found, and must be included in each infix dictionary entry. Subject to the constraints imposed by the infix location field, AMPLE searches the rest of the word for any occurrence of any allomorph string of the infix. This makes infixes rather expensive, computationally, so they should be constrained as much as possible.


 1.  <infix_location> ::= <types> <constraints>

 2a. <types>          ::= <type>
 2b.                      <type> <types>

 3a. <constraints>    ::= <environment>
 3b.                      <environment> <constraints>

 4a. <environment>    ::= <marker> <leftside> <envbar> <rightside>
 4b.                      <marker> <leftside> <envbar>
 4c.                      <marker> <envbar> <rightside>

 5a. <leftside>       ::= <side>
 5b.                      <boundary>
 5c.                      <boundary> <side>

 6a. <rightside>      ::= <side>
 6b.                      <boundary>
 6c.                      <side> <boundary>

 7a. <side>           ::= <item>
 7b.                      <item> <side>
 7c.                      <item> ... <side>

 8a. <item>           ::= <piece>
 8b.                      ( <piece> )

 9a. <piece>          ::= ~ <piece>
 9b.                      <literal>
 9c.                      [ <literal> ]

10a. <type>           ::= prefix
10b.                      root
10c.                      suffix

11a. <marker>         ::= /
11b.                      +/

12a. <envbar>         ::= _
12b.                      ~_

13a. <boundary>       ::= #
13b.                      ~#

14.  <literal>        ::= one or more contiguous characters

7.6.0.1 Comments on selected BNF rules

2.

The first part of the infix location field lists the type of morpheme in which the infix may be hidden. This consists of one or more of the words prefix, root, or suffix. If prefix is given, then AMPLE looks for infixes after exhausting the possible prefixes at a given point in the word, and resumes looking for more prefixes after finding an infix. Similarly, if root is given, then AMPLE looks for infixes after running out of roots while parsing the word, and if it finds an infix, it looks for more roots. Suffixes are treated the same way if suffix is given in the infix location field.

5.

A boundary marker (#) on the left side of the environment bar refers to the place in the word which the parse has reached before looking for infixes, not to the beginning of the word.

6.

A boundary marker (#) on the right side of the environment bar refers to the end of the word.

7c.

An ellipsis (...) indicates a possible break in contiguity.

8b.

Something enclosed in parentheses is optional.

9a.

A tilde (~) reverses the desirability of an element, causing the constraint to fail if it is found rather than fail if it is not found.

11.

A +/ is usually used for morpheme environment constraints, but may used for infix location environment constraints as well.

12.

A tilde attached to the environment bar (~_) inverts the sense of the constraint as a whole.

13b.

The boundary marker preceded by a tilde (~#) indicates that it must not be a word boundary.

14.

The special characters used by environment constraints can be included in a literal only if they are immediately preceded by a backslash:


\+  \/  \#  \~  \[  \]  \(  \)  \{  \}  \.  \_  \\

The infix location field is used only in infix dictionary entries.

7.7 Morphname (internal code M)

A morphname is an arbitrary name for a given morpheme. Only the first word (string of contiguous nonspace characters) following the morphname field code is used as the morphname. Morphnames must be less than 64 characters long.

A morphname serves two important functions:

It identifies a morpheme in morpheme environment constraints, morpheme co-occurrence constraints, ad hoc pairs, and tests.
It is the default morpheme identifier written to the output analysis file. See section 7.5 Root Gloss (internal code G).

Generally, a morphname is an identifier of a morpheme and does not need to faithfully represent that morpheme's meaning or function.

If a dictionary entry has more than one morphname field, the morphname from the first one is used; the others cause an error message. The morphname field is used in all types of dictionary entries: prefix, infix, suffix, and root. The usage differs somewhat between affix and root dictionary entries, so these two types of morphnames are described separately.

7.7.1 Affix morphnames

Every affix dictionary entry must have a morphname field. Users are strongly encouraged to observe the following suggestions in creating affix morphnames:

Make each morphname unique. If two morphemes have the same name, it is impossible to refer unambiguously to them. The same morphname should not be used in different affix dictionaries (that is, in the prefix dictionary and in the suffix dictionary).
Keep morphnames short. This reduces the size of analysis files and makes text glossing more aesthetically pleasing. For example, for a verbal person marker, use simply 1 rather than 1P unless there is good reason to add the P for person or possessive. For a first person object marker, 1O might serve as well as 1OBJ.
Use only uppercase alphabetic characters and numbers for contrast with root morphnames, which are generally made up of lowercase alphabetic characters. Be cautious in using hyphens, periods, underscores, slashes, backslashes, or other nonalphanumeric characters. The reason to avoid these is that other programs which apply to the resulting analysis may make use of nonalphanumerics in different ways.
Design a syntax of names and stick to it for inflectional morphemes which combine more than one semantic notion. For example, for Latin nominal inflections, which indicate gender, number, and case, the syntax might be
```
MORPHNAME = GENDER CASE NUMBER
```
where GENDER is M for masculine, F for feminine and N for neuter; CASE is N for nominative, A for accusative, G for genitive, and so on; and NUMBER is S for singular and P for plural. The name for masculine nominative singular would then be MNS.

7.7.2 Root morphnames

Root morphnames are generally either glosses or etymologies. Etymologies are frequently marked with a leading asterisk (*). (This is used by STAMP to indicate regular sound changes.)

If the morphname field contains only an asterisk, the morphname becomes an asterisk followed by whatever allomorph is matched. If the morphname field is omitted, or if it contains only a comment, AMPLE puts whatever allomorph was matched in the text into the analysis. If the morpheme contains any alternate forms, it is wise to include an explicit morphname field.

7.8 Order class (internal code O)

The order class of an affix is a number indicating its position relative to other morphemes. Prefixes should be assigned negative numbers and suffixes should be assigned positive numbers. Infixes should be assigned order class values appropriate to where they can appear in the word relative to the prefixes and suffixes.

If the order class field is omitted, then a default value of zero (0) is assigned to the affix. Order class values must be between -32767 and 32767.

Order classes are used only by tests in the analysis data file. They are needed only if appropriate tests are written to take advantage of them.

The order class field is used only in affix type dictionary entries: prefix, infix, and suffix. Roots always have an implicit order class of zero.

7.9 Morpheme property (internal code P)

This field contains one or more morpheme properties. These properties must have been defined by a \mp field in the analysis data file. A morpheme property is inherited by all allomorphs of the morpheme.

The morpheme property field is optional, and may be repeated. If multiple properties apply to a morpheme, they may be given all in a single field or each in a separate field.

Morpheme properties typically indicate a characteristic of the morpheme which conditions the occurrence of allomorphs of an adjacent morpheme. Morpheme properties are used in tests defined in the analysis data file and in morpheme environment constraints.

The morpheme property field is used in all types of dictionary entries: prefix, infix, suffix, and root.

7.10 Morpheme type (internal code T)

In a unified dictionary, the type of an entry is determined by the first letter following the morpheme type field code: p or P for prefixes, i or I for infixes, s or S for suffixes, and r or R for roots. The morpheme type field is not needed for root entries because the entry type defaults to root.

The morpheme type field is used only in unified dictionary files, since the morpheme type is otherwise implicit.

7.11 Underlying Form (internal code U)

The underlying form field contains information for writing to \u fields in the output analysis file. If a mapping from a dictionary field code to internal code U is not defined in the dictionary code table file, then this field effectively does not exist.

Only one underlying form field is allowed in each dictionary entry. If an entry has more than one underlying form field, then the first one is used and the others trigger provoke an error message.

If a particular record in a dictionary file does not have an underlying form field, but does use an "elsewhere" field (see section 7.3 Elsewhere Allomorph (internal code E)), then AMPLE uses the elsewhere entry for the underlying form. If an entry has neither an underlying form field nor an elsewhere field, AMPLE assumes that the underlying form is null and will output a zero (0) for the underlying form.

The underlying form field is used in all types of dictionary entries: prefix, infix, suffix, and root.

7.12 Morpheme Co-occurrence Constraint (internal code Z)

See section 4.1.16 Morpheme Co-occurrence Constraint: \mcc, for a description of morpheme co-occurrence constraint fields in the analysis data file. These fields can also occur in dictionary entries. This is appropriate only if the constraint is about that morpheme.

One difference between morpheme co-occurrence constraints in the analysis data file and those found in dictionary entries is that the field code in the dictionary file is not necessarily \mcc. The primary difference is that morpheme co-occurrence constraints found in a dictionary entry are stored with the dictionary entry in memory, and those found in the analysis data file are stored together in one long list. If a constraint applies to more than one morpheme, it must be put in the analysis data file to work properly.

The morpheme co-occurrence constraint field is optional. If more than one constraint applies to the morpheme, as many of these fields as desired may be included.

The morpheme co-occurrence constraint field is used in all types of dictionary entries: prefix, infix, suffix, and root.

7.13 Do not load (internal code !)

When a "do not load" field is included in a record, AMPLE ignores the record altogether. This makes it possible to include records in the dictionary for linguistic purposes, while not needlessly taking up memory space if the dictionary is used for some other purpose.

The "do not load" field is used in all types of dictionary entries: prefix, infix, suffix, and root.

8. Text Input Control File

This chapter describes the expected characteristics of an input text file, and the options offered for describing these characteristics by a text input control file.(1)

8.1 Input text files

Text input control files define a simple model of input text files. They are plain text files with two types of embedded format markers.

A primary format marker consists of one or more contiguous characters beginning with a special flag character. The default character initiating format markers is the backslash (\). Thus, each of the following would be recognized as a format marker and would not be processed by the program:
```
\
\p
\sp
\begin{enumerate}
\very-long.and;muddled/format*marker,to#be$sure
```
Note that format markers cannot have a space or tab embedded in them; the first space or tab encountered terminates the format marker. One final note: the format character under discussion here applies only to the input text files which are to be processed. It has absolutely nothing to do with the use of backslash (\) to flag field codes in control files such as the text input control file.
A secondary type of marker consists of a flag character followed by a single character from a list of known values. This secondary flag character must be different than the primary flag character. Its default value is the vertical bar (|), causing this type of format marker to be frequently called a bar code. The following could be valid (secondary) format markers and would not be processed by the program:
```
|b
|i
|r
```

Consider the following two lines of input text:


\bgoodbye\r
|bgoodbye|r

Using the default definitions of format markers, the first line is considered to be a single format marker, and provides nothing which the program should try to parse. The second line, however contains two format markers, |b and |r, and the word goodbye which would be processed by the program.

The primary format markers serve to divide the text into fields. See section 8.7 Fields to Exclude: \excl and section 8.9 Fields to Include: \incl for details on how these fields are used. There is no requirement that the format markers be at the beginning of a line as with the field codes used in AMPLE control files.

8.2 Ambiguity Marker Character: \ambig

The \ambig field defines the character used to mark ambiguities and failures in the analysis output file. For example, to use the hash mark (#), the text input control file would include:


\ambig  #

This would cause an ambiguous analysis to be output as follows:


\a #3#< N0 kay >#< V1 ka > IMP#< V1 ka > INF#

It makes sense to use the \ambig field only once in the text input control file. If multiple \ambig fields do occur in the file, the value given in the first one is used. If the text input control file does not have an \ambig field, the percent sign (%) is used.

The first printing character following the \ambig field code is used as the ambiguity marker. The character currently being used to mark comments cannot be assigned to also mark ambiguities in the output file. Thus, the vertical bar (|) cannot normally be used as the ambiguity marker. Logically, this field should be in the analysis data file rather than the text input control file since it affects output instead of input. Nevertheless, compatibility demands that it stays this way.

8.3 Bar code format marker character: \barchar

The \barchar defines the character that begins a two-character secondary format marker. For example, if this type of format marker begins with the dollar sign ($), the following would be placed in the text input control file:


\barchar  $

An empty \barchar field in the text input control file prevents any bar code format markers from being recognized. Thus, the following field effectively turns off special treatment of this style of format marking (assuming the | is marking comments):


\barchar       | no bar character

It makes sense to use the \barchar field only once in the text input control file. If multiple \barchar fields do occur in the file, the value given in the first one is used.

The first printing character following the \barchar field code is used as the bar code format marker. The character currently being used to mark comments cannot be assigned to also flag format markers in input text files. Thus, the default value (|) cannot normally be explicitly defined (since \barchar | is treated as \barchar followed only by a comment), so it must be taken as given.

8.4 Bar Code Format Code Characters: \barcodes

In conjunction with the special format marking character discussed in the previous section, the \barcodes field defines the individual characters used with in bar codes. These characters may be separated by spaces or lumped together. Thus, the following two fields are equivalent:


\barcodes    abcdefg         | lumped together
\barcodes    a b c d e f g   | separated

If provided more than one \barcodes field in the text input control file, the combination of all characters defined in all such fields is used. No check is made for repeated characters: the previous example would be accepted without complaint despite the redundancy of the second line.

The default value for the bar codes is bdefhijmrsuvyz. Therefore, if the text input control file contains neither a \barchar nor a \barcodes field, the following bar codes are considered to be formatting information by AMPLE: |b, |d, |e, |f, |h, |i, |j, |m, |r, |s, |u, |v, |y, and |z. These are exactly the codes recognized by the SIL Manuscripter program that was in vogue when the concept of a text input control file was originally developed.

8.5 Text Orthography Change: \ch

An orthography change is defined by the \ch field code followed by the actual orthography change. Any number of orthography changes may be defined in the text input control file. The output of each change serves as the input the following change. That is, each change is applied as many times as necessary to an input word before the next change from the text input control file is applied.

8.5.1 Basic changes

To substitute one string of characters for another, these must be made known to the program in a change. (The technical term for this sort of change is a production, but we will simply call them changes.) In the simplest case, a change is given in three parts: (1) the field code \ch must be given at the extreme left margin to indicate that this line contains a change; (2) the match string is the string for which the program must search; and (3) the substitution string is the replacement for the match string, wherever it is found.

The beginning and end of the match and substitution strings must be marked. The first printing character following \ch (with at least one space or tab between) is used as the delimiter for that line. The match string is taken as whatever lies between the first and second occurrences of the delimiter on the line and the substitution string is whatever lies between the third and fourth occurrences. For example, the following lines indicate the change of hi to bye, where the delimiters are the double quote mark ("), the single quote mark ('), the period (.), and the at sign (@).

\ch "hi" "bye"
\ch 'hi' 'bye'
\ch .hi. .bye.
\ch @hi@ @bye@

Throughout this document, we use the double quote mark as the delimiter unless there is some reason to do otherwise.

Change tables follow these conventions:

Any characters (other than the delimiter) may be placed between the match and substitution strings. This allows various notations to symbolize the change. For example, the following are equivalent:
```
\ch "thou" "you"
\ch "thou" to "you"
\ch "thou" > "you"
\ch "thou" --> "you"
\ch "thou" becomes "you"
```
Comments included after the substitution string are initiated by a vertical bar (|), or whatever is indicated as the comment character by means of the -c option when AMPLE is started. The following lines illustrate the use of comments:
```
\ch "qeki" "qiki" | for cases like wawqeki
\ch "thou" "you"  | for modern English
```
A change can be ignored temporarily by turning it into a comment field. This is done either by placing an unrecognized field code in front of the normal \ch, or by placing the comment character (|) in front of it. For example, only the first of the following three lines would effect a change:
```
\ch "nb" "mp"
\no \ch "np" "np"
|\ch "mb" "nb"
```

The changes in the text input control file are applied as an ordered set of changes. The first change is applied to the entire word by searching from left to right for any matching strings and, upon finding any, replacing them with the substitution string. After the first change has been applied to the entire word, then the next change is applied, and so on. Thus, each change applies to the result of all prior changes. When all the changes have been applied, the resulting word is returned. For example, suppose we have the following changes:

\ch "aib" > "ayb"
\ch "yb"  > "yp"

Consider the effect these have on the word paiba. The first changes i to y, yielding payba; the second changes b to p, to yield paypa. (This would be better than the single change of aib to ayp if there were sources of yb other than the output of the first rule.)

The way in which change tables are applied allows certain tricks. For example, suppose that for Quechua, we wish to change hw to f, so that hwista becomes fista and hwis becomes fis. However, we do not wish to change the sequence shw or chw to sf or cf (respectively). This could be done by the following sequence of changes. (Note, @ and $ are not otherwise used in the orthography.)

\ch "shw" > "@"     | (1)
\ch "chw" > "$"      | (2)
\ch "hw"  > "f"      | (3)
\ch "@"   > "shw"   | (4)
\ch "$"   > "chw"    | (5)

Lines (1) and (2) protect the sh and ch by changing them to distinguished symbols. This clears the way for the change of hw to f in (3). Then lines (4) and (5) restore @ and $ to sh and ch, respectively. (An alternative, simpler way to do this is discussed in the next section.)

8.5.2 Environmentally constrained changes

It is possible to impose string environment constraints (SECs) on changes in the orthography change tables. The syntax of SECs is described in detail in section .

For example, suppose we wish to change the mid vowels (e and o) to high vowels (i and u respectively) immediately before and after q. This could be done with the following changes:

\ch "o" "u"  / _ q  / q _
\ch "e" "i"  / _ q  / q _

This is not entirely a hypothetical example; some Quechua practical orthographies write the mid vowels e and o. However, in the environment of /q/ these could be considered phonemically high vowels /i/ and /u/. Changing the mid vowels to high upon loading texts has the advantage that--for cases like upun "he drinks" and upoq "the one who drinks"--the root needs to be represented internally only as upu "drink". But note, because of Spanish loans, it is not possible to change all cases of e to i and o to u. The changes must be conditioned.

In reality, the regressive vowel-lowering effect of /q/ can pass over various intervening consonants, including /y/, /w/, /l/, /ll/, /r/, /m/, /n/, and /n/. For example, /ullq/ becomes ollq, /irq/ becomes erq, and so on. Rather than list each of these cases as a separate constraint, it is convenient to define a class (which we label +resonant) and use this class to simplify the SEC. Note that the string class must be defined (with the \scl field code) before it is used in a constraint.

\scl +resonant y w l ll r m n n~
\ch "o" "u" / q _ / _ ([+resonant]) q
\ch "e" "i" / q _ / _ ([+resonant]) q

This says that the mid vowels become high vowels after /q/ and before /q/, possibly with an intervening /y/, /w/, /l/, /ll/, /r/, /m/, /n/, or /n/.

Consider the problem posed for Quechua in the previous section, that of changing hw to f. An alternative is to condition the change so that it does not apply adjacent to a member of the string class Affric which contains s and c.

\scl Affric c s
\ch "hw" "f" / [Affric] ~_

It is sometimes convenient to make certain changes only at word boundaries, that is, to change a sequence of characters only if they initiate or terminate the word. This conditioning is easily expressed, as shown in the following examples.

\ch "this" "that"           | anywhere in the word
\ch "this" "that"  / # _    | only if word initial
\ch "this" "that"  /   _ #  | only if word final
\ch "this" "that"  / # _ #  | only if entire word

8.5.3 Using text orthography changes

The purpose of orthography change is to convert text from an external orthography to an internal representation more suitable for morphological analysis. In many cases this is unnecessary, the practical orthography being completely adequate as the internal representation. In other cases, the practical orthography is an inconvenience that can be circumvented by converting to a more phonemic representation.

Let us take a simple example from Latin. In the Latin orthography, the nominative singular masculine of the word "king" is rex. However, phonemically, this is really /reks/; /rek/ is the root meaning king and the /s/ is an inflectional suffix. If the program is to recover such an analysis, then it is necessary to convert the x of the external, practical orthography into ks internally. This can be done by including the following orthography change in the text input control file:

\ch  "x"  "ks"

In this, x is the match string and ks is the substitution string, as discussed in section . Whenever x is found, ks is substituted for it.

Let us consider next an example from Huallaga Quechua. The practical orthography currently represents long vowels by doubling the vowel. For example, what is written as kaa is /ka:/ "I am", where the length (represented by a colon) is the morpheme meaning "first person subject". Other examples, such as upoo /upu:/ "I drink" and upichee /upi-chi-:/ "I extinguish", motivate us to convert all long vowels into a vowel followed by a colon. The following changes do this:

\ch  "aa"  "a:"
\ch  "ee"  "i:"
\ch  "ii"  "i:"
\ch  "oo"  "u:"
\ch  "uu"  "u:"

Note that the long high vowels (i and u) have become mid vowels (e and o respectively); consequently, the vowel in the substitution string is not necessarily the same as that of the match string. What is the utility of these changes? In the lexicon, the morphemes can be represented in their phonemic forms; they do not have to be represented in all their orthographic variants. For example, the first person subject morpheme can be represented simply as a colon (-:), rather than as -a in cases like kaa, as -o in cases like qoo, and as -e as in cases like upichee. Further, the verb "drink" can be represented as upu and the causative suffix (in upichee) can be represented as -chi; these are the forms these morphemes have in other (nonlowered) environments. As the next example, let us suppose that we are analyzing Spanish, and that we wish to work internally with k rather than c (before a, o, and u) and qu (before i and e). (Of course, this is probably not the only change we would want to make.) Consider the following changes:

\ch  "ca"  "ka"
\ch  "co"  "ko"
\ch  "cu"  "ku"
\ch  "qu"  "k"

The first three handle c and the last handles qu. By virtue of including the vowel after c, we avoid changing ch to kh. There are other ways to achieve the same effect. One way exploits the fact that each change is applied to the output of all previous changes. Thus, we could first protect ch by changing it to some distinguished character (say @), then changing c to k, and then restoring @ to ch:

\ch  "ch"  "@"
\ch  "c"  "k"
\ch  "@"  "ch"
\ch  "qu"  "k"

Another approach conditions the change by the adjacent characters. The changes could be rewritten as

\ch  "c"  "k"  / _a  / _o  / _u  | only before a, o, or u
\ch  "qu"  "k"                   | in all cases

The first change says, "change c to k when followed by a, o, or u." (This would, for example, change como to komo, but would not affect chal.) The syntax of such conditions is exactly that used in string environment constraints; see section .

8.5.4 Where orthography changes apply

Input orthography changes are made when the text being processed may be written in a practical orthography. Rather than requiring that it be converted as a prerequisite to running the program, it is possible to have the program convert the orthography as it loads and before it processes each word.

The changes loaded from the text input control file are applied after all the text is converted to lower case (and the information about upper and lower case, along with information about format marking, punctuation and white space, has been put to one side.) Consequently, the match strings of these orthography changes should be all lower case; any change that has an uppercase character in the match string will never apply.

8.5.5 A sample orthography change table

We include here the entire orthography input change table for Caquinte (a language of Peru). There are basically four changes that need to be made: (1) nasals, which in the practical orthography reflect their assimilation to the point of articulation of a following noncontinuant, must be changed into an unspecified nasal, represented by N; (2) c and qu are changed to k; (3) j is changed to h; and (4) gu is changed to g before i and e.

\ch  "mp"  "Np"     | for unspecified nasals
\ch  "nch" "Nch"
\ch  "nc"  "Nk"
\ch  "nqu" "Nk"
\ch  "nt"  "Nt"

\ch  "ch"  "@"     | to protect ch
\ch  "c"   "k"      | other c's to k
\ch  "@"   "ch"    | to restore ch
\ch  "qu"  "k"

\ch  "j"   "h"

\ch  "gue" "ge"
\ch  "gui" "gi"

This change table can be simplified by the judicious use of string environment constraints:

\ch  "m"  >  "N"  / _p
\ch  "n"  >  "N"  / _c  / _t  / _qu

\ch  "c"  >  "k"  / _~h
\ch  "qu" >  "k"

\ch  "j"  >  "h"

\ch  "gu" >  "g"  / _e  /_i

As suggested by the preceding examples, the text orthography change table is composed of all the \ch fields found in the text input control file. These may appear anywhere in the file relative to the other fields. It is recommended that all the orthography changes be placed together in one section of the text input control file, rather than being mixed in with other fields.

8.5.6 Syntax of Orthography Changes

This section presents a grammatical description of the syntax of orthography changes in BNF notation. These changes are found either in the dictionary orthography change table file or in the text input control file (see section 6.1 Dictionary Orthography Change: \ch).


 1a. <orthochange>  ::= <basic_change>
 1b.                    <basic_change> <constraints>

 2a. <basic_change> ::= <quote><quote> <quote><string><quote>
 2b.                    <quote><string><quote> <quote><quote>
 2c.                    <quote><string><quote> <quote><string><quote>

 3.  <quote>        ::= any printing character not used in either
                        the ``from'' string or the ``to'' string

 4.  <string>       ::= one or more characters other than the quote
                        character used by this orthography change

 5a. <constraints>  ::= <change_envir>
 5b.                    <change_envir> <constraints>

 6a. <change_envir> ::= <marker> <leftside> <envbar> <rightside>
 6b.                    <marker> <leftside> <envbar>
 6c.                    <marker> <envbar> <rightside>

 7a. <leftside>   ::= <side>
 7b.                  <boundary>
 7c.                  <boundary> <side>

 8a. <rightside>  ::= <side>
 8b.                  <boundary>
 8c.                  <side> <boundary>

 9a. <side>       ::= <item>
 9b.                  <item> <side>
 9c.                  <item> ... <side>

10a. <item>       ::= <piece>
10b.                  ( <piece> )

11a. <piece>      ::= ~ <piece>
11b.                  <literal>
11c.                  [ <literal> ]

12.  <marker>     ::= /
                      +/

13.  <envbar>     ::= _
                      ~_

14.  <boundary>   ::= #
                      ~#

15.  <literal>    ::= one or more contiguous characters

8.5.6.1 Comments on selected BNF rules

2.

The same <quote> character must be used at both the beginning and the end of both the "from" string and the "to" string.

3.

The double quote (") and single quote (') characters are most often used.

7-8.

Note that what can appear to the left of the environment bar is a mirror image of what can appear to the right.

9c.

An ellipsis (...) indicates a possible break in contiguity.

10b.

Something enclosed in parentheses is optional.

11a.

A tilde (~) reverses the desirability of an element, causing the constraint to fail if it is found rather than fail if it is not found.

11c.

A literal enclosed in square brackets must be the name of a string class defined by a \scl field in the analysis data file, or earlier in the dictionary orthography change file.

12.

A +/ is usually used for morpheme environment constraints, but may used for change environment constraints in \ch fields in the dictionary orthography change table file.

13.

A tilde attached to the environment bar (~_) inverts the sense of the constraint as a whole.

14b.

The boundary marker preceded by a tilde (~#) indicates that it must not be a word boundary.

15.

The special characters used by environment constraints can be included in a literal only if they are immediately preceded by a backslash:


\+  \/  \#  \~  \[  \]  \(  \)  \.  \_  \\

8.6 Decomposition Separation Character: \dsc

The \dsc field defines the character used to separate the morphemes in the decomposition field of the output analysis file. For example, to use the equal sign (=), the text input control file would include:


\dsc  =

This would cause a decomposition field to be output as follows:


\d %3%kay%ka=y%ka=y%

It makes sense to use the \dsc field only once in the text input control file. If multiple \dsc fields do occur in the file, the value given in the first one is used. If the text input control file does not have an \dsc field, a dash (-) is used.

The first printing character following the \dsc field code is used as the morpheme decomposition separator character. The same character cannot be used both for separating decomposed morphemes in the analysis output file and for marking comments in the input control files. Thus, one normally cannot use the vertical bar (|) as the decomposition separation character.

Logically, this field should be in the analysis data file rather than the text input control file since it affects output instead of input. Nevertheless, compatibility demands that it stays this way.

8.7 Fields to Exclude: \excl

The \excl field excludes one or more fields from processing. For example, to have the program ignore everything in \co and \id fields, the following line is included in the text input control file:


\excl  \co  \id      | ignore these fields

If more than one \excl field is found in the text input control file, the contents of each field is added to the overall list of text fields to exclude. This list is initially empty, and stays empty unless the text input control file contains an \excl field. Thus, no text fields are excluded from processing by default.

If the text input control file contains \excl fields, then only those text fields are not processed. Every word in every text field not mentioned explicitly in an \excl field will be processed.

Note that every text field in the input text files is processed unless the text input control file contains either an \excl or an \incl field. One or the other is used to limit processing, but never both.

8.8 Primary format marker character: \format

The \format field designates a single character to flag the beginning of a primary format marker. For example, if the format markers in the text files begin with the at sign (@), the following would be placed in the text input control file:


\format  @

This would be used, for example, if the text contained format markers like the following:


@
@p
@sp
@make(Article)
@very-long.and;muddled/format*marker,to#be$sure

If a \format field occurs in the text input control file without a following character to serve for flagging format markers, then the program will not recognize any format markers and will try to parse everything other than punctuation characters.

It makes sense to use the \format field only once in the text input control file. If multiple \format fields do occur in the file, the value given in the first one is used.

The first printing character following the \format field code is used to flag format markers. The character currently used to mark comments cannot be assigned to also flag format markers. Thus, the vertical bar (|) cannot normally be used to flag format markers.

8.9 Fields to Include: \incl

The \incl field explicitly includes one or more text fields for processing, excluding all other fields. For instance, to process everything in \txt and \qt fields, but ignore everything else, the following line is placed in the text input control file:


\incl  \txt  \qt      | process these fields

If more than one \incl field is found in the text input control file, the contents of each field is added to the overall list of text fields to process. This list is initially empty, and stays empty unless the text input control file contains an \incl field.

If the text input control file contains \incl fields, then only those text fields are processed. Every word in every text field not mentioned explicitly in an \incl field will not be processed.

8.10 Lowercase/uppercase character pairs: \luwfc

To break a text into words, the program needs to know which characters are used to form words. It always assumes that the letters A through Z and a through z are used as word formation characters. If the orthography of the language the user is working in uses any other characters that have lowercase and uppercase forms, these must given in a \luwfc field in the text input control file.

The \luwfc field defines pairs of characters; the first member of each pair is a lowercase character and the second is the corresponding uppercase character. Several such pairs may be placed in the field or they may be placed on separate fields. Whitespace may be interspersed freely. For example, the following three examples are equivalent:


\luwfc  �� ��


\luwfc  ��      | e with acute accent
\luwfc  ��      | enyee


\luwfc  � �  � �

Note that comments can be used as well (just as they can in any AMPLE control file). This means that the comment character cannot be designated as a word formation character. If the orthography includes the vertical bar (|), then a different comment character must be defined with the `-c' command line option when AMPLE is initiated; see section 2.1 AMPLE Command Options.

The \luwfc field can be entered anywhere in the text input control file, although a natural place would be before the \wfc (word formation character) field.

Any standard alphabetic character (that is a through z or A through Z) in the \luwfc field will override the standard lower- upper case pairing. For example, the following will treat X as the upper case equivalent of z:


\luwfc z X

Note that Z will still have z as its lower-case equivalent in this case.

The \luwfc field is allowed to map multiple lower case characters to the same upper case character, and vice versa. This is needed for languages that do not mark tone on upper case letters.

8.11 Multibyte lowercase/uppercase character pairs: \luwfcs

The \luwfcs field extends the character pair definitions of the \luwfc field to multibyte character sequences. Like the \luwfc field, the \luwfcs field defines pairs of characters; the first member of each pair is a multibyte lowercase character and the second is the corresponding multibyte uppercase character. Several such pairs may be placed in the field or they may be placed on separate fields. Whitespace separates the members of each pair, and the pairs from each other. For example, the following three examples are equivalent:


\luwfcs  e' E` n~ N^ � C&


\luwfcs  e' E`      | e with acute accent
\luwfcs  n~ N^      | enyee
\luwfcs  �  C&      | c cedilla


\luwfcs  e' E`
         n~ N^
         �  C&

Also note that there is no requirement that the lowercase form be the same length (number of bytes) as the uppercase form. The examples shown above are only one or two bytes (character codes) in length, but there is no limit placed on the length of a multibyte character.

The \luwfcs field can be entered anywhere in the text input control file. \luwfcs fields may be mixed with \luwfc fields in the same file.

Any standard alphabetic character (that is a through z or A through Z) in the \luwfcs field will override the standard lower- upper case pairing. For example, the following will treat X as the upper case equivalent of z:


\luwfcs z X

Note that Z will still have z as its lowercase equivalent in this case.

The \luwfcs field is allowed to map multiple multibyte lowercase characters to the same multibyte uppercase character, and vice versa. This may be useful in some situations, but it introduces an element of ambiguity into the decapitalization and recapitalization processes. If ambiguous capitalization is supported, then for the previous example, z will have both X and Z as uppercase equivalents, and X will have both x and Z as lowercase equivalents.

8.12 Maximum number of decapitalizations: \maxdecap

The \maxdecap field sets the maximum number of different decapitalizations allowed. Since the \luwfc field can map several lowercase characters onto a single uppercase character, a word with uppercase characters can (logically) generate a number of alternatives when decapitalized. This is especially true of words that are entirely capitalized to begin with. The default limit is 100.

8.13 Prevent Any Decapitalization: \nocap

The usual behavior is to normalize input words to lowercase. The program remembers the case of the word as one of four possibilities:

all uppercase
all lowercase
only the first letter uppercase
mixed uppercase and lowercase

However, not all orthographies use the concept of capitalization. To help deal with these, the field code \nocap disables all case normalization if it appears anywhere in the text input control file.

8.14 Prevent Decapitalization of Individual Characters: \noincap

The handling of mixed uppercase and lowercase is limited in utility, and sometimes causes more problems than it solves. For this reason, the \noincap field code turns off mixed case decapitalization. The program would still decapitalize words that are entirely capitalized and words that begin with a capital letter.

8.15 String class: \scl

Each \scl field defines a single string class. Any number of \scl fields may appear in the file. The only restriction is that a string class must be defined before it is used.

String classes must be defined before being used. For example, the first two lines of the simpler Caquinte example above could be given as follows:

\scl  -bilabial  c t qu
\ch  "m"  >  "N"  / _ p
\ch  "n"  >  "N"  / _ [-bilabial]

The string class definition could be in another control file: string classes defined elsewhere can be used in the text input control file as well.

If no \scl fields appear in the text input control file, then AMPLE does not allow any string classes in text input orthography change environment constraints unless they are defined in the analysis data file or the dictionary orthography changes file.

8.16 Caseless word formation characters: \wfc

To break a text into words, the program needs to know which characters are used to form words. It always assumes that the letters A through Z and a through z are used as word formation characters. If the orthography of the language the user is working in uses any characters that do not have different lowercase and uppercase forms, these must given in a \wfc field in the text input control file.

For example, English uses an apostrophe character (') that could be considered a word formation character. This information is provided by the following example:


\wfc  '    | needed for words like don't

Notice that the characters in the \wfc field may be separated by spaces, although it is not required to do so. If more than one \wfc field occurs in the text input control file, the program uses the combination of all characters defined in all such fields as word formation characters.

The comment character cannot be designated as a word formation character. If the orthography includes the vertical bar (|), then a different comment character must be defined with the `-c' command line option when AMPLE is initiated; see above. section 2.1 AMPLE Command Options.

8.17 Multibyte caseless word formation characters: \wfcs

The \wfcs field allows multibyte characters to be defined as "caseless" word formation characters. It has the same relationship to \wfc that \luwfcs has to \luwfc. The multibyte word formation characters are separated from each other by whitespace.

8.18 A sample text input control file

The following is the complete text input control file for Huallaga Quechua (a language of Peru):

\id HGTEXT.CTL - for Huallaga Quechua, 25-May-88

\co         WORD FORMATION CHARACTERS

\wfc  ' ~

\co         FIELDS TO EXCLUDE

\excl  \id            | identification fields

\co         ORTHOGRAPHY CHANGES

\ch  "aa" > "a:"      | for long vowels
\ch  "ee" > "i:"
\ch  "ii" > "i:"
\ch  "oo" > "u:"
\ch  "uu" > "u:"
\ch  "qeki" > "qiki"  | for cases like wawqeki
\ch  "~n" > "n~"      | for typos
| for Spanish loans like hwista
\scl sib s c          | sibilants
\ch  "hw" > "f"  / ~[sib]_

9. Output Analysis Files

Analysis files are record oriented standard format files. This means that the files are divided into records, each representing a single word in the original input text file, and records are divided into fields. An analysis file contains at least one record, and may contain a large number of records. Each record contains one or more fields. Each field occupies at least one line, and is marked by a field code at the beginning of the line. A field code begins with a backslash character (\), and contains 1 or more letters in addition.

9.1 Analysis file fields

This section describes the possible fields in an analysis file. The only field that is guaranteed to exist is the analysis (\a) field. All other fields are either data dependent or optional.

9.1.1 Analysis field: \a

The analysis field (\a) starts each record of an analysis file. It has the following form:


\a PFX IFX PFX < CAT root CAT root > SFX IFX SFX

where PFX is a prefix morphname, IFX is an infix morphname, SFX is a suffix morphname, CAT is a root category, and root is a root gloss or etymology. In the simplest case, an analysis field would look like this:


\a < CAT root >

where CAT is a root category and root is a root gloss or etymology.

The \rd field in the analysis data file can replace the characters used to bracket the root category and gloss/etymology; see section 4.1.24 Root Delimiter Characters: \rd. The dictionary field code mapped to M in the dictionary codes file controls the affix and default root morphnames; see section 7.7 Morphname (internal code M). If the `-g' command line option is given, the output analysis file contains glosses from the root dictionary marked by the field code mapped to G in the dictionary codes file; see section 2.1 AMPLE Command Options and section 7.5 Root Gloss (internal code G).

9.1.2 Decomposition field: \d

The morpheme decomposition field (\d) follows the analysis field. It has the following form:


\d anti-dis-establish-ment-arian-ism-s

where the hyphens separate the individual morphemes in the surface form of the word.

The \dsc field in the text input control file can replace the hyphen with another character for separating the morphemes; see section 8.6 Decomposition Separation Character: \dsc.

The morpheme decomposition field is optional. It is enabled either by a `-w d' command line option (see section 2.1 AMPLE Command Options), or by an interactive query.

9.1.3 Category field: \cat

The category field (\cat) provides rudimentary category information. This may be useful for sentence level parsing. It has the following form:


\cat CAT

where CAT is the word category. A more complex example is


\cat C0 C1/C0=C2=C2/C1=C1/C1

where C0 is the proposed word category, C1/C0 is a prefix category pair, C2 is a root category, and C2/C1 and C1/C1 are suffix category pairs. The equal signs (=) serve to separate the category information of the individual morphemes.

The \cat field of the analysis data file controls whether the category field is written to the output analysis file; see section 9.1.3 Category field: \cat.

If there are multiple analyses, there will be multiple categories in the output, separated by ambiguity markers.

9.1.4 Properties field: \p

The properties field (\p) contains the names of any allomorph or morpheme properties found in the analysis of the word. It has the form:


\p ==prop1 prop2=prop3=

where prop1, prop2, and prop3 are property names. The equal signs (=) serve to separate the property information of the individual morphemes. Note that morphemes may have more than one property, with the names separated by spaces, or no properties at all.

By default, the properties field is written to the output analysis file. The `-w 0' command option, or any `-w' option that does not include `p' in its argument disables the properties field.

9.1.5 Feature Descriptors field: \fd

The feature descriptor field (\fd) contains the feature names associated with each morpheme in the analysis. It has the following form:


\fd ==feat1 feat2=feat3=

where feat1, feat2, and feat3 are feature descriptors. The equal signs (=) serve to separate the feature descriptors of the individual morphemes. Note that morphemes may have more than one feature descriptor, with the names separated by spaces, or no feature descriptors at all.

The dictionary field code mapped to F in the dictionary code table file controls whether feature descriptors are written to the output analysis file; if this mapping is not defined, then the \fd field is not written. See section 7.4 Feature Descriptor (internal code F).

If there are multiple analyses, there will be multiple feature sets in the output, separated by ambiguity markers.

9.1.6 Underlying form field: \u

The underlying form field (\u) is similar to the decomposition field except that it shows underlying forms instead of surface forms. It looks like this:


\u a-para-a-i-ri-me

where the hyphens separate the individual morphemes.

The \dsc field in the text input control file can replace the hyphen with another character for separating the morphemes; see section 8.6 Decomposition Separation Character: \dsc.

The dictionary field code mapped to U in the dictionary code table file controls whether underlying forms are written to the output analysis file; if this mapping is not defined, then the \u field is not written. See section 7.11 Underlying Form (internal code U).

9.1.7 Word field: \w

The original word field (\w) contains the original input word as it looks before decapitalization and orthography changes. It looks like this:


\w The

Note that this is a gratuitous change from earlier versions of AMPLE and KTEXT, which wrote the decapitalized form.

The original word field is optional. It is enabled either by a `-w w' command line option (see section 2.1 AMPLE Command Options), or by an interactive query.

9.1.8 Formatting field: \f

The format information field (\f) records any formatting codes or punctuation that appeared in the input text file before the word. It looks like this:


\f \\id MAT 5 HGMT05.SFM, 14-feb-84 D. Weber, Huallaga Quechua\n
        \\c 5\n\n
        \\s

where backslashes (\) in the input text are doubled, newlines are represented by \n, and additional lines in the field start with a tab character.

The format information field is written to the output analysis file whenever it is needed, that is, whenever formatting codes or punctuation exist before words.

9.1.9 Capitalization field: \c

The capitalization field (\c) records any capitalization of the input word. It looks like this:


\c 1

where the number following the field code has one of these values:

1: the first (or only) letter of the word is capitalized
2: all letters of the word are capitalized
4-32767: some letters of the word are capitalized and some are not

Note that the third form is of limited utility, but still exists because of words like the author's last name.

The capitalization field is written to the output analysis file whenever any of the letters in the word are capitalized; see section 8.13 Prevent Any Decapitalization: \nocap and section 8.14 Prevent Decapitalization of Individual Characters: \noincap.

9.1.10 Nonalphabetic field: \n

The nonalphabetic field (\n) records any trailing punctuation, bar code (see section 8.4 Bar Code Format Code Characters: \barcodes), or whitespace characters. It looks like this:


\n |r.\n

where newlines are represented by \n. The nonalphabetic field ends with the last whitespace character immediately following the word.

The nonalphabetic field is written to the output analysis file whenever the word is followed by anything other than a single space character. This includes the case when a word ends a file with nothing following it.

9.2 Ambiguous analyses

The previous section assumed that only one analysis is produced for each word. This is not always possible since words in isolation are frequently ambiguous. Multiple analyses are handled by writing each analysis field in parallel, with the number of analyses at the beginning of each output field. For example,


\a %2%< A0 imaika > CNJT AUG%< A0 imaika > ADVS%
\d %2%imaika-Npa-ni%imaika-Npani%
\cat %2%A0 A0=A0/A0=A0/A0%A0 A0=A0/A0%
\p %2%==%=%
\fd %2%==%=%
\u %2%imaika-Npa-ni%imaika-Npani%
\w Imaicampani
\f \\v124 
\c 1
\n \n

where the percent sign (%) separates the different analyses in each field. Note that only those fields which contain analysis information are marked for ambiguity. The other fields (\w, \f, \c, and \n) are the same regardless of the number of analyses.

The \ambig field in the text input control file can replace the percent sign with another character for separating the analyses; see section 8.2 Ambiguity Marker Character: \ambig.

9.3 Analysis failures

The previous sections assumed that words are successfully analyzed. This does not always happen. Analysis failures are marked the same way as multiple analyses, but with zero (0) for the ambiguity count. For example,


\a %0%ta%
\d %0%ta%
\cat %0%%
\p %0%%
\fd %0%%
\u %0%%
\w TA
\f \\v 12 |b
\c 2
\n |r\n

Note that only the \a and \d fields contain any information, and those both have the original word as a place holder. The other analysis fields (\cat, \p, \fd, and \u) are marked for failure, but otherwise left empty.

The \ambig field in the text input control file can replace the percent sign with another character for marking analysis failures and ambiguities; see section 8.2 Ambiguity Marker Character: \ambig.

Bibliography

Weber, David J., H. Andrew Black, and Stephen R. McConnel. 1988. AMPLE: a tool for exploring morphology. Occasional Publications in Academic Computing No. 12. Dallas, TX: Summer Institute of Linguistics.
Weber, David J., H. Andrew Black, Stephen R. McConnel, and Alan Buseman. 1990. STAMP: a tool for dialect adaptation. Occasional Publications in Academic Computing No. 15. Dallas, TX: Summer Institute of Linguistics.