This is Info file anadiff.info, produced by Makeinfo version 1.68 from
the input file anadiff.txi.

                       ANADIFF Reference Manual
                       comparing analysis files
                             version 1.0b6
                             October 1997

                          by Stephen McConnel

                 Copyright (C) 2000 SIL International

Published by:
Language Software Development
SIL International
7500 W. Camp Wisdom Road
Dallas, TX 75236
U.S.A.
Permission is granted to make and distribute verbatim copies of this
file provided the copyright notice and this permission notice are
preserved in all copies.

The author may be reached at the address above or via email as
`steve@acadcomp.sil.org'.


File: anadiff.info,  Node: Top,  Next: Introduction,  Prev: (dir),  Up: (dir)

This is the reference manual for the ANADIFF program.

* Menu:

* Introduction::                Why ANADIFF was written
* Running ANADIFF::             How to use ANADIFF
* Analysis files::              Description of input to ANADIFF


File: anadiff.info,  Node: Introduction,  Next: Running ANADIFF,  Prev: Top,  Up: Top

Introduction to the ANADIFF program
***********************************

Comparing the output from two different runs of programs like AMPLE or
KTEXT can be quite difficult.  The analysis files can differ only in
the order of ambiguous analyses, or in the number of analyses.  Either
situation is difficult to analyze with simple file comparison utilities
like `fc.exe' or `diff'.  That is why ANADIFF was written.


File: anadiff.info,  Node: Running ANADIFF,  Next: Analysis files,  Prev: Introduction,  Up: Top

Running ANADIFF
***************

Command line options
--------------------

The ANADIFF program uses an old-fashioned command line interface
following the convention of options starting with a dash character
(`-').  The available options are listed below in alphabetical order.
Those options which require an argument have the argument type
following the option letter.

`-a character'
     selects the character that marks ambiguities and failures in the
     analysis files.  (The default is `%'.)

`-q'
     tells ANADIFF to operate *quietly* without the normal output
     showing the differences.

`-v'
     tells ANADIFF to operate *verbosely* with extra output to the
     screen.

ANADIFF expects the names of two analysis files following any options
on the command line.  The examples below illustrate this.

Examples
--------

ANADIFF normally writes a summary when it has finished comparing the
two analysis files.  For example, comparing two identical analysis
files would look like this:

     C> anadiff a.aaa a.ana
     a.aaa and a.ana are the same (78 of 78 are identical)


Comparing two analysis files that are logically, but not literally,
identical produces similar output:

     C> anadiff b.aaa b.ana
     b.aaa and b.ana are the same (70 of 78 are identical)


As would be expected, more output is generated when the two analysis
files actually are different.  The following example illustrates four
types of differences:

  1. Different morph names in a single analysis

  2. Different property names in a single analysis

  3. A double versus a single analysis

  4. Different morph names in the second of two analyses

Note that ANADIFF displays analyses somewhat differently than they
appear in the analysis files.

     C> anadiff c.aaa c.ana
     1. diff:  \a   < N0 *hirka > LOC
                    < N0 *hirka > LOCATIVE
               \d   hirka-chaw
               \p   =
               \w   HIRKACHAW
               \f   \\id HGMT05.SFM, 14-feb-84 D. Weber, Huallaga Quechua\n
                        \\c 5\n\n
                        \\s
               \c   2
     
     2.        \a   < V2 *yatra > CAUS 3
               \d   yacha-chi-n
        diff:  \p   =Mlowers=foreshortens
                    =Morphlowers=foreshortens
               \w   YACHACHIN
               \c   2
               \n   \n
     
     3.        \a   < N0 *runa > PLUR
               \d   runa-kuna
               \p   =
        diff:  \a   < N0 *runa > plural
                    [missing in c.ana]
        diff:  \d   runa-kuna
                    [missing in c.ana]
        diff:  \p   =
                    [missing in c.ana]
               \w   runakuna
     
     4.        \a   < V1 *n~awpa > 3 COND
               \d   n~awpa-n-man
               \p   =foreshortens=
        diff:  \a   < N0 *n~awpa > 3P GOAL
                    < N0 *n~awpa > 3P GOALIE
               \d   n~awpa-n-man
               \p   ==
               \w   n~awpanman
     
        c.aaa and c.ana differ 4 times (63 of 78 are identical)


Batch file example
------------------

ANADIFF returns the number of actual differences that it found when it
exits.  This allows it to be used in batch files for regression tests.
For example, consider the following MS-DOS batch file for testing
different versions of AMPLE on a specific set of data:

     if "%1" == "" goto done
     %1 -m -f aetest.cmd -i aetest.txt -o aetest.ana >aetest.log
     anadiff aetest.aaa aetest.ana >aetest.dif
     if errorlevel 1 goto done
     del aetest.dif
     del aetest.ana
     del aetest.log
     :done


(The `%1' is an MS-DOS batch variable corresponding to the first
command line argument following the name of the batch file in the
command.)  This batch file runs the given version of AMPLE on a
specific set of data, and then compares the output file (`aetest.ana'
to the output from a previous run (`aetest.aaa').  If the output is the
same, except possibly for the order of ambiguous analyses, then the new
output and log files are deleted.  The same is accomplished by the
following Unix shell script.

     #! /bin/sh
     if [ $# -gt 0 ]; then
         $1 -m -f aetest.cmd -i aetest.txt -o aetest.ana >aetest.log
         if (anadiff aetest.aaa aetest.ana >aetest.dif) then
             rm aetest.dif aetest.ana aetest.log
             fi
         fi



File: anadiff.info,  Node: Analysis files,  Prev: Running ANADIFF,  Up: Top

Input Analysis Files
********************

Analysis files are "record oriented standard format files".  This means
that the files are divided into records, each representing a single
word in the original input text file, and records are divided into
fields.  An analysis file contains at least one record, and may contain
a large number of records.  Each record contains one or more fields.
Each field occupies at least one line, and is marked by a "field code"
at the beginning of the line.  A field code begins with a backslash
character (`\'), and contains 1 or more letters in addition.

* Menu:

* Analysis file fields::            Description of each type of field
* Ambiguous analyses::              How ambiguous analyses are marked
* Analysis failures::               How analysis failures are marked


File: anadiff.info,  Node: Analysis file fields,  Next: Ambiguous analyses,  Prev: Analysis files,  Up: Analysis files

Analysis file fields
====================

This section describes the possible fields in an analysis file.  The
only field that is guaranteed to exist is the analysis (`\a') field.
All other fields are either data dependent or optional.

* Menu:

* \a::          Analysis
* \d::          Decomposition (surface forms)
* \cat::        Category (possible word, morpheme)
* \p::          Properties
* \fd::         Feature Descriptors
* \u::          Underlying forms (decomposition)
* \w::          Word (before decapitalization and orthography changes)
* \f::          Formatting (junk before the word)
* \c::          Capitalization flag
* \n::          Nonalphabetic (junk after the word)


File: anadiff.info,  Node: \a,  Next: \d,  Prev: Analysis file fields,  Up: Analysis file fields

Analysis field: \a
------------------

The analysis field (`\a') starts each record of an analysis file.  It
has the following form:

     \a PFX IFX PFX < CAT root CAT root > SFX IFX SFX

where `PFX' is a prefix morphname, `IFX' is an infix morphname, `SFX'
is a suffix morphname, `CAT' is a root category, and `root' is a root
gloss or etymology.  In the simplest case, an analysis field would look
like this:

     \a < CAT root >

where `CAT' is a root category and `root' is a root gloss or etymology.


File: anadiff.info,  Node: \d,  Next: \cat,  Prev: \a,  Up: Analysis file fields

Decomposition field: \d
-----------------------

The morpheme decomposition field (`\d') follows the analysis field.  It
has the following form:

     \d anti-dis-establish-ment-arian-ism-s

where the hyphens separate the individual morphemes in the surface form
of the word.


File: anadiff.info,  Node: \cat,  Next: \p,  Prev: \d,  Up: Analysis file fields

Category field: \cat
--------------------

The category field (`\cat') provides rudimentary category information.
This may be useful for sentence level parsing.  It has the following
form:

     \cat CAT

where `CAT' is the word category.

If there are multiple analyses, there will be multiple categories in
the output, separated by ambiguity markers.


File: anadiff.info,  Node: \p,  Next: \fd,  Prev: \cat,  Up: Analysis file fields

Properties field: \p
--------------------

The properties field (`\p') contains the names of any allomorph or
morpheme properties found in the analysis of the word.  It has the form:

     \p ==prop1 prop2=prop3=

where `prop1', `prop2', and `prop3' are property names.  The equal
signs (`=') serve to separate the property information of the
individual morphemes.  Note that morphemes may have more than one
property, with the names separated by spaces, or no properties at all.


File: anadiff.info,  Node: \fd,  Next: \u,  Prev: \p,  Up: Analysis file fields

Feature Descriptors field: \fd
------------------------------

The feature descriptor field (`\fd') contains the feature names
associated with each morpheme in the analysis.  It has the following
form:

     \fd ==feat1 feat2=feat3=

where `feat1', `feat2', and `feat3' are feature descriptors.  The equal
signs (`=') serve to separate the feature descriptors of the individual
morphemes.  Note that morphemes may have more than one feature
descriptor, with the names separated by spaces, or no feature
descriptors at all.

If there are multiple analyses, there will be multiple feature sets in
the output, separated by ambiguity markers.


File: anadiff.info,  Node: \u,  Next: \w,  Prev: \fd,  Up: Analysis file fields

Underlying form field: \u
-------------------------

The underlying form field (`\u') is similar to the decomposition field
except that it shows underlying forms instead of surface forms.  It
looks like this:

     \u a-para-a-i-ri-me

where the hyphens separate the individual morphemes.


File: anadiff.info,  Node: \w,  Next: \f,  Prev: \u,  Up: Analysis file fields

Word field: \w
--------------

The original word field (`\w') contains the original input word as it
looks before decapitalization and orthography changes.  It looks like
this:

     \w The

Note that this is a gratuitous change from earlier versions of AMPLE
and KTEXT, which wrote the decapitalized form.


File: anadiff.info,  Node: \f,  Next: \c,  Prev: \w,  Up: Analysis file fields

Formatting field: \f
--------------------

The format information field (`\f') records any formatting codes or
punctuation that appeared in the input text file before the word.  It
looks like this:

     \f \\id MAT 5 HGMT05.SFM, 14-feb-84 D. Weber, Huallaga Quechua\n
             \\c 5\n\n
             \\s


where backslashes (`\') in the input text are doubled, newlines are
represented by `\n', and additional lines in the field start with a tab
character.

The format information field is written to the output analysis file
whenever it is needed, that is, whenever formatting codes or
punctuation exist before words.


File: anadiff.info,  Node: \c,  Next: \n,  Prev: \f,  Up: Analysis file fields

Capitalization field: \c
------------------------

The capitalization field (`\c') records any capitalization of the input
word.  It looks like this:

     \c 1

where the number following the field code has one of these values:
`1'
     the first (or only) letter of the word is capitalized

`2'
     all letters of the word are capitalized

`4-32767'
     some letters of the word are capitalized and some are not

Note that the third form is of limited utility, but still exists
because of words like the author's last name.

The capitalization field is written to the output analysis file
whenever any of the letters in the word are capitalized.


File: anadiff.info,  Node: \n,  Prev: \c,  Up: Analysis file fields

Nonalphabetic field: \n
-----------------------

The nonalphabetic field (`\n') records any trailing punctuation, bar
codes, or whitespace characters.  It looks like this:

     \n |r.\n

where newlines are represented by `\n'.  The nonalphabetic field ends
with the last whitespace character immediately following the word.

The nonalphabetic field is written to the output analysis file whenever
the word is followed by anything other than a single space character.
This includes the case when a word ends a file with nothing following
it.


File: anadiff.info,  Node: Ambiguous analyses,  Next: Analysis failures,  Prev: Analysis file fields,  Up: Analysis files

Ambiguous analyses
==================

The previous section assumed that only one analysis is produced for
each word.  This is not always possible since words in isolation are
frequently ambiguous.  Multiple analyses are handled by writing each
analysis field in parallel, with the number of analyses at the
beginning of each output field.  For example,

     \a %2%< A0 imaika > CNJT AUG%< A0 imaika > ADVS%
     \d %2%imaika-Npa-ni%imaika-Npani%
     \cat %2%A0 A0=A0/A0=A0/A0%A0 A0=A0/A0%
     \p %2%==%=%
     \fd %2%==%=%
     \u %2%imaika-Npa-ni%imaika-Npani%
     \w Imaicampani
     \f \\v124
     \c 1
     \n \n


where the percent sign (`%') separates the different analyses in each
field.  Note that only those fields which contain analysis information
are marked for ambiguity.  The other fields (`\w', `\f', `\c', and
`\n') are the same regardless of the number of analyses.


File: anadiff.info,  Node: Analysis failures,  Prev: Ambiguous analyses,  Up: Analysis files

Analysis failures
=================

The previous sections assumed that words are successfully analyzed.
This does not always happen.  Analysis failures are marked the same way
as multiple analyses, but with zero (`0') for the ambiguity count.  For
example,

     \a %0%ta%
     \d %0%ta%
     \cat %0%%
     \p %0%%
     \fd %0%%
     \u %0%%
     \w TA
     \f \\v 12 |b
     \c 2
     \n |r\n


Note that only the `\a' and `\d' fields contain any information, and
those both have the original word as a place holder.  The other
analysis fields (`\cat', `\p', `\fd', and `\u') are marked for failure,
but otherwise left empty.



Tag Table:
Node: Top746
Node: Introduction1063
Node: Running ANADIFF1568
Node: Analysis files6008
Node: Analysis file fields6898
Node: \a7711
Node: \d8319
Node: \cat8680
Node: \p9118
Node: \fd9684
Node: \u10407
Node: \w10780
Node: \f11170
Node: \c11877
Node: \n12610
Node: Ambiguous analyses13224
Node: Analysis failures14239

End Tag Table