This is Info file anadiff.info, produced by Makeinfo version 1.68 from the input file anadiff.txi. ANADIFF Reference Manual comparing analysis files version 1.0b6 October 1997 by Stephen McConnel Copyright (C) 2000 SIL International Published by: Language Software Development SIL International 7500 W. Camp Wisdom Road Dallas, TX 75236 U.S.A. Permission is granted to make and distribute verbatim copies of this file provided the copyright notice and this permission notice are preserved in all copies. The author may be reached at the address above or via email as `steve@acadcomp.sil.org'.  File: anadiff.info, Node: Top, Next: Introduction, Prev: (dir), Up: (dir) This is the reference manual for the ANADIFF program. * Menu: * Introduction:: Why ANADIFF was written * Running ANADIFF:: How to use ANADIFF * Analysis files:: Description of input to ANADIFF  File: anadiff.info, Node: Introduction, Next: Running ANADIFF, Prev: Top, Up: Top Introduction to the ANADIFF program *********************************** Comparing the output from two different runs of programs like AMPLE or KTEXT can be quite difficult. The analysis files can differ only in the order of ambiguous analyses, or in the number of analyses. Either situation is difficult to analyze with simple file comparison utilities like `fc.exe' or `diff'. That is why ANADIFF was written.  File: anadiff.info, Node: Running ANADIFF, Next: Analysis files, Prev: Introduction, Up: Top Running ANADIFF *************** Command line options -------------------- The ANADIFF program uses an old-fashioned command line interface following the convention of options starting with a dash character (`-'). The available options are listed below in alphabetical order. Those options which require an argument have the argument type following the option letter. `-a character' selects the character that marks ambiguities and failures in the analysis files. (The default is `%'.) `-q' tells ANADIFF to operate *quietly* without the normal output showing the differences. `-v' tells ANADIFF to operate *verbosely* with extra output to the screen. ANADIFF expects the names of two analysis files following any options on the command line. The examples below illustrate this. Examples -------- ANADIFF normally writes a summary when it has finished comparing the two analysis files. For example, comparing two identical analysis files would look like this: C> anadiff a.aaa a.ana a.aaa and a.ana are the same (78 of 78 are identical) Comparing two analysis files that are logically, but not literally, identical produces similar output: C> anadiff b.aaa b.ana b.aaa and b.ana are the same (70 of 78 are identical) As would be expected, more output is generated when the two analysis files actually are different. The following example illustrates four types of differences: 1. Different morph names in a single analysis 2. Different property names in a single analysis 3. A double versus a single analysis 4. Different morph names in the second of two analyses Note that ANADIFF displays analyses somewhat differently than they appear in the analysis files. C> anadiff c.aaa c.ana 1. diff: \a < N0 *hirka > LOC < N0 *hirka > LOCATIVE \d hirka-chaw \p = \w HIRKACHAW \f \\id HGMT05.SFM, 14-feb-84 D. Weber, Huallaga Quechua\n \\c 5\n\n \\s \c 2 2. \a < V2 *yatra > CAUS 3 \d yacha-chi-n diff: \p =Mlowers=foreshortens =Morphlowers=foreshortens \w YACHACHIN \c 2 \n \n 3. \a < N0 *runa > PLUR \d runa-kuna \p = diff: \a < N0 *runa > plural [missing in c.ana] diff: \d runa-kuna [missing in c.ana] diff: \p = [missing in c.ana] \w runakuna 4. \a < V1 *n~awpa > 3 COND \d n~awpa-n-man \p =foreshortens= diff: \a < N0 *n~awpa > 3P GOAL < N0 *n~awpa > 3P GOALIE \d n~awpa-n-man \p == \w n~awpanman c.aaa and c.ana differ 4 times (63 of 78 are identical) Batch file example ------------------ ANADIFF returns the number of actual differences that it found when it exits. This allows it to be used in batch files for regression tests. For example, consider the following MS-DOS batch file for testing different versions of AMPLE on a specific set of data: if "%1" == "" goto done %1 -m -f aetest.cmd -i aetest.txt -o aetest.ana >aetest.log anadiff aetest.aaa aetest.ana >aetest.dif if errorlevel 1 goto done del aetest.dif del aetest.ana del aetest.log :done (The `%1' is an MS-DOS batch variable corresponding to the first command line argument following the name of the batch file in the command.) This batch file runs the given version of AMPLE on a specific set of data, and then compares the output file (`aetest.ana' to the output from a previous run (`aetest.aaa'). If the output is the same, except possibly for the order of ambiguous analyses, then the new output and log files are deleted. The same is accomplished by the following Unix shell script. #! /bin/sh if [ $# -gt 0 ]; then $1 -m -f aetest.cmd -i aetest.txt -o aetest.ana >aetest.log if (anadiff aetest.aaa aetest.ana >aetest.dif) then rm aetest.dif aetest.ana aetest.log fi fi  File: anadiff.info, Node: Analysis files, Prev: Running ANADIFF, Up: Top Input Analysis Files ******************** Analysis files are "record oriented standard format files". This means that the files are divided into records, each representing a single word in the original input text file, and records are divided into fields. An analysis file contains at least one record, and may contain a large number of records. Each record contains one or more fields. Each field occupies at least one line, and is marked by a "field code" at the beginning of the line. A field code begins with a backslash character (`\'), and contains 1 or more letters in addition. * Menu: * Analysis file fields:: Description of each type of field * Ambiguous analyses:: How ambiguous analyses are marked * Analysis failures:: How analysis failures are marked  File: anadiff.info, Node: Analysis file fields, Next: Ambiguous analyses, Prev: Analysis files, Up: Analysis files Analysis file fields ==================== This section describes the possible fields in an analysis file. The only field that is guaranteed to exist is the analysis (`\a') field. All other fields are either data dependent or optional. * Menu: * \a:: Analysis * \d:: Decomposition (surface forms) * \cat:: Category (possible word, morpheme) * \p:: Properties * \fd:: Feature Descriptors * \u:: Underlying forms (decomposition) * \w:: Word (before decapitalization and orthography changes) * \f:: Formatting (junk before the word) * \c:: Capitalization flag * \n:: Nonalphabetic (junk after the word)  File: anadiff.info, Node: \a, Next: \d, Prev: Analysis file fields, Up: Analysis file fields Analysis field: \a ------------------ The analysis field (`\a') starts each record of an analysis file. It has the following form: \a PFX IFX PFX < CAT root CAT root > SFX IFX SFX where `PFX' is a prefix morphname, `IFX' is an infix morphname, `SFX' is a suffix morphname, `CAT' is a root category, and `root' is a root gloss or etymology. In the simplest case, an analysis field would look like this: \a < CAT root > where `CAT' is a root category and `root' is a root gloss or etymology.  File: anadiff.info, Node: \d, Next: \cat, Prev: \a, Up: Analysis file fields Decomposition field: \d ----------------------- The morpheme decomposition field (`\d') follows the analysis field. It has the following form: \d anti-dis-establish-ment-arian-ism-s where the hyphens separate the individual morphemes in the surface form of the word.  File: anadiff.info, Node: \cat, Next: \p, Prev: \d, Up: Analysis file fields Category field: \cat -------------------- The category field (`\cat') provides rudimentary category information. This may be useful for sentence level parsing. It has the following form: \cat CAT where `CAT' is the word category. If there are multiple analyses, there will be multiple categories in the output, separated by ambiguity markers.  File: anadiff.info, Node: \p, Next: \fd, Prev: \cat, Up: Analysis file fields Properties field: \p -------------------- The properties field (`\p') contains the names of any allomorph or morpheme properties found in the analysis of the word. It has the form: \p ==prop1 prop2=prop3= where `prop1', `prop2', and `prop3' are property names. The equal signs (`=') serve to separate the property information of the individual morphemes. Note that morphemes may have more than one property, with the names separated by spaces, or no properties at all.  File: anadiff.info, Node: \fd, Next: \u, Prev: \p, Up: Analysis file fields Feature Descriptors field: \fd ------------------------------ The feature descriptor field (`\fd') contains the feature names associated with each morpheme in the analysis. It has the following form: \fd ==feat1 feat2=feat3= where `feat1', `feat2', and `feat3' are feature descriptors. The equal signs (`=') serve to separate the feature descriptors of the individual morphemes. Note that morphemes may have more than one feature descriptor, with the names separated by spaces, or no feature descriptors at all. If there are multiple analyses, there will be multiple feature sets in the output, separated by ambiguity markers.  File: anadiff.info, Node: \u, Next: \w, Prev: \fd, Up: Analysis file fields Underlying form field: \u ------------------------- The underlying form field (`\u') is similar to the decomposition field except that it shows underlying forms instead of surface forms. It looks like this: \u a-para-a-i-ri-me where the hyphens separate the individual morphemes.  File: anadiff.info, Node: \w, Next: \f, Prev: \u, Up: Analysis file fields Word field: \w -------------- The original word field (`\w') contains the original input word as it looks before decapitalization and orthography changes. It looks like this: \w The Note that this is a gratuitous change from earlier versions of AMPLE and KTEXT, which wrote the decapitalized form.  File: anadiff.info, Node: \f, Next: \c, Prev: \w, Up: Analysis file fields Formatting field: \f -------------------- The format information field (`\f') records any formatting codes or punctuation that appeared in the input text file before the word. It looks like this: \f \\id MAT 5 HGMT05.SFM, 14-feb-84 D. Weber, Huallaga Quechua\n \\c 5\n\n \\s where backslashes (`\') in the input text are doubled, newlines are represented by `\n', and additional lines in the field start with a tab character. The format information field is written to the output analysis file whenever it is needed, that is, whenever formatting codes or punctuation exist before words.  File: anadiff.info, Node: \c, Next: \n, Prev: \f, Up: Analysis file fields Capitalization field: \c ------------------------ The capitalization field (`\c') records any capitalization of the input word. It looks like this: \c 1 where the number following the field code has one of these values: `1' the first (or only) letter of the word is capitalized `2' all letters of the word are capitalized `4-32767' some letters of the word are capitalized and some are not Note that the third form is of limited utility, but still exists because of words like the author's last name. The capitalization field is written to the output analysis file whenever any of the letters in the word are capitalized.  File: anadiff.info, Node: \n, Prev: \c, Up: Analysis file fields Nonalphabetic field: \n ----------------------- The nonalphabetic field (`\n') records any trailing punctuation, bar codes, or whitespace characters. It looks like this: \n |r.\n where newlines are represented by `\n'. The nonalphabetic field ends with the last whitespace character immediately following the word. The nonalphabetic field is written to the output analysis file whenever the word is followed by anything other than a single space character. This includes the case when a word ends a file with nothing following it.  File: anadiff.info, Node: Ambiguous analyses, Next: Analysis failures, Prev: Analysis file fields, Up: Analysis files Ambiguous analyses ================== The previous section assumed that only one analysis is produced for each word. This is not always possible since words in isolation are frequently ambiguous. Multiple analyses are handled by writing each analysis field in parallel, with the number of analyses at the beginning of each output field. For example, \a %2%< A0 imaika > CNJT AUG%< A0 imaika > ADVS% \d %2%imaika-Npa-ni%imaika-Npani% \cat %2%A0 A0=A0/A0=A0/A0%A0 A0=A0/A0% \p %2%==%=% \fd %2%==%=% \u %2%imaika-Npa-ni%imaika-Npani% \w Imaicampani \f \\v124 \c 1 \n \n where the percent sign (`%') separates the different analyses in each field. Note that only those fields which contain analysis information are marked for ambiguity. The other fields (`\w', `\f', `\c', and `\n') are the same regardless of the number of analyses.  File: anadiff.info, Node: Analysis failures, Prev: Ambiguous analyses, Up: Analysis files Analysis failures ================= The previous sections assumed that words are successfully analyzed. This does not always happen. Analysis failures are marked the same way as multiple analyses, but with zero (`0') for the ambiguity count. For example, \a %0%ta% \d %0%ta% \cat %0%% \p %0%% \fd %0%% \u %0%% \w TA \f \\v 12 |b \c 2 \n |r\n Note that only the `\a' and `\d' fields contain any information, and those both have the original word as a place holder. The other analysis fields (`\cat', `\p', `\fd', and `\u') are marked for failure, but otherwise left empty.  Tag Table: Node: Top746 Node: Introduction1063 Node: Running ANADIFF1568 Node: Analysis files6008 Node: Analysis file fields6898 Node: \a7711 Node: \d8319 Node: \cat8680 Node: \p9118 Node: \fd9684 Node: \u10407 Node: \w10780 Node: \f11170 Node: \c11877 Node: \n12610 Node: Ambiguous analyses13224 Node: Analysis failures14239  End Tag Table