ANADIFF Reference Manual comparing analysis files version 1.0b6 October 1997 by Stephen McConnel Copyright (C) 2000 SIL International Published by: Language Software Development SIL International 7500 W. Camp Wisdom Road Dallas, TX 75236 U.S.A. Permission is granted to make and distribute verbatim copies of this file provided the copyright notice and this permission notice are preserved in all copies. The author may be reached at the address above or via email as `steve@acadcomp.sil.org'. Introduction to the ANADIFF program *********************************** Comparing the output from two different runs of programs like AMPLE or KTEXT can be quite difficult. The analysis files can differ only in the order of ambiguous analyses, or in the number of analyses. Either situation is difficult to analyze with simple file comparison utilities like `fc.exe' or `diff'. That is why ANADIFF was written. Running ANADIFF *************** Command line options -------------------- The ANADIFF program uses an old-fashioned command line interface following the convention of options starting with a dash character (`-'). The available options are listed below in alphabetical order. Those options which require an argument have the argument type following the option letter. `-a character' selects the character that marks ambiguities and failures in the analysis files. (The default is `%'.) `-q' tells ANADIFF to operate *quietly* without the normal output showing the differences. `-v' tells ANADIFF to operate *verbosely* with extra output to the screen. ANADIFF expects the names of two analysis files following any options on the command line. The examples below illustrate this. Examples -------- ANADIFF normally writes a summary when it has finished comparing the two analysis files. For example, comparing two identical analysis files would look like this: C> anadiff a.aaa a.ana a.aaa and a.ana are the same (78 of 78 are identical) Comparing two analysis files that are logically, but not literally, identical produces similar output: C> anadiff b.aaa b.ana b.aaa and b.ana are the same (70 of 78 are identical) As would be expected, more output is generated when the two analysis files actually are different. The following example illustrates four types of differences: 1. Different morph names in a single analysis 2. Different property names in a single analysis 3. A double versus a single analysis 4. Different morph names in the second of two analyses Note that ANADIFF displays analyses somewhat differently than they appear in the analysis files. C> anadiff c.aaa c.ana 1. diff: \a < N0 *hirka > LOC < N0 *hirka > LOCATIVE \d hirka-chaw \p = \w HIRKACHAW \f \\id HGMT05.SFM, 14-feb-84 D. Weber, Huallaga Quechua\n \\c 5\n\n \\s \c 2 2. \a < V2 *yatra > CAUS 3 \d yacha-chi-n diff: \p =Mlowers=foreshortens =Morphlowers=foreshortens \w YACHACHIN \c 2 \n \n 3. \a < N0 *runa > PLUR \d runa-kuna \p = diff: \a < N0 *runa > plural [missing in c.ana] diff: \d runa-kuna [missing in c.ana] diff: \p = [missing in c.ana] \w runakuna 4. \a < V1 *n~awpa > 3 COND \d n~awpa-n-man \p =foreshortens= diff: \a < N0 *n~awpa > 3P GOAL < N0 *n~awpa > 3P GOALIE \d n~awpa-n-man \p == \w n~awpanman c.aaa and c.ana differ 4 times (63 of 78 are identical) Batch file example ------------------ ANADIFF returns the number of actual differences that it found when it exits. This allows it to be used in batch files for regression tests. For example, consider the following MS-DOS batch file for testing different versions of AMPLE on a specific set of data: if "%1" == "" goto done %1 -m -f aetest.cmd -i aetest.txt -o aetest.ana >aetest.log anadiff aetest.aaa aetest.ana >aetest.dif if errorlevel 1 goto done del aetest.dif del aetest.ana del aetest.log :done (The `%1' is an MS-DOS batch variable corresponding to the first command line argument following the name of the batch file in the command.) This batch file runs the given version of AMPLE on a specific set of data, and then compares the output file (`aetest.ana' to the output from a previous run (`aetest.aaa'). If the output is the same, except possibly for the order of ambiguous analyses, then the new output and log files are deleted. The same is accomplished by the following Unix shell script. #! /bin/sh if [ $# -gt 0 ]; then $1 -m -f aetest.cmd -i aetest.txt -o aetest.ana >aetest.log if (anadiff aetest.aaa aetest.ana >aetest.dif) then rm aetest.dif aetest.ana aetest.log fi fi Input Analysis Files ******************** Analysis files are "record oriented standard format files". This means that the files are divided into records, each representing a single word in the original input text file, and records are divided into fields. An analysis file contains at least one record, and may contain a large number of records. Each record contains one or more fields. Each field occupies at least one line, and is marked by a "field code" at the beginning of the line. A field code begins with a backslash character (`\'), and contains 1 or more letters in addition. Analysis file fields ==================== This section describes the possible fields in an analysis file. The only field that is guaranteed to exist is the analysis (`\a') field. All other fields are either data dependent or optional. Analysis field: \a ------------------ The analysis field (`\a') starts each record of an analysis file. It has the following form: \a PFX IFX PFX < CAT root CAT root > SFX IFX SFX where `PFX' is a prefix morphname, `IFX' is an infix morphname, `SFX' is a suffix morphname, `CAT' is a root category, and `root' is a root gloss or etymology. In the simplest case, an analysis field would look like this: \a < CAT root > where `CAT' is a root category and `root' is a root gloss or etymology. Decomposition field: \d ----------------------- The morpheme decomposition field (`\d') follows the analysis field. It has the following form: \d anti-dis-establish-ment-arian-ism-s where the hyphens separate the individual morphemes in the surface form of the word. Category field: \cat -------------------- The category field (`\cat') provides rudimentary category information. This may be useful for sentence level parsing. It has the following form: \cat CAT where `CAT' is the word category. If there are multiple analyses, there will be multiple categories in the output, separated by ambiguity markers. Properties field: \p -------------------- The properties field (`\p') contains the names of any allomorph or morpheme properties found in the analysis of the word. It has the form: \p ==prop1 prop2=prop3= where `prop1', `prop2', and `prop3' are property names. The equal signs (`=') serve to separate the property information of the individual morphemes. Note that morphemes may have more than one property, with the names separated by spaces, or no properties at all. Feature Descriptors field: \fd ------------------------------ The feature descriptor field (`\fd') contains the feature names associated with each morpheme in the analysis. It has the following form: \fd ==feat1 feat2=feat3= where `feat1', `feat2', and `feat3' are feature descriptors. The equal signs (`=') serve to separate the feature descriptors of the individual morphemes. Note that morphemes may have more than one feature descriptor, with the names separated by spaces, or no feature descriptors at all. If there are multiple analyses, there will be multiple feature sets in the output, separated by ambiguity markers. Underlying form field: \u ------------------------- The underlying form field (`\u') is similar to the decomposition field except that it shows underlying forms instead of surface forms. It looks like this: \u a-para-a-i-ri-me where the hyphens separate the individual morphemes. Word field: \w -------------- The original word field (`\w') contains the original input word as it looks before decapitalization and orthography changes. It looks like this: \w The Note that this is a gratuitous change from earlier versions of AMPLE and KTEXT, which wrote the decapitalized form. Formatting field: \f -------------------- The format information field (`\f') records any formatting codes or punctuation that appeared in the input text file before the word. It looks like this: \f \\id MAT 5 HGMT05.SFM, 14-feb-84 D. Weber, Huallaga Quechua\n \\c 5\n\n \\s where backslashes (`\') in the input text are doubled, newlines are represented by `\n', and additional lines in the field start with a tab character. The format information field is written to the output analysis file whenever it is needed, that is, whenever formatting codes or punctuation exist before words. Capitalization field: \c ------------------------ The capitalization field (`\c') records any capitalization of the input word. It looks like this: \c 1 where the number following the field code has one of these values: `1' the first (or only) letter of the word is capitalized `2' all letters of the word are capitalized `4-32767' some letters of the word are capitalized and some are not Note that the third form is of limited utility, but still exists because of words like the author's last name. The capitalization field is written to the output analysis file whenever any of the letters in the word are capitalized. Nonalphabetic field: \n ----------------------- The nonalphabetic field (`\n') records any trailing punctuation, bar codes, or whitespace characters. It looks like this: \n |r.\n where newlines are represented by `\n'. The nonalphabetic field ends with the last whitespace character immediately following the word. The nonalphabetic field is written to the output analysis file whenever the word is followed by anything other than a single space character. This includes the case when a word ends a file with nothing following it. Ambiguous analyses ================== The previous section assumed that only one analysis is produced for each word. This is not always possible since words in isolation are frequently ambiguous. Multiple analyses are handled by writing each analysis field in parallel, with the number of analyses at the beginning of each output field. For example, \a %2%< A0 imaika > CNJT AUG%< A0 imaika > ADVS% \d %2%imaika-Npa-ni%imaika-Npani% \cat %2%A0 A0=A0/A0=A0/A0%A0 A0=A0/A0% \p %2%==%=% \fd %2%==%=% \u %2%imaika-Npa-ni%imaika-Npani% \w Imaicampani \f \\v124 \c 1 \n \n where the percent sign (`%') separates the different analyses in each field. Note that only those fields which contain analysis information are marked for ambiguity. The other fields (`\w', `\f', `\c', and `\n') are the same regardless of the number of analyses. Analysis failures ================= The previous sections assumed that words are successfully analyzed. This does not always happen. Analysis failures are marked the same way as multiple analyses, but with zero (`0') for the ambiguity count. For example, \a %0%ta% \d %0%ta% \cat %0%% \p %0%% \fd %0%% \u %0%% \w TA \f \\v 12 |b \c 2 \n |r\n Note that only the `\a' and `\d' fields contain any information, and those both have the original word as a place holder. The other analysis fields (`\cat', `\p', `\fd', and `\u') are marked for failure, but otherwise left empty.