Date: Fri, 19 May 1995 10:36:14 -0500 From: Evan.Antworth@sil.org (Evan L. Antworth) Subject: [14] Englex update To: pc-parse@sil.org Reply-to: PC-PARSE@sil.orgMIME-version: 1.0 Comments: PC-PARSE Mailing List I have just released an update for Englex, numbered 2.0b4. It fixes a number of errors and makes several improvements (I hope!). The biggest change is that Englex now handles words with lexical capitalization, e.g. proper nouns. I am distinguishing lexical capitalization from orthographic capitalization. For words such as "September" and "France", capitalization is part of their lexical form. Thus "May", the month, can be distinguished from "may", the modal. Also, many acronyms have lexical capitalization, such as IBM and NATO. Orthographic capitalization refers to sentence-initial capitalization or use of all-caps for emphasis. Englex does not handle these cases; they must be handled with preprocessing. To effect this change, I have added the 26 upper case letters to the ALPHABET declared in the rules file (english.rul). I have been reluctant to do this for fear that adding another 26 feasible pairs would seriously degrade processing speed. However, I finally realized that lexical upper case letters almost never undergo any of the orthographic rules; so the upper case pairs have not been added to the rules. The result is only about a 10% increase in processing time. I think the considerable gain in descriptive power and processing accuracy is worth it. The lexicon entries in proper.lex and abbrev.lex (mainly) have also been changed to use lexical capitalization. I also added . (period) to the alphabet since it is used as a lexical letter in many abbreviations, e.g. U.S.A. The file abbrev.lex now uses lexical forms including periods. Note that this has nothing to do with handling period as a punctuation character (full stop at the end of a sentence). I created a new lexicon file named natural.lex which contains less common flora and fauna terms, etc. The file can be optionally loaded to save memory. See the file english.lex. I've also made various fixes and changes to the grammar file (english.grm) including the following. The features "lemma" and "lemma_pos" are now (more accurately) "root" and "root_pos". I'm trying to handle cooccurrence restrictions on affixes and roots. For example, the word "copies" would return two parses: copy+s (the expected one) and co+pie+s (if you can have a copresident, why not a copie?). In the general case you want to permit the prefix co+ to occur on nouns, but would like to block howlers like this. So I am using specific features to do it: LET deg2 be = + ;co+ LET ~deg2 be = - ;co+ The lexical entry for co+ has the feature [deg2 +] while specific nouns such as "pie" have the feature [deg2 -], thus blocking their unification. Note that while I have created some of the structure for handling such cooccurrence restrictions, I have not found all them! Send me your examples of parses you think should be blocked. Here is a point on which I would like to hear some discussion: how to handle participles. For example, the -en participle form of verbs can function as an adjective, as in "cooked vegetables"; it even takes adjectival affixes: "uncooked vegetables". To handle this, Englex produces both a verb (-en form) and an adjective parse of such words. Also, an -ing verb is parsed as a verb, an adjective, and a noun. This makes for a lot of parses. If the main use of Englex is to provide morphological information to a syntactic parser (such as P-PATR), is this the desired behavior? In this new update of Englex, I have added a feature contraint that blocks the adjective and noun parses of -en and -ing forms just in the case where they do NOT bear any derivational affixes. For example, "cooked" will be parsed as a verb (-en form), but "uncooked" as an adjective. You can reverse this behavior by finding the "Word = Stem" rule and commenting out the contraint " = -". Below is information on how to obtain the Englex update. We also have an AppleEvents version of PC-KIMMO (AEKimmo) that supports AppleScript. We haven't released it yet. Is anyone interested in it? We are also close to releasing a new program using the PC-KIMMO parser. Called KTagger, it can be used to parse words and produce a tagged output in formats customized by the user. Watch this list! --Evan Evan Antworth | e-mail: evan.antworth@sil.org Academic Computing Department | phone: 214-709-3346 Summer Institute of LInguistics | fax: 214-709-3363 7500 W. Camp Wisdom Road Dallas, TX 75236 ----------------------------------------- For more information on Englex (including on-line documentation), connect to our Web server or Gopher server at these URLs: gopher://gopher.sil.org/11/gopher_root/pc-kimmo/v2/englex/ http://www.sil.org/pckimmo/v2/doc/englex.html Englex is directly available from these URLs: MS-DOS and Windows: ftp://ftp.sil.org/data/pc-kimmo/dos/engl20b4.zip Macintosh: ftp://ftp.sil.org/data/pc-kimmo/mac/englex20b4.sea_hqx UNIX: ftp://ftp.sil.org/data/pc-kimmo/unix/englex20b4.tar_z Englex can also be retrieved via e-mail. Send a message to MAILSERV@SIL.ORG consisting of these two lines only: HELP INDEX