Getting Started

Congratulations. We are nearly finished. On Linux, the finite-state applications you see in this
directory run in an xterm window. They cannot be started by double clicking the icons.

In what follows, when we say to "enter" a command, that means to type the command at the command-line prompt and press the Enter (or Return) key.

If you copied the for-linux folder into your home directory, open a Terminal window and enter the command

cd ~/for-linux

to go into the directory. Then enter the command

ls -l

to make sure you have arrived at the right place. You should see this GettingStarted file and the five applications: xfst, lexc, lookup, tokenize, and twolc. The display should look something like this:

-rw------- 1 myname staff 13418 Mar 11 18:22 GettingStartedLinux.html-rwx------ 1 myname staff 499248 Aug 11 2002 lexc-rwx------ 1 myname staff 371432 Jun 13 2002 lookup
-rwx------ 1 myname staff 221688 Jun 13 2002 tokenize
-rwx------ 1 myname staff 424652 Jul 20 2002 twolc
-rwx------ 1 myname staff 686608 Sep 27 18:52 xfst

The -rwx------ signature in the beginning of the line indicates that you have the full rights to the file (read, write, and execute). Sometimes file permissions are not preserved when files are copied. If you see something other than -rwx------ in the beginning of the last five lines, enter the command

chmod 700 *

to fix the permissions. You (and only you) should be able to read, write, move, or execute these files.

If you now issue the command

./xfst

(note the period and the slash), the xfst application will start and you will see the xfst[0]: prompt. To try out a simple command, you can type

read regex a b c;

to make your first network. This is just a test, you can immediately quit with the command exit. To make the programs accessible from any xterm window, they should be copied to a directory that is on your "path", a list of directories that Unix searches to find the executable programs for you. Enter the command

echo $path

to see what directories you have on your path.

If you are an experienced Unix user, you already know what to do: move the programs to some existing directory such as ~/bin that is on your path, enter the command rehash, and follow it by the command which xfst just to make sure your did everything right. The installation is finished and you can skip the rest of this section.

If you are a novice Linux user, now is the time to learn a couple of tricks. First make sure that your current working directory is ~/for-linux; if you copied the for-linux folder to your home directory, then enter the command

cd ~/for-linux

If you don't have a ~/bin directory (~/ stands for the path from the top of the file system into your home directory), we recommend that you make one and move the programs there. To do that, enter the commands

mkdir ~/bin
mv * ~/bin/

The mkdir command creates the folder bin in your home directory. If you already have a ~/bin directory, the mkdir command will tell you so, and do nothing. The mv command moves all the files in your current ~/for-linux directory into the ~/bin directory. To verify that all went well, enter the command

ls -l ~/bin/.

You should see the five programs in their new location.

The next step is a little tricky. If you already had a ~/bin directory, there is a chance that it is already on your path. If that is the case, entering the command

echo $path

should show something terminating in ~/bin or /home/myname/binin its output (different Linux installations may store the user home directories in different locations). If that is the case, enter the commands

rehash
which xfst

to make sure everything is installed correctly.

If the ~/bin directory does not show up in your path, we need to put it there. When the xterm application starts, it looks for a file called .cshrc (the dot is part of the name) in your home directory. Check first if you have such a file by doing ls ~/.cshrc. If the file exists, bring it up in a text editor and add the line

set path = (~/bin $path)

and save the .cshrc file. If the file does not yet exist, type the following three lines

cat > ~/.cshrc
set path = (~/bin $path)
^D

where ^D stands for control-D. You now have a ~/.cshrc file that adds ~/bin to your path in every newly launched xterm application.

To verify that everything is OK, enter the command

source ~/.cshrc

to add the ~/bin directory to the path in the current xterm, followed by the commands

rehash
which xfst

If the which command comes back with a path to the location where xfst was installed, all is well and you are done. You can now launch xfst by with the command xfst in any new xterm window. If xfst works, the other four applications, lexc, lookup, tokenize, and twolc, will also launch properly.

If you have followed the above instructions and, at some later date, wish to uninstall the software, you can do it in any Terminal window on your machine with the commands

cd ~/bin rm lexc lookup tokenize twolc xfst

History

The Xerox finite-state software has a long history going back to the 1980s. The basic finite-state calculus and the maintenance routines such as determinization and minimization were originally implemented by Ronald M. Kaplan in Xerox Interlisp (Medley) with help from Martin Kay and John Maxwell. The system was then re-implemented and improved in C by Lauri Karttunen and Todd Yampol around 1990 based on Karttunen's 1988 Common Lisp version, which included important contributions from Jan Pedersen, Atty Mullins, and Doug Cutting. Around the same time, Ken Beesley and Lauri Karttunen re-implemented the compiler for Kimmo Koskenniemi's two-level rule formalism (twolc) and Karttunen and Yampol wrote the lexicon compiler (lexc) that became the basic tool for creating lexical transducers for a succession of Xerox enterprises: DDS, XSoft, Inxight.

In 1993, Xerox established a European research center in Grenoble, France, first called RXRC (Rank Xerox Research Centre) and later XRCE (Xerox Research Centre Europe). The maintenance and development of the C-version of the finite-state code moved from Palo Alto to Grenoble when the Grenoble center was established. The enrichment of the calculus with replace-rule expressions is the work of Karttunen and André Kempe, similar to but more versatile and efficient than the compilation algorithm in Kaplan and Kay's 1994 paper. The xfst interface and the two runtime applications (tokenize, lookup) were written at XRCE. The primary XRCE contributors are Pasi Tapanainen, André Kempe, Tamás Gaál, Hervé Poirier, Caroline Privault, and Jean-Marc Coursimault.

In practical use at Xerox, the replace rules of xfst have superseded Koskenniemi's two-level formalism, which was the dominant paradigm in the early 1990s. Most Xerox developers now use lexc to create lexicon-like finite-state transducers and xfst to write rules; the twolc language is falling out of use.

The twolc compiler is included on the software CD but is not documented in the book Finite State Morphology. If you are planning to use twolc or want to know about two-level rules, please read the chapter entitled Two-Level Compiler in the doc folder.

Known Issues

The software on this CD dates back to the summer of 2002. It has been used extensively by many developers at XRCE, Parc, and Inxight. As in any complex piece of software, there are undoubtedly some errors and misfeatures in the code, but we are not aware of any serious bugs. However, there are two limitations that the user should be aware of:

Because of its Unix origins, all the applications assume that lines in text input files end with the Unix newline character "\n". Input files that terminate lines with "\r" (Macintosh) or with "\r\n" (Windows, DOS) cannot be processed with the CD versions of the software. If you have a source file created on a Macintosh, you can replace the end-of-line characters in Unix with the command

tr "\r" "\n" < inputfile > outputfile

The command

tr -d "\r" < inputfile > outputfile

converts a Windows/DOS document into Unix format.

Only ISO-8859-1 (= Latin-1) character encoding is supported by the CD versions of the software. 16-bit Unicode characters (UCS-2) are handled internally but they cannot be entered directly as input. For example, the Hebrew letter Alef can be represented as "\u05D0" in a regular expression where "\u" indicates that the following four Hex characters encode a Unicode symbol but the symbol will not be printed as the proper Hebrew character even if the computer has a Hebrew font installed.

In the near future, we will make available new versions of xfst, lexc, lookup, and tokenize that are aware of different end-of-line conventions and are able to process UTF-8 encoded Unicode files. Please check out the book web site, http://www.fsmbook.com, for updates.