SYNTAXSCAPE A syntactic parser for the Internet Juno R Suk (junowhoim@yahoo.com) __________________________________________________________________ TUTORIAL CONTENTS PURPOSE STARTING THE PROGRAM DESCRIPTION OF GRAPHICAL USER INTERFACE LOADING AND SAVING A GRAMMAR LOADING AND SAVING A CORPUS GRAPHICAL VIEW PARSE INTERFACE GRAMMAR EDITOR LEXICON EDITOR CORPUS EDITOR HISTORY CACHE HYPERLINKING PREFERENCE EDITOR __________________________________________________________________ PURPOSE This program, first and foremost, serves as a didactic tool in visually demonstrating the concepts of syntactic parsing and context-free grammars. Through an intuitive graphical user interface, students can experiment with creating grammars and lexicons, and inputting or retrieving sentences for syntactic parsing based on the created grammar and lexicon. The program provides functions for loading, editing, and saving grammars and lexicons, retrieving corpii through either a URL, the local file system, or a history cache, editing the corpus and automatically saving the corpus in its current state to the history cache, selectively parsing the contents of the corpus, and viewing and printing the syntactic parse trees in one of two available graphical views. __________________________________________________________________ STARTING THE PROGRAM You can start the program in one of two ways. The easiest way is to just use the provided shell scripts to start the program. The other way is to type in the actual java commandline. The syntax for starting the program is java -D_APP_HOME_DIR= -D_DEBUG= Synscape An example of this syntax is found in the included shell/batch scripts: (for UNIX) sscape testrun (for MS-DOS) sscape.bat testrun.bat __________________________________________________________________ DESCRIPTION OF GRAPHICAL USER INTERFACE The GUI main screen is divided up into 5 parts. From top to bottom, they are: 1. Menubar 2. Main operations Panel 3. URL Input Panel 4. Graphical Canvas 5. Parse Interface The menubar provides access to many functions already available in the interface. Also included in the menubar are items such as loading and saving a grammar and viewing this tutorial. The main operations panel provides access to some of the more commonly used functions. These include: loading a corpus through either a local file or the history cache, opening editors for modifying the grammar/lexicon/corpus, opening the preference window for configuring program options. The URL input panel provides a textfield in which to directly type in new URL locations to load corpii. The graphical canvas provides the view to the syntactic parse. The view is either available in box or tree format which can be specified through an option in the... Parse interface includes the aforemention option as well as a print function for printing out the current graphical canvas, and arrows for traversing the different parses and sentences. A reparse of the current sentence is also an option. At the bottom of this interface is a status bar which keeps you up to date on some important info- current sentence, current parse, current view, maximum depth allowed on parse trees, and a current action message box. __________________________________________________________________ LOADING AND SAVING A GRAMMAR A default grammar is usually loaded automatically upon program start from a default file, but the user also has the option of specifying an alternate grammar upon start up of program when issuing the java command through command-line arguments (see above, STARTING THE PROGRAM), or can load up a new grammar later through the menubar under menu File. Note: The grammar file should end in an extension ".grammar". The default grammar folder is APP_HOME_DIR/GRAMMARS Saving the grammar is done automatically during two events: 1. User loads up a new grammar 2. User shuts down program If you wish to save the grammar under a different name, this can be done by selecting "Save Grammar" under the File menu. This will save the file under the name and directory of your choice. It will automatically append the extension ".grammar" if you did not do so already in the file dialog. __________________________________________________________________ LOADING AND SAVING A CORPUS You have four ways of loading a corpus. 1. By specifying an initial corpus url in the "java" command line (see above STARTING THE PROGRAM). 2. Enter URL into the URL: textfield and press 3. Click on the "Local" icon and select a file from your local file system. 4. Click on the "History" icon and select one of the previously viewed pages. Currently, corpii are assumed to be in plain text form and are parsed as such. HTML files may be loaded as corpus but HTML parsing is minimal and the current filter will probably let many interesting non-sensical tags and phrases show up. Saving a corpus is done automatically by the program upon the following events: 1. User requests a new corpus 2. User shuts down program __________________________________________________________________ GRAPHICAL VIEW This is the area where the syntactially parsed sentence is graphically shown. The parse tree can be displayed in three ways. 1. Lexeme Only This view automatically is shown when the current sentence selected in the parse list is the top, unparsed form. 2. Tree View This is the default view. It shows each of the tokens as a node in the tree and delineates all the appropriate branches to its children and parent. 3. Box View This view, selected by clicking on the Tree/Box view toggle button on the parse interface, will switch the view from nodes and branches to an overlapping boxes view. Each child token's box is encompassed in its entirety by its parent's box. And conversely, all a token's children are encompassed by its box. __________________________________________________________________ PARSE INTERFACE The parse interface helps you navigate through the individual sentences and their parses. The Next Sentence and Previous Sentence buttons will move you through all the sentences currently loaded in the corpus. The Next Parse and Previous Parse buttons will move you through all the parses currently available for the selected sentence. The desired parse can also be directly selected by clicking on the appropiate list item right below this bar. The Reparse button will parse the current selected sentence. By default, the sentences will not be parsed initially, so you will either have to parse them by clicking this button or by clicking on the Parse button in the Corpus Editor. The Tree/Box View button will toggle your view between the corresponding views, as aforementioned in the Graphical View section. If the current selected parse is the top item on the parse list, then there is no Tree or Box view available since the top list item is always the non-parsed, original sentence. The Print button will send a print job of the current parse tree image. Note that the print image will probably differ from the one shown in the program. This is because the print image is recalculated to fit-to-size in the printer document dimensions. Sometimes, the image resulting from this recalculation is extremely different in appearance, though the structure is retained. __________________________________________________________________ GRAMMAR EDITOR Gives a listing of all the grammar definitions in the currently loaded grammar. There are options here for the user to modify this list by either adding, updating, or deleting from this list. The operations here are intuitive. Be careful that all inputs here are valid and intended. Any change in the grammar will be saved to disk. __________________________________________________________________ LEXICON EDITOR Gives a listing of all the lexemes existing in the current context-free grammar lexicon section. To the right is also a list of Parts-Of-Speech, which is updated to show the parts of speech associated with the selected lexeme on the left. There are options here for adding, deleting, updating lexemes as well as adding and deleting parts-of-speech to the currently selected lexeme. As was with the Grammar Editor, be careful that all inputs here are valid and intended. Any change in the grammar will be saved to disk. One more thing to note - if you add a word to the lexicon, remember to also add at least one POS to it as well. The way the CFGs are defined, if a word has no POS's associated with it, it will be lost since the definitions are based on POS's and not by the words themselves. __________________________________________________________________ CORPUS EDITOR Provides a listing of all the sentences in the currently loaded corpus. Clicking on one of these sentences will update the graphical view and parse interface to jump to that sentence. The user also has options here to update or delete a sentence in the corpus, and also is able to request a parse of the sentence as well (i.e. in the event that he has just modified the grammar/lexicon and wants to see the change). __________________________________________________________________ HISTORY CACHE The history cache is similar to the drop-down URL menu available on Netscape and Microsoft IE browsers. The program saves retrieved corpuses to a cache directory and allows you to retrieve them quickly through the history cache. Files in the history cache will also be loaded automatically when the user chooses a local file or the user types into the URL field a URL that matches one already in the cache. Previously parses are recorded in these cache-saved files so that files retrieved through the cache need not parse sentences over again. The number of files allowed in the cache can be set through the Preference Editor. The number of files in cache can also be set to indefinite through the Preference Editor if the user does not want any of the files to be deleted in the cache. The option to clear all items in the cache is available through the History window. __________________________________________________________________ HYPERLINKING An additional enhancement to the graphical view, clicking on a node in the visual graph will cause the corresponding lists to auto-select the clicked-on element in either the Grammar Editor or the Lexicon Editor. If the user clicks on either a terminal node (a lexeme) or the parent of a terminal node (a part of speech), the Lexicon Editor will automatically be sent to front (if visible) with these items selected. If the user clicks on any other node, the Grammar Editor will come to front with the appropriate grammar definition selected. __________________________________________________________________ PREFERENCE EDITOR This configuration window gives the user the means to modify default values associated with different parts of the program. The user can specify which grammar file or corpus URL to load up open program start, as well as specify whether the graphical view will fit its output to user-specified dimensions or will automatically resize its dimensions to fit the tree rendered at default font size/ node spacing/ node padding/ etc... The size of the history cache and whether the program will limit the saved corpus files to this number of go to indefinitely can be set here. The maximum depth of the parse tree can also be specified. __________________________________________________________________