This section discusses the regular expressions of FSA as well as each of the built-in regular expression operators.
The following global variables are important for regular expressions:
regex_cache
pred_module
Regular expressions are defined as Prolog terms, and therefore Prolog syntax applies. For detailed information on this, cf. the Prolog manual. The brackets () can always be used to express the desired grouping. The order of precedence of operators is as follows:
: /
..
+ * ^
& -
o x xx
! #
Operators with the same precedence are interpreted left-to-right. For example, the expression
a..z* - b* & c..d*
is interpreted as:
(((a..z)*) - (b*)) & ((c..d)*)
Syntax restrictions
These are all due to the use of Prolog syntax. The benefit of using Prolog syntax is that I don't need to implement a parser, and you have flexibility (by using your own operator definitions). However, a few limitations are inherited as well. Here are a few rules of thumb:
Capitals can be used in regular expression in the Tk entry field, by putting them between quotes:
'A'..'Z'
At the Regex prompt (after fsa -r) you can use:
'A'..'Z'
At the command-interpreter you can use:
2 |: -r 'A'..'Z'
As part of Expr in the fsa -r Expr command, use (this depends on the shell you are using. This example works for bourne sh, csh, and bash):
fsa -r "'A'..'Z'"
Use space between operators. Use space before and after a question mark (?). Don't use the dot '.' or the vertical bar '|' as (part of) a new operator. Similarly, avoid using the comma ',', the ';', and '->' as (part of a) regular expression operator. It's neither a good idea to use ':-'. Operators can be escaped using (), but hardly ever have to (e.g. the following works, even if o is the binary composition operator!).
fsa -r 'o o o'
Brackets can be used for grouping as well:
fsa -r 'o o o o (o o o)'
The regular expression compiler provides detailed information on the computation time and the size of the resulting automata for certain regular expression operators, namely for those operators Op for which the predicate
bb_get(fsa_rx_spy:Op,on).
succeeds. So you can set a spy-point to operator concat by the directive:
?- bb_put(fsa_rx_spy:concat,on).
The special operator spy(Expr) is equivalent to Expr except that it has an associated spy-point.
Using the -aux[file] command line option, or the AuxFile button of the TK Widget, you can load auxiliary regular expression operators. The file should be a Prolog source file (either .pl or .ql). It will be loaded into module *fsa_regex_aux*. The syntax of regular expressions can be used in this file (in fact it must be used, beware if the file also contains ordinary Prolog code!).
Two relations are important: 1. macro/2 2. rx/2
The first relation is usually defined by unit clauses. It simply states that the regular expression in the first argument is an abbreviation for the regular expression in the second argument. For example:
macro(vowels,{a,e,i,u,o}).
Such macro's can be parameterized using Prolog variables; e.g.:
macro(brz(Expr),determinize(reverse(determinize(reverse(Expr))).
The relation rx/2 can be used for more complicated operations (operations that are cumbersome or impossible to define in terms of simpler regular expression operators). It defines a relation between the regular expression in the first argument and the finite automaton in the second argument. It is often useful to be able to call the regular expression compiler recursively. This should be done through the predicate fsa_regex:rx/2. The following is equivalent to the first example of macro/2 above:
rx( vowels, Fa) :-
fsa_regex:rx({a,e,i,u,o}, Fa).
Consult the Examples directory, for instance in the MohriSproat96, Karttunen95, Karttunen96, Karttunen98, GerdemannVannoord, Queens directories, for some extensive illustrations.
Suppose you want to use the definition of a replace operator in some file replace.pl in your analysis of Dutch phonology. In the latter file you can include the definitions from replace.pl by including somewhere at the top of your file the following directives:
:- ensure_loaded(replace). %% loads replace.pl
:- multifile macro/2.
:- multifile rx/2.
This only works, if the multifile declarations are also present in the file you are importing. I.e. in this example the file replace.pl should also have the directives
:- multifile macro/2.
:- multifile rx/2.
The set of one-symbol strings over the universal alphabet, ie. ? can be read as `any symbol whatsoever'. It uses the true/1 predicate_declaration from the current predicate module. If that declaration is not defined then no compilation for this operator is possible, and an error occurs.
Applies minimization to the result of compiling Expr. There are a number of related expressions depending on which minimization algorithm is to be used.
mb uses the algorithm due to Brzozowski, mh uses the algorithm by Hopcroft (as described in Aho, Hopcroft and Ullman, 1974).
If Expr is a transducer then it is temporarily treated as a recognizer over pairs of symbols (using the fsa_frozen predicate module).
Set of strings denoted by A, but moreover the subset construction determinization algorithm is applied to ensure that the automaton is deterministic. The algorithm can be specified as the second argument. There are several variants of the algorithm, which are different with respect to the treatment of epsilon transitions:
per_graph: first construct efree automaton (jumps taken into account on target side of transitions and on start states)
per_inverted_graph: first construct efree automaton (jumps taken into account on source side of transitions and and final states)
per_reachable_graph: as per_inverted_graph, but maintains accessibility
per_co_reachable_graph: as per_graph, but maintains co-accessibility
per_subset: compute transitive closure of jumps on the fly for each subset
per_state: compute transitive closure of jumps on the fly for each state
These variants and some interesting experimental observations are described in a paper I presented at the FSMNLP 98 workshop in Ankara. The paper is available from http://www.let.rug.nl/~vannoord/papers/. An improved version of the paper will be published in Computational Linguistics.
By default the algorithm is chosen by a simple heuristic based on the number of states and number of jumps of the input automaton. If A is a transducer then it is temporarily treated as a recognizer over pairs of symbols.
Constructs epsilon-free automaton for the automaton created for E. The first variant is faster, the second and third algorithms yield smaller automata by only taking into account states reachable from the start state, resp. from which a final state is reachable. If E is a transducer then it is temporarily treated as a recognizer over pairs of symbols.
The complement of the language denoted by E. E must be a recognizer.
Set of strings denoted by A minus those given by B. A and B must be recognizers.
The language consisting of all strings that have an instance of E as a sub-string: [? *, E, ? *]. Note that the result is a minimal automaton. Since the definition of this operator depends on the ?/0 operator it is only defined if the current predicate module provides a definition of true/1. E can be both a recognizer or a transducer.
The set of pairs denoted by E, but moreover the determinization algorithm for transducers by Mohri, cf. also Roche and Schabes, is applied to E. NB: this is only guaranteed to terminate if in fact E can be determinized in the appropriate sense. The implementation currently does not check for this. Refer to the Examples/RocheSchabes97 directory for an experimental implementation of that check and various related algorithms. Also note that the outputs associated with final states that result from the construction are represented by new transitions with epsilon input. A is (coerced into) a transducer.
representation of sequential transducers
Note that in FSA subsequential transducers are represented as ordinary transducers. This implies in particular that instead of output symbols associated with final states, we have a separate transition over epsilon input and final output to a new final state. Similarly, automata which require an initial output to be associated with the start state will give rise to an extra transition from a new start state with epsilon input and the required start output.
types of transducers
The predicates all take a flag which indicate the type of transducer. This flag is passed on to a number of predicates in the transducer module. For each type of transducer, the transducer module must define the predicates:
zero(Type,Val).
plus(Type,Val0,Val1,Sum).
minus(Type,Val0,Val1,Diff).
minimum(Type,Val0,Val1,MinVal).
minimum_only(Type,YesNo).
Refer to the transducer module for an overview of the types of transducer currently supported.
The treatment of identity constraints over predicate pairs is especially tricky. This is mostly hidden in the predicate module declaration of determinize/2 preds. Funny things to watch for:
certain non-functional automata become t_deterministic:
t_minimize([a x class([b,d]),d*])
whereas FSA5 would loop over
t_minimize([ a x {b,d}, d*])
This is actually quite useful.
delayed identity constraints: predicates apply on the input side before the transition containing the target of the identity constraint is encountered. For example:
t_minimize({[a:b,?,?,?,?,?,b],[a:c,?,?,?,?,?,c]})
This is especially nasty if the number of question marks do not match up:
t_minimize({[a:b,?,?,?,?,?,b],[a:c,?,?,?,c]})
the opposite occurs as well: sometimes we have to output symbols satisfying a certain predicate which must be identical to an input symbol which is yet to be encountered! This currently works. Try for instance:
t_minimize([a:b,class(a..f)])
I think this is quite spectacular.
The set of pairs denoted by E, but moreover the minimization algorithm for transducers by Morhi is applied to E. NB: this is only guaranteed to terminate if in fact E can be determinized in the appropriate sense. The implementation currently does not check for this. Refer to the Examples/RocheSchabes97 directory for an experimental implementation of that check and various related algorithms. Also note that the outputs associated with initial and final states that result from the construction are represented by new transitions with epsilon input. E is (coerced into) a transducer.
The set of pairs denoted by E, but moreover the determinization algorithm for string to weight transducers by Mohri is applied to E. Thus, E must denote a string-to-weight transducer, i.e., all output symbols must be numbers. NB: this is only guaranteed to terminate if in fact E can be determinized in the appropriate sense. E is a transducer where all output symbols are numbers.
The set of pairs denoted by E, but moreover the minimization algorithm for string to weight transducers by Mohri is applied to E. Thus, E must denote a string-to-weight transducer, i.e., all output symbols must be numbers. NB: this is only guaranteed to terminate if in fact E can be determinized in the appropriate sense. E is a transducer where all output symbols are numbers.
Creates perfect hash of minimal size for the words found in ListOfAtoms. This uses an inefficient algorithm, intended for didactic purposes only. More efficent algorithms are implemented in module fsa_dict but are not available as regular expression operators.
Creates perfect hash for the words found in ListOfAtoms (you would normally apply the w_minimize operator to the result). This uses an inefficient algorithm, intended for didactic purposes only. More efficent algorithms are implemented in module fsa_dict but are not available as regular expression operators.
A and B are symbols; is a transducer mapping an A to a B.
The set of pairs (A0,B0) such that A0 is in A and B0 is in B. Both A and B must describe recognizers.
The set of pairs (A0,B0) such that A0 is in A and B0 is in B, moreover the strings A0 and B0 are to be of the same length. Both A and B must be recognizers.
Sym is a symbol. This denotes the language consisting of that symbol. Can be used to overwrite special meaning of some symbols. For instance, escape(?) can be used to denote a literal question mark. Sym should be ground Prolog term, and it is passed through the predicate regex_notation_to_predicate of the current predicate module.
S and T are one-character atoms or integers. In the first case, denotes the set of symbols from S up to T in ASCII coding. For instance a..e is equivalent to {a,b,c,d,e}. If S and T are integers, represents the set of integers in that interval: for instance 8..11 is equivalent to {8,9,10,11}.
Identical to set(Expr), except that Expr must be a:
a list of symbols and intervals of symbols
an interval of symbols
a symbol
After expansion of the intervals, the resulting list of atomics is passed through PredModule:class_to_pred, typically ensuring that the resulting set of transitions is smaller. This operator is similar to the character classes found in UNIX-style regular expressions. For instance:
class([a..z,0..9])
Similar to class/1, except that in this case the complement of the symbols defined by class(Expr) is used. Similar to the negated character classes found in UNIX-style regular expressions, such as [^aeiou]. For example:
negated_class(['<1','<2','1>','2>']).
Ensures that all states in the automaton for A are co-accessible, i.e. for each state there is path to a final state.
Ensures that all states in the automaton for A are co-accessible, i.e. for each state there is path to a final state.
This operator ensures that for each state s in the automaton of A there is a path from a start state to s.
This operator ensures that for each state s in the automaton of A there is a path from a start state to s.
Adds transitions and a sink state such that the transition table is total, i.e. there is a transition for every symbol from every state. If A is a transducer then it is temporarily treated as a recognizer over pairs of symbols. If A is a transducer then it is temporarily treated as a recognizer.
Strings from A interspersed with substrings from B. For instance, ignore([a,a,a],c) contains all strings over the alphabet {a,c} which contain exactly three a's. Both A and B must be recognizers.
{} denotes the empty language.
Union of the languages denoted by E1,..,En. As a special case, '{}' is the empty language, i.e. a language without any strings. Note that the result is a minimal automaton. E1 .. En can be both transducers or recognizers. If one of them is a transducer, then all of the others are coerced into transducers as well.
[] denotes the empty string (or equivalently the language solely consisting of the empty string).
The concatenation of the languages denoted by E1, E2, .. En. As a special case, [] is the language solely containing of the empty string. Note that the result is a minimal automaton. E1 .. En can be both recognizers and transducers. If one of them is a transducer then all of the others are coerced into transducers as well.
Kleene closure (zero or more concatenations) of the language denoted by E. Note that the result is a minimal automaton. E can be both a recognizer or a transducer.
Kleene plus (one or more concatentations) of the language denoted by E. Note that the result is a minimal automaton. E can be both a recognizer or a transducer.
Union of E with the empty string, i.e. a string from E occurs optionally. The result is a minimal automaton. E can be both a recognizer or a transducer.
The intersection of the languages denoted by A and B. Produces a minimal automaton. A and B must be recognizers.
The set of pairs (A,C) such that (A,B) is in E0 and (B,C) is in E1. Both E0 and E1 are (coerced into) transducers. Note that the result is a minimal automaton.
Note that in case both E0 and E1 are not same-length transducers, then often the resulting transducer will give rise to `spurious' results in the sense that for a given input the same output is produced several times. See the paper by Pereira and Riley, 1996, for some suggestions to repair this. Obviously, in cases where you can determinize the transducer (with t_determinize) the spurious ambiguities will disappear as well.
Binary form is equivalent to Set* & Expr; ternary form is equivalent to DomSet* o Expr o RanSet*. You are advised to use the class/1 operator for the specification of Set, but this is not required. For example, the following two expressions define the same language, but the latter expression typically results in a smaller automaton (this depends eventually on the predicale module):
sigma(a..z,Expr)
sigma(class(a..z),Expr)
Set and Expr must both describe recognizers.
Is equivalent to Set* o Expr. You are advised to use the class/1 operator for the specification of Set, but this is not required. Set must be recognizer; Expr is (coerced into) a transducer.
Is equivalent to Expr o Set*. You are advised to use the class/1 operator for the specification of Set, but this is not required. Set must be recognizer; Expr is (coerced into) a transducer.
set of strings F such that the reversal of F is in E.
The set of pairs B:A such that A:B is in E. If E is a recognizer, then it is converted to its identity transducer.
The set of pairs A:A such that A is in E.
The set of strings A such A:B is in E.
The set of strings B such A:B is in E.
The cleanup operator attempts to pack several transitions into one. For instance, assume there are two transitions from state p to q over the predicates p1 and p2 respectively. If p3 is a predicate which is true just in case either p1 or p2 is true, then we replace the two transitions by one transition over predicate p3.
E is supposed to be a transducer. The result will be all pairs allowed by E and furthermore all pairs (x,y) such that (x',y) is in E and x' can be formed by substituting one symbol in x.
E is supposed to be a transducer. The result will be all pairs allowed by E and furthermore all pairs (x,y) such that (x',y) is in E and x' can be formed by deleting one symbol in x.
E is supposed to be a transducer. The result will be all pairs allowed by E and furthermore all pairs (x,y) such that (x',y) is in E and x' can be formed by deleting one symbol in x.
Denotes the string Atom, as a concatenation of its individual characters. For instance word(regular) is equivalent to [r,e,g,u,l,a,r].
Converts the automaton defined by Expr into an automaton using the pred_module declarations found in NewModule. Note that this is possible only in case the newer module is at least as expressive as the old one. For instance, you can convert an automaton with the fsa_frozen predicate module into an automaton with the fsa_preds predicate module, but not vice versa. The binary operator is for recognizers, the ternary operator for letter transducers.
Fa is already a finite automaton in appropriate format.
This denotes the finite-automaton read from file X.
Equivalent to Expr, but sets spy-point on compilation of Expr. This implies that for debug level 1 or higher the CPU-time is reported required to compile Expr, as well as the size of the resulting automaton.
Equivalent to Expr, but caches result of compiling Expr, if the flag regex_cache is set to *selective*. If that flag has value off then there is no caching. If the value is on then the regular expression compiler caches every sub-computation.
A random automaton is constructed consisting of the number of states specified in the first argument, number of symbols in the second argument. The desired density of the automaton is given in the third argument, whereas the final argument is the jump density. E.g. random(20,10,0.1,0.1) will be an automaton with 20 states, 10 symbols, approximately 400 transitions and 40 jumps.