This page details the purpose, aims and current status of my suite of tools for investigating Recursive Adaptive Grammars (or RAGs). Visit the sourceforge summary page for this project.
Currently, most programming languages are described using context-free grammars (CFGs). However, the CFG formalism is not powerful enough to express many features that are desirable in a useful and practical language. Therefore, the compilation process must necessarily include extensions to this formalism, and comprises several stages of analysis (lexical, syntactic, semantic) before the translation to object code can be carried out.
The recursive adaptive grammar (RAG) formalism ^{}[1] is one which intuitively extends the familiar concept of context-free grammars to naturally incorporate context-sensitivity and make them Turing-powerful. Context-dependent language features can easily be expressed using RAGs; this work presents an algorithm for parsing languages described by [a subset of] recursive adaptive grammars, showing how this naturally combines syntactic and semantic analysis. Furthermore, built into the RAG formalism is the identity of the domains of syntatic, metasyntactic and sematic values and the association of semantic values with nodes of the parse tree; it is demonstrated how this provides a mechanism for translation. Finally, the one-step computational mechanism of RAGs allows all these stages to be combined into a single process.
The broad aims of this project are to develop a method of parsing that is based on RAGs, and to investigate the capabilities and usefulness of RAGs in defining and implementing programming languages. Some more concrete and (hopefully) achievable goals are listed below:
Recursive adaptive grammars (RAGs), like context-free grammars (CFGs), are a formalism that allow you to specify languages. Unlike CFGs though, they are capable of describing context-dependent features, i.e. constraints on a section of a program which depend upon what is present (or not) in another part of the program. Such features are essential for any programming language which is to be of use in real-world situations.
As an example, consider the constraint (present in many modern programming languages) that a variable must be declared before it is assigned a value. A code snippet using java illustrates this:
public void foo()
{
int x;
x = 5;
System.out.println(x);
}
The above code is valid java; the code below is not, because the variable x is not declared:
public void foo()
{
x = 5;
System.out.println(x);
}
This declaration/assignment feature is context-dependent because the validity of the assignment statement x = 5;
depends upon the presence of a declaration statement int x;
occuring before it at some point in the program.
The following gives a short summary of the key features of RAGs appropriate to understanding the scope and depth of this project. For a complete description and formal treatment, please see John Shutt’s Master’s Thesis^{}[1].
RAGs contain certain elements that are analogous to those of context-free grammars, and the similarity in structure can easily be seen. However, there are fundamental, and essential, differences. The combined syntactic, meta-syntactic and semantic domains of recursive adaptive grammar, G, are represented by elements called Answers whose domain is denoted A_{G}. These are analogous to the symbols of a CFG. Like the symbols of a context-free grammar, answers may be terminal, or non-terminal. The domain of terminal answers, T_{G}, is a subset of A_{G}. In string RAGs (the subset of RAGs around which this project is built), T_{G} is simply the domain of strings over a given alphabet, plus the empty string, λ. Unlike in CFGs, however, any answer may represent either syntax, or play a meta-syntactic role (in CFGs only terminals may represent syntax, and only non-terminals play meta-syntactic roles). Another fundamental difference between RAG answers and the symbols of context-free grammars is that, whereas CFG symbols are simply that, RAG answers are constructed from operators which can take arguments (formally, A_{G} is a one-sorted algebra), and therein lies their power. Essentially, this allows them to act as functions which map answers representing syntax onto answers that represent semantic values. Operators of zero arity are called constant, and those with an arity greater than or equal to one are called non-constant operators.
The second component of RAGs are rules. Again, these are equivalent to CFG rules in that they facilitate the derivation process: that of transforming, or mapping, a metasyntatic value into a syntactic one (parsing is the inverse of this process).
An unbound rule is a construction of the form:
⟨v_{0},e_{0}⟩ → t_{0} ⟨e_{1},v_{1}⟩ t_{1} … ⟨e_{n-1},v_{n-1}⟩ t_{n-1} ⟨e_{n},v_{n}⟩ t_{n}n ≥ 0
The v_{k} are distinct variables, which can take on answers as values. The t_{k} are answers (usually, but not necessarily terminal). The e_{k} are polynomials in the variables v_{0},…v_{n}, i.e. expressions denoting answers that may contain variables. The left hand component of each of the ordered pairs, ⟨v_{0},e_{0}⟩ and all ⟨e_{n},v_{n}⟩, plays a meta-syntactic role, and the right hand component represents the semantic value that is attached to that meta-syntactic element. Both the left and right sides of the rule can be called Configurations, and belong to a domain called the configuration algebra C_{G}.
In the formal machinery of RAGs, the domain of [unbound] rules is related to the domain of answers by means of a rule function. In practice, this means that each rule form is associated with a particular answer. In the notation laid down by Shutt^{}[1] this association is written as in the following example:
S:
⟨v_{0},λ⟩ → λ
⟨v_{0},λ⟩ → a ⟨v_{0},v_{1}⟩ b
fig. 1
Figure 1. defines a completely self-contained RAG, G, where T_{G} = {a, b}*, the set of non-terminal operators is {S}, and the start symbol is S. This grammar generates the language L = {a^{n}b^{n}, n ≥ 0} and the semantic value associated with each terminal string is λ: G is said to predicate L.
When rule forms are associated with an answer in this manner, the first variable of each rule, v_{0}, is bound to the associated answer (in this case S) creating partially bound rules:
(1)⟨S,λ⟩ → λ
(2)⟨S,λ⟩ → a ⟨S,v_{1}⟩ b
A fully bound rule is one in which all the variables have been replaced by, or bound to answers. Technically, variables can be bound to any answer in A_{G} whatsoever, however only certain bindings will lead to the derivation of a terminal string. For example, in rule (2) above, we could bind the variable v_{1} to the answer S, creating the bound rule:
⟨S,λ⟩ → a ⟨S,S⟩ b
However, if this rule were to be used in a derivation, it would constitue a dead end since the pair ⟨S,S⟩ cannot appear on the left hand side of any rule. In this example, the only useful binding for v_{1} is λ, yielding the bound rule:
⟨S,λ⟩ → a ⟨S,λ⟩ b
The domain of bound rules of a grammar G, is denoted β(G). By convention, rule associations are not specified for terminals and so any symbols appearing in a RAG definition which do not have a rule set specified are terminal operators. In reality, every RAG contains the implicit rule form, quantified over all terminals:
∀ t ∈ T_{G} t: ⟨v_{0},λ⟩ → t
A third, and key, ingredient of the RAG formalism which has no equivalent in CFGs is the Query operator, written as ‘a : b’. This is a binary operator which denotes the semantic value associated by meta-syntactic value a with syntactic value b. What this means is that, if there is some derivation which transforms a pair ⟨a, v_{0}⟩ into the terminal string b, then a : b denotes the semantic value, v_{0}, that is synthesized during that derivation. It is this operator that endows RAGs with their ‘recursive’ nature since, on encoutering a query term, the parsing algorithm will recursively call itself to evaluate the term.
The derivation relation is the final piece of the RAG model. It is herein that the actual computational power of a recursive adaptive grammar lies. It is similar in nature to the derivation relation of CFGs, which acts on strings of CFG symbols to transform one into another according to the production rules of the grammar. However, the derivation relation of RAGs is extended to handle the newly introduced Query operator.
The derivation step relation ⇒ is a binary relation on Configurations (i.e. expressions made up of pairs, queries and answers) that specifies which of these configurations may follow on from which others in a derivation. For the purposes of this summary, the derivation step relation can be thought of as being constructed from three parts. The first is based upon the rules of the RAG to which it is being applied, and stipulates that if there is a [bound] rule that transforms a pair into some configuration c, then under the derivation step relation that pair evaluates to c. More concisely:
If (⟨a_{1},a_{2}⟩ → c) ∈ β(G), then (⟨a_{1},a_{2}⟩, c) ∈ ⇒ or ⟨a_{1},a_{2}⟩ ⇒ c.
The second part deals with queries, and simply formalises the nature of the query operator that was described above: If, under the grammar G, the meta-syntactic element of a query can generate the answer that is the syntactic element of that query, then under the derivation step relation that query will evaluate to the semantic value synthesized. More concisely:
If ⟨a_{1},a_{3}⟩ ⇒ ... ⇒ a_{2}, then (a_{1} : a_{2}) ⇒ a_{3}.
This definition of the derivation relation clearly illustrates the recursive nature that the query operator confers upon the process of expression evaluation in RAGs: the behaviour of the derivation process as applied to queries is defined in terms of itself!
The third part of the derivation relation construction is a formality that allows it to be applied to elements of a configuration individually. Remember that the domains to which the elements of RAG expressions belong are one-sorted algebras, constructed from operators. The final axiom that the derivation relation must satisfy stipulates that the derivation step relation can act on each argument of an operator individually:
Suppose σ is an operator of arity n ≥ 1, c_{1}, ... ,c_{n} are configurations and c_{k} ⇒ c′ for some 1 ≤ k ≤ n. Let c′_{j} = c′ if j = k, and c_{j} otherwise. Then σ(c_{1}, ... ,c_{n}) ⇒ σ(c′_{1}, ... ,c′_{n})
By far the most common operator that the above construction will be applied to is the [binary] concatenation operator,⋅. By way of example, consider the RAG rule form that was defined above. The right hand side of the rule was said to be a configuration, but it is one comprised of the concatenation of other configurations (alternate answers and pairs). The above axiom means that in such a configuration, the derivation can proceed one pair or query at a time.
Consider the following RAG, where Z is any alphabet, and the starting symbol is S:
S:
⟨v_{0},v_{3}⟩ →
⟨W,v_{1}⟩
⟨v_{1},v_{2}⟩
⟨v_{1},v_{3}⟩
W:
⟨v_{0},λ⟩ → λ
⟨v_{0},v_{1}v_{2}⟩ →
⟨C,v_{1}⟩
⟨v_{0},v_{1}⟩
C:
∀z ∈ Z ⟨v_{0},z⟩ → z
Nonterminal constant C generates any single letter in the alphabet, Z, and associates it with a semantic value which is the letter generated. Building on this, nonterminal constant W generates any string w ∈ Z*, and associates with a semantic value that is the string generated. Finally, at the top level, nonterminal constant S uses W to generate an arbitrary string in Z* (this is the first pair on the right-hand side of the rule form belonging to S). The generated string is then bound to v_{1}. This binding allows the second and third pairs of the rule form belonging to S to duplicate the string. These pairs also assign the semantic value λ to these duplicated strings (thanks to the implicit rule form for terminals mentioned above). The remaining variables v_{2} and v_{2} are therefore bound to λ. The overall effect is that this RAG generates the language {www | w ∈ Z*}.
For simplicity’s sake, let us use Z = {a,b,c}. Then, a possible derivation sequence might proceed as follows:
⟨S, λ⟩
⇒ ⟨W, abc⟩⟨abc, λ⟩⟨abc, λ⟩
⇒ ⟨C, a⟩⟨W, bc⟩⟨abc, λ⟩⟨abc, λ⟩
⇒ ⟨C, a⟩⟨C, b⟩⟨W, c⟩⟨abc, λ⟩⟨abc, λ⟩
⇒ ⟨C, a⟩⟨C, b⟩⟨C, c⟩⟨W, λ⟩⟨abc, λ⟩⟨abc, λ⟩
⇒ a⟨C, b⟩⟨C, c⟩⟨W, λ⟩⟨abc, λ⟩⟨abc, λ⟩
⇒ ab⟨C, c⟩⟨W, λ⟩⟨abc, λ⟩⟨abc, λ⟩
⇒ abc⟨W, λ⟩⟨abc, λ⟩⟨abc, λ⟩
⇒ abcλ⟨abc, λ⟩⟨abc, λ⟩
⇒ abcabc⟨abc, λ⟩
⇒ abcabcabc
This string, and indeed the entire language L = {www | w ∈ Z*}, cannot be generated by any context-free grammar: immediately, RAGs have proved themselves a more powerful formalism than CFGs, while at the same time retaining the same intuitive and conceptually simple structure. In this case, the result has been facilitated by the RAG's ability to take a semantic value synthesized by one section of the derivation and use it as meta-syntax to drive the derivation in a different section. This example has been taken from Shutt’s Thesis^{}[1], but the second rule for the non-terminal W has been rearranged so that the grammar is not left-recursive, as the parsing algorithm cannot handle such RAGs.
Note that this example is a rather simple one, and one that does not demonstrate the use of the query operator at all. Such an example is too extensive for this summary. To see an example of the query operator in use, please refer to the Thesis^{[1]} §4.3.
This section outlines the algorithm that has been devised to perform RAG-based parsing. This algorithm is the original work of the author and, to the best of the author’s knowledge, is a novel piece of work which has not been previously described in the literature.
The heart of the RAG parsing algorithm is a pruning breadth-first search of the tree of possible derivations for a given grammar. The algorithm maintains a working set of configurations which match the input currently consumed. This set is initially populated with the start symbol of the particular grammar which is begin using to parse the input. At each iteration, the algorithm uses the grammar to generate possible derivations from the configurations in the working set, until each configuration begins with a terminal. The terminal at the head of each candidate configuration is compared with the next terminal in the input, and any configurations which don't match are discarded. Ones which do survive to the next iteration. At the end of each iteration, the head terminal of each surviving configuration is consumed, along with the current input token.
An important aspect of the parsing algorithm is the way in which it handles variables, and the binding of these variables to answers. In the example derivation, each step of the derivation used fully bound rules from the outset; in effect, the resulting terminal string had been ‘chosen’ in the first derivation step, when we chose to bind the variable v_{1} to the answer abc. The parsing algorithm, however, defers binding variables until the last possible moment. This is accomplished by a forking process, which creates a new candidate configuration wherever a variable may be bound to more than one answer. To use the RAG example above, if the parsing algorithm were to encounter the pair ⟨W, v_{1}⟩, it would first look up which rule forms can be applied. It would find that there are two possibilities, and so fork this configuration into two new ones. In one, the variable is bound to λ, and in the other it is bound to the polynomial v_{1}v_{2}, the variables of which will be bound to answers at some later point in the parsing process. Note that the variable v_{1} in the original pair, and the variable v_{1} in the new polynomial do not denote the same variable, they are distinct. The variables have been written as they appear in the rule forms in order that the reader can easily follow the derivation process; in the parsing algorithm however, when a derivation step is applied, new variables are created.
The algorithm is presented below. There are also a few definitions that will be useful:
The algorithm works as follows:
Firstly, (1) initialises the candidate set of sentential forms that can potentially generate the input.
The starting point is a pair consisting of the meta-syntax of the input query, and a new variable, v, which is also the result of
the sentential form.
Next, in steps (2) - (7), the algorithm enters its main processing loop,
which is repeated until the input is consumed or until there are no candidate sentential forms left. At the start of each
iteration, empty sentential forms are removed from the candidate set, as they cannot match the remaning input. Next, in steps
(3) and (4), an inner loop processes the candidate set until an answer is
generated at the head of each sentential form. In (3), the fork procedure
examines each sentential form in the candidate set, applying the appropriate rules of the grammar to the head element of each form.
If more than one rule can be applied to the head element of any sentential form, then that form is copied (or
cloned) the appropriated number of times, and each of the applicable rule forms applied to a separate
one. The application of a rule form involves replacing the pair element in question with the configuration of the matching rule
form and binding the variable of the pair being transformed to the result expression of that rule form. It is also during the
fork procedure that the parsing algorithm calls itself recursively: if it encounters a query at the
head of any given sentential form, it will parse the query, and apply the results to the sentential form in the same way that rule
forms are applied. After each round of rule applications, any λ terminals that have been generated at the head of a
sentential form are removed, since in the terminal algebra we have the identities λ ⋅ t = t and t ⋅ λ = t.
In other words, we can effectively ignore any λ's when matching input.
When no more rules can be applied to any of the sentential forms in the candidate set, the algorithm is in a position to compare the candidates with the input string. Any sentential forms whose head element does not match the next token of the input is discarded in (5). This leaves in the candidate set only sentential forms that continue to match the input. Step (6) advances the sentential forms in the candidate set by removing the head element of each form, and (7) removes the left-most token from the input string, since it has now been processed.
When all the input has been consumed (or the candidate set is empty), the algorithm exits the main processing loop. At this point, the algorithm processes the candidate set one more time to remove trailing λ's from any of the surviving sentential forms. Finally, step (10) discards any sentential forms from the candidate set that are not empty. This is because they cannot match the input, as they must generate further ouput. The algorithm then returns a set containing the results of each surviving sentential form. Any variables in the results are guaranteed to be bound to an answer at this point, since each iteration of the processing loop ensures that an answer is generated before it moves onto the next. Note that it is entirely possible for the candidate set, and therefore the result set to be empty. Indeed, this may be an intended function of elements of the grammar (see the example outlined in Shutt’s thesis^{[1]} §4.3).
The various helper functions used by the parse algorithm are defined below:
The head-set function constructs a set containing all the head elements of the sentential forms in the set which is passed as its argument.
Similarly, the result-set function constructs a set containing all the results of the sentential forms in the set which is passed as its argument.
The advance() function simply removes the head of each of the sentential forms in the set which is passed as its argument. The λ-advance() function is a little more specialised in that it will remove the head element of each sentential form if that element is the terminal λ; if it is not, the sentential form is left alone.
The set of partially bound rule forms belonging to a grammar G for a given answer a is denoted by partial-β_{G}(a). A partially bound rule is a rule form in which the initial variable v_{0} has been bound to the rule’s associated answer (see §3.i).
If p is a pair ⟨e, v⟩, where e is a polynomial expression and v is a variable, then expr(p) = e and val(p) = v.
A rule is essentially a sentential form, but one that is associated with with a particular answer. If r is a rule of the form ⟨v_{0},e_{0}⟩ → t_{0} ⟨e_{1},v_{1}⟩ t_{1} … ⟨e_{n-1},v_{n-1}⟩ t_{n-1} ⟨e_{n},v_{n}⟩ t_{n}, then its equivalent is a sentential form [t_{0} ⟨e_{1},v_{1}⟩ t_{1} … ⟨e_{n-1},v_{n-1}⟩ t_{n-1} ⟨e_{n},v_{n}⟩ t_{n}, e_{0}] with the variable v_{0} bound to the rule’s associated answer. The set of variables referenced in a sentential form or rule is denoted by vars(S). Note also that the variable set of a rule form can contain a special collection of variables which correspond to the arguments of the [non-constant] answer with which it is associated. This argument set is denoted args(S), and is a proper subset of vars(S).
As introduced above, the fork function applies all matching rules to the sentential forms in the set passed as its argument. The newvar() function generates a new, previously un-referenced variable. The clone() function creates a copy of a sentential form, in which the variable set of the original form is mapped onto a set of newly generated variables.
The normalise function used in the definition above makes sure that the head of the sentential form currently being forked only contains a single single term as its expression. This is required since all RAG rule forms can only be associated with a single answer. The exact behaviour of RAGs in regards to the concatenation operator is discussed in §6.3.2 of Shutt’s thesis^{[1]}, but for practical purposes it amounts to the following rule form:
∀ a_{1},a_{2} ∈ A_{G} a_{1}⋅a_{2}: ⟨v_{0}, v_{1}v_{2}⟩ → ⟨a_{1}, v_{1}⟩ ⟨a_{2}, v_{2}⟩
which is implicit in every RAG.
Since the input to the algorithm is a query, an input program can be prcocessed by first constructing a query (S : input) from the input string and the start symbol S_{G} of a grammar G, and then passing it to the parsing algorithm. This gives us the auxiliary parsing function for G:
note: since the parsing algorithm searches through the leftmost derivations of its candidate sentential forms until it generates an answer, it can only carry out parsing on grammars that do not contain left-recursion in their rule forms. Left-recursive rule forms would cause the parsing algorithm to loop indefinitely as it tries to apply the same rule form over and over again. The grammar must also be left-to-right in order to ensure that all variables referenced in a given expression have been bound to answers before the expression is used to drive rule applications.
Quinn Tyler Jackson has presented a similarly powerful formalism, called the §-calculus, along with a language accepting automaton called a PDA-T, which can accept languages generated by this calculus. He has shown that this formalism provides practical benefits in the areas of context-sensitive and semantic parsing. His 2002 paper, "Some Theoretical and Practical Results in Context-Sensitive and Adaptive Parsing", can be found at the following URL: http://www.iscid.org/papers/Jackson_AdaptiveParsing_093002.pdf
Wikipedia page on adaptive grammars - http://en.wikipedia.org/wiki/Adaptive_grammar
Wikipedia page on extensible programming - http://en.wikipedia.org/wiki/Extensible_programming_language
[1] Shutt, John N., Recursive Adaptable Grammars, Master’s Thesis, Worcester Polytechnic Institute, 1993. (16 December 2003 emended revision.) Online version: http://web.cs.wpi.edu/~jshutt/thesis/top.html