Tuesday, September 25, 2007

Executable Grammars

Way back in January, I described a parser combinator library I’d built in Smalltalk. Since then, we’ve moved from Smalltalk to Newspeak, and refined the library in interesting ways.

The original parser combinator library has been a great success, but grammars built with it were still polluted by two solution-space artifacts. One is the need to use self sends, as in

id := self letter, [(self letter | [self digit]) star]

In Newspeak, self sends can be implicit, and so this problem goes away. We could write

id:: letter, [(letter | [digit]) star]

The other problem is the use of blocks. We’d much rather have written

id:: letter, (letter | digit) star

The solution is to have the framework pre-initialize all slots representing productions with a forward reference parser that acts as a stand-in for the real parser that will be computed later.

When the tree of parsers representing the grammar has been computed, it will contain stand-ins exactly in those places where forward references occurred due to mutual recursion. When such a parser gets invoked, it looks up the real parser (now stored in the corresponding slot) and forwards all parsing requests to it.

Our parsers are now entirely unpolluted by solution-space boilerplate, so I feel justified in calling them executable grammars. They really are executable specifications, that can be shared among all the tools that need access to a language’s grammar.

Below is a small but complete grammar in Newspeak:

class ExampleGrammar1 = ExecutableGrammar (
|
digit = self charBetween: $0 and: $9.
letter = (self charBetween: $a and: $z) | (self charBetween: $A and: $Z).
id = letter, (letter | digit) star.
identifier = tokenFor: id.
hat = tokenFromChar: $^.
expression = identifier.
returnStatement = hat, expression.
|
) ()


If you want to understand all the details, check out this paper; if you can, you might also look at my JAOO 2007 talk, which also speculates on how we can make things look even nicer, e.g.:

class ExampleGrammar3 = ExecutableGrammar (
|
digit = $0 - $9.
letter = ($a - $z) | ($A - $Z).
id = letter, (letter | digit) star.
identifier = tokenFor: id.
expression = identifier.
returnStatement = $^, expression.
|
)()