A place to be (re)educated in Newspeak

Tuesday, September 25, 2007

Executable Grammars

Way back in January, I described a parser combinator library I’d built in Smalltalk. Since then, we’ve moved from Smalltalk to Newspeak, and refined the library in interesting ways.

The original parser combinator library has been a great success, but grammars built with it were still polluted by two solution-space artifacts. One is the need to use self sends, as in

id := self letter, [(self letter | [self digit]) star]

In Newspeak, self sends can be implicit, and so this problem goes away. We could write

id:: letter, [(letter | [digit]) star]

The other problem is the use of blocks. We’d much rather have written

id:: letter, (letter | digit) star

The solution is to have the framework pre-initialize all slots representing productions with a forward reference parser that acts as a stand-in for the real parser that will be computed later.

When the tree of parsers representing the grammar has been computed, it will contain stand-ins exactly in those places where forward references occurred due to mutual recursion. When such a parser gets invoked, it looks up the real parser (now stored in the corresponding slot) and forwards all parsing requests to it.

Our parsers are now entirely unpolluted by solution-space boilerplate, so I feel justified in calling them executable grammars. They really are executable specifications, that can be shared among all the tools that need access to a language’s grammar.

Below is a small but complete grammar in Newspeak:

class ExampleGrammar1 = ExecutableGrammar (
|
digit = self charBetween: $0 and: $9.
letter = (self charBetween: $a and: $z) | (self charBetween: $A and: $Z).
id = letter, (letter | digit) star.
identifier = tokenFor: id.
hat = tokenFromChar: $^.
expression = identifier.
returnStatement = hat, expression.
|
) ()


If you want to understand all the details, check out this paper; if you can, you might also look at my JAOO 2007 talk, which also speculates on how we can make things look even nicer, e.g.:

class ExampleGrammar3 = ExecutableGrammar (
|
digit = $0 - $9.
letter = ($a - $z) | ($A - $Z).
id = letter, (letter | digit) star.
identifier = tokenFor: id.
expression = identifier.
returnStatement = $^, expression.
|
)()

4 comments:

Lex said...

By-name parameters would help you. Then, the argument to | would get turned into a block automatically.

By-name parameters are easy to efficiently implement for statically typed languages, and Scala includes them. I do not know how efficient they can be for dynamically typed languages. The issue is that unless you are careful, normal method calls can get more expensive even if they don't use by-name parameters.

Late Binding of Execution Order

Gilad Bracha said...

Hi Lex,

Thanks for your comment. I am of course well aware of Scala's technique of using the type system to coerce arguments into closures as needed.

Without a mandatory static type system, it is hard to do this. As you know, I don't favor mandatory type systems, and Newspeak doesn't (and won't) have one. Eventually, it wll have a pluggable type system, but that wouldn't resolve this issue.

The technique I described solves the problem very nicely, so it's no longer an issue. We no longer need to use blocks instead of parsers. In addition, we can factor our grammars out nicely, keeping parsing and processing separately. This is really hard to do with a mandatory static type system, which is why no one did it before.

James said...

Nice work once again Gilad.

But, you don't really want character minus "-"
to generate a range, do you?

I mean it's odd of $0 - $9 means something so different to 0 - 9.

So please at least consider say -- for range, so you can write $0 -- $9 to get a character range, and then 0 -- 9 for an integer range...

Gilad Bracha said...

James,

Well, $0 - $9 doesn't bother me at all. Context is key here: when you use regular expressions, do you find the use of "-" to denote a range a problem?
The same would apply in this case. It would only work in the context of a module which has extended Character in this way.