Saturday, January 06, 2007

Parser Combinators

Many excellent ideas from functional programming apply to object oriented programming (functions are objects too, you know). Parser combinators in particular are a technique with a long history in the functional programming community.


This is a blog post, so I’m not going to give a proper bibliography of the idea; suffice to say that it goes back decades, and that Phil Wadler, and Erik Meijer, among others, have done important work in this area. I myself was inspired to look into the topic by Martin Odersky’s
Scala tutorial.


I’ve built a simple parser combinator library in Smalltalk, and it seems to work out very nicely. I thought it would be good to write about it here, coming at the topic from an OO perspective.


So, what are parser combinators exactly? The basic idea is to view the operators of BNF (or regular expressions for that matter) as methods that operate on objects representing productions of a grammar. Each such object is a parser that accepts the language specified by a particular production. The results of the method invocations are also such parsers. The operations are called combinators for rather esoteric technical reasons (and to intimidate unwashed illiterates wherever they might lurk).


To make this concrete, lets look at a fairly standard rule for identifiers:


id -> letter (letter | digit)*



using my combinator library in Smalltalk, one defines a subclass of CombinatorialParser, and inside it one writes


id := self letter, [(self letter | [self digit]) star]



Here, letter is a a parser that accepts a single letter; digit is a parser that accepts a single digit. Both are obtained by invoking a method on self ( this, for those unfortunates who don’t program in Smalltalk). The subexpression self letter | [self digit] invokes the method | on the parser that accepts a letter, with an argument that accepts a digit (ignore the square brackets for a moment). The result will be a parser that accepts either a letter or a digit.



tangent: No Virginia, Smalltalk does not have operator overloading. It simply allows method names using non-alphanumeric characters. These are always infix and all have the same fixed precedence. How can it be so simple? It’s called minimalism, and it’s not for everyone. Like Mies van der Rohe vs. Rococo.



The only detail I’ve glossed over is the brackets. The brackets denote closures, so that [self digit] is a closure that when applied, will yield a parser that accepts a digit. Why do I do this? Because grammar rules are often mutually recursive. If a production A is used in a production B and vice versa, one of them needs to be defined first (say, A), at which point the other (B) is not yet defined and yet must be referenced. Wrapping the reference to the other production in a closure delays its evaluation and gets round this problem. In a lazy language like Haskell this is not an issue - which is one key reason Haskell is very good at defining DSLs. However, Smalltalk’s closure syntax is very lightweight (lighter than lambdas in most functional languages!) so this is not a big deal. And Smalltalk’s infix binary methods and postfix unary methods give a very pleasing result overall.


We then invoke the method star on the result



(self letter | [self digit]) star



which yields a parser that accepts zero or more occurrences of either a letter or a digit.



In a syntax more people understand, this would look something like:


(1) letter().or( new DelayedParser(){ public Parser value(){ return digit();} }).star()


If Java had closures, it might look like this:


(2) letter().or({=> digit()}).star()


This is better, but either way, the goal of writing an executable grammar tends to get lost in the noise. Nevertheless, it seems most people prefer (1), and the vast majority of the rest prefer (2) over the “bizarre” Smalltalk syntax. Who knows what darkness lies in the hearts of men.



We pass this parser as an argument to the method , which we invoke on letter. The “,” method is the sequencing combinator (which is implicit in BNF). It returns a parser that first accepts the language of the receiver (target, in Javanese) and then accepts the language of its argument. In this case, this means the result accepts a single letter, followed by zero or more occurrences of either a letter or a digit, just as we’d expect. Finally, we assign this result to id, which will now represent the production for identifiers. Other rules can use it by invoking its accessor (i.e., self id).


The example also shows that this approach to parsing covers both lexers and parsers.


The lack of distinction between lexing and parsing is a bit of a problem. Traditionally, we rely on a lexer to tokenize the input. As it does that, it does away with whitespace (ignoring languages that make whitespace significant) and comments. This is easily dealt with by defining a new operator, tokenFor:, that takes a parser p and returns a new parser that skips any leading whitespace and comments and then accepts whatever p accepts. This parser can also attach start and end source indices to the result, which is very handy when integrating a parser into an IDE. From the point of view of higher level grammar productions, its useful to refer to a production identifier that produces such tokenized results:



identifier := self tokenFor: self id.



We would naturally do this for all the tokens in our language, and then define the syntactical grammar without concern for whitespace or comments, just as we would in traditional BNF. As an example, here’s the rule for the return statement in Smalltalk



returnStatement := self hat, [self expression].



Ok, so you can define a parser pretty much by writing down the grammar. However, just accepting a language isn’t all that useful. Typically, you need to produce an AST as a result. To address this, we introduce a new operator, wrapper: . The result of this operator is a parser that accepts the same language as the receiver. However, the result it produces from parsing differs. Instead of returning the tokens parsed, it processes these tokens using a closure which it takes as its sole parameter. The closure accepts the output of the parse as input, and yields some result - typically an abstract syntax tree.


returnStatement := self hat,  [self expression]

     wrapper:[:r :e  | ReturnStatAST new expr:e; start: r start; end: e end].  



The grammar production is still clearly separated, with the AST generation on a separate line. However, I would prefer to leave the grammar pristine. That’s easy - put all the AST generation code in a subclass, where the grammar production accessors are overridden, so:


returnStatement

^super returnStatement

    wrapper:[:r :e  | ReturnStatAST new expr:e; start: r start; end: e end].    



This is nice, for example, if you want to parse the same language and feed it to different back ends that each accept their own AST; or if you need to use the parser for a different purpose, like syntax coloring, but want to share the grammar. Another nice thing about this approach is that one can factor out language extensions very cleanly (especially if you can use mixins). It's one of the benefits of embedding a DSL in a general purpose language - your DSL inherits all the features of the host language. In this case, it inherits inheritance.


So what’s not to like? Well, one could imagine more efficient approaches to parsing. In Smalltalk, one usually parses one method at a time, and methods tend to be short. Even though I'm using Squeak, which isn't all that fast, and parses a method on every keystroke to do syntax coloring, it's perfectly acceptable. For large methods, the delay can be noticeable though. However, there are ways to tune things to improve performance. We have our methods ...


Another problem is left recursion, as in:



expr -> expr + expr | expr * expr | id



In such cases one has to refactor the grammar. I don’t see this as a big deal, and in principle, the parser could refactor itself dynamically to solve the problem; this is one of things that one can do relatively easily in Smalltalk, that tends to be harder in other languages.


In summary, parser combinators are really cool. They work out beautifully in Smalltalk. I had a blast implementing them and using them. Most important, they are a great example of how object orientation and functional programming synergize. If you want to learn more, there’s a lot of literature out there, mainly in the Haskell world.

17 comments:

  1. Happy New Year, Gilad!

    I encountered JParsec recently. It's a Java library that does also parser combinators, but in Java. It also supports a Ruby and a Groovy improved notation.

    The concept of parser combinators is pretty interesting and is certainly a refreshing change compared with the usual parser generators we've used so far.

    ReplyDelete
  2. Hi Guillaume,

    Happy new year to you too!

    I wasn't aware that someone had done this in Java. That's cool - and looks a lot like what I expected. For example, there is indeed a class ParserEval that is analogous to the class DelayedParser I gave in my example.

    It would be really nice to see how that framework works out with closures.

    Of course, in Java you have to statically typecheck all this, which can lead to complications. I experimented with typechecking my framework in Strongtalk. One difficulty is dealing with variable arity (which is why there are a host of Map interfaces in JParsec, but only one wrapper: combinator in my framework.

    It seems there's a Ruby Parsec as well. That probably works out way better than in Java. And Ruby desperately needs a parser with that power to cope with its own syntax.

    ReplyDelete
  3. This looks pretty cool. It seems like it would be very debug-able since the production rules are Smalltalk and not a separate grammar language like SmaCC or the AT ParserCompiler (in VisualWorks). That was one of the big problems I had was not being able to use the debugger directly on the production rules. Is your code available somewhere?

    ReplyDelete
  4. Mike,

    It is true that since the grammar is ordinary Smalltalk code, you can use the debugger and what-have-you. Bugs are errors in the grammar/specification. Some have to do with the limitations of the system (like when you accidentally have left recursion) but most are true specification bugs.

    I'm afraid the code is not available at this time. One of the few disadvantages of getting paid to do this kind of stuff is that you can't just give it away. Big companies have lawyers, policies etc. and I don't know when/if I'll be able to release any of this. I just wanted to draw people's attention to the idea. It really isn't that hard to do.

    ReplyDelete
  5. Gilad,

    Thanks! This was a most interesting post.

    I wonder if both Parsec and the Smalltalk parser library you describe are exactly equivalent, in terms of parsing power and general structure, to what Henry Baker describes as META in this paper. It seems to be the case to me, but I may easily be missing something important, and would appreciate your opinion.

    E.g. the identifier parser from your post would look like this in
    a small parsing library I wrote for
    my own use in Lisp (after Baker, but with syntax that appealed to me
    and with the automatic building of the AST):

    (defun parse-id ()
    (parse (sequence
    letter
    (maybe-some
    (or letter digit)))))

    ReplyDelete
  6. Anatoly,

    Thanks for the pointer Henry Baker's paper. He is always fun to read. I'm sure his code is much more efficient - I haven't done any optimizations yet.

    I don't expect any these schemes to be exactly equivalent to each other, despite the similarities. The literature discusses the variations. For example, monadic parser combinators can handle context sensitive languages, while "arrow" ones cannot.

    Since I'm working in an imperative language, one can use shared state to influence the behavior of the parsers, and get context sensitivity. However, I haven't done this - the combinators are purely functional and I like that. If I have a need, I think I can easily go beyond that.

    ReplyDelete
  7. In my own parser combinator library, I added a custom fixpoint combinator for parsers, that checks to make sure that it has made progress before calling a parser recursively. If it hasn't made progress, it fails instead. This means you can write left-recursive grammars and the parser won't go into an infinite loop.

    ReplyDelete
  8. Neel,

    Thanks for the suggestion. It's a cool idea, and maybe the nicest one of several possible solutions. What kind of library have you got (i.e., what language, any publications etc.)?

    Comments:if the fix combinator has to be invoked explicitly, there is always the risk you'll forget to use it. You'll discover this pretty quickly in most cases of course. Just a small nuisance.

    It also makes parsers stateful. Not necessarily a big deal. Overall, it's an attractive way to go.

    ReplyDelete
  9. What about handling left recursion by running all branches in an OR expression concurrently. You would terminate other branches when the first one completes. This requires additional support from the stream (you need to split the stream to allow concurrent access to the stream, with each process having an independent position in the stream). I'm working on this angle for our parser combinator here at work. I've also got a concurrency framework that makes the branching of processing in this manner much more natural, which I suppose is why I thought of this approach. Unfortunately, I too have work related issues with distributing the code.

    ReplyDelete
  10. Stephen,

    Thanks for the suggestion.
    The idea of running alternation in parallel occurs in the literature. However, using that to deal with left recursion is a twist I haven't seen (but I am no expert on parser combinators). My combinators are purely functinal (so far) and so this is a possibility. As a practical matter, it depends how much real parallelism you have at your disposal, I suppose. I wouldn't use this technique in my current setting, but maybe in the future.

    ReplyDelete
  11. This comment has been removed by the author.

    ReplyDelete
  12. To see a spectacular demonstration of parser combinators (in the form of XML tree transformers), check out Harmony.

    ReplyDelete
  13. For a parser combinator library with left recursion and fixpoint please see this code sample.

    ReplyDelete
  14. A nice trick to parse left-recursive expression grammars is to parse them using "right-recursion", and then do a left fold on the resulting expression list, using the operator (addition, multiplication, etc) semantics.

    In this way you effectively recover left-recursion in your grammar, without having to directly use it in your combinators!

    This is encoded by the pChainL combinator in Doaitse Swierstra's parser combinators. For more info take a look at page 11 of manual . The home page of these combinators is: here.

    It's nice to see that parser combinators in Smalltalk are rather concise, even taking into account the delaying of evaluation.

    ReplyDelete
  15. I found this entry very interesting, More for the way you describe smalltalk syntax than for Parsers Combinators.

    I'm going to start studying ST

    thanks! :)

    ReplyDelete
  16. Hi, unfortunately I have only just found this site, and am a bit behind. I apologize if we missed referring to some of your work in our paper that we recently presented at PADL '08. The paper describes our parser combinators which accommodate left recursive ambiguous grammars in polynomial time (using monadic memoization) and which generate polynomial-sized representations of the possibly exponential number of parse trees for highly ambiguous input. The combinators are implemented in Haskell.

    More can be found at our X-SAIGA research project site:

    http://cs.uwindsor.ca/~hafiz/proHome.html

    Regards,

    Richard Frost

    ReplyDelete
  17. Hi Richard,

    Thanks for getting in touch; I was not aware of your work either. Since you're response refers to my earliest public post on the subject, I don't know if you noticed the later post:

    http://gbracha.blogspot.com/2007/09/executable-grammars.html

    or the paper, (listed at the top of http://bracha.org/Site/Papers.html). We're very happy with the approach we've taken, but have done little to optimize its perfomance so far. It's on my agenda; I hope to return to this topic, and I'll be looking at your work as well. Thanks for the pointer!

    ReplyDelete