Wednesday, December 09, 2009

Chased by One’s Own Tail

Functional programming languages generally allow you to program without resorting to explicit loops. You can use recursion instead. It’s more elegant and higher level. In particular, it allows one to code certain nonterminating computations (such as state machines) as (a set of mutually) recursive calls.

In order to reliably support tail calls, implementations convert tail recursion into iteration. As a result, the activation frames for the recursive calls never come into existence, preventing nasty stack overflows.

I recently came across a truly brilliant post by Guy Steele arguing that tail recursion is actually essential to object oriented programming. Go read it, as well the terrific essay by William Cook that it references.

And yet, object oriented languages usually don’t do tail call optimization (TCO for short). What’s their excuse?

One excuse, especially in the Java world, is security. Specifically, stack frames carry information about protection domains that the the classic J2SE model depends upon. Since I am not (and have never been) a fan of that particular model, I’ll dispense with that argument. There are better ways to get security.

A more important excuse is debugging. Stack traces are very useful for debugging of course, and if stack frames are optimized away via tail recursion, key information is lost.
I myself want complete stack traces, but I also want to write tail recursive programs and know that they will reliably work.

One possibility is that upon stack overflow, we compress the stack - eliminating any tail recursive frame that is at its return point. This should be controllable via debugging flags. The excess frames are eliminated, but only when strictly necessary. Programs that work well without tail call optimization won’t be affected. So if you think a reliable stack trace is more important than tail call optimization, you have no need for concern.

Programs that rely on tail calls might suffer performance loss in this mode, but they will still run - and be easier to debug. If this is a problem, one can specify a different configuration which optimizes tail calls away. So if you favor tail call optimization over stack traces, you need not be alarmed either.

Now let us change perspective slightly. Keep in mind that the stack compression suggested above is really just a particular form of garbage collection.

It’s good to recall that the whole notion of the call stack is simply an optimization. One can allocate activation records on the heap and dispense with a call “stack” altogether. If you did that, you could make each activation point at its caller. Tail recursive calls might produce activations that pointed not at the caller, but at the caller of the first activation of the tail recursive function. All intermediate frames would be subject to garbage collection.

In the above model, aggressive garbage collection corresponds to traditional TCO. On the other hand, the idea of "collecting" the stack only when it runs out, gives us as much debugging information as a conventional non-TCO implementation. We only lose information in situations that simply don't work without TCO.

Now let's pursue the analogy between stack frames and heap storage further. We’d like to have all our past states in the heap, so we could examine them with a time traveling debugger, but that’s usually too costly. If we are lucky enough to have a time traveling debugger at all, we must configure how much of the past is available.

The past history of the call stack (i.e., what tail recursive calls took place) is not fundamentally different. If we are chasing our own tail, we can go back in time and see it chasing us. So my ideal development environment would allow me to recover stack frames that had been eliminated by TCO if I really wanted to.

As usual, it’s easier to see the common abstraction if one considers everything, including activations, as an object.

I therefore argue that:

  1. Support for tail recursion should be required by the language specification. This is essential: if tail recursive programs only run on certain implementations, one cannot write portable code that relies on tail recursion.
  2. How tail recursion is supported is an implementation detail. Whether tail recursive activations never come into being, or exist on the stack or the heap and get collected by one policy or another, is up to the implementation.
  3. During development, the IDE should let you choose policy on retaining computation history. In particular, it should allow you to retain as much of the history of your computation as possible - on the stack and the heap (e.g., time traveling debugging, full stack traces etc.).
I’ve endeavored to support this outlook in Newspeak - but only partially so far.
The Newspeak spec states that method or closure activations are objects, and a called activation is passed a reference to its continuation activation (not its continuation!), which is either its calling activation, or the continuation activation of the caller if there is no further code to execute in the caller (i.e., the caller is a tail call).

This meets requirements (1) and (2). Alas, the implementation does not yet comply (so forget about (3)). I hope we can correct this in time.

We need not compromise on either debugging or tail recursion. We do need to seek out common abstractions (like objects) and insist on better tools and languages rather than settling for the dubious services of the mainstream.

12 comments:

  1. Guy Steele's example (from Cook) of symbolic set composition and calculation is a red herring. The exact same argument outlined therein for TCO falls apart when you consider a symbolic set in the shape of a tree of Unions, rather than a linked list of Adjoins. In that case, TCO isn't sufficient to save you from stack overflow; you need allocation on the heap to save your point in the tree, in case you need to descend the other node when checking for inclusion.

    TCO is an optimization predicated on the pragmatics of creating fast languages for current CPU architectures and cache benefits of the stack, etc. If you're willing to do the necessary contortions to turn return-side state into accumulator parameters, you can write functional code which won't overflow the stack. But that doesn't make it much more natural.

    A return expression which expects (and requires) a tail-call looks pretty similar to one that doesn't to the untrained eye. Loops are simply more readable. Clojure's recur form seems to me preferable, because it makes the expectation explicit and can let the compiler ensure that the requirements are met - of course, it's not a general tail call, only suitable for tail recursion and thereby loops.

    I guess the ideal situation would be stack compression / explosion out onto the heap on overflow would probably be the ideal, but difficult, implementation.

    ReplyDelete
  2. MIT-Scheme has an interesting implementation of TCO, which retains "eliminated" stack frames in a fixed-size ring buffer, for debugging, etc.. Joe Marshall describes this in more detail.

    ReplyDelete
  3. I thought tail calls couldn't be done in OO, because of late binding, but I was confusing tail calls, tail recursion, and missed the fact that there are multiple optimizations there.

    Still, in OO languages, tail recursion can only be completely optimized into iteration (by reusing the current activation) when the compiler is certain that the called activation is of the same shape as its caller… i.e. in Smalltalk, for self sends of a monomorphic message (or a bit of luck with type inference).

    ReplyDelete
  4. Damien, I believe that tail call optimization should be possible in Smalltalk VMs that keep type information harvested from PICs when JITting the interpreted bytecode.

    For example, in Strongtalk the JIT compiler retains type information and includes type tests in the generated code. This should allow the recursive call optimization to be enhanced for tail calls. The typical approach being to include an uncommon branch which forces frame deoptimization in the case of a type test failure.

    A similar optimization should be possible in the upcoming Cog JIT VM for Squeak, though that would probably be a second or third release item. I believe that the initial version will not do any inlining.

    ReplyDelete
  5. I'm not keen to the idea of dynamic enabling TCO at stack overflow events. This exception is not the only problem of recursion w/o TCO; your app may suffer from important performance degradation because the activation stacks thrash CPU caches, especially if you (or a "ergonomic" VM) allocates large stacks to prevent overflows. If you mix that with heavy multithreading, you may even hit swapping costs that make you app orders of magnitude slower.

    [I remember the early days of Java, when we didn't have java.nio, so servers resorted to the thread-per-connection model. the performance and scalability of server apps was highly dependent on stack size, and that was without ANY recursion. Today we're once again seeing a surge of thread counts for different reasons. The CPU's count of cores / hardware threads will keep increasing faster than memory sizes, and much faster than memory bandwidth.]

    A better idea IMO is: enable TCO by default, but disable it (via deoptimization) for methods that have their activation frames walked by either the exception-filling routine, or by a debugger. The security framework(*) should provide some hint to the JVM, or use some internal/alternate API, when walking stacks for security checks, to NOT deopt TCO - otherwise, these checks will eventually kill most TCO in the program. And this deoptimization should be optional, because it's highly debatable whether the missing frames are any good, either in exception dumps or in a debugger. Now if you have some app that dumps full exception traces frequently, you lose the TCO, but then you should fix the app anyway. So, the only compromise of this idea is that you don't get a full stack trace in the first exception that includes a given method; you must wait for the next exception.

    BTW we already have compromises in exception dumps. Very long dumps (esp. with chained exceptions) are often abbreviated by the VM; in some VMs like IBM's (at least with the default configuration for WebSphere), you don't get line numbers for optimized methods. All production installations of WebSphere use this configuration; speaking as a developer who supports a few apps in this platform, yeah sometimes this makes my life a bit harder... but it's not a big deal either, considering that a bad bug in production always needs to be reproduced in the development environment, and when you do that you get your full exception info if you need it.

    (*) Readers should check Guy Steele's final comments: "The paper 'A Tail-Recursive Machine with Stack Inspection' by John Clements and Matthias Felleisen demonstrates that correct implementation of tail calls is not incompatible with security mechanisms that rely on stack inspection." So I call this a strawman.

    P.S.: My first serious programming, with Z80 Assembly, was shock-full of "tail calls" even though I had zero CS education and didn't even know how to use recursion. It was just obvious that, if the last operation in a routine R1 was a unconditional CALL R2, it was simpler and more efficient to just end R1 with JMP R2 (provided that I'd first pop any temp data pushed by R1 in the stack - this would be necessary between a CALL and RET, too). There was no compromise in debugging convenience or anything. Today, I'm totally amazed that incredibly advanced optimizers like HotSpot won't do this simple, still highly rewarding trick.

    ReplyDelete
  6. On terminology: I think it's important to speak of "proper tail calls". Tail recursion is only a subset of tail calling, and "optimization" implies this is a nice-to-have that affects only performance, like constant folding or loop unrolling. Tail calling is more like garbage collection or bignums: they don't affect the behavior of programs running on an abstract machine without any limits, but they strongly affect programs running on real (even if virtual) machines.

    Barry: Loops may be more readable, but they are less writable: you need to have two different ways to decompose functions. With proper tail calling, you need only one. And TCO long predates modern CPU architectures: it goes back to about 1974 in the Scheme community, and about 1980 in the Prolog community.

    Damien: OO languages conflict with proper tail calling only if you insist that each stack frame is a boxed object. If you use formless stack frames (which you can object-ify later if you need to), then all you have to do is reuse the upper part of the stack.

    Osvaldo: It's quite true that using too big a stack can be problematic. Chicken Scheme uses the C stack for all calls with no attempt at tail-calling, allocating all objects on the stack as we go. When the stack fills, any live objects are evicted to the heap, and all the existing stack frames are discarded using longjmp().

    In essence, then, tail-calling is amortized. Too small a stack, and you thrash too much in the garbage collector; too big, and you thrash too much paging it in and out. Chicken used to try to decide at config time what the best stack size was; it now simply uses a fixed stack size of 64K. For stack-trace purposes, Chicken preserves and reports the last N calls using a fixed-sized buffer.

    ReplyDelete
  7. John - In programming language idioms, readability trumps writeability. I wonder if cognitive dissonance stopped you from spotting the relative absurdity in what you wrote.

    I'm also pretty sure TCO predates Scheme, as it's an obvious optimization in assembly.

    The CPU point relates to the justification for having a contiguous stack at all, though, rather than doing everything in a GC heap. GC these days is efficient enough, and we have enough spare memory, to make it feasible in many cases. But memory is so slow relative to CPU that the cacheability of the stack has become the new limiting factor preventing heap allocation of frames. The CPU architecture point doesn't really relate to TCO.

    ReplyDelete
  8. Barry: Agree with your comment to John, but readability is in the eye of the beholder. I confess to find difficult to read Lisp's in particular - but then, my problem is not recursion instead of loops, rather other things (like higher-order function composition). Indeed, I tend to find recursive algorithms natural and easy to read. You can already do recursion in ANY modern OOPL (you just may have perf issues), and adding tail calls won't remove the looping constructs of the language so you just keep using them when you think explicit iteration is easier to read|write. Not to mention (once again) that tail calls are not useful only to recursion, they are great for ANY method that ends invoking another method (and returning its result if there's any).

    In fact, Guy Steele's blog just skims over the advantages of TCO for OOPLs; he focuses on core paradigm issues of abstraction, but there are other advantages. OO programs tend to have too many tiny methods - like getters/setters, tons of delegation used by many design patterns, or simply the result of proper modularization (each class and method focusing on a single, minimal responsibility). The result is bloated, slow code. Modern compilers solve this mostly by heavy inlining, but this comes with a significant space cost, remarkably for methods of "average" size (short enough to inline, but long enough so inlining spends significantly more code than saved by the calling protocol). The TCO is often a good replacement. TCO produces a nice performance improvement, admittedly much less than inlining but OTOH it's a simpler optimization. It sucks less memory and JIT time; can be used much more often as callee size is irrelevant; and as a consequence of those first advantages, can be used in a low-optimizing JIT mode before there's any profiling info to justify expensive optimizations.

    ReplyDelete
  9. Barry: Readability is all in what you are used to, and I like Osvaldo find recursive functions highly readable. Scheme does have specialized syntax for looping; as a matter of implementation, it is immediately rewritten to use tail calls.

    Tail-calling optimization may indeed have been a common assembler trick (though nobody told me about it in the late 70s when I was writing a lot of assembler code), but Scheme is as far as anyone knows the first language to require proper tail calls. This distinction is fundamental to the point of this post, which is why its use of "TCO" is regrettable.

    ReplyDelete
  10. @John: I very much agree with all your comments. I should have been more careful to distinguish tail recursion, tail calls, optimization etc.

    @Barry: Guy's point is valid, and deep, independently of any specific example. If one is to rely exclusively on procedural abstraction (as in solipsistic OO, as Wm. Cook presents it) then we should make procedural abstraction competitive with other abstractions. In particular, abstractions that allow you to express computations unbounded in time yet bounded in space, such as loops.

    Proper tail calls are needed for that, unless one decides to cheat (as we often do), say by presenting control structures as magically defined procedures (look into how while loops are defined in any Smalltalk system). One can cheat for a very long time though.

    ReplyDelete
  11. You're absolutely right; eliminating the call stack makes debugging unreasonably hard. I recently spend a lot of time chasing down a bug in Erjang [Erlang on the JVM] in which all I had available was a call to erlang:error "throw exception" on the "top" of the stack. No indication of who called that error function, and thus where the error even occurred.

    I know that in Erlang, you generally use tracing to handle such cases; so once you see the problem, you will enable tracing and that lets you go back and see calls/returns. I don't have Tracing in Erjang [yet], so it took a while to guess where to go look for the problem.

    ReplyDelete
  12. In a system that discards all but one of a given activation (or avoids allocating a new activation when it found it already had one around that it could reuse), I wonder if it would be useful to add a counter to the activation of the number of duplicates that had been discarded, or not created?

    That way, at least a stack trace would give an idea of what was missing.

    If foo() calls bar(), which tail-recursively calls itself 6 times, and then calls baz(), which throws, the stack trace with TCO but no counters would look like:

    exception NumberFormatException
    at baz
    at bar
    at foo

    With counters, it could produce:

    exception NumberFormatException
    at baz
    at bar * 7
    at foo

    Cold comfort for the programmer who wanted to go back and examine the stack frame of the third call to bar() in the debugger, but an improvement, nonetheless.

    ReplyDelete