A place to be (re)educated in Newspeak

Sunday, August 31, 2008

Foreign functions, VM primitives and Mirrors

An issue that crops up in systems based on virtual machine is: what are the primitives provided by the VM and how are they represented?

One answer would be that those are simply the instructions constituting the virtual machine language (often referred to as byte codes). However, one typically finds that there are some operations that do not fit this mold. An example would be the defineClass() method, whose job is to take a class definition in JVML (Java Virtual Machine Language) and install it into the running JVM. Another would be the getClass() method that every Java object supports.

These operations cannot be expressed directly by the high level programming languages running on the VM, and no machine instruction is provided for them either. Instead, the VM provides a procedural interface. So while the Java platform exposes getClass(), defineClass() and the like, behind the scenes these Java methods invoke a VM primitive to do their job.

Why aren’t primitives supported by their own, dedicated virtual machine language instructions? One reason is there are typically too many of them, and giving each an instruction might disrupt the instruction set architecture (because you might need too many bits for opcodes, for example). It’s also useful to have an open ended set of primitives, rather than hardwiring them in the instruction set.

You won’t find much discussion of VM primitives in the Java world. Java provides no distinct mechanism for calling VM primitives. Instead, primitives are treated as native methods (aka foreign functions) and called using that mechanism. Indeed, in Java there is no distinction between a foreign function and a VM primitive: a VM primitive is foreign function implemented by the VM.

On its face, this seems reasonable. The JVM is typically implemented in a foreign language (usually C or C++) and it can expose any desired primitives as C functions that can then be accessed as native methods. It is very tempting to use one common mechanism for both purposes.

One of the goals of this post is to explain why this is wrong, and why foreign functions and VM primitives differ and should be treated differently.

Curiously, while Smalltalk defines no standardized FFI (Foreign Function Interface), the original specification defines a standard set of VM primitives. Part of the reason is historical: Smalltalk was in a sense the native language on the systems where it originated. Hence there was no need for an FFI (just as no one ever talks about an FFI in C), and hence primitives could not be defined in terms of an FFI and had to be thought of distinctly.

However, the distinction is useful regardless. Calling a foreign function requires marshaling of data crossing the interface. This raises issues of different data formats, calling conventions, garbage collection etc. Calling a VM primitive is much simpler: the VM knows all there is to know about the management of data passed between it and the higher level language.

The set of primitives is moreover small and under the control of the VM implementor. The set of foreign functions is unbounded and needs to be extended routinely by application programmers. So the two have different usability requirements.

Finally, the primitives may not be written in a foreign language at all, but in the same language in a separate layer.

So, I’d argue that in general one needs both an FFI and a notion of VM primitives (as in, to take a random example, Strongtalk). Moreover, I would base an FFI on VM primitives rather than the other way around. That is, a foreign call is implemented by a particular primitive (call-foreign-function).

Consider that native methods in Java are implemented with VM support; the JVM’s method representation marks native methods specially, and the method invocation instructions handle native calls accordingly.

The Smalltalk blue book’s handling of primitives is similar; primitive methods are marked specially and handled as needed by the method invocation (send) instructions.

It might be good to have one instruction, invokeprimitive, dedicated to calling primitives. Each primitive would have an identifying code, and one assumes that the set of primitives would never exceed some predetermined size (8 bits?). That would keep the control of the VM entirely within the instruction set.

It is good to have a standardized set of VM primitives, as Smalltalk-80 did. It makes the interface between the VM and built in libraries cleaner, so these libraries can be portable. We discussed doing this for the JVM about nine or ten years ago, but it never went anywhere.

If primitives aren’t just FFI calls, how does one invoke them at the language level? Smalltalk has a special syntax for them, but I believe this is a mistake. In Newspeak, we view a primitive call as a message send to the VM. So it is natural to reify the VM via a VM mirror that supports messages corresponding to all the primitives.

A nice thing abut using a mirror in this way, is that access to primitives is now controlled by a capability (the VM mirror), so the standard object-capability architecture handles access to primitives just like anything else.

To get this to really work reliably, the low level mirror system must prohibit installation of primitive methods by compilers etc.

Another desirable propery of this scheme is that you can emulate the primitives in a regular object for purposes of testing, profiling or whatever. It's all a natural outgrowth of using objects and message passing throughout.

10 comments:

Patrick said...

Gilad--

If I understand it correctly, in HotSpot there are a number of functions which are treated as "intrinsic", which I believe are declare to be (from Java) "native", but whose implementation is actually part of the VM itself (e.g. there is no call leaving the VM boundary). IIRC, this is done primarily for performance--are you including these in your discussion as well?

Thanks
Patrick

gcorriga said...
This comment has been removed by a blog administrator.
Gilad Bracha said...

Giovanni,
What the VM interfaces to is the language *implementation* - which is not the same as the language itself.

The VM already has to do this - the VM typically knows about Object, Class, Process etc. So making VMMirror a special object that is part of the VM implementation is no big deal.

Much more importantly, it gives programmers access to the VM with a clean abstraction they are used to - an object - that fits in with how everything else is handled in Newspeak.

Steve said...

Looked at from the standpoint of the Smalltalk programmer, the idea of putting the primitives together in a single object has some attractions. For one thing, it is clear where such behaviour resides.

OTOH primitives often implement quite different behaviours - eg. create an instance of some class, perform garbage collection, evaluate a block. It is not clear that these behaviours belong on the same object, other than due to being primitives.

By that measure the VMMirror becomes more of a library object; essentially a bucket for disparate behaviours that happen to be primitives.

I'm not sure I feel completely comfortable with that.

Ryan said...

I think putting primitives into an object/objects is a really great idea because the more uniformly the objects/messages metaphor is applied, the more powerful it becomes. Perhaps in objects, primitives will be more accessible to programmers who haven't learned the finer points of the platform as deeply, and some creative uses might be found.

Gilad Bracha said...

Steve,

What happens in practice is that whatever library that are now implemented as primitives, are instead implemented as calls on VMMirror. So for most programmers, nothing changes.

What has changed is that the language is no longer burdened by an extra misfeature, and access to primitives is controlled the same way as everything else.

So security, or testing, or browsing (to take 3 prominent examples) do not have to handle primitives as a special case.

Steve said...

So how are you proposing that primitives be added to the VMMirror? Is there any high-level language coding involved at all to do so, or would creating and declaring the primitive in the VM itself be enough?

What about primitive failure code? Typically that can be fairly low-level as well - such as recovering from an object allocation failure - and may well involve the calling of other primitives, which would now take the form of sending other messages to the VMMirror. It would be convenient to put such primitive failure code in methods on the VMMirror as well, at least in some cases.

Should VMMirror be a special object or a special class? Should its class be subclassable? Is it, in effect, a normal object in the high level language or is it intrinsically different as the embodiment of the interface to the VM?

Gilad Bracha said...

Steve,

Adding the primitive to the VM is all that would be required (whether that's high level language or not depends on what the VM is written in).

Primitives take an error handling closure as a parameter, much like Strongtalk. These are passed by the caller, written in Newspeak, and if they call other primitives, so be it.

From the perspective of the user, VMMIrror is an object. Of course it has a class, but you can't really do anything with it.

A general comment: the motivation for this arrangement is to make the system work better for its users, not to make it particularly convenient for VM implementors.

Steve said...

I only used the phrase "high-level language", since you had mentioned Java in the original post, and, leaving aside discussions of what constitutes a high-level language, I didn't necessarily want to limit the scope of the discussion to Smalltalk, though obviously that is where my interests lie.

At least in Strongtalk we can make some progress towards this goal fairly easily. By converting all of the existing primitives to a procedural form where the object that was the receiver of the message is now merely an argument to the primitive call on the VMMirror, we can at least move all of the current primitives to a VMMirror object.

One wrinkle that would need to be ironed out is that the compiler doesn't appear to support passing of error-handling closures held in variables as arguments to primitives, though possibly it would if they were explicitly typed.

In the first instance I would probably keep wrapper methods for the current syntax of primitive, since that will keep the first step simpler.

As an optional second step we could replace this VMMirror with a special purpose object provided by the VM.

It is the special-purpose nature if this object that concerns me, though. If it appears to the Smalltalk programmer as an ordinary object that just happens to invoke a primitive when a message is sent to it, then what would stop a programmer from attempting to mixin this behaviour into their own classes? Of course we could detect source references to VMMirror and generate bytecode to invoke the relevant primitive, but since the source for the compiler is written in Smalltalk, we have only deferred the problem to a different part of the image. Further, that would defeat simulation of primitive execution in Smalltalk code, which would be a nice feature to have.

I'm not necessarily averse to the idea. I'm just not that keen on introducing a special object that has to be treated differently not just by the VM, but also by elements of Smalltalk code.

I'm completely on board with making things better for the users of the VM at the expense of making them harder for the VM maker. OTOH I'm against making them harder for the VM maker if it doesn't provide extra value to the user. I'm not yet convinced that going further than moving the primitives to a VMMirror object is worth the extra effort involved.

Gilad Bracha said...

Steve,

This is getting rather detailed, so perhaps we should take it offline.

I certainly don't see a pressing need to change Strongtalk so the primitive calling syntax is gone. It would, as you suggest, suffice to wrap them in methods in a VMMirror class written in Smalltalk, and provide an accessor for an instance of it as part of the Newspeak platform object.

If one wants to really secure the system, there are many issues involved (the byte code format not least among them).