Room 101: Serialization Killer

Way back in October 2009, I threatened to write a post about how serialization can serve as a binary format. The moment of reckoning has arrived.

Object serialization is probably most widely known due to Java serialization, but of course has a long history before that. Modula-2+ supported pickling long before Java, for example, as did Smalltalk systems.

Java serialization serializes objects in a most un-object-oriented way: it separates the object’s data from its behavior. Only the data is actually serialized. The object’s behavior (namely its class) is represented symbolically (as fully qualified class names; more on that later). During deserialization, the symbolic class information is used to reconstruct the classes of objects.

The problem is that this only works properly when both serializer and deserializer agree on the interpretation of the symbolic class information. For example, when two VMs running identical versions of the code communicate via RMI (the original use of Java serialization).

If the code in the deserializer differs from that in the serializer, as is very often the case (say, when one wants to load old serialized data) problems arise. The serialized data may not describe instances of the class on the deserializing side at all, because the private representation of the class may have changed.

Tangent: Java serialization introduced an extra-linguistic mechanism for creating instances, that was not considered as part of the language design, which only foresaw objects being created via constructor calls. This too is problematic. What if the invariants imposed by the constructor change over time?

To deal with these problems, one may opt to store data using a more stable schema than the in-memory representation (e.g., a database, an agreed external data format etc.).

Alternatively, one can add conversion routines that map old representations into new ones. This requires identifying the version of the object’s class (aka the serialVersionUID) when serializing an object. This approach is problematic however. Each change of representation requires a new version number, and a new conversion routine. These must be in place before the objects are serialized.

The reliance on class names is also an issue. What of anonymous classes? This is a problem in any case, but aggravated due to the reliance on names.

Tangent: the serialization team was, however, perfectly justified in assuming every class had a well defined name. They were working with Java 1.0, before the introduction of inner classes. Likewise, the inner class team was working on a system without serialization. No one saw the conflict until after the release combining the two - when it was far too late to do much about it.

In contrast, if one actually serializes the objects rather than just the data (that is, one serializes the data and the behavior), the serialized objects are much more self contained. (at some point one still wants to cut things off, but at stable APIs like Object).

If you want to bring old objects up to date, you must convert them; but:

You don’t have to; they work exactly as they did the day they were serialized, just like a mummy come to life.
You can add the conversion after the fact, at any time; for example, you can deserialize and then convert. The only requirement is that the necessary information is available via the object’s public API.

The serialization team didn’t have the option of serializing the classes in the manner just described. The Java byte code verifier makes that impossible. The verifier imposes a nominal type system, which means you cannot have two classes with the same name running in the same class loader.

Tangent: The wonders of byte code verification probably deserve a post of their own. For now, just note this as another example of the kind of difficult-to-foresee interactions that occur between seemingly unrelated parts of a complex system.

Assume we have a system where we can serialize objects including their behavior. Can we use the serialization format as a binary format for code? Specifically, can we use serialized class objects as our binary format?

In Newspeak, top level classes, also known as module declarations, are stateless. Hence the serialized form of these class objects is stateless as well, fulfilling a key requirement for a binary code representation.

Module declarations have no external dependencies, so we needn’t serialize a great tangle of objects as is often the case with object serialization.

Tangent: This is what a module is supposed to be: something that can be built independently! This also means that you can load modules declarations in any order. I note with glee that this runs against the entire tradition of ADTs as the basis for software modularity.

Entire applications can be represented this way as well - it’s simply a matter of creating an object that ties together the various module declarations used by the application. This source form of this object acts like your makefile, and its serialized form is analogous to an executable (or a JAR or whatever). To make this more concrete:

A Newspeak application is an object conforming to a standard API. This API consists of a single method, main:args:.

class BraveNewWorldExplorerApp fileBrowserClass: fb = (

| BraveNewWorldExplorer = fb. |

)(

“MAIN METHOD”

public main: platform args: argv = (

| fn |

fn:: argv at: 1 ifAbsent:[ 'C:/Users'].

platform hopscotch core HopscotchWindow openSubject:

((BraveNewWorldExplorer usingPlatform: platform) FileSubject onModel: fn

)))

You need not understand every detail here; what is important is the following:

An serialized instance of BraveNewWorldExplorerApp acts as our binary. The Newspeak runtime loads such a serialized object, deserializes it, and invokes its main:args: method. The latter invocation is very similar to what a JVM does when it loads the main class of a program and calls its main() method, or for that matter, what C does with the main() function.

The method is invoked with two parameters (here we differ from the mainstream). The second of these represents any (command line) arguments to the program, just like argv in a C program. What is different is the first argument, platform, which represents the Newspeak platform. The precise meaning of the expression inside the method is relatively unimportant. What matters is that we use the platform argument in two places: first, to instantiate the file browser module, so that it can make use of platform code; and second to access the GUI (platform HopscotchFramework).

In this case, the application instance is created with a single parameter, the module declaration for the file browser.
More complex applications tie several modules together; in that case, the app module would be instantiated with a series of parameters, one for each module declaration required by the application.

To make it easy for developers, our IDE uses a standard convention for instantiating and serializing application objects. If a top level class has a class method packageUsing:, the IDE will assume the class represents an application, and allows us to create a deployable app with the push of a button.

public packageUsing: ideNamespace = (

^BraveNewWorldExplorerApp fileBrowserClass: ideNamespace BraveNewWorldExplorer

)

The IDE will call the class method, passing it a namespace object as a parameter. The method can use that namespace to look up any available module declarations that it needs to gather into the application, and compute an application object that references them all. This application object is then serialized. This packaging process is somewhat analogous to constructing a JAR file.

Semi-tangent: We also allow you to output more common/mundane deployment formats like Windows executables. Likewise, MacOS apps or Linux rpms can (and likely will) be added; a small matter of programming. Most interesting, and still in flight, deployment as web pages.

We have serialized/deserialized applications, such as the compiler and the IDE, into binary objects a few hundred kilobytes in size. However, this isn’t our standard modus operandi yet. Right now, we are still slowly untangling ourselves from the Squeak environment.

What then is the moral of the story? Well, one moral is that a running application can be thought of as an object, combining state and behavior; moreover, classical binary formats like a.out can be thought of as serialized objects.

Why is this profitable? Because we can cover more ground with less concepts, and less implementation effort. For example, rather than class files, JAR files and serialized objects, we can do with serialized objects alone. Moreover, we can do better with this one mechanism than we did with the other three combined. Less is more. And that is the moral of many stories.

Room 101

Tuesday, February 23, 2010

Serialization Killer

No comments: