Room 101: 02/01/2010

Way back in October 2009, I threatened to write a post about how serialization can serve as a binary format. The moment of reckoning has arrived.

Object serialization is probably most widely known due to Java serialization, but of course has a long history before that. Modula-2+ supported pickling long before Java, for example, as did Smalltalk systems.

Java serialization serializes objects in a most un-object-oriented way: it separates the object’s data from its behavior. Only the data is actually serialized. The object’s behavior (namely its class) is represented symbolically (as fully qualified class names; more on that later). During deserialization, the symbolic class information is used to reconstruct the classes of objects.

The problem is that this only works properly when both serializer and deserializer agree on the interpretation of the symbolic class information. For example, when two VMs running identical versions of the code communicate via RMI (the original use of Java serialization).

If the code in the deserializer differs from that in the serializer, as is very often the case (say, when one wants to load old serialized data) problems arise. The serialized data may not describe instances of the class on the deserializing side at all, because the private representation of the class may have changed.

Tangent: Java serialization introduced an extra-linguistic mechanism for creating instances, that was not considered as part of the language design, which only foresaw objects being created via constructor calls. This too is problematic. What if the invariants imposed by the constructor change over time?

To deal with these problems, one may opt to store data using a more stable schema than the in-memory representation (e.g., a database, an agreed external data format etc.).

Alternatively, one can add conversion routines that map old representations into new ones. This requires identifying the version of the object’s class (aka the serialVersionUID) when serializing an object. This approach is problematic however. Each change of representation requires a new version number, and a new conversion routine. These must be in place before the objects are serialized.

The reliance on class names is also an issue. What of anonymous classes? This is a problem in any case, but aggravated due to the reliance on names.

Tangent: the serialization team was, however, perfectly justified in assuming every class had a well defined name. They were working with Java 1.0, before the introduction of inner classes. Likewise, the inner class team was working on a system without serialization. No one saw the conflict until after the release combining the two - when it was far too late to do much about it.

In contrast, if one actually serializes the objects rather than just the data (that is, one serializes the data and the behavior), the serialized objects are much more self contained. (at some point one still wants to cut things off, but at stable APIs like Object).

If you want to bring old objects up to date, you must convert them; but:

You don’t have to; they work exactly as they did the day they were serialized, just like a mummy come to life.
You can add the conversion after the fact, at any time; for example, you can deserialize and then convert. The only requirement is that the necessary information is available via the object’s public API.

The serialization team didn’t have the option of serializing the classes in the manner just described. The Java byte code verifier makes that impossible. The verifier imposes a nominal type system, which means you cannot have two classes with the same name running in the same class loader.

Tangent: The wonders of byte code verification probably deserve a post of their own. For now, just note this as another example of the kind of difficult-to-foresee interactions that occur between seemingly unrelated parts of a complex system.

Assume we have a system where we can serialize objects including their behavior. Can we use the serialization format as a binary format for code? Specifically, can we use serialized class objects as our binary format?

In Newspeak, top level classes, also known as module declarations, are stateless. Hence the serialized form of these class objects is stateless as well, fulfilling a key requirement for a binary code representation.

Module declarations have no external dependencies, so we needn’t serialize a great tangle of objects as is often the case with object serialization.

Tangent: This is what a module is supposed to be: something that can be built independently! This also means that you can load modules declarations in any order. I note with glee that this runs against the entire tradition of ADTs as the basis for software modularity.

Entire applications can be represented this way as well - it’s simply a matter of creating an object that ties together the various module declarations used by the application. This source form of this object acts like your makefile, and its serialized form is analogous to an executable (or a JAR or whatever). To make this more concrete:

A Newspeak application is an object conforming to a standard API. This API consists of a single method, main:args:.

class BraveNewWorldExplorerApp fileBrowserClass: fb = (

| BraveNewWorldExplorer = fb. |

)(

“MAIN METHOD”

public main: platform args: argv = (

| fn |

fn:: argv at: 1 ifAbsent:[ 'C:/Users'].

platform hopscotch core HopscotchWindow openSubject:

((BraveNewWorldExplorer usingPlatform: platform) FileSubject onModel: fn

)))

You need not understand every detail here; what is important is the following:

An serialized instance of BraveNewWorldExplorerApp acts as our binary. The Newspeak runtime loads such a serialized object, deserializes it, and invokes its main:args: method. The latter invocation is very similar to what a JVM does when it loads the main class of a program and calls its main() method, or for that matter, what C does with the main() function.

The method is invoked with two parameters (here we differ from the mainstream). The second of these represents any (command line) arguments to the program, just like argv in a C program. What is different is the first argument, platform, which represents the Newspeak platform. The precise meaning of the expression inside the method is relatively unimportant. What matters is that we use the platform argument in two places: first, to instantiate the file browser module, so that it can make use of platform code; and second to access the GUI (platform HopscotchFramework).

In this case, the application instance is created with a single parameter, the module declaration for the file browser.
More complex applications tie several modules together; in that case, the app module would be instantiated with a series of parameters, one for each module declaration required by the application.

To make it easy for developers, our IDE uses a standard convention for instantiating and serializing application objects. If a top level class has a class method packageUsing:, the IDE will assume the class represents an application, and allows us to create a deployable app with the push of a button.

public packageUsing: ideNamespace = (

^BraveNewWorldExplorerApp fileBrowserClass: ideNamespace BraveNewWorldExplorer

)

The IDE will call the class method, passing it a namespace object as a parameter. The method can use that namespace to look up any available module declarations that it needs to gather into the application, and compute an application object that references them all. This application object is then serialized. This packaging process is somewhat analogous to constructing a JAR file.

Semi-tangent: We also allow you to output more common/mundane deployment formats like Windows executables. Likewise, MacOS apps or Linux rpms can (and likely will) be added; a small matter of programming. Most interesting, and still in flight, deployment as web pages.

We have serialized/deserialized applications, such as the compiler and the IDE, into binary objects a few hundred kilobytes in size. However, this isn’t our standard modus operandi yet. Right now, we are still slowly untangling ourselves from the Squeak environment.

What then is the moral of the story? Well, one moral is that a running application can be thought of as an object, combining state and behavior; moreover, classical binary formats like a.out can be thought of as serialized objects.

Why is this profitable? Because we can cover more ground with less concepts, and less implementation effort. For example, rather than class files, JAR files and serialized objects, we can do with serialized objects alone. Moreover, we can do better with this one mechanism than we did with the other three combined. Less is more. And that is the moral of many stories.

Files are extremely important in current computing experience. Much too important. Files should be put in their place; they should be put away.

There are two aspects to this: the user experience, and the programmer experience. These are connected. Let’s start with the user experience.

Users see a hierarchical file system (HFS), with directories represented as folders. The idea of an HFS goes way back. The folder was popularized by Apple - first with the Apple Lisa , the ill-fated precursor of the mac, and then with the mac itself.

Historical Tangent: The desktop metaphor comes from Xerox PARC. I know some of that history is controversial, but one thing Steve Jobs did NOT see at the legendary 1979 Smalltalk demo was a folder. Smalltalk had no file system to put folders in. To be fair though, Smalltalk had a category hierarchy navigated via a multi-pane browser much like the file browsers we see in MacOS today. The folder came later, with the Xerox Star (1981 or so).

David Gelernter has said that computers are turning us into filing clerks. Sadly, his attempt to fix this was a commercial failure, but his point is well taken. We have seen attempts to improve the situation - things like Apple’s spotlight and Google desktop search - but this is only a transition.

Vista was supposed to have a database as a file system. This is where we’re going. Web apps don’t have access to the file system. Instead we see mechanisms like persistent object stores and/or databases. Future computers will abstract away the underlying file system - just like the iPhone and iPad. Jobs gave us the folder (i.e., the graphical/UI metaphor for the HFS) and Jobs taketh away.

This trend is driven in part by an attempt to improve the user experience, but there are also other considerations. One of these is security - and better security is also better user experience. Ultimately, it is about control: If you don’t have a file system, it becomes harder for you to download content from unauthorized sources. This is also good for security, and in a perverse way, for the user experience. And it’s also good for software service providers.

Tangent: This is closely tied to my previous post regarding the trend toward software services that run on restricted clients.

Which brings us to the programmer experience. File APIs will disappear from client platforms (as in the web browser). So programmers will become accustomed to working with persistent object stores as provided by HTML 5, Air etc. And as they do, they will get less attached to files as code representation as well.

Most programmers today have a deep attachment to files as a program representation. Take the common convention of representing the Java package hierarchy as a directory hierarchy, with individual classes as files within those directories.

Unlike C or C++, there is nothing in the Java programming language that requires this. Java wisely dispensed with header files, include directives and the C preprocessor culture. This is a great help in fighting bloat, inordinately long compilation times, platform dependencies etc.

A Java program consists of packages, which in turn consist of compilation units. There are no files to be found. And yet, the convention of using directories as a proxy for the package hierarchy persists.

Of course, it’s not just Java programmers. Programmers in almost any language waste their time fretting over files. The only significant exception is (bien sur!) Smalltalk (and its relatives).

Files are an artifact that has nothing to do with the algorithms your program uses, or its data structures, or the problem the program is trying to solve. You don’t need to know how your code is scattered among files anymore than you need to know what disk sector it’s on. Worrying about it is just unnecessary cognitive load. Programmers need not be filing clerks either.

With modern IDEs, one can easily view the structure of the program instead. In fact, the IDE can load your Java program that much faster if it doesn’t use the standard convention. You can still export your code in files for transport or storage but that is pretty much the only use for them.

I suspect these comments will spur a heated response. Most programmers have used the file system as a surrogate IDE for so long that they find it hard to break old habits and imagine a cleaner, simpler way of doing business. But do note that I am not arguing for the Smalltalk image model here - I’ve discussed its strengths and weaknesses elsewhere.

What I am saying is that your data - including but not limited to program code - should be viewed in a structured, searchable, semantically meaningful form. Not text files, not byte streams, but (collections of) objects.

As file systems disappear from the user experience, and from client APIs, newer generations of coders will be increasingly open to the idea of storing their code in something more like a database or object store. It will take time, and better tooling (especially IDEs and source control systems) but it will happen.

Room 101

Tuesday, February 23, 2010

Serialization Killer

Monday, February 01, 2010

Nail Files

About Me

Blog Archive