A place to be (re)educated in Newspeak

Monday, February 01, 2010

Nail Files

Files are extremely important in current computing experience. Much too important. Files should be put in their place; they should be put away.

There are two aspects to this: the user experience, and the programmer experience. These are connected. Let’s start with the user experience.

Users see a hierarchical file system (HFS), with directories represented as folders. The idea of an HFS goes way back. The folder was popularized by Apple - first with the Apple Lisa , the ill-fated precursor of the mac, and then with the mac itself.

Historical Tangent: The desktop metaphor comes from Xerox PARC. I know some of that history is controversial, but one thing Steve Jobs did NOT see at the legendary 1979 Smalltalk demo was a folder. Smalltalk had no file system to put folders in. To be fair though, Smalltalk had a category hierarchy navigated via a multi-pane browser much like the file browsers we see in MacOS today. The folder came later, with the Xerox Star (1981 or so).

David Gelernter has said that computers are turning us into filing clerks. Sadly, his attempt to fix this was a commercial failure, but his point is well taken. We have seen attempts to improve the situation - things like Apple’s spotlight and Google desktop search - but this is only a transition.

Vista was supposed to have a database as a file system. This is where we’re going. Web apps don’t have access to the file system. Instead we see mechanisms like persistent object stores and/or databases. Future computers will abstract away the underlying file system - just like the iPhone and iPad. Jobs gave us the folder (i.e., the graphical/UI metaphor for the HFS) and Jobs taketh away.

This trend is driven in part by an attempt to improve the user experience, but there are also other considerations. One of these is security - and better security is also better user experience. Ultimately, it is about control: If you don’t have a file system, it becomes harder for you to download content from unauthorized sources. This is also good for security, and in a perverse way, for the user experience. And it’s also good for software service providers.

Tangent: This is closely tied to my previous post regarding the trend toward software services that run on restricted clients.

Which brings us to the programmer experience. File APIs will disappear from client platforms (as in the web browser). So programmers will become accustomed to working with persistent object stores as provided by HTML 5, Air etc. And as they do, they will get less attached to files as code representation as well.

Most programmers today have a deep attachment to files as a program representation. Take the common convention of representing the Java package hierarchy as a directory hierarchy, with individual classes as files within those directories.

Unlike C or C++, there is nothing in the Java programming language that requires this. Java wisely dispensed with header files, include directives and the C preprocessor culture. This is a great help in fighting bloat, inordinately long compilation times, platform dependencies etc.

A Java program consists of packages, which in turn consist of compilation units. There are no files to be found. And yet, the convention of using directories as a proxy for the package hierarchy persists.

Of course, it’s not just Java programmers. Programmers in almost any language waste their time fretting over files. The only significant exception is (bien sur!) Smalltalk (and its relatives).

Files are an artifact that has nothing to do with the algorithms your program uses, or its data structures, or the problem the program is trying to solve. You don’t need to know how your code is scattered among files anymore than you need to know what disk sector it’s on. Worrying about it is just unnecessary cognitive load. Programmers need not be filing clerks either.

With modern IDEs, one can easily view the structure of the program instead. In fact, the IDE can load your Java program that much faster if it doesn’t use the standard convention. You can still export your code in files for transport or storage but that is pretty much the only use for them.

I suspect these comments will spur a heated response. Most programmers have used the file system as a surrogate IDE for so long that they find it hard to break old habits and imagine a cleaner, simpler way of doing business. But do note that I am not arguing for the Smalltalk image model here - I’ve discussed its strengths and weaknesses elsewhere.

What I am saying is that your data - including but not limited to program code - should be viewed in a structured, searchable, semantically meaningful form. Not text files, not byte streams, but (collections of) objects.

As file systems disappear from the user experience, and from client APIs, newer generations of coders will be increasingly open to the idea of storing their code in something more like a database or object store. It will take time, and better tooling (especially IDEs and source control systems) but it will happen.

38 comments:

Neal Gafter said...

The Java programming language is moving toward giving more meaning to the file system, not less. The modularity extension enforces, at compile-time, the protection between the boundaries of modules. But the compiler determines what module a Java source file belongs to not by something appearing in its syntax, but by its placement in the file system.

On the other hand, you probably didn't expect Java to lead the way toward sanity.

Eric Arseneau said...

AMEN!

Eric Arseneau said...

Neil, that is rather unfortunate :( There is a good reason for this separation of file system and compiled code in Java, as it is the ClassLoader's responsibility to resolve a class. And ClassLoader's do NOT have to use files. Does this mean that the module system will not work with this ClassLoader semantic model?

Patrick Mueller said...

I more or less agree with the argument here, however there are some nice aspects to "files" and "directories".

I think the biggest problem w/r/t programming languages is the reliance on file systems that our current crop of SCM systems have. Makes me kinda wonder if what'll happen is we'll see files/directories continue to serve as shareable artifacts, but once they've been sucked into your programming system, you'll never the file parts of them again ... until you want to share again.

John Cowan said...

Historical correction:

Xerox's claim to have invented the folder is firm. The 1983 Lisa was the first Apple computer with a GUI (including folders), but the Xerox Star was first sold in 1981, and it definitely had folders. I was not on the inside for this, but the closest thing to it: at one time I was the only commercial third-party Mesa programmer between Boston and Baltimore, Mesa being the "missing link between Pascal and Modula-2" that was used in Star programming.

I was also the first person to release third-party freeware for the Star. I still miss the little app I wrote that started up Star's Notepad-analogue with a hierarchical list of directories and files when you dropped a folder or file drawer (disk/network device) onto it. You could rearrange the files however you wanted, and when you closed the editor, the folder hierarchy would be rearranged to suit.

There were a lot of common elements between the Lisa and the Star, most notably the fact that you had to boot a different OS altogether (Tajo/XDE on the Star, Workshop on the Lisa) to do development. That meant, at least on the Star, that every time the debugger stopped your program on a breakpoint, it had to swap memory images, taking about three minutes to do so!

So Jobs did not give us the folder, and (given how the concept has proliferated) Jobs is most unlikely to take it away either. Hierarchies don't have to be rigid to be useful.

Osvaldo Doederlein said...

Java's moved away from the HFS in devices - see JavaME/MIDP's RMS, 10 years old now. But the initial innovation was probably the browser with its cookies, and server-side storage of anything too big for cookies. (Ironically then, hierarchical pathnames live in your browser's URL, even more with RESTful URLs.)

For end users, the HFS tends indeed to go away as computers are more like other electronic appliances. But for programmers, the HFS must live even if in a more restricted form: applications will just be sandboxed so they only have access to a sub-tree of the whole HFS. Special APIs like HTML5's, RMS etc., are hideous because they are gratuitous platform/language variations, make harder to reuse code and tools (grep anyone?), etc. So I think the real trend here (for developers) is not the move away from HFS - in fact we just keep asking for more (ZFS, BtrFS etc.); we want+need control, we want+need performance (Microsoft's WinFS failed because it was bloated and slow [I've tested the beta], perhaps it was too early for a fully relational/transactional FS). No, the real trend is that application's user-space is becoming severely more restricted, with harder and harder rules imposed by sandboxing technologies (managed runtimes; virtualization; Unix jails; Google NaCl - anything that does the job). And this is good for developers, as long as every stupid device/platform doesn't impose completely different APIs and concepts for such basic tasks as storing some data or opening a socket. Just make things managed, require security certificates or something similar to unlock access to protected resources and APIs.

Now wrt the rigid package=directory convention, yeah Java sucks... [but - historically - compared to what? I always loved that I can cd to a Java project's root source folder and compile it trivially, without a makefile just to find the sources]. This is an obsolete facility now that we have IDEs, Ant, Maven; and few projects above HelloWorld complexity have a single source tree (I routinely work with ~150 projects in a single workspace, with a maze of dependencies). Ok, just IDEs as I mostly hate the latter.

And it's worth notice that the Eclipse IDE has picked up part of the Smalltalk spirit (expected from its VisualAge/Java origins). Inside the Eclipse workspace, the HFS-based source files are just the "official" representation of code to the outside world (programmer, command-line tools, SCM etc). Eclipse builds a ton of metadata about these sources, to allow efficient and powerful features like incremental compilation, browsing, Local History etc. Part of this metadata is kept in memory, part is persisted in the workspace's .metadata folder in proprietary formats that bear no resemblance to Java's HFS-based conventions. In fact the .workspace folder is very similar to a Smalltalk image, except that it's not a single monolithic file, and it can't store frozen JVM states (e.g. to continue your debugging session another day).

Gilad Bracha said...

John:

I stand corrected. I had forgotten about the Xerox Star (and never knew much anyway). I'll fix the post.

I agree hierarchies are very important - but they are not very useful for managing storage on a PC anymore. There's just too much of it.

Developers will continue to use such machines - but the vast majority of people won't, and this will affect developer perceptions as well.

Gilad Bracha said...

Patrick: I think we will see files used as lowest common denominator for sharing. I agree that SCM's are the biggest problem (and I keep wanting to write about that).

Your IDE can hide all this from you though, so programmers need never care how code is represented as files.

Gilad Bracha said...

Osvaldo:

The fact that Eclipse goes to the trouble of pretending to view things through files is part of what I am railing against.

I agree with most of what you've said BTW. The comments on files in this post are closely tied to the overall trend toward more restricted platforms.

What this means is that the personal computer, as we know it, will become a high end, exotic professional tool - as rare as large format cameras as opposed to a point and click or even SLRs.

Fred Blasdel said...

Osvaldo: There is no such thing as a "RESTful URL" — Resources, Requests, and Responses can be RESTful, but the URL strings are completely irrelevant — they might as well be completely opaque, because you should be finding them in the responses, not constructing them yourself.

Putting focus on human-readable URLs as a part of an API is one of the biggest and most pernicious REST anti-patterns.

Mike Milinkovich said...

An historical footnote: There was another way of doing Smalltalk development that wasn't solely image based. That was the ENVY/Manager repository from OTI. I notice a few names on the comments that knew it well :-)

That same repository technology was used to underpin the IBM VisualAge for Java product. An early, but modestly successful Java IDE that basically did the sorts of things that you're discussing in your post. It truly was a modern Java IDE which used an object-based repository under the hood.

The reactions to VA/Java were pretty interesting at the time. A lot of developers loved it, but many hated it. Some really disliked the fact that they couldn't get to "their" files. That experience led, I believe, in some small part to Eclipse being file-based.

The lesson is that moving away from files would be very hard. Not because the either the technology or the user experience are better, but because of cultural inertia.

Gilad Bracha said...

Mike:

I am all to painfully aware of cultural inertia on this topic, and in programming in general. My point is not just that using files this way is inane - that much I've been saying for 20 years; my point is we can finally see light at the end of the tunnel.

Users won't see files anymore = they'll search, just as today you use URLs and bookmarks less and less - you just search for stuff. And developers will eventually get it too.

Buddy Casino said...

"(collections of) objects".

I think I get what you mean, but do you know whats gonna happen? The structure that holds all these nifty searchable, semantically meaningful objects will be corrupted one day.

We all know this to be true, because in the olden days when we were still using local (non-web based) email programs, thats what would happen to our large Outlook PST file (or whatever it was called in your mail reader, I know it happened to thunderbird too).

We couldn't easily manipulate its content, like adding, copying or removing things, since the only interface to that object store was the mail program itself, or maybe some third party tool that was designed to fix said issues.

And this is what I like about files: you can handle them without having to use any API whatsoever.

All the necessary abstractions are built into the OS. And not only the crappy one that I use, in fact almost every OS. I don't have to fire up my IDE or have any sort of tool installed, I don't need to write no query or do some API calls, it can just do it.

If a running program has a file based import / export interface, I can easily see which files got imported and which didn't. I don't need to attach a debugger to the process or sift through heaps of logging messages.

File systems alone are pretty basic, but that is their main strength.

The good news is that they can be easily enhanced. The OS should know how to extract text from PDFs and DOCS to make them searchable, you don't actually have to get it as wrong as Microsoft.

You can add meaning to files by having metadata associated with them, to know what program can edit them and so on.

And your point about performance: thousands of small separate files (e.g. Java) slow things down, true. But in five years we will all have solid state storage, and the latencies associated with mechanical disks will be nothing but a faint memory that makes us shudder in hindsight.


If a new kind of universal storage format arises I would welcome that, because the lowest common denominator we have today is fat32 (which sucks). But it better had built-in OS level support from the start and not require any special tooling, or it will fail.

Sorry for the long rant, I had to let that one out.

Osvaldo Doederlein said...

Gilad(+Neal): What is your concrete suggestion? I have used a bit of ST (and a lot of VA/Java), and the "image" has its own problems. It's brittle, fragile. An opaque, binary, typically proprietary and hideously complex blob. If it's corrupt, you can't fix it. If the IDE/VM has no connector to a specific SCM, you can't use that SCM (ENVY was great for its time, but was also lock-in). As Mike remembers wrt VisualAge, the image was a love/hate thing. Both options (files/images) being imperfect, in the and the victory went to the option that was more open, simple, interoperable and robust.

Today, perhaps we could just stuff all code (and even objects? for ST-like experience of "live image") in a good embeddable/lightweight RDBMS, with a standard data dictionary for the core artifacts that many diverse tools must manipulate, and then extended tables for tool-specific features. This might be a best-of-both-world solution.

Osvaldo Doederlein said...

Fred: I know the "RESTful URL" is just a convention, but it's a very popular one. In fact this convention came from Ruby on Rails and it's being increasingly adopted by other web frameworks. Perhaps the focus is not human-readability, but rather, developers love for hierarchies... they surely make things easy to organize and find. The Tree is such a lovely, no-brainer general-purpose data structure, as any XML devotee will remember us. ;-)

Gilad Bracha said...

Mike:

There's more in your comment than I can respond to, but, FWIW:

a. File systems get corrupted too.
b. Once you could tinker with electronics; it's hard to get inside a chip. Same with automobile engines. No user serviceable parts. Fact of life.

Buddy Casino said...

@Osvaldo: this is not a coincidence, organizing things hierarchically is actually embedded in the human brain. Our social structures - governments, corporations, communities - are organized that way, it is just natural for humans.
Unfortunately I forgot to tag the URL so I can't find the source for that claim, sorry.

Neal Gafter said...

@Osvaldo: My suggestion is to avoid encoding programming language semantics in the layout of the files in the file system. I don't really care if you use the file system to store the bags of bits that are the source files. I object to the layout in the file system being part of the language.

Gilad Bracha said...

Osvaldo:

A concrete proposal is a bit more than these margins can contain :-). But some random related thoughts:

a. Your suggestion may be fine.
b. Strongtalk is a Smalltalk w/o an image. Your program is preserved in both source and binary forms (admittedly a specific format) and recovering from crashes is a usually a snap.
c. Your IDE is welcome to spit out files on an ongoing basis as a backup. I just don't want to see them.
d. SCMs are a pain. Some level of customized connector is inevitable. U
e. Ultimately, I want the platform to be a generalized SCM and object database. This is what "objects as software services" advocates.

Osvaldo Doederlein said...

@Michael: No need for that lost URL; I can track evidence e.g. to the Sankhya (Hindu metaphysic theory behind Yoga and other traditions). It contains elaborate, hierarchical descriptions of the cosmos and the human mind/body.

Alex Buckley said...

@Neal: It is true that the Java programming language does not associate a compilation unit with a module. The association is left up to the host system. File system layout may or may not factor into the host system's decision.

The reason for this approach is that modules are not just access control for a group of packages, i.e. are not just "super packages". Modules' true utility comes from the visibility information present in what must be a central specification, e.g. an OSGi manifest. Visibility has always been up to the host system; witness "observability" in the JLS.

The accessibility aspect of modules is secondary. It is possible to put module identification in individual Java compilation units, but it is not practical. If you work out a way to implement 'internal' accessibility in .NET without putting assembly information in individual compilation units, let us know!

Anonymous said...

I think the success of the file system is due to its conceptual simplicity and flexibility. See UNIX, everything is a file. Security is also rather easier than more complex in a file system: files have permissions attached, which is again a conceptually simple system.

The APIs like in HTML5 that I have seen so far do not provide this flexibility, and they certainly don't provide any security. What keeps you from executing code that has been downloaded into some object store? HTML5 in particular doesn't even support hierarchies, objects are simply stored in a hash map.

Regarding the "higher abstraction level" above simple byte streams - yes and no. The problem is IMHO that there is no single data model that fits all. Files have the flexibility of storing anything you want, and if you want a relational data model, you simply use sqlite and store that in a file.

These newer APIs so far support tree structured data entries (no graphs!), but there is nothing supporting e.g. meaningful queries over these tree items, which is pretty bad. You don't even have the standard file system metadata.

So, I think we are certainly not there yet. I don't know of a really convincing programming model for data storage. I think the storing trees thing is a step in the right direction, but without meaningful queries it's not really helping all that much.

Though this is all of course unrelated to the question whether programming languages should express their compilation units as files, which clearly a bad idea.

Osvaldo Doederlein said...

@Neal, @Alex: Indeed, the JLS leaves package organization as a host-specific decision and only mentions the package=directory model as an example. (It's even funny to read this ancient part of the JLS: the example uses "gls" as a root package/folder for classes from Guy Steele, etc... nobody uses this convention today for personal projects (well, unless there are TLDs like gls).)

Even the restriction that public top-level types fit in a source file with identical name is only a recommendation ("MAY"). From my reading of JLS 7.6, it's fine to write a compiler that allows "public class A {...} public class B {...}" all in a single file named "C.osvaldo". (The ".java" extension is specifically documented not-mandatory.) The classfile format only enforces the filename==top-level-type restriction, and IMHO that's a mistake but for other reasons (efficiency of constant pool encodings etc.) The classfile's package==folder restriction is arguably just an implementation detail of classloaders, I could write a valid classloader that does not enforce this in any way. packaging/distribution formats like ZIP/JAR/Pack200 are mostly non-normative, only details like Manifests.

OTOH, the tradition of the behavior of javac and all standard classloaders and other tools, is terribly strong. But we could change javac to relax these restrictions; this would break many pieces of the larger Java toolchain, but who cares? next month we'd have updates of most of the important stuff like Ant, etc.; new JDK releases always breaks many of these tools with a updated classfile format that requires new low-level parsers.

Osvaldo Doederlein said...

@martin: In HTML5, I bet many people will just use strings like "this/path/name" to emulate the missing hierarchy, which is the best solution whenever one needs more than a handful of file^H^H^H^Hresources. ;-) The JavaFX 1.2 I/O API (javafx.io. Resource/Storage) is an interesting tradeoff: it does support hierarchical pathnames and even absolute pathnames (within the sandbox access restriction for applets/JAWS), but at the same time it's enough abstract/high-level to not depend on full-blown filesystem services (so it's portable to JavaME), still its Resource objects look like files (named things that can be read/written with the standard java.io APIs InputStream/OutputStream) so it doesn't suck as much as MIDP's RMS even though it's actually less-featured (no indexed-records layer atop the files). This (FX, not RMS) is the right model for ANY platform-level API: just give me simple, hierarchical files that are binary blobs; if I need anything better like a SQL or other kind of database, I can implement that using files as storage.

Neal Gafter said...

@AlexBuckley: Indeed, one could implement 'internal' accessibility in .NET by identifying the assembly in individual compilation units.

Alex Buckley said...

@Neal: One could. Yet they didn't. Why not?

Neal Gafter said...

@Alex: one doesn't need a reason not to do something. There was a great deal of history constraining the solutions available, which are different than the constraints on Java.

Alex Buckley said...

@Neal: So .NET doesn't need a reason to keep assembly identification out of compilation units; but Java does need a reason to keep module identification out of compilation units? Something isn't right. I do not believe the constraints on .NET were different than those currently faced by Java. Perhaps you could expand on them?

skrishnamachari said...

This harks me back to my 2002 presentation on database as a backend storage for the CAD program dev I was involved in. The entire team saw it as heretical to move away from dwg and dxf to a DB table structure. I see strong reasons beyond the quoted ease, efficiency, speed, security to the extended capability of distributed per object locking and editability of complex drawings by multiple people synchronously.

Just to extend this thought down to a simple application level capability.

Neal Gafter said...

@Alex: It is too late to change .NET's handling of source assemblies, so there is no point attempting to redesign them, no matter how much the redesign might improve on them. It isn't too late to change Java's handling of modules. You can assume that .NET's solution is worth emulating. In that case, the compiler would define an assembly as the set of sources being compiled at once, and produce a binary assembly (jam file, presumably) directly instead of producing .class files. Or you can avoid that assumption and reason from first principles. You don't appear to have done either.

To add to that, one of the things that distinguishes the .NET software world from the Java software world is that Java has, traditionally, been based on the principle that the sources rule. There are a few exceptions - for example, *-imports don't explicitly name the things that they import, and you need to go to the target to see what you got. It is perfectly reasonable to develop Java code with emacs or vi. C#, on the other hand, was designed assuming that you write and read code with the assistance of tools. C#'s "using" directive is equivalent to Java's *-import. It is possible, but painful, to import names singly.

Many decisions that differ between Java and C# reflect these differing principles. It isn't reasonable to blindly emulate features of C# that were designed based on different principles.

Gilad Bracha said...

Neal:

"Java has, traditionally, been based on the principle that the sources rule"

Your sense of humor is so refined. Binary compatibility (JLS 13) is based on the idea of running programs that correspond to no legal source whatsoever.

Unknown said...

@Neil:
"But the compiler determines what module a Java source file belongs to not by something appearing in its syntax, but by its placement in the file system."
"I object to the layout in the file system being part of the language."

My understanding is that this is a convention followed by javac and not part of the language.
Is this not the case?

I assume the javac of JDK7 could just avoid such a file-location check and use "something appearing in [the] syntax".

Neal Gafter said...

@Tasos javac can't use something appearing in the syntax unless the Java Language Specification describes that as part of the Java language, and I think that would be a good thing. But that does not appear to be what's happening.

Unknown said...

@Neal
Are you saying that javac can't use the package declaration in the source files?
I remember listening to Alex Buckley on a JavaPosse interview state that the JLS doesn't define any mechanisms and leaves it to the "host system".
I have checked the JLS chapter on packages and I haven't found Alex's comment to be incorrect.
Am I misunderstanding you and we talk about different things?

Neal Gafter said...

@Tasos: The package declaration says what package a source file is in, which is a completely different thing from what module it is in.

While it is true that the host system determines what packages are visible, that is generally done on the basis of what source and class files you give the compiler. If a given source file is visible, its meaning is defined by the language specification. The host system does not determine which classes are in which packages - that is defined by the source.

I would hope a similar mechanism would be used for modules as is used for packages (if the compiler is given the sources or class files etc, then they're visible), but the draft specs for modules has the host system determine not only visibility but also which module each source file (and class file!) belongs to, and therefore what the meaning of the source (and class) file is.

Unknown said...

Slides 15 and 16 describe the issues around module membership in source: openjdk.java.net/projects/jigsaw/doc/ModulesAndJavac.pdf

@Neal: What are your counter-arguments to the ones in the slides? Or the benefits/issues of an alternative approach?

Neal Gafter said...

@Tasos: The benefits of module declarations in the source are to the human readers of the code. By contrast, the "points" in slide 15 are about tools, not humans. A programming language should be designed for the benefit of the human programmers.

Suggesting that there is a benefit that a module system can override what appears in the source makes no sense. That seems to be the basis for not supporting the module declarations. It also makes no sense to say that the system is "agnostic" to module declaration appearing in the source, if there is no source syntax for it and any source syntax could be overridden and not enforced at runtime. The logic is impeccable - because the current module specification ignores any module declaration in source, they might as well not be supported. But they should not be ignored in source.

As for the disadvantages on slide 16: most of them don't seem to be disadvantages. "Makes easy case hard + hard case easy" could be considered a disadvantage, but adding module declarations to the sources actually makes the easy case easy too.

Norbert Hartl said...

Gilad,

I agre to your article. But I don't really care if it is files or a database. The strange thing to me is that we take it for granted that volatile object memory and persistence have to be two separate things and that this distinction is natural.

To me it is not. Whatever kind of object memory I choose I want all data to be just in there, kept with proper meta-data attached. This is the reason why I'm doing most of my work with gemstone. And I'm really interested in every LOOM architecture that will appear in the near future. Using filesystems is just one solution for the shortcomings of main memory in computers. The design that the act of persisting is a manual act remembers me of memory management. I want to have a garbage collector that manages my memory usage and I want to have something like LOOM that manages memory storage automatically.

With the advent of SSDs being on the market we can get an impression that the gap between main memory and disk spindles can be made closer. If we imagine that there is no gap and every main memory is persisted at the same time who will argue about files? If there is no gap we don't have to deal with files and source code changes will be more like changesets than source code file diffs. What a wonderful world!