Ignorance Compounded

Wednesday, May 6, 2009

I Don't Quite Get Dependency Injection

Here's how you write a program that prints out "hello world" in ordinary Java:


public class Hello {
  public static void main(String[] args) {
    System.out.println("Hello world!");
  }
}

I'm reading Spring In Action, and this excellently written book starts out by writing "hello world" in Spring-enabled Java. It's about ten times as long; I won't reproduce it here. The main benefit is that the "Hello world" string is in an XML file, instead of in a Java file, so that it's easier to change.

Or something. Personally, I'd rather edit Java than XML. In my experience as a developer, it's actually much easier to write cryptic bugs in XML (or more broadly, in configuration code of any sort) than in Java. This is because the rules of Java (or other formal languages, such as C, C++, heck, even Perl) are more standardized, better defined, better testable, and better documented than those of proprietary configuration languages such as Spring. If I change a character string in Java, I can predict quite well what will happen, and I can watch it happen in the debugger. If I change a string in an XML configuration file, I have no way to know who is reading that file or what they are doing with it. It's magic.

In the 1970s, when I first learned to program, the field was predominantly procedural: a program was a bunch of instructions to be followed one after another. The instructions might involve reading data from files and acting on those data, but the data did not modify the program itself; in fact we believed at the time that it was very poor form to let a program be self-modifying, because it made it hard to understand how it would behave, so we were very reluctant to consider the data files to be part of the program.

Thirty years later, it feels to me like the software industry has moved very strongly toward configuration-based programming. We want our procedural (Java) code to do less and less, and instead we want to control behavior by way of increasingly complex configuration. A program I'm currently working on has got more configuration files than it does Java files, and they are spread over more directories on the hard drive. The procedural part is all in one language: Java. The configuration part is in EHCache, Hibernate, Spring, Maven, JDBC, and Log4J, each with its own cryptic syntax that is subject to change with every version and that is documented in bits and pieces on web forums and spottily-available books.

As so often, I find myself wondering if the emperor has any clothes. Is it really easier to program this way? Or is it just different?

I wonder whether instead, we should focus on figuring out what is "hard" about procedural programming, and work on making that easier, within the confines of a formally defined, easy to understand and read language. For instance, IDEs could easily help with changing dependencies - in fact, an IDE could look at all the pieces of a program, determine all the external dependencies, and present them as a single view, allowing the programmer to substitute equivalent components.

Monday, April 6, 2009

Improving Maven

I'm still at loggerheads with Maven. There are some specific problems that bother me and that I think could be improved while preserving its basic Maven-ness. Many of the problems center around issues when simultaneously developing more than one component at a time, i.e., when depending on SNAPSHOT versions. In fact a SNAPSHOT dependency is a fairly good indicator of misery; but some small improvements in Maven could reduce that pain point by a lot.

1. Maven knows that source produces artifacts, but it doesn't know that artifacts come from source. When Maven checks dependencies, it looks in local and remote repositories, but it doesn't know enough to build (or re-build) an artifact from source.

If Maven artifacts (at least locally deployed ones) had a backpointer to the location of the source project that produced them, then it would be possible to check dependencies against source. For instance, if project B depends on A, and I touch project A's code and then rebuild B, project A should also get rebuilt and installed. Similarly, if artifact A came from local source, then it almost certainly should NOT get replaced by an "updated" artifact from a remote repository, even if the remote artifact is newer; rather, local source code should always get honored. Extra points for a "mvn svn:update" command or the like, that would transitively sync the version control system to the latest code in all upstream projects.

2. SNAPSHOTs need to be versioned. When you're collaboratively working on two projects, and an API between them changes, the downstream build is broken until the upstream project gets refreshed. But right now that happens in a nondeterministic, asynchronous way: to Maven, all SNAPSHOTs are identical until, around midnight or so, it decides to refresh. Basing refreshes on an update interval is like filling up your car's gas tank every Friday: it's either too soon or too late. This needs to be deterministic. What I really want from SNAPSHOT is the idea of a fine-grained version number, that I will throw away upon release. It could be as simple as letting me say 1.3.0-SNAPSHOT-002, instead of just 1.3.0-SNAPSHOT.

3. Maven assumes that the internet is fast and reliable. It is neither, as anyone who works from coffeeshops and airports knows all too well. When Maven fails to get a network connection, or the network dies midway or times out, it needs to be able to roll back to a known and working state. Among other things, this means that updates need to be atomic across projects, or at least they need to be nondestructive. It also means that basic help should not rely on a network connection. Maven should not attempt to update plug-ins or projects during a 'clean' or 'help' operation.

I've got other problems with Maven - for instance, I think XML is nasty to work with, and I think that "convention over configuration" translates to "doesn't play well with others." Those things are harder to address while still keeping it Maven. But if the above three improvements were made, I think no one who loved Maven would be harmed, and a lot of other folks would be helped.

Thursday, April 2, 2009

Eclipse awards

I spend most of my professional life feeling ignorant of one thing or another, so in the few areas where I do know at least a little bit I try to help out. For that reason I post pretty often to the eclipse.newcomer newsgroup. I'm proud to have been a finalist for the Eclipse Newcomer Evangelist award for the second year in a row:

Wednesday, March 11, 2009

Sharing Data

I recently discovered Joe Armstrong's post "Why OO Sucks". I didn't agree with much of what he said, but I was struck by one of his claims, "functions and data structures belong in totally different worlds." This is of course the antithesis of the OO (object-oriented) programming philosophy, which holds that you should lump data together with the sets of rules and actions that apply to it.

I disagree. Data is only useful if it has integrity, and it only has integrity if there are rules that govern how it is read and changed. "Rules" is just another word for "code", so this argues that code should be tightly associated with data.

An "object" in software jargon is just a way to expose data while still wrapping it in a decent amount of clothing. Within an application, objects make it safer to work with data, because you can ensure that no matter what you do the data is still valid.

The problem with objects is that they're not easily shared between applications. They're very ephemeral; they live only as long as they're contained in a running program, and they're tightly coupled to all the details of that particular program. In Java every object is an instance of a particular class, and every class is associated with the classloader that produced it, and classloaders in turn are associated with a single instance of a single application. If you try to put an object into a different application, it looks around for its classloader and, not finding one it recognizes, gets scared and shy.

The usual solution to this is to convert the object into raw data (often some sort of text representation, like XML) for long enough to transport it to a different application, and then in that application a new object is created by reading in the raw data and associating it with a hopefully compatible class from a hopefully compatible classloader. This is slow, expensive, and inaccurate. For instance, there's no way to guarantee that the classes are truly identical, so if an object moves from application A to B and back again, it might come back in an illegal state. Also, it often requires the programmer to write a lot of code to spell out how to read and write the object.

Terracotta is a way to spread the work of a software application across a large number of computers. We allow objects to move freely from one computer to the next. Under the covers we do still convert the objects to raw data and back, but we do it in a way that is quite efficient and transparent to the application programmer. We're great at moving objects from computer to computer in a single application. Up until the latest release, however, we weren't much good at moving objects between different applications. Now we are.

The basic idea here is that even though the classloaders for two different applications are different, as long as they both contain a definition of the class being shared (and the other classes that it in turn needs to access), that's good enough. The computer has no way of knowing whether that's true; but the programmer does. So, we let the programmer tell Terracotta which applications are allowed to share classes with each other. The configuration feature is called "app-groups", and I'd point to the documentation in this post, but it's not up on the web site quite yet. It's quite simple to use; you just define an app-groups element in the Terracotta configuration file, give it a name, and inside it you list all the applications that you want to be able to share objects with each other.

A typical use case would be if you've got a user-facing application and also an administrative application. Imagine, for instance, a merchant site, that lets users build up a shopping cart. Using Terracotta you might avoid storing that shopping cart in your central database, to reduce database load; instead, you'd keep it as transient data, getting session scalability and server failover from the Terracotta system instead. But suppose you want to let a sales agent view a customer's shopping cart, to make recommendations or fix problems. How can you share the transient shopping cart data between the customer-facing application and the agent-facing administrative application? One idea is to keep the list of shopping carts as a shared root in both applications, and then place both applications in the same app-group with the Terracotta configuration. No database required; transient data is still transient; no custom serialization code or data format definitions required. Just transparent sharing of objects between two otherwise different applications that both happen to include the same Java class definitions.

There are still some caveats, of course. One ugly one is that the different applications have to be running in different Java virtual machines. That is, you can't have a single application server instance and deploy both applications to it. That's for internal technical reasons that we hope to eliminate in a future release. For now, you'd have to put the sales-agent application on a separate app server instance (although it could be running on the same physical computer). Another caveat is that you can't have multiple overlapping groups (like, A can share with B and B with C but not A with C), and you can't restrict sharing to only certain objects or roots, it's application-wide. Caveats notwithstanding, I think it's a powerful new feature, and it'll be interesting to see what new uses of Terracotta this enables.

Monday, March 9, 2009

Circus Contraption

As long as we're talking about things I'm proud of, let me mention the new show that I'm doing sound for, Circus Contraption. I did the sound system design and installation, and the overall sound design, and I'm sharing the night-to-night mixing duties with two other sound guys. If you happen to be in Seattle on a Friday, Saturday, or Sunday in the next couple months, try to see the show! It's the real thing. The sword swallower actually swallows the sword, it's not just stage magic.

Doing live sound is very, very different than writing software. You do not get a chance to fix bugs, and you cannot take things slowly or stop to have a design discussion. Every night is different: the performers sing and play softer or louder, there are more or fewer people (read: sound-absorbing sacks of water) in the audience, equipment that worked the last night breaks this night, someone trips over a wire or forgets to turn on their microphone. You do what you can and move on. Frankly, I'd probably be a better software developer if I treated software more like live sound.

Wednesday, March 4, 2009

Terracotta

I write software for Terracotta, which is an open-source company. I love working on open source code, in part because what I do is not a secret - I can tell my geeky friends about the cool problems that I wrestle with. (I also work on Eclipse, another open source project.)

Paradoxically, though, it seems like in the open source world it is often very hard to talk about who my customers are. Partly that's because we don't always know - anyone can download the product for free. But also it's because our paying customers don't always want their competition to know how they succeed, and we of course need to honor their confidentiality.

So I'm really pleased that Terracotta has lately been getting some great press about one of our important customers, Sabre Holdings. They're perhaps most commonly known for one facet of their business, Travelocity. Sabre is huge - according to one article, "On any given day, Sabre's servers have to be able to handle up to half a billion transactions a day and a peak volume that can go up to 32,000 transactions per second."

How do they get that kind of volume, and the reliability that has to go with it? Answer: they run their mission-critical, high-volume stuff on Terracotta, my software. Yes, I'm proud :-)

Thursday, February 12, 2009

What should code comments do?

Below I've posted some code I just had to look at. I've got nothing against this code; it's a nice clean class, simple, I'm not aware of any bugs in it.

It's easy to figure out what this code does, just by looking at it. It takes a slash-delimited string ending in "war", like the one in main(), and deletes the third token if it contains only decimal digits.

But WHY? What problem does this class solve? What is Geronimo, why is the string "war" important?

I can't help but think that someone discovered the need for this code the hard way, after time spent looking at Geronimo code or documentation, talking with peers, perhaps after fixing a bug report from the field. All that information has now been lost.

Perhaps the need for this applied only to a particular version of Geronimo. Perhaps it only turns up in a peculiar use case. Perhaps the original developer's understanding was flawed and this code is never actually needed. There's no way to know, and anyone who encounters this code in the future will have to try to figure out how not to break it. Very likely, it actually does do something important but it's not covered in the test suite, and any breakage will be discovered as a regression in the field, when some user tries to update to the latest product version and their application no longer runs.

It's like a post in the middle of the living room: you figure it's probably supporting some weight above, but how do you know? So you can't remodel the room, because the second floor might collapse. But maybe the builder put it there because they were planning on a hot tub on the floor above, where now you've got a walkin closet. Now you've got to hire a structural engineer to do the same calculations again, because the original rationale has been lost.

Well-written code shouldn't need to explain what it does. But it should explain why it does it. What other options were considered? In what situations is the code necessary?


public class GeronimoLoaderNaming {

  public static String adjustName(String name) {
    if (name != null && name.endsWith("war")) {
      String[] parts = name.split("/", -1);
      if (parts.length != 4) { throw new RuntimeException("unknown format: " + name + ", # parts = " + parts.length); }

      if ("war".equals(parts[3]) && parts[2].matches("^\\d+$")) {
        name = name.replaceAll(parts[2], "");
      }
    }

    return name;
  }

  public static void main(String args[]) {
    String name = "Geronimo.default/simplesession/1164587457359/war";
    System.err.println(adjustName(name));
  }

}