Wednesday, March 11, 2009

Sharing Data

I recently discovered Joe Armstrong's post "Why OO Sucks". I didn't agree with much of what he said, but I was struck by one of his claims, "functions and data structures belong in totally different worlds." This is of course the antithesis of the OO (object-oriented) programming philosophy, which holds that you should lump data together with the sets of rules and actions that apply to it.

I disagree. Data is only useful if it has integrity, and it only has integrity if there are rules that govern how it is read and changed. "Rules" is just another word for "code", so this argues that code should be tightly associated with data.

An "object" in software jargon is just a way to expose data while still wrapping it in a decent amount of clothing. Within an application, objects make it safer to work with data, because you can ensure that no matter what you do the data is still valid.

The problem with objects is that they're not easily shared between applications. They're very ephemeral; they live only as long as they're contained in a running program, and they're tightly coupled to all the details of that particular program. In Java every object is an instance of a particular class, and every class is associated with the classloader that produced it, and classloaders in turn are associated with a single instance of a single application. If you try to put an object into a different application, it looks around for its classloader and, not finding one it recognizes, gets scared and shy.

The usual solution to this is to convert the object into raw data (often some sort of text representation, like XML) for long enough to transport it to a different application, and then in that application a new object is created by reading in the raw data and associating it with a hopefully compatible class from a hopefully compatible classloader. This is slow, expensive, and inaccurate. For instance, there's no way to guarantee that the classes are truly identical, so if an object moves from application A to B and back again, it might come back in an illegal state. Also, it often requires the programmer to write a lot of code to spell out how to read and write the object.

Terracotta is a way to spread the work of a software application across a large number of computers. We allow objects to move freely from one computer to the next. Under the covers we do still convert the objects to raw data and back, but we do it in a way that is quite efficient and transparent to the application programmer. We're great at moving objects from computer to computer in a single application. Up until the latest release, however, we weren't much good at moving objects between different applications. Now we are.

The basic idea here is that even though the classloaders for two different applications are different, as long as they both contain a definition of the class being shared (and the other classes that it in turn needs to access), that's good enough. The computer has no way of knowing whether that's true; but the programmer does. So, we let the programmer tell Terracotta which applications are allowed to share classes with each other. The configuration feature is called "app-groups", and I'd point to the documentation in this post, but it's not up on the web site quite yet. It's quite simple to use; you just define an app-groups element in the Terracotta configuration file, give it a name, and inside it you list all the applications that you want to be able to share objects with each other.

A typical use case would be if you've got a user-facing application and also an administrative application. Imagine, for instance, a merchant site, that lets users build up a shopping cart. Using Terracotta you might avoid storing that shopping cart in your central database, to reduce database load; instead, you'd keep it as transient data, getting session scalability and server failover from the Terracotta system instead. But suppose you want to let a sales agent view a customer's shopping cart, to make recommendations or fix problems. How can you share the transient shopping cart data between the customer-facing application and the agent-facing administrative application? One idea is to keep the list of shopping carts as a shared root in both applications, and then place both applications in the same app-group with the Terracotta configuration. No database required; transient data is still transient; no custom serialization code or data format definitions required. Just transparent sharing of objects between two otherwise different applications that both happen to include the same Java class definitions.

There are still some caveats, of course. One ugly one is that the different applications have to be running in different Java virtual machines. That is, you can't have a single application server instance and deploy both applications to it. That's for internal technical reasons that we hope to eliminate in a future release. For now, you'd have to put the sales-agent application on a separate app server instance (although it could be running on the same physical computer). Another caveat is that you can't have multiple overlapping groups (like, A can share with B and B with C but not A with C), and you can't restrict sharing to only certain objects or roots, it's application-wide. Caveats notwithstanding, I think it's a powerful new feature, and it'll be interesting to see what new uses of Terracotta this enables.

No comments: