Monday, October 27, 2008

Maven Sucks

Tools like Maven make me want to quit the software industry before I die of frustration. I've wasted basically the last two days screwing around with problems related to Maven. In a nutshell, Maven does not provide a consistent or reliable build environment.

Modern software is not written from scratch; it's assembled from myriad components from other vendors, in much the same way that car manufacturers buy their parts (seats, brakes, wiper motors, radios) from other companies. And just like a car seat is itself made of springs and fabric and metal bits that come from yet other manufacturers, each software component may itself be assembled from other components.

Maven is a tool to help manage all the little pieces that fit together into a software product - to manage the fact that my product consists of components X, Y, and Z, which in turn require the presence of P, Q, R, and S, which in turn require A, B, and C; to fetch those components from wherever they come from if they're not already on my computer; and to make sure that the versions of these different components are all compatible.

But it sucks. Maven wants to update things without me asking, so that if I run a test twice in a row I don't necessarily get the same results. Maven wants to download things without me asking, so a task that took 15 seconds the last time might take a few minutes, or fail utterly (if I lose my network connection), leaving me in an unknown and inconsistent state where I can't build at all. And most frustratingly, Maven is itself highly modular, meaning that it's not all downloaded until it's needed, meaning that in the event of network trouble it can again fail utterly. Networks are not yet reliable enough for that to be okay.

Further, Maven seems to do a crappy job of understanding and resolving or reporting version conflicts. I'm extending some test code, that's supposed to exercise version 3.2.5 of product HappyWidget. The test code therefore tells Maven that it needs HappyWidget version 3.2.5. But the test code also needs OtherThing version 2.7.1; which needs FussyBit 2.0.1; which needs HappyWidget 3.1.2. Whoops! This is a version conflict. It would have been easy to fix, except that Maven didn't tell me there was a problem, it just silently deployed HappyWidget 3.1.2. So all along this test was actually testing HappyWidget 3.1.2. What's really scary is that when I uninstalled and then reinstalled everything, without explicitly making any changes in my dependencies, the behavior changed - now it's deploying 3.2.5, which is nice for my test but not so nice for the corresponding code that is supposed to test HappyWidget 3.1.2.

The first requirement of a build tool is that it should behave predictably. Maven fails.

It's late and I'm not going to spend more time saying all the hateful things that Maven has done to me in the last two days. I will blog in the future about what I think the solution to this should be.

Thursday, October 23, 2008

Synchronization defines structure

Consider the following code snippets, from some Aspectwerkz code:

public synchronized ClassInfo[] getInterfaces() {
if (m_interfaces == null) {
m_interfaces = new ClassInfo[interfaces.length];
// etc.
}
return m_interfaces;
}

/**
* Returns the super class.
*
* @return the super class
*/
public ClassInfo getSuperclass() {
if (m_superClass == null) {
m_superClass = m_classInfoRepository.getClassInfo(superclass.getName());
// etc.
}
return m_superClass;
}


Notice that the first method is synchronized, and the second is not. How come? Is this a bug in Aspectwerkz? Both methods seem to require synchronization, because the "check something and then conditionally change it" pattern is otherwise unsafe.

My inclination is to say that it's just a bug. But it might not be; there might be some external reason that the second method does not need to be synchronized here. For instance, it might always be called from within another synchronized block (though the fact that it's got public access scope makes this hard to enforce).

The point here is that synchronization (almost by definition) implies a particular call structure: to correctly synchronize a particular body of data, you need to know how that data will be accessed, by whom, in what possible sequences of events. You can't just put the "synchronized" keyword in front of every method, because over-synchronization leads to deadlock; you can't just synchronize every method that changes state, because you won't get the right visibility guarantees. You have to actually know what the code is doing, to correctly synchronize it.

This is a huge problem for two reasons. First, while you're coding, you're changing structure, so it's hard to keep up; thus, synchronization bugs creep in. In the above example, it's possible that the second method was originally private (and always known to be called from within some other synchronized block), and then someone changed it to be public without updating the synchronization. Second, it makes it much harder to change code locally: you have to understand the overall behavior of the code in more detail than would otherwise be needed.

Which brings me to the main point: unlike a lot of code, synchronization is not self-documenting. It is simply too fragile and opaque. I cannot look at the above code and figure out what pattern the developer had in mind, what assumptions s/he was making. When maintaining code, I want to preserve assumptions or else systematically and thoroughly change them. I can't do that if I can't even discern them.

As a side note, isn't that Javadoc something special? Really, "returns the superclass" is easy to figure out from the method name. What I need to know from this method's documentation are things like "under what circumstances might it return a null?" and "are there initialization requirements before this method can safely be called?".

Wednesday, October 22, 2008

I Blame My Tools

Computer science was one of the things I studied a long time ago in school. We learned how well-chosen algorithms can operate on well-chosen data structures to achieve powerful results. The work is highly conceptual and rather intuitive - like calculating differential equations, it relies more on a flash of inspiration, of being able to see the problem in the right way, than on methodically "turning the crank". There are no good algorithms to generate good code.

That maybe described computer science, but it describes only a very small part of the day to day work of computer programming. The reality of computer programming, at least for me, is that most of my time is spent wrestling with tools and technologies that don't do what they're supposed to. Metaphorically speaking, I don't get to envision graceful bridges and soaring skyscrapers; instead I futz around with a load of concrete that won't set, my lumber delivery is delayed till next week, and the extension cord doesn't reach from the outlet to my power saw.

I'm trying to do a little bit of programming work today. I use the Eclipse programming editor. Somehow the shortcut I use to start multiple instances of Eclipse on the Mac got turned back into a plain text file - I have no idea how. I got that sorted after half an hour or so. Now I want to build my project but I can't because someone added a dependency on another module of code that I don't have. I downloaded that, and built it, but in so doing I triggered some sort of version check and now it's complaining that my version of Maven, the build tool, is impermissibly out of date. (That Maven versions matter at all is a sign that Maven is trying to do way too much.) So now I need to download and install a new version of Maven. This is what my day has been like, all day long.

I know, it's a poor craftsman who blames his tools. But I blame them anyway.

Follow-up: my command-line tar utility (the Mac equivalent of "unzip") won't recognize the format of the Maven download file. Finder won't let me copy the files to the directory they need to go in; I don't have permissions. On my own machine.

Follow-up #2: I used sudo to let me copy the files. I used diff to see if the settings file had changed. It shows me that every line is different in some mysterious way that is not evident from looking at the files - perhaps the line endings changed between Windows and Unix style? Anyway, ignoring that, I then futzed around trying to change my old symbolic link for Maven to point to the new copy. That took a bunch of googling because I don't know how to create and delete links on Unix. All this is just so that I can run the build tool, to build the project that the changes I'm supposed to be working on will affect. I haven't even begun to actually do the work I'm supposed to be doing. It's quarter till 5.

Friday, October 10, 2008

Synchronization and Relativity

Thinking about state as a way of reasoning about synchronization seems like a good approach. But the problem is, the concerns I have about synchronization are often about execution: will it deadlock? Will there be lock contention? Reasoning purely about state leads me to write programs where the data is always correct but nothing can actually complete. Both state and execution are important, and they're kind of like matter and energy, seemingly unrelated concepts.

Einstein managed to show that energy and matter were actually two sides of the same coin: that a certain amount of matter was, in fact, equivalent to a certain amount of energy, if you chose the right units. Particle physicists took this and ran with it, coming up with the idea of symmetry breaking and explaining the circumstances it takes for the coin to flip. I need someone to do the same for multithreading. I want a theory of multithreading relativity that explains how given multithreading constructs act on execution and state, and what the symmetries are.

For instance, if you protect state with a mutex (in Java, a "synchronized()" block), then you can debug deadlocks by asking the system what threads own what locks. But you can also protect state by waiting for a notification event; this is a common way to implement a reader/writer pattern. The result is similar: the state is protected at the expense of some thread being blocked. But when a system deadlocks because every thread is waiting for a notification, there's no way to ask the system which thread was supposed to send it.

You'll never make a program hang by removing a synchronized(), but you might make a program hang by removing a notify(). On the other hand, you'll never make a program corrupt data by removing a notify(), but you might make a program corrupt data by removing a synchronized(). Is this a real symmetry?

Similarly, it's possible (but maybe not algorithmically possible) to look at code and see at least the hallmarks of deadlockability: for instance, code that takes nested locks out of order. Is there an equivalent analysis for code based on wait() and notify()?

In electrical engineering, it's possible to take any circuit based on voltage sources and impedances, and convert it (using the Thevenin and Norton theorems) to a different but equivalent circuit based on current sources and impedances, that will behave just the same to an outside observer. Doing this is often very useful for understanding how a circuit works. Is this possible for synchronization? Given some code implemented with certain synchronization tools, is it always possible to reimplement that code using different synchronization tools, such that the behavior will be the same? What do the rules of that conversion look like, and what will I learn about the underlying synchronization pattern by doing this? Exactly what "behavior" will prove to be invariant?

Thursday, October 9, 2008

Synchronization is Hard

I read a newspaper story about a neuropsychologist who had a stroke. She recounted trying to call 911, but not being able to figure out which digit was which on the phone, or what the steps were to make a phone call. She knew what the right tool was, but she'd lost the cognitive tools to use it. All the while, being a neuropsychologist, she was aware of what was going on, and even somewhat fascinated by it, but also aware that her life depended on doing a seemingly simple thing that she nonetheless could not quite grok.

Synchronization is like this for me. At least I know I'm not alone - some of the smartest people I know have a hard time thinking about synchronization problems, and my industry is littered with bugs due to incorrect synchronization. But I always feel like there's a right way to reason about these problems, and I know it's there but I don't know what it is and I can't even quite articulate why it is that I can't think clearly about it. My hope is that one day I'll GET IT and then I won't be able to remember why I couldn't figure it out before.

I do know some wrong ways to think about synchronization, though. Any time I am reasoning about synchronization and I find myself thinking "okay, if two threads come in here at the same time...", I am about to make a mistake or go down a rat hole. This is how books always present the topic, but intuitively I think it's wrong - I don't believe you can think correctly about synchronization by thinking about execution.

Instead, I think it's probably better to reason in terms of state. "What states could this object be in when this variable is evaluated?" "If I modify the state, how will other threads discover the modification?"

Today I spent a couple hours trying, along with some people who are pretty good at these things, to come up with a good pattern for lazy initialization when the initialization routine is not trusted (e.g., when it might try to call back into the object being initialized). The real moral of the experience is twofold: first, we each wrote routines that we thought were good and that were promptly found wrong by the others; second, although I think we did end up with two valid solutions, I'm not sure how to PROVE that they're valid.

Wednesday, October 8, 2008

Hibernate locking

To continue the last topic: I'm working on a program that lets examinations be taken online. It uses Hibernate to get stuff from a database - for example, we get a list of possible answers to an exam question. The list is implemented by Hibernate, so that it doesn't actually read the database until the first time someone tries to access the contents of the list. But that fact is invisible to our code - to us it just looks like an ordinary Java list.

Now, like the ordinary Java collections, the Hibernate-provided collections don't have any built-in synchronization. If you tried to read the list from two threads at a time, it might try to initialize the collection twice, or not at all, or it might just crash.

This is not a problem for most uses of Hibernate. Generally any given Hibernate collection is only accessed by one execution thread, even though there might be zillions of Hibernate collections all referencing the same table in the database.

But in our application, we use the same collection from a lot of threads - perhaps tens of thousands, distributed over a cluster of machines with Terracotta. We never modify it, we just read it. Except for the very first time it's accessed. But there's no way to tell, from the outside, whether it's the first time - so we have to treat it as if it might be, every time.

This pattern does not work well. We could wrap the collections inside "synchronized" blocks, like the java.util.Collections$SynchronizedWhatever classes do; but that means that every time any thread tries to read an entry in the list, it has to wait for every other thread to get out of the way first, just because once upon a time one of those threads did the initialization.

Like I said in the last entry: replacing the implementation without changing the interface is powerful, but it means there's no way to know whether an operation is actually a read or a write. Locking is one reason why a caller cares about implementation.

The solution is to change the Hibernate code so that it does its own locking, using a read/write lock. Within a single method, it can take a read lock to figure out whether initialization is needed; if it is, then it gives up the read lock and takes a write lock to do the initialization. The first time through, a write lock will be taken, but (nearly) every time thereafter, it'll only need a read lock, which means that no thread will ever have to wait. In practice this is very effective. In one performance test we saw a roughly 200x speedup: latencies went from 4 seconds to 20 milliseconds.

The unfortunate part is that the code gets messier. This nice code:

int size() {
if (!initialized) {
initialize();
initialized = true;
}
return size;
}

Becomes:

int size() {
readLock.lock();
try {
if (initialized) {
return size;
}
} finally {
readLock.unlock();
}
writeLock.lock();
try {
if (!initialized) {
initialize();
initialized = true;
}
return size;
} finally {
writeLock.unlock();
}
}

Amidst all the locking and unlocking and multiple checking, it's hard to see what's actually being done by the method. Which gets back to my first post.

Monday, October 6, 2008

The two R's



Computer software consists of a lot of instructions that read and write information to memory. The basic idea hasn't changed since World War II. Imagine a gazillion toggle switches, each of which can be flipped up or down. That's the memory, and then there's a processor that runs instructions, what we call the "code". The instructions are like "go look at switch 3,452,125 and see if it's flipped up. Then go look at switch 35,289 and see if it's flipped up. If they're both flipped up, then go find switch 278,311 and flip it down." And so forth. Some of the switches do things, like turn on a pixel on the computer screen. Others are just there to remember. We have nice tools so that we don't actually have to use numbers for the individual switches when we write the instructions, but under the covers that's exactly what's going on.

I work at a company called Terracotta Technology. We make software that connects many computers together so that they can solve problems bigger than one computer could handle. We make a sort of virtual computer, that other people's programs can execute on. We fool the programs that run on Terracotta into thinking they're running on a normal computer. A program thinks it's flipping a switch on its own computer, when actually it might be on some other computer.

So, we care a lot about when the programs try to read from memory and when they try to write to it, because we have to intercept all those operations. If all they want to do is read, we don't have to do as much work. Flipping a switch, in our world, that's real work.

This idea of replacing what's under the covers without changing how it looks to the software that's using it is not original - it comes up over and over again in software, in fact it's probably the single biggest idea the industry ever had. Hibernate is another product that does something like this. Hibernate takes reads and writes to memory, and supports them with reads and writes to a database, which is more reliable and persistent and searchable. A programmer could just write instructions to talk directly to the database, but Hibernate makes it easier by hiding some of the complexity under the covers.

But the illusion breaks down. The picture up above is what happens when you ask Hibernate how many things are in a list. Asking how many things are in a list shouldn't change it, right? That's common sense. But asking Hibernate how many things are in a list might change memory, because it might have to go fetch the information from the database and then save it in memory.

So if you want to figure out whether an operation is a "read" or a "write" or both, you need to know who's responsible for performing it. And that's something that can change on the fly, because we're so good at replacing what's under the covers.

But why do we care about the distinction anyway? Are reading and writing really the only way to compute? Why should this implementational distinction matter to a programmer?

Is this the path to functional programming?

Ignorance is the mother of invention

I spend too much of my time trying to figure out what pieces of software do.

In 2008, generations after we started programming, it is still almost always easier to write an okay prototype yourself, than to use an existing component that someone else wrote. This is nuts. This is why software engineers like me still get paid as much as we do. I appreciate the pay, but it's holding back the industry.

The reason for this is that with very rare exceptions, we engineers suck at saying what the code we write does, and for that matter we also suck at coming up with code that does things that can be succinctly described. The rare exceptions have names like Josh Bloch.

Good software engineers excel at looking at other engineers' implementations - at the program code that they wrote - and figuring out what it does. (I am not very good at this, and I admire it in my coworkers.) That is because it's the only way to survive in the industry; there is no other way to figure out what a component does.

Imagine that, before going to the loo, you needed to trace the plumbing to make sure that it actually exited to a sewer rather than the drinking fountain. Imagine that before starting a rental car you needed to trace the ignition wiring to make sure that turning the key clockwise wouldn't break the timing belt. This is the state of the software industry.

When one piece of software calls another piece, it needs to give it some data, and then it expects some data back. The important things that it needs to know include: what is the range of data that it can safely pass in? What is the range of data that might be returned? Will anything change as a result of the call? Is it okay to call again, before the first answer comes back? If the rules are broken, how bad are the consequences?

The tools used in the mainstream software industry do not answer ANY of these questions. Instead we have "comments," which are bits of text written by the software engineer, hopefully describing the situation. This does not work, because comments (a) are not required; (b) are written by software engineers, who are often not very good writers; (c) do not have any validation; (d) are often not updated when the code is updated.

To continue with the loo analogy, it's as if the plumber put a sticky note on the toilet. The plumber assumes that you know the things that were obvious to him and focuses on the details you might not know, so the note says something like "sewer line is made of cast iron," rather than what you really need to know, "flushing toilet will cause contents to be safely sent elsewhere." But it doesn't matter, because at some point someone else went into the basement and replaced the cast iron with PVC, without realizing that there was a sticky note on the toilet.

This is the most serious problem the software industry faces. Arguing about whether Java 7 should include closures is irrelevant. We need enforceable, validatable contracts that describe how software components can correctly be used.