Friday, October 22, 2010

time-delayed feedback in the workplace

The job of buildmaster rotates amongst managers. The buildmaster is primarily responsible for haranguing developers when the automated test failure rates are too high; and if they are too high for a while, the buildmaster can "lock the line", meaning that the only permitted checkins are those that ostensibly fix tests. We have some test suites that take several days to complete. Thus a bad checkin may cause test results to plunge days after the fact.

In Peter Senge's classic The Fifth Discipline, he talks about the effect of introducing a time delay into a negative feedback system. Whereas negative feedback usually stabilizes a system, negative feedback plus time delay tends to cause ever-more-violent oscillation.

Consider the following actual data:










Test Current EOD 10/21 EOD 10/20 EOD 10/19 TARGET
fast_suite 97.55% 98.77% 99.39% 99.39% 98%
slow_suite 86.43% 94.10% 83.61% 95.29% 97.5%


The fast suite returns feedback in a couple hours; the slow suite takes a few days to catch up to a changelist.

I am assured by various people that it sucks to be the buildmaster. It will continue to suck to be the buildmaster, I think, until we devise a system that is stable rather than oscillatory. A stable system is characterized by damping rather than nonlinear gain; and by feedback that is at least an order of magnitude faster than the forward phase response of the system. (It's possible to stabilize systems other ways, but this is the most general and reliable.)

To speed up the feedback loop, we could have fast suites that predict the behavior of the slow suites. Simply choosing a random subset of the tests in the slow suite, running those first, and providing interim results could achieve that.

To have damping rather than nonlinear gain, we need to remove or highly restrict the buildmaster's ability to lock the line; and instead, we need to increase the amount of pre-testing that is required in order to do a checkin. For instance, if interim results indicate a high failure rate, then new checkins should be subjected to a higher level of testing in the precheckin queue before they are allowed to actually commit.