Ignorance Compounded

Monday, February 17, 2014

Remote Teams and Standup

Agile standup meetings are a struggle for the teams I observe and manage, all of which have people in more than one location. I'm not sure we're collaborating the right way for the work we do and the way we work.

Teleconferencing with more than three people is painful. It's one thing to get six or ten people together in the team aisle for 10 minutes for a quick round of "what changed since yesterday." Standing up actually makes sense, in that scenario. But conference-room speakerphones change that dynamic radically. Even videoconferences aren't great; the latency, poor audio quality, and limited field of vision make it much lower bandwidth. No one who regularly uses remote conferencing technology, at least the sort my teams have access to, would call it a team-building or bonding experience.

The rules for standup are unnatural. The standup is not supposed to be a status report; but the difference between a status report and a story update is too subtle for most people. Unless we have great discipline, people talking one by one naturally turns into a status report, and status reports are boring and useless. Similarly, the standup is not supposed to be a problem-solving session; once an issue requiring collaboration is identified, we are supposed to park it for later. But engineers like to solve problems, so unless we have great discipline, talking about our challenges naturally turns into a technical problem solving session.

How could we honor our instincts? Why do we work so hard in standup to avoid doing the natural thing to do, especially when it's a thing that is beneficial? We want collaborative problem solving, and we want to feel connected to each other and to the team's progress and problem-solving flow. How can we structure our meetings so that what we want to do is the same as what we should be doing?

One idea might be to schedule an hour every day, immediately after standup, for small-group collaboration. Use the standup to identify the collaboration that needs to happen, and then spend the next hour doing the collaboration, in twosomes or threesomes. If you don't need to be involved you don't have to be; if you do, you've already got it on your schedule. This would make the "parking lot" more real; for distributed teams, once you hang up the phone it's hard to regroup.

Another idea would be to have our webcams on fulltime. If you're at your desk, no matter what city it's in, your coworkers can see you. Working from home? Wear your clean pajamas. The problem with this, again, is that our technologies aren't up to par. Ideally I'd have a little row of realtime video thumbnails across the top of my screen, one for every coworker, and any time I wanted to ping them I could just click. I'm not aware of a product like that. Would someone like to build me one?

Tuesday, February 4, 2014

Specialists vs Generalists

This weekend one of the teams I manage had to work some very long hours to resolve a severe customer problem. (I got about seven hours of sleep over three days, and I'm getting a bit too old to recover easily from that.) One thing that made it take so long was that we had to submit database scripts to be executed on the customer's behalf; for good reason, there are a lot of restrictions and approvals necessary to do that, but my team wasn't very familiar with the process.

Thinking about it, I realized this is an example of a systemic failure, and I'm on the cause side as often as the effect side. I'll call the problem "specialist versus generalist."

The database administrators (DBAs) see dozens of requests a day for scripts to be executed. Executing these scripts is a core part of their job. They know their team and their communication channels are well established. They know the proper approval process and they know what to expect in a properly formatted script request. From the DBA perspective, script requests are a simple, straightforward process, that has been fine-tuned and improved over months and years; and they don't understand why developers so often screw it up, when it happens so often and is so important.

There are dozens of requests a day. But there are more than a thousand developers. Any individual developer might go half a year or even two years without submitting a request. When they do, it is in a situation where they've been pulled off of their normal job in order to work an emergency customer issue in production. They're under high stress and tight time pressure: by the time a problem is escalated through ordinary support and all the way to development, the customer has probably been suffering for several days already and is hopping mad and running out of time. The developer doesn't know what a proper script request looks like, nor who to talk to in order to find out, and all they know about the approval process is that it probably changed since the last time they did it.

The DBA is the specialist in this scenario. The developer is the generalist. The specialist understands the rules and the problem domain so well that s/he can't see what it looks like to the generalist, who only encounters the situation once a year.

But this is not just DBAs versus developers! The generalist in one scenario is the specialist in another. One of the dev teams I manage owns an infrastructure that lets other developers license the features they develop. We are deeply knowledgeable about how licenses work and how features are defined. From our perspective, it's hard to understand why the other developers continually screw up their definitions. We're the specialists. But any given developer only defines a new feature every year or so. They're the generalists, and to them, the system is confusing and every time they try to use it something has changed.

The more I think about this pattern, the more I see it playing out all around me. When you spend all your time providing a service, it's hard to understand that to the people who consume your service, you are just an occasional thing, and most of their attention is elsewhere.

Friday, October 4, 2013

The Power of Weakness

Make things weak, to keep things simple.

One of my teams works on a system that lets developers define licenses that enable customers to use our features. The license definitions contain settings, such as permissions and numeric limits, that apply to various parts of the system. A purchasable product contains one or more license definitions. Buy the product and you get all the permissions and numeric limits its license definitions contain.

It turns out this system has to handle a lot of complexity. For example, if you have a few different licenses that contain the same setting ("file storage", let's say), how should they be combined? Do they get added, or does the highest value win, or do they all need to be the same? There are use cases for each. In some cases, there are dependencies between settings; you can't have feature B unless you also have feature A, so should the licensing system represent that dependency in some way?

We now support licenses not only for our own system, but for a number of others, such as companies we've acquired, that have their own cloud-based systems. So our license definitions need to support settings for arbitrary other systems.

We are continually tempted to make the system very flexible and expressive. Any license definition could, in principle, contain other license definitions; it could depend on (but not include) other definitions; it could contain settings for any part of any system. Any setting could be specified explicitly, or have a complex system of defaults and fallbacks if it is not present in a definition. A setting could be calculated based on other settings; for example, "file storage" could be the product of "number of users" and "storage per user", or it could be the larger of that value and some other value. In principle we could define a whole calculus of settings and definitions.

But we know better. We went that way in an earlier version of the system. What we found was that when customers didn't get what they expected, it took days of dev effort and troubleshooting just to figure out what they were supposed to be getting, let alone why they weren't getting it. Every new dev had to rediscover the complexities, often the hard way (by writing bugs). We couldn't trust new license definitions without testing them, because there were so many unexpected emergent behaviors of the system. The license definitions had ceased to be data, and become program code that ran against an ill-defined interpreter.

So now our license definitions are as simple and explicit as we can make them. No clever defaults; no calculations (except a few we had to grandfather in). They're verbose; but we have tooling to help with that.

We're wrestling right now with whether a license definition should only contain settings for a single resource pool within the system. If we have a license that allows a specific group of users to read and write the Foo entity, should that license definition also turn on the Foo Tracking feature for the customer company (which applies to the whole company)? Or are those two separate licenses, which can be bundled into one product?

On the one hand, forcing licenses to have just one target keeps them simpler. On the other hand, it just forces the complexity into the product definition; now we have to start talking about hierarchies of products, product validation tools, and so forth.

I don't know the answer yet, but I know what my guiding principle will be: I'll do whatever reduces the number of possible ways to achieve the same end result. I'll do whatever makes the system less powerful.

Sunday, September 22, 2013

Naming is Hard, But Worth It

I got a design review recently where a Quux, which has a field we'll call Foo, was supposed to get a new field called FooSource.

You don't need to know anything about the code or the rest of the design, for the name to start you thinking. I noticed that word "source". Why should an object need to know where something comes from?

What might the values of FooSource be, in practice? They would have to identify the possible sources of a Foo. That is, they would contain information about the rest of the system; about the various places in the rest of the system that a Foo could come from. If the system changed such that a previously valid source of Foos no longer existed, we'd have to go through all the Quuxes and fix the ones that pointed to the now-defunct source.

Also, up to now we've kept the module of code that Quux belongs to pretty clean; it doesn't have any code dependencies on the upstream part of the code that knows how to make a Foo. So if we wanted to validate FooSources at compile or run time, we'd need to introduce a new dependency. Someone who changed the code that makes Foos would have to rebuild, and republish, the Quux code. If my team owns Quux, that means that whenever the Foo team does a release, we have to stop what we're doing and spend a few weeks making sure that we're in sync.

Eventually we concluded that FooSource should live in a separate module whose purpose is to define the overlap between Foo and Quux. That module will have to be maintained whenever Foo changes; but it is a small and simple module whose sole purpose is to define that interaction, so it's easy to test and safe to change. The Quux module will remain clean.

As a side effect, I learned that I've not done a good job communicating our high-level architecture to all my developers. So I've got some homework to do, making and presenting better architectural documentation.

The bottom line here is that the developer who chose "FooSource" as a name did a good thing: he communicated his idea. The idea was flawed, and we fixed that; but if it hadn't been for the clear, descriptive name, I wouldn't have realized what it was he was trying to do, and the error would not have been caught till it was too late. It's hard to boil a complex idea down into a dozen or so characters, but it is very, very valuable.

Thursday, August 29, 2013

What is Quality Engineering?

Back in the 1990s, when waterfall was the development methodology, developers wrote code and checked it in; testers tested code, and found bugs; developers fixed the bugs; and then eventually we released the product to a limited group of customers called beta testers, and finally to the general public.

Nobody works that way any more. Waterfall failed, and Agile took over. Agile does not have a clear role for testers, because there is not a separate testing phase. If you have someone different doing your testing, then you must necessarily be spending some time with code that is (at least partly) checked in but that has not been verified. That is not Agile.

At the company I work for, teams have resolved this dilemma in many ways, most of them lousy. Some teams don't have testers; some have automation developers; some have mini-waterfall.

We have a job title of "Quality Engineer"; people with this job are not expected to implement customer-facing features. The absurd implication is that the people who implement customer-facing features are not quality engineers. A software engineer who is not a quality engineer should be fired. Quality is not something that can be applied after the coding is done.

But testers are important. It's really hard to rigorously test your own code. If you didn't see a gap the first time, you probably won't see it the second time. And writing code is a creative act that takes emotional investment. Asking someone to find the flaws in their own code is like asking a painter to critically assess the artistic relevance of their work before the paint dries on the canvas.

Pair programming is one solution; it's a lot easier to see someone else's error, or challenge someone else's shortcut. Two sets of eyes during coding can greatly improve quality. But the skill set of a good manual tester is different than that of a coder. Watching a good manual tester is like watching a good hacker: the feature you thought was solid gold dissolves into a pile of bugs before your eyes.

So there is still a role for manual testing. QE can understand the product from the customer's perspective, use it, and find out what doesn't work: essentially, act as a high-bandwidth, low-latency customer proxy. QE in this role should be most tightly aligned with the product owner.

But manual testing is low leverage, compared to some more interesting possibilities. There are areas where "Quality Engineering" really becomes a meaningful term. Regrettably few companies invest in these areas. The common characteristic of all these possibilities is that the work is internal-facing, decoupled from the product release cycle, and aimed at the development process rather than the product as such.

Predictive Fault Detection
There is a wealth of academic work, and some commercial products, dedicated to the premise that it is possible to predict before any code has been written where the bugs will be. Bugs are not random: certain design patterns, certain APIs and technologies, certain methodological patterns are inherently buggy. QE should be studying past results to predict future buggy patterns, steering coders away from them where possible, and advising extra attention where necessary. QE should be like a harbor pilot, who knows where the hidden reefs are better than the ship captains can.

When technologies or patterns that are highly likely to provoke bugs are found, QE should propose eliminating them entirely: for example, if the company has been using a particular messaging framework but coders interacting with the framework tend to use it incorrectly and cause bugs, perhaps it is a sign that it is a bad choice of framework, even if it is otherwise performant and cool. Or maybe it can be fixed.

Test Curation
Coders should write the majority of their own tests. But as the codebase grows, so does the body of tests; and the test base becomes redundant and full of low-value tests. Careful unit testing alleviates this problem because the individual tests continue to run quickly; but unit testing relies on well-modularized code, and in many enterprise situations - including at the company I work for - this is a goal that we can work towards but it is not a point we can start from.

So we have a vast number of slow, highly redundant tests, most of which test features that are not likely to regress. QE should monitor the overall test base and combine tests that are too redundant, eliminate tests that provide insufficient value, and identify areas of weak coverage. QE should understand and manage the test base as a whole, where coders tend to interact only with specific tests.

Framework Development
Coders are generally working under time pressure to produce a customer-facing feature. We tend to do whatever reduces our risk of on-time delivery, even if it results in accumulating technical debt. It's often hard just to get a coder to take the extra time to refactor the shared code they are building on top of. Most developers are not in a position where they can tell their boss they're about to spend a few months developing code that will pay off company-wide but that will not directly result in shipping the feature the team is supposed to be working on. As a dev manager, my personnel funding is proportioned on the basis of feature need, not internal investment.

However, the payoff for having a well-maintained set of test frameworks is huge; all the more so when the maintenance isn't just a series of one-off efforts by coders who need a feature, but a proactive, intentionally designed effort by a dedicated team. QE can serve as a pool of engineers whose job is to improve the quality and efficiency of the feature-dedicated coders.

In summary:
The term "Quality Engineer" is nothing but a euphemism, when it's used to make a tester feel important in a development methodology that doesn't have a place for testing. Testing is important, and it doesn't need to be called something other than what it is; but it's entirely different from quality engineering. Quality engineering should be valuable and high leverage, but it can only be so if we take it seriously, separate it from testing, and select quality engineers on the basis of relevant skill, training, and experience.

Tuesday, August 27, 2013

I'm managing people these days so I spend more time opining. One of my coworkers asked me to opine more publicly. So I'll start this thing back up and we'll see how long it takes to make a fool of myself. 10... 9... 8...

Friday, October 22, 2010

time-delayed feedback in the workplace

The job of buildmaster rotates amongst managers. The buildmaster is primarily responsible for haranguing developers when the automated test failure rates are too high; and if they are too high for a while, the buildmaster can "lock the line", meaning that the only permitted checkins are those that ostensibly fix tests. We have some test suites that take several days to complete. Thus a bad checkin may cause test results to plunge days after the fact.

In Peter Senge's classic The Fifth Discipline, he talks about the effect of introducing a time delay into a negative feedback system. Whereas negative feedback usually stabilizes a system, negative feedback plus time delay tends to cause ever-more-violent oscillation.

Consider the following actual data:

Test	Current	EOD 10/21	EOD 10/20	EOD 10/19	TARGET
fast_suite	97.55%	98.77%	99.39%	99.39%	98%
slow_suite	86.43%	94.10%	83.61%	95.29%	97.5%

The fast suite returns feedback in a couple hours; the slow suite takes a few days to catch up to a changelist.

I am assured by various people that it sucks to be the buildmaster. It will continue to suck to be the buildmaster, I think, until we devise a system that is stable rather than oscillatory. A stable system is characterized by damping rather than nonlinear gain; and by feedback that is at least an order of magnitude faster than the forward phase response of the system. (It's possible to stabilize systems other ways, but this is the most general and reliable.)

To speed up the feedback loop, we could have fast suites that predict the behavior of the slow suites. Simply choosing a random subset of the tests in the slow suite, running those first, and providing interim results could achieve that.

To have damping rather than nonlinear gain, we need to remove or highly restrict the buildmaster's ability to lock the line; and instead, we need to increase the amount of pre-testing that is required in order to do a checkin. For instance, if interim results indicate a high failure rate, then new checkins should be subjected to a higher level of testing in the precheckin queue before they are allowed to actually commit.