Friday, October 4, 2013

The Power of Weakness

Make things weak, to keep things simple.

One of my teams works on a system that lets developers define licenses that enable customers to use our features.  The license definitions contain settings, such as permissions and numeric limits, that apply to various parts of the system.  A purchasable product contains one or more license definitions.  Buy the product and you get all the permissions and numeric limits its license definitions contain.

It turns out this system has to handle a lot of complexity.  For example, if you have a few different licenses that contain the same setting ("file storage", let's say), how should they be combined?  Do they get added, or does the highest value win, or do they all need to be the same?  There are use cases for each.  In some cases, there are dependencies between settings; you can't have feature B unless you also have feature A, so should the licensing system represent that dependency in some way?

We now support licenses not only for our own system, but for a number of others, such as companies we've acquired, that have their own cloud-based systems.  So our license definitions need to support settings for arbitrary other systems.

We are continually tempted to make the system very flexible and expressive.  Any license definition could, in principle, contain other license definitions; it could depend on (but not include) other definitions; it could contain settings for any part of any system.  Any setting could be specified explicitly, or have a complex system of defaults and fallbacks if it is not present in a definition.  A setting could be calculated based on other settings; for example, "file storage" could be the product of "number of users" and "storage per user", or it could be the larger of that value and some other value.  In principle we could define a whole calculus of settings and definitions.

But we know better.  We went that way in an earlier version of the system.  What we found was that when customers didn't get what they expected, it took days of dev effort and troubleshooting just to figure out what they were supposed to be getting, let alone why they weren't getting it.  Every new dev had to rediscover the complexities, often the hard way (by writing bugs).  We couldn't trust new license definitions without testing them, because there were so many unexpected emergent behaviors of the system.  The license definitions had ceased to be data, and become program code that ran against an ill-defined interpreter.

So now our license definitions are as simple and explicit as we can make them.  No clever defaults; no calculations (except a few we had to grandfather in).  They're verbose; but we have tooling to help with that.

We're wrestling right now with whether a license definition should only contain settings for a single resource pool within the system.  If we have a license that allows a specific group of users to read and write the Foo entity, should that license definition also turn on the Foo Tracking feature for the customer company (which applies to the whole company)?  Or are those two separate licenses, which can be bundled into one product?

On the one hand, forcing licenses to have just one target keeps them simpler.  On the other hand, it just forces the complexity into the product definition; now we have to start talking about hierarchies of products, product validation tools, and so forth.

I don't know the answer yet, but I know what my guiding principle will be: I'll do whatever reduces the number of possible ways to achieve the same end result.  I'll do whatever makes the system less powerful.

Sunday, September 22, 2013

Naming is Hard, But Worth It

I got a design review recently where a Quux, which has a field we'll call Foo, was supposed to get a new field called FooSource.

You don't need to know anything about the code or the rest of the design, for the name to start you thinking.  I noticed that word "source".  Why should an object need to know where something comes from?

What might the values of FooSource be, in practice?  They would have to identify the possible sources of a Foo.  That is, they would contain information about the rest of the system; about the various places in the rest of the system that a Foo could come from.  If the system changed such that a previously valid source of Foos no longer existed, we'd have to go through all the Quuxes and fix the ones that pointed to the now-defunct source.

Also, up to now we've kept the module of code that Quux belongs to pretty clean; it doesn't have any code dependencies on the upstream part of the code that knows how to make a Foo.  So if we wanted to validate FooSources at compile or run time, we'd need to introduce a new dependency.  Someone who changed the code that makes Foos would have to rebuild, and republish, the Quux code.  If my team owns Quux, that means that whenever the Foo team does a release, we have to stop what we're doing and spend a few weeks making sure that we're in sync.

Eventually we concluded that FooSource should live in a separate module whose purpose is to define the overlap between Foo and Quux.  That module will have to be maintained whenever Foo changes; but it is a small and simple module whose sole purpose is to define that interaction, so it's easy to test and safe to change.  The Quux module will remain clean.

As a side effect, I learned that I've not done a good job communicating our high-level architecture to all my developers.  So I've got some homework to do, making and presenting better architectural documentation.

The bottom line here is that the developer who chose "FooSource" as a name did a good thing: he communicated his idea.  The idea was flawed, and we fixed that; but if it hadn't been for the clear, descriptive name, I wouldn't have realized what it was he was trying to do, and the error would not have been caught till it was too late.  It's hard to boil a complex idea down into a dozen or so characters, but it is very, very valuable.

Thursday, August 29, 2013

What is Quality Engineering?

Back in the 1990s, when waterfall was the development methodology, developers wrote code and checked it in; testers tested code, and found bugs; developers fixed the bugs; and then eventually we released the product to a limited group of customers called beta testers, and finally to the general public.

Nobody works that way any more.  Waterfall failed, and Agile took over.  Agile does not have a clear role for testers, because there is not a separate testing phase.  If you have someone different doing your testing, then you must necessarily be spending some time with code that is (at least partly) checked in but that has not been verified.  That is not Agile.

At the company I work for, teams have resolved this dilemma in many ways, most of them lousy.  Some teams don't have testers; some have automation developers; some have mini-waterfall.

We have a job title of "Quality Engineer"; people with this job are not expected to implement customer-facing features.  The absurd implication is that the people who implement customer-facing features are not quality engineers.  A software engineer who is not a quality engineer should be fired.  Quality is not something that can be applied after the coding is done.

But testers are important.  It's really hard to rigorously test your own code.  If you didn't see a gap the first time, you probably won't see it the second time.  And writing code is a creative act that takes emotional investment.  Asking someone to find the flaws in their own code is like asking a painter to critically assess the artistic relevance of their work before the paint dries on the canvas.

Pair programming is one solution; it's a lot easier to see someone else's error, or challenge someone else's shortcut.  Two sets of eyes during coding can greatly improve quality.  But the skill set of a good manual tester is different than that of a coder.  Watching a good manual tester is like watching a good hacker: the feature you thought was solid gold dissolves into a pile of bugs before your eyes.

So there is still a role for manual testing.  QE can understand the product from the customer's perspective, use it, and find out what doesn't work: essentially, act as a high-bandwidth, low-latency customer proxy.  QE in this role should be most tightly aligned with the product owner.

But manual testing is low leverage, compared to some more interesting possibilities.  There are areas where "Quality Engineering" really becomes a meaningful term.  Regrettably few companies invest in these areas.  The common characteristic of all these possibilities is that the work is internal-facing, decoupled from the product release cycle, and aimed at the development process rather than the product as such.

Predictive Fault Detection
There is a wealth of academic work, and some commercial products, dedicated to the premise that it is possible to predict before any code has been written where the bugs will be.  Bugs are not random: certain design patterns, certain APIs and technologies, certain methodological patterns are inherently buggy.  QE should be studying past results to predict future buggy patterns, steering coders away from them where possible, and advising extra attention where necessary.  QE should be like a harbor pilot, who knows where the hidden reefs are better than the ship captains can.

When technologies or patterns that are highly likely to provoke bugs are found, QE should propose eliminating them entirely: for example, if the company has been using a particular messaging framework but coders interacting with the framework tend to use it incorrectly and cause bugs, perhaps it is a sign that it is a bad choice of framework, even if it is otherwise performant and cool.  Or maybe it can be fixed.

Test Curation
Coders should write the majority of their own tests.  But as the codebase grows, so does the body of tests; and the test base becomes redundant and full of low-value tests.  Careful unit testing alleviates this problem because the individual tests continue to run quickly; but unit testing relies on well-modularized code, and in many enterprise situations - including at the company I work for - this is a goal that we can work towards but it is not a point we can start from.

So we have a vast number of slow, highly redundant tests, most of which test features that are not likely to regress.  QE should monitor the overall test base and combine tests that are too redundant, eliminate tests that provide insufficient value, and identify areas of weak coverage.  QE should understand and manage the test base as a whole, where coders tend to interact only with specific tests.

Framework Development
Coders are generally working under time pressure to produce a customer-facing feature.  We tend to do whatever reduces our risk of on-time delivery, even if it results in accumulating technical debt.  It's often hard just to get a coder to take the extra time to refactor the shared code they are building on top of.  Most developers are not in a position where they can tell their boss they're about to spend a few months developing code that will pay off company-wide but that will not directly result in shipping the feature the team is supposed to be working on.  As a dev manager, my personnel funding is proportioned on the basis of feature need, not internal investment.

However, the payoff for having a well-maintained set of test frameworks is huge; all the more so when the maintenance isn't just a series of one-off efforts by coders who need a feature, but a proactive, intentionally designed effort by a dedicated team.  QE can serve as a pool of engineers whose job is to improve the quality and efficiency of the feature-dedicated coders.

In summary:
The term "Quality Engineer" is nothing but a euphemism, when it's used to make a tester feel important in a development methodology that doesn't have a place for testing.  Testing is important, and it doesn't need to be called something other than what it is; but it's entirely different from quality engineering.  Quality engineering should be valuable and high leverage, but it can only be so if we take it seriously, separate it from testing, and select quality engineers on the basis of relevant skill, training, and experience.

Tuesday, August 27, 2013

more

I'm managing people these days so I spend more time opining. One of my coworkers asked me to opine more publicly. So I'll start this thing back up and we'll see how long it takes to make a fool of myself. 10... 9... 8...