Everybody loves to hate metrics. I get it. There are a lot of terrible metrics out there in the software development and testing world. People still propose counting commits or test cases as a measure of productivity. It’s garbage. But I also believe that measuring something can be a useful way to understand aspects of it that you couldn’t get with qualitative measures alone. I’m not going to give a defence of metrics in all cases here, but I do have a few suggestions for how to make them suck less.
1. Be very explicit about what a metric measures
To take the example of counting the number of commits a developer makes. It’s a terrible metric because commits aren’t actually a measure of productivity. While the platonic ideal of a commit is that it represents a single atomic change, the amount of work involved could still involve anything from a single character change to a large refactor of a highly coupled codebase.
Number of test cases run and the number of bugs found are equally bad metrics. Neither has an unambiguous way to be counted. Test cases can be broken up in all kinds of arbitrary ways to change their number. Meanwhile a single root cause might be reported as 8 different bugs across 3 application layers, either just because that’s how it manifested or because someone is incentivized to find lots of bugs.
There’s a very academic but interesting paper by Kaner & Bond all about rigorously asking what metrics actually measure. They propose a series of questions to help define metrics in a way that makes sure what you’re trying to measure is explicit. In my reading, the most important aspects of it boil down to making sure you have solid answers to the following:
- Why are you measuring this? (If you don’t have a good answer, stop here.)
- What attribute are you actually trying to understand? (e.g., productivity)
- Is the metric you’re proposing actually correlated with that attribute? If the metric goes up, does that mean the attribute improved, and vice versa?
- What assumptions are you making about how these two things are related?
In short, you want to be very explicit about why you’re looking at any particular metric, because it is very easy to track a metric that doesn’t measure what you think it does.
Another important question that Kaner & Bond bring up is: what are the side effects of measuring this? That leads us to the next important piece.
2. Have a counter-metric
One of the most common complaints about metrics is that they can always be “gamed”, or manipulated. If I start counting how many bugs people log, they’ll find ways to log more bugs. If I count commits, developers will make smaller commits. At their best, a metric will motivate positive changes, but it can always be taken too far. Goodhart’s law warns us that any metric that becomes a target immediately ceases to become a good target.
If we’ve carefully thought through the side effects of making a measurement, we should know how things might go wrong. While culture plays a big role here — e.g. by making very clear that a new metric is not a measure of personal performance, actually meaning it, and having people believe you — we can be more systematic about preventing this.
A classic example of counter-metrics from DevOps is that by pursuing more frequent small releases, a team might cut corners in testing. Less testing means you can release faster, but you could also see the quality of their product decrease by releasing bugs more often. This is why the “Big 4” DevOps metrics have two related to speed (release frequency and lead time) and two related to stability (how many production issues there are and how long it takes to recover from them). The stability metrics are there to make sure people don’t privilege speed at all costs.
(It’s also possible that doing less testing won’t result in more bugs; it is possible to over-test, after all. Pairs of counter-metrics aren’t guaranteed to be anti-correlated.)
It’s not always trivial to have a counter-metric. Counter-metrics are themselves metrics that will have their own side-effects. But for any metric you must ask: how will you know if it starts to do more harm than good?
3. Make them temporary
If you have a reason to measure something, there will be a reason to stop measuring it.
Generally, a metric either serves as a way to observe the effects of a change, or as an early warning system.
In the former case, once your goals have been met, think about getting rid of it. Make new goals and move on. Even if the goal hasn’t been met, examine why and re-evaluate. Avoid the temptation to make every metric a quota or target that has to be measured forever.
Any metric related to testing usually falls into this category for me; nobody should care how many test cases ran, because at the end of the day what really matters is whether the product is able to do its job for its users. You might pay attention to counting test cases because you have a hypothesis that changing that number will improve the resulting product quality. (The “how” here matters in practice, but let’s assume for a moment that you have a legitimate reason to make the hypothesis.) There are three main possibilities:
- You succeed at increasing the number of test cases run before release, and product quality improved.
- You succeed at increasing the number of test cases run before release, but product quality doesn’t improve.
- You don’t succeed at increasing the number of test cases run before release.
In all three cases, you can stop worrying about how many test cases ran. The only case for keeping it around as a metric is in the 1st scenario so you can be alerted if the number regresses, but the longer you keep a metric like this around as a target, the more likely it starts being manipulated. The effects of Goodhart’s law are guaranteed to come into play eventually.
Bonus tip: Know who you’re talking to
A lot of my motivation for wanting to write this is that I’m a very quantitative person by nature. I have a hard science background and I like understanding things in terms of numbers where it makes sense to do that. If you’ve ever done corporate training on communication styles, you’ve almost certainly seen a 2×2 matrix dividing us all into 4 types of people. One quarter is always some version of “logical” or “analytical”, which is where I fall. These courses teach you that, even if you’re not that kind of person yourself, you’ll have more success communicating with people like that if you can put numbers on things. If you talk to someone in the opposing quarter — usually a Hufflepuff — you should leave the numbers out. Who is looking at a metric can be just as important as the metric itself.