skip to main content

gerg.dev

Blog | About me

The Gambler and other fallacies in Testing

I just listened to Episode 3 of the Ministry of Testing’s TestSphere Roulette podcast series, and something about the conversation irked me. The discussion was centered on the Gambler’s Fallacy card, which says:

The human tendency to perceive meaningful patterns within random data.

Specifically, it usually refers to a gambler playing a game of chance who might think that past results can tell him something about what is likely to come up next. In a game of roulette, after seeing a string of red, we might be tempted to think that black is “due” to come up next. Or, possibly, that red is on “a streak”, and therefore more likely to come up again on the next spin.

The examples on the TestSphere card, though, describe what I think are quite different scenarios:

  1. Creating tests for every bug as they’re found, so in a few years people wonder why there are tests for such obscure things.
  2. Repeatedly going back to test the same things that have broken in the past.
  3. A very small portion of your user base being very loud in app store reviews.

The conversation on the podcast focused on these three examples and how people had experienced them. It wasn’t until I pulled out the Gambler card myself and read through it again that I realized what bugged me. There was nothing wrong with what anybody said. The problem I have is that none of those examples on the card are examples of the Gambler’s fallacy at play, because bugs aren’t random data.

I suspect a lot of us have experienced some flavour of the Pareto principle in testing. It usually goes something like this: 80% of the bugs are caused by 20% of the code. I work on a very large web app and I would say most bugs by far come from either CSS visual layout or mishandling of malformed data coming from one of the APIs. If bugs arose randomly, it would be a case of the Gambler’s fallacy to believe that past CSS bugs are a predictor of more future CSS bugs. The rational belief would be that CSS bugs should arise in proportion to how much of the app is CSS. But my experience as a tester tells me that isn’t true. In fact, there’s a different TestSphere card—the History heuristic—that reflects this: “Previous versions of your product can teach you where problematic features are or where bugs are likely to turn up.” (I’m actually surprised there’s not an orange Patterns card in TestSphere for Pareto.)

This argument also applies to the user reviews example: a lot of angry reviews from a small portion of the users might be because they’re all about a significant bug that only affects that one segment of users. The Gambler’s fallacy warns that if reviews are random data, then a string of reviews from one small segment does not make the next review any more or less likely to come from that same segment. But the reverse is probably true here: a string of reviews from one small segment of your users suggests that there might be a correlation, and you should expect more reviews from that same segment in the future unless you change something.

(Sidenote: The Gambler’s fallacy sometimes doesn’t even apply that well to gambling for the same reason. There’s an interesting Mathologer video that walks through the math of seeing 60 red roulette spins out of 100. Despite what the Gambler’s fallacy might suggest, you actually should bet on red for spin 101 because it’s likely that the wheel isn’t actually random, i.e. it has a bias towards red. This is also why counting cards works.)

Of course, all of this raises a question: how does the Gambler’s fallacy apply to testing? In order for the Gamer’s fallacy to really apply, we need to be looking at something with random data. And aside from a few specific cases, like where your product is actually dealing with randomness, it’s hard to see scenarios where this comes up day to day. At a stretch, we might be able to say that sufficiently rare events are as good as random. For example, one request out of a billion failing in some weird unexpected mode. Even cosmic rays can cause one-off misbehaviours! An unfortunate string of cosmic bit-flips should not necessarily be taken as evidence that your product is exceptionally prone to them. This might be a case of the accident fallacy: misbehaviours are usually caused by bugs, and cosmic rays cause misbehaviours, therefore cosmic rays are bugs. But then we’re getting into a tangent about risk tolerance and probabilities.

What the TestSphere card (and the podcast) was instead suggesting was that a bug occurring once shouldn’t be taken as evidence that it will occur again. I think Gambler’s fallacy is the wrong label to put on that idea, but it is worth considering on its own. I certainly agreed with the speakers on the podcast that it is important to prune our test suites, and regularly review whether the tests that it contains are valuable. However, I don’t think you can extend that to saying something as black-and-white as any assertion that tests for old bugs are unnecessary. It is difficult even to say “this bug is now impossible and thus shouldn’t be tested for”, since implementation could change to make it possible again. How likely is that? As usual, the answer is annoyingly context dependent. Deciding which tests are worth doing given finite time is one of the great arts of testing. Likely 90% of automated tests written will never catch a bug because it is impossible to know in advance which bugs will happen again and which won’t. But it doesn’t necessarily follow that it is good strategy to reduce your test suite by a factor of 10x. Nor does it follow that you shouldn’t add a test for a bug you’ve seen today.

There can also be a belief that if a test exists, it should exist. This is the is-ought fallacy. In trying to justify it, you might run afoul of the Historian’s fallacy by thinking that if someone wrote that test in the past, they must have had a good reason for doing so, and that good reason is reason to keep it. In reality, we may have more information about our product today than the testers of the past had, so we might come to a different conclusion. I’ve also seen people use the Appeal to Tradition fallacy – “we test it that way because it’s always been tested that way”. That’s not much of a test strategy either, I dare say.

At the end of the day, what I really got out of this whole discussion is that it’s great fun to read through Wikipedia’s List of Fallacies and think about all the ways in which they are misused in testing. Our job is often justifying what tests are worth doing, whether we think about it explicitly or not. It’s worth being able to recognize when our logic leaves something to be desired.

About this article

Leave a Reply