skip to main content

gerg.dev

Blog | About me

Going deeper on “Should we automate each negative test?”

In recent article on the Ministry of Testing site, Mark Winteringham asks: “Should You Create Automation For Each Negative API Scenario?” In short, he answers that which scenarios you automate will depend entirely on what risks you’re trying to mitigate. While I’m on board with the idea that each test should have a reason behind it, I would have tackled the question differently, because I think there’s a more interesting question lurking beneath the surface.

Let’s use the same example: an API that validates an email address by responding either with 200 OK or 400 Bad Request. In this context, a “positive” test scenario says that a valid email will return a 200 OK response. A negative scenario would say that an invalid email should get a 400 Bad Request response. Now we can break the question down.

Should you create automation for one negative API scenario? The answer to this is unequivocally yes, absolutely. Without at least one case of seeing that 400 response, your API could be returning 200 OK for all requests regardless of their content. Along the same lines as my claim that all tests should fail, one test doesn’t tell you much unless it shows a behaviour in contrast to some other possible behaviour.

Should you create automation for all negative API scenarios? The answer to this should also obviously be “no”, for the same reason that you can’t automate all positive scenarios. There are infinitely many of them.

Now, should you create automation for each negative API scenario? I’m not sure whether this is any different from the previous question, except for the fact that “each” implies (to me) being able to iterate through a finite list. This question actually can’t be answered as asked because, as Mark points out, there are infinite ways that an email could be considered invalid.

The more interesting question I would pose instead is: Which negative scenarios should we automate?

Yes, this still depends on what risks you’re interested in, but even when addressing a single risk we can add a bit more information. It is safe to assume in this example that an invalid address getting into the system is going to have some negative effects associated with it, otherwise this validation API wouldn’t exist at all. But to address that singular idea of only admitting valid emails there are still an infinite number of ways to test it.

At the risk of being too prescriptive, something we can likely do is break down the behaviour into two pieces:

  1. Does the API return 400 Bad Request when the email fails validation?
  2. Does the email validation function fail for all invalid email addresses?

The first question is now much simpler. Using one example of an email that we know fails validation should be sufficient to answer it. We’ve essentially reduced an infinite number of possible negative test cases into a single equivalence partition; i.e. a group of negative test cases from which any one is capable of answering our question. If you like formal math-y lingo, you might call this “reducing the cardinality” or “normalizing” the set.

The second question now says nothing about the API at all and we can hopefully tackle it as unit tests. This does stray a bit into greybox testing, but I’m not against that sort of thing if you aren’t.

Our job isn’t done, though. This still should raise some questions:

  1. Are there other ways the API could fail?
  2. Which negative scenarios should we automate at the unit level?

Let’s take each in turn.

For the first, even if the answer is yes, we should try the same trick of reducing many variations of errors into distinct modes or equivalence partitions. You might be interested in other boundaries as well. An empty email address, or no email field at all, are both distinct from the case of an invalid email. If there are other response codes that the API could return, like 401 Unauthorized or 404 Not Found, there’s a good chance you’ll want one case for each of those. You’re unlikely to need more than that, though, unless there are multiple distinct reasons for returning the same response. You could get deep into the intricacies like invalid JSON or changing the HTTP headers, but at that point you definitely need to ask if those are risks you’re worried about enough to put time into.

Now the second question. You’ve probably already caught that this is the same as the “interesting question” we started with. We’ve just bumped it down to a lower level of testing. At least that means we’ve made each example cheaper to test, though there are still infinitely many.

No matter what subset of the infinitely many invalid emails you pick to test, you should still be able to articulate why each version of “invalid” is different from each other version. Can you tie each one back to specific product-level risks? Probably not, in all practicality. At a product level, invalid is invalid. I doubt you’ll be able to get anybody to say that testing one invalid email mitigates a different risk than any other invalid email. Unfortunately, it doesn’t follow from that that you only need to test one, because it is still true that whatever method your product is using to validate email address could be flawed.

Remember that ideally we are working at the unit level here, so hopefully you agree that it is fair to go into whitebox testing at this point. At worst, you might end up reading RFC 5322 until you go cross-eyed trying to identify what actually makes an email valid or not. If you did, you could devise one negative example from each of the specifications in that RFC. More likely, you will find that either the product is using a much simpler definition of “valid” than the actual email specification, or it is using a 3rd party library.

In the former case, your product team has to accept the risk of rejecting potentially valid addresses, but at least understanding the product’s definition of “invalid” will define a much more narrow set of test cases. Each negative test you use should map directly to your product’s definition. You could have fun coming up with valid emails (according to RFC 5322) that your product calls invalid, or vice verse—I once had to do exactly this as a way of needling our product team into improving our home-grown definition of a valid domain name, which has its own similarly complicated RFC. If your product’s definition is changed to account for those counterexamples, they are good candidates to retain in your tests. If not, it can be helpful to keep them as examples of how your spec knowingly diverges from the official spec, but make sure the test is explicit about the difference between a feature or a known bug. Future generations may look at specs like that and wonder whether there was a reason you have to accept invalid (or reject valid) emails, or if it was just a case of cutting corners in your validation. That is, they need to be able to know if it is safe to change those kinds of behaviours. (“Should you write tests for known bugs?” is a good topic for a separate discussion).

In the latter case—using a third party library—then you’re likely not too interested in testing the internals of that too much. Your scope of testing is now defined by “how much do we trust (or distrust) this library?” If the answer is “not at all”, then you’re back to the RFC and trying to violate each example in turn. If the answer is “completely!” then you likely don’t need any more than a few broad examples (as long as they are all different from each other). One technique that sometimes works is to come up with the most outrageous input possible so you can say “if it knows this is valid, it probably understands much more normal input too”. The trick is being deliberate about your choice of “outrageous”.

Finally, if you’re one of the unfortunate people who doesn’t have the option of moving scenarios down into unit tests, you’ll still have to answer these same question at the API level anyway. My advice would still be to have one test as the canonical answer to “does an invalid email get rejected”, and have a separate group of tests that are explicitly labelled as testing what it means to be an “invalid email”. Then the reason for having each set is, at least, explicit. You can still test changes to the definition of “valid” separately from changing the API’s reaction to it.

I recognize that getting into the weeds of email’s RFC specifications is not likely what Mark intended with this specific example, but I think the lessons here still carry over to other features that don’t have public standards behind them. You can’t test each negative test case. You can limit the scope of what “negative” means based on the level of testing you’re in. You can keep testing at higher levels simple by building on the tests at lower levels. And, you can reduce infinite negative examples to distinct classes to test one example of each.

(Re-reading this later, I realize there might also be some subtle terminology things that change the question: what is the difference between a “test”, a “test case”, and a “scenario”, for example? I know testers love bickering about terminology but I tend to lump things together. If your preferred definitions would change the above, feel free to mentally substitute in whichever words you think I’m actually talking about.)

About this article

Leave a Reply