skip to main content

gerg.dev

Blog | About me

Yes, I test in production

Recently a post on Reddit about a company doing tests on a live production environment sparked some conversation on the Testers.io slack channel about whether “testing in production” is a wise idea or not. One Reddit user commented saying:

Rule number 1 of testing (i.m.o): DO IT ON A NON-PRODUCTION ENVIRONMENT.
Rule number 2 of testing (i.m.o): Make sure you are NOT, I repeat, NOT on a production environment.

Three years ago I might have agreed with that. Today, I absolutely don’t. Am I crazy?

Never test in production… for some definitions of production

The original post describes a situation where some medical equipment stopped working over night. After much debugging and technical support, the cause was identified to be that the machines were remotely put into a special mode for testing by the vendor and not restored before the morning.

There’s unlikely anything controversial in saying that this wasn’t a good move on the vendor’s part. They were messing with something a customer was, or could have been, using. Without notifying them about it. Though it’s all the more egregious because of being medical equipment, any customer would be annoyed by this when they found out. But you can’t extend that in a blanket way to all kinds of production environments.

There is certainly a lesson to be learned here, but we will get more from it by being more specific. One might suggest any of:

  • Never test something your customer is already using
  • Never test in a way your customer will notice
  • Never test something your customer will notice without telling them

But I can think of counter-examples to each of these, and it boils down to a very simple observation.

If you never test x, you will miss things about x

If you never test in production, you’re robbing yourself of insights about production. You won’t necessarily miss bugs, but unless you have a test environment that mimics every aspect of production perfectly (and none of us do), there will be something that goes on in production that you won’t see.

This is what I didn’t understand four years ago when I started in this line of work. In my first testing job, we didn’t test in production, so why would we test in production? The naïveté of a newbie tester, eh?

Over time we got bit by this in many different ways. The most common category of issues were use cases happening in production that we simply didn’t anticipate. Of course, some of these you might find earlier by using production data in a test environment, but you will always be limited by your sample. A second category of issues would then arise around faulty assumptions in your test environment. It’s those little details that “shouldn’t affect testing” until they do. If you’re lucky you spot an error right away. If you’re less lucky you push a feature that quietly does nothing until somebody notices it isn’t there. If you’re really having a bad day it silently does the wrong thing.

It’s around this time that you start to catch on to the fact that you need to test new deployments, at the very least to verify that something is working “in the real world”. At this point you’re testing, to some degree, in production. Are we as far along as turning off medical equipment a customer is using? No, but we are already bending the “never test in production” rule.

This is just the beginning of a long list of reasons; iAmALittleTester has compiled a list of many similar scenarios where testing in production can provide information that you aren’t getting otherwise. All of these, though, count on the fact that you aren’t going to break anything by testing. (Maybe this is the crux: if you think that the job of a software tester is to break the software, then you likely think “testing in production” is synonymous with “breaking production”!)

What can you do safely?

One of the key differences between testing in a web or back-end server environment and hardware owned by a customer is that I can usually send requests to a server and examine the response without impacting any other requests hitting the same server. I probably can’t do that with customer hardware. The parallel would be that you probably shouldn’t use a customer’s account for your tests. You need to be careful about any state or statistics a system might be keeping track of. (Though I would raise an eyebrow at anything that needs to be aware that it is being tested. If you have “test mode” in your code, you’ve just created a mini test environment in a live program. Are you gaining any advantages from being in production in that case?)

Whether this works, and the types of things you can do, will depend on the issues you expect to find. I’m not going to try stress testing my production environment during peak traffic. If I suspect that a certain kind of request will corrupt the state of the server, I’m certainly not going to do that in production either. If my test has any chance of having a negative effect on a user, I’m not going to risk that. But on a web app, one more anonymous request should be no different from what “real” users are sending the app. And on the subject of “real” users…

I’m not just a QA, I’m also a user

This is the aspect that I’ve found most useful about testing a production environment. Much like experience is often best gained by doing, knowledge of how a product works may be best gained by using it. If you only use a product in a test environment, then you only know about how your product works in a test environment. There are lots of insights about a product that can’t come from simply using it, of course, and in some cases it isn’t realistic to expect to be able to use a product as much as the intended users. But if it possible, if you can make it possible, then it is an opportunity to see things in a different way.

When I do get to use something I’m working on like an end-user does, on some level I’m always testing it. It’s not a big leap to say that your end users are, on some level, testing your product every time they use it. If your users are testing in production, why aren’t you?


Addendum: Rosie Sherry pulled together some resources on testing in production over at the MoT Club that are definitely worth checking out. Chaos Monkey from Netflix is one really interesting way of testing your production setup that I meant to mention in this post, but was only reminded of again when I came across this thread.

About this article

Leave a Reply