Antifragile: Fail often, detect quickly, recover fast
If you’re responsible for delivering IT services, then you've probably noticed an ever increasing demand for very high levels of availability. Services that used to be provided only to internal users are now expected to deliver value directly to external customers, and they are doing this more often, which means that IT failures are now highly visible to external customers, and sometimes to the press and to competitors too.
I read reports of catastrophic IT failures nearly every week. Some of them cost huge amounts of money, or cause untold damage to the reputation of the organization involved. So what can you do to avoid finding your own organisation in this situation?
The fortress approach
The old school approach involves a herculean effort to design IT services that never fail. This is often called a “fortress approach” because like the stone castles of old it is intended to repel all attacks and remain standing. Rigid and reliable. Fortress IT solutions can indeed be very reliable. But the trouble with a fortress is that when it does fall the consequences tend to be catastrophic for those it was supposed to protect. And, as those of us who work in IT know all too well, however strongly we build our fortress we can never anticipate every possible thing that might happen. We may have excellent risk management and succeed in identifying 99.9% of the things that might go wrong, but that remaining 0.1% will eventually turn up, and when it does we’re in trouble. Something unexpected will get past our defences and recovering from a successful attack on a fortress can take a very long time.
The antifragile approach
Paradoxically, by defining success as never failing, we guarantee that somewhere along the line we are going to fail. The alternative to building a fortress is to accept that failures are going to happen, and to focus on making sure that they don’t have a significant impact when they do. The humble dandelion isn’t very rigid, and it is very easy indeed to cut off the flowers and the leaves. But this has remarkably little effect on the plants. Even if you cut them down to the ground they grow back so fast that you hardly have time to notice they were gone. And if you succeed in digging up the roots, the little seeds you missed will be sprouting in no time flat. This is the principle behind antifragile. An antifragile IT solution is designed, like the dandelion, to be able to shrug off failures. However often it fails, and however much damage it takes, it just springs back to life, as strong as when it started.
In terms of IT what this means is that we design, and operate, our IT services in the full knowledge that they are going to fail. So what we do is ensure that when they do fail we know about the failure very quickly, and are in a position to recover from it very quickly. Typically, this means having the resources available to switch our services across to alternative components while the originals are repaired or replaced, and having a plan in place to let us do this fast. But there are many other things we can do to help ride through the potential impact of IT failure.
Finally, and most importantly, we constantly test our ability to detect and recover from different types of failure. This testing is an absolutely essential component of an antifragile approach because a recovery strategy that hasn’t been tested is very unlikely to work. Some organizations go to extreme lengths to test their recovery procedures - they intentionally inject failures into their production environment during normal operations. This provides their staff with constant practice in detecting and recovering from failures, and helps to focus people’s attention on the need to keep the recovery mechanisms up to date. For example, Netflix created a tool called Chaos Monkey that randomly terminates virtual machines in their environment. They explain that “By frequently causing failures, we force our services to be built in a way that is more resilient.”
How fragile are your services?
How about your services? Would you dare to run a Chaos Monkey to randomly remove infrastructure? If not, then perhaps you should carry out a thought experiment to help you identify possible improvements. Think about each component in your environment. What might happen if it suddenly failed? How would the failure be detected? Would detection be automatic or would it depend on a person? How would your service recover? Think very carefully about any steps that need to be performed manually, because manual recovery steps are often slower and less reliable than automated recovery. Automating anything that can be automated might be an excellent place to improve the availability of your services.
If you’ve ever visited an old castle you might have noticed dandelions helping the walls to gradually crumble as their roots force out the mortar that holds it together. If you’ve built IT services that are rigid and should never fail, then maybe it’s time to think about failing a bit more often!