Stop Trying to Avoid Infrastructure Failures–Count on Them!

With the massive migration of companies from traditional data centers to the cloud, it is necessary to adopt new perceptions and design principles when doing so. Some of the cloud service providers give their customers nearly unlimited flexibility in designing their application’s architecture, however this fact is worthless if you, as the customer, don’t leverage this flexibility to build a durable and reliable application.

With Flexibility Comes Responsibility

Compared to the traditional data center, the cloud is phenomenally flexible. As the customer, you are free to design the infrastructure for your application in almost any way you see fit. Your cloud provider might not even know – or care – about the nature of your application and the way you deploy it on their infrastructure. Instead of giving you an SLA, your provider gives you full control over the infrastructure, and transfers the responsibility to you.

Think about it: instead of getting a promise from your service provider that the infrastructure will be available for a minimal amount of time per month / year, the service provider gives you full control over the infrastructure. Building a mission-critical application? Create a multi-tiered architecture distributed over three data centers in separate geographical locations. Need to host a hobby blog and availability isn’t an issue? Create a simple environment and pay much less. It’s really up to you. You have the flexibility – as well as the responsibility – to decide which infrastructure you need for your application.

Disposable Infrastructure

Cloud computing adds new levels of abstraction between your application and the bare metal on which it runs. This allows you to design applications which don’t rely on specific resources. When designing your application to run in the cloud, you can usually avoid using specific, geographically-bound elements in your architecture entirely. That means you may never have to use hard-coded IP addresses, storage logical units or anything similar. The infrastructure on which your application runs has no identity – if anything happens, for example, to one of your servers, instead of trying to fix the problem you can simply discard that server and launch a new, healthy one in a matter of minutes. You don’t really care about the IP address, hostname or physical location of the new server – it will simply join your pool of existing resources and start handling some of your application’s workload. With the orchestration tools provided by most cloud service providers today, this process can probably be done automatically for you.

Another good idea is to prefer many small, distributed elements over few large ones. It is very easy to distribute your traffic evenly over a large number of elements with the tools offered by the cloud providers today. If your application is designed properly, your customers might not even notice a failure since traffic would be gracefully handled by other servers in the meantime; however even if something went wrong and your application didn’t handle a failure properly, if your traffic is distributed across many elements, only a small portion of your traffic will be affected.

Design for Failure

Physical things are bound to fail. Servers crash, hard drives get corrupted, network cables get tripped on – you get the picture. While the traditional solution to this kind of problems is to create an ultra-durable infrastructure and try to ensure it never goes down, the more suitable approach for applications in the cloud is to create applications which expect failures and handle them gracefully. Choosing this approach frees you from constantly worrying about the availability of the physical infrastructure your application runs on, because you know that even when a failure occurs, your service won’t be affected.

Treating your infrastructure as disposable gives you the peace of mind that your application can actually survive failures. A great example for this is Netflix, which runs one of the largest cloud deployments to date. The engineers at Netflix have developed a utility which intentionally kills random servers (yup, that’s right!) in order to guarantee that their applications can survive failures.

This may sound a little extreme at first, but when you think about it – it is actually a very good idea. Imagine that your last infrastructure failure was 2 years ago, and since then your code has been updated dozens of times. How confident would you be that when something eventually happens to your infrastructure, everything will behave properly? “Well, this is why we tested everything in our development environment before deploying to production”, you might think. But even if your production infrastructure and deployment procedure are 100% identical to your development environment (which is rarely the case), it is still a different environment; so it might behave differently.

Fail Often!

To conclude, if you are worried about your application’s availability, your best bet might be to fail often. Design your application to gracefully handle infrastructure failures, create intentional failures every once in a while, verify that your application converges properly and learn from your mistakes when it doesn’t. It could also make more sense to initiate the failures during working hours when you have your operational teams available to handle problems, than wait for a random failure (which tends to happen at the worst possible time) and hope that your on-call engineer will be able to solve the problem quickly.

You can’t design an infrastructure that never fails but you can definitely design applications that handle failures properly.

Originally Posted by TekJoe

With Flexibility Comes Responsibility

Disposable Infrastructure

Design for Failure

Fail Often!

Johanan Lieberman