Recovering Failure in AWS on a Tuesday Morning

By Will Hall

As I am sure many of you know, AWS had an Availability Zone failure within the EU-West-2 (London) region on Tuesday 25th August. But how, where and what do you do when you are a managed services company trying to ensure appropriate availability for your customers across AWS?

Tuesdays, in August, in the UK, are often punctuated by the drinking of tea, not large failures to happen in AWS. Realistically, datacentre major events are not particularly common, and major outages are even less common, largely due to either extreme natural phenomenon or scaled configuration errors (where the latter is far more likely).

All was not OK in EU-West-2

From a HeleCloud perspective, the first thing to alert us to the issues was the SIEM (Security Information and Event Management System) we run for one of our customers informed us that there was degraded performance across an ElasticSearch cluster. Whilst these events are not regular, they are also not unheard of. That said, in such cases we move between infrastructure, releasing instances from the failing piece of infrastructure and quickly rebuild within a new location without any performance degradation. However, this event was not the same as we were having issues communicating via the API. The AWS API is the core connection between the outside world requesting resources and infrastructure and the datacentre.

Upon seeing these alerts, we immediately opened a request with AWS Support highlighting the challenge and set about trying to understand the scope and complexity of the issue. When referencing back, we were aware of the issue around 02:16 PT, which was around 10 minutes after AWS had become aware of the issue (addressed in later status updates).

Diagnosing the issue

One of the great advantages of us running Managed Services for many clients is that not only do we have experience of a great range of different challenges but we also have a range of different estates that are simultaneously running infrastructure across the whole of AWS, so you can compare across different accounts, Regions and Availability Zones to discover outages more quickly.

Whilst the Managed Services team were keeping our customers informed on the potential issues, we were also reaching out across the company to build a consensus about what the issue was and how we could aim to mitigate it. By using our knowledge across the AWS estate, we could see that the problem was focussed on EC2 and RDS instances and further investigation showed us that it was within the euw2-az2 availability zone (also called eu-west2b).

A good secondary check is always to check various places on the Web to see if others are complaining about outages also. Reddit and Twitter are usually quite fast in these kinds of outage situations. This is just to support the results of our investigation and for us is just another data point that we like to keep an eye on.

Beginning to Resolve the Issues

Now that we had diagnosed the issue as being EC2 and RDS in euw2-az2, we started mitigation efforts to resolve our issues. With the root cause being both within connectivity and performance of the infrastructure in that Availability Zone, it was not simply a question of restarting the instances, which should move to working hardware. As you are probably aware, running in multiple Availability Zones is part of AWS architectural planning and normally you would plan to run instances in more than 1 Availability Zone for exactly this kind of situation, however, the first thing to do was identify all of the failing underlying infrastructure to start to mitigate the problem. The process is as follows:

  1. Identify anything running only in euw2-az2 (that is going to be failing now)
  2. Identify anything running partly in euw2-az2 (that is going to be underperforming)
  3. Migrate or move anything that can be moved to other availability zones where possible.

For services such as RDS, the challenge was within database servers that were only located in a single AZ as multi-AZ databases were already mitigating the issue to a different availability zone. Having effective infrastructure as code meant that we were able to move the majority of these services, within a short window and architecturally there were very few single AZ databases. In general, we avoid running RDS instances in production environments in a single Availability Zone for exactly this reason.

Minimal Disruption Migration

As with any kind of issue resolution, the aim is to provide minimal disruption to end services whilst resolving the problem, for instances that were already within euw2-az2 they already had severe degradation and therefore it was critical to be able to move them from that availability zone to another to recover applications and services.

We began looking to move services out of the Availability Zone within the first hour of the issue and ran team-wide discussions on how to best resolve with minimal disruption. We mention minimal disruption as with being a managed services and professional services company it is important to run our customer workloads with the utmost of care. However, within the first 2 hours of disruption, we were able to resolve, mitigate and address 80% of the issues within the AZ.

Recovering Failure in AWS on a Tuesday Morning

What We Learnt from the Outage

There are a few learnings to come from the recovery of euw2-az2. Firstly, it does matter that you architect your solutions to be multi-availability zone (multi-AZ). There is a big difference between being multi-region (EU-west-1, EU-west-2, EU-central-1) and multi-availability zone and for the majority of production services, we would not recommend anything that does not run across multiple availability zones.

Secondly, there is no replacement for knowledge and experience. Having a team of our best people working and probing to try and find the limits and extent of the issue was vital in us both understanding and mitigating the problem quickly. I would suggest that when this situation arises again, get together with other clever people (whether that be internally on Slack/Teams or externally on Twitter) and try to find solutions.