On Wednesday December 4th between 13:30 and 19:40 UTC our US data center experienced a prolonged period of degraded performance primarily impacting background processing.
This meant that operations such as synchronizing schedules which usually commence within seconds were delayed, at times, by a minute or more.
This was caused by a degradation in service of an AWS managed service that is vital to our ability to scale our capacity to match demand. We are awaiting an RCA from AWS around this, but don't feel it is overly material to our own postmortem as we've been told we could not have resolved the underlying problem ourselves.
We will update this postmortem as necessary once the RCA has been received from AWS.
Further details, lessons learned, and further actions we will be taking can be found below.
All times rounded for clarity and UTC
On Wednesday December 4th at 10:50 an additional permissions policy was added to the DE, US, and non-production Amazon Elastic Kubernetes Service (EKS) clusters in those environments. These three environments are older than the others, and so they lacked some permissions the newer environments had inherited by default. This was not affecting the operation of any of the data centers, but we wanted to bring their configuration inline after noticing the difference.
These changes were applied successfully and everything appeared to operate as normal. More than two hours later at 13:20 we found the first signs of there being an issue within our US data center.
Without the RCA from AWS, we are assuming the configuration change is somehow related, but the fact that it only affected one of the three altered environments casts some doubt on that.
EKS provides the control plane of Kubernetes, with the nodes from the worker pool communicating with it to coordinate the distribution of work and scaling activity.
At this time, the nodes and processes running on them stopped being able to communicate with the EKS as usual. This meant that things like processes responsible for triggering deployments to scale up could not do so, and that other processes that relied on obtaining leases from the control plane to elect a leader could not. Most crucially it meant that newly provisioned nodes could not fully join the cluster as they could not provision their networking stack and signal themselves as ready to run other processes.
The first notification we received around the issue came at 13:40, and the first alert at 13:50.
At 14:00 we attempted to increase the capacity of the background processes to provide headroom as things were not scaling dynamically but this was unsuccessful due to the underlying issue. At 14:15 we made the scaling change more directly to directly provision as much capacity as we could.
We also attempted to add more compute capacity by adding more nodes to the cluster, but as they were unable to fully register themselves we were stuck with the capacity we had.
For context, on the previous Wednesday, background processing fluctuated between 30 and 100 replicas during this period, with the servers in the cluster also fluctuating to provide the capacity for those replicas to run.
As the issue began we were at 50 replicas, with direct intervention we were able to get it to 70 replicas. The gap between necessary capacity and peak capacity is the source of the performance degradation. We were able to process all tasks successfully but did not have the throughput available to keep up with spikes in load as they arrived.
As we had made a change to multiple EKS clusters earlier in the day, we looked for signs for similar behavior in other environments, including those unchanged but did not find any.
An incident being outside of our control, without it being part of a wider outage in a given AWS service or region, is historically rare. We spent the next two hours on activities such as undoing and redoing the change from earlier that day, manually comparing multiple environments configuration in case of some other drift, and such like.
At 16:00 we came together to review the situation. As part of this we realized this may have become noticeable to users and that the situation was likely to become worse over the next hour as 17:00 is usually the time of peak load for our US data center. At 16:10 we opened this incident on our status page.
We decided to try and add capacity to our US data center by provisioning a new EKS cluster and working out how to scale up capacity there.
With our own diagnostic paths exhausted and our options going forward limited, we opened a ticket with AWS support at 16:55 whilst we working on provisioning a sibling cluster.
At 17:25 AWS support request permission to review logs which we grant.
Work continued to provision a new EKS cluster into which we could successfully register new nodes. Work then switched to how and what we would need to deploy into the second cluster to get something that would function without causing more issues than it would solve.
At 18:45 we realized we had heard nothing from AWS for over an hour and so initiated a chat session. After some back and forth we had confirmation that it was being investigated at 19:09. At 19:13 we were asked to check how the cluster looked and there were signs of improvement but still issues. At 19:26 the AWS agent joined our conference call and helped us triage lingering issues.
At 19:30 we'd been able to add additional nodes to the cluster which meant we could deal with the background processing backlog.
By 19:35 all issues within the cluster had been resolved and it further scaled up through the automatic mechanisms.
We continued to monitor whilst reverting changes made to provide as good as service as was possible throughout the incident, before returning fully to our usual configuration around an hour later.
The questions we ask ourselves in an incident retrospective are:
Also, we don't want to focus too heavily on this specifics of an individual incident, instead look for holistic improvements alongside targeted ones.
Could it have been identified sooner?
Yes, whilst we received alerts they were slower than we would like and pointing towards more symptoms of the issue rather than the issue itself.
Could it have been resolved sooner?
Absolutely, with AWS having to resolve it, opening a ticket with them much sooner would have helped. We also suspect opening a chat with them rather than an email ticket may have influenced the speed of response.
Could it have been prevented?
From our actions, we don’t believe so. The bargain you make with managed services is that if you use them correctly, they’ll work. To our current knowledge, AWS failed on their side of the bargain which we can’t prevent without extremely significant overhead day-to-day.
This incident uncovered a flavor of infrastructure failure that is not sufficient covered by our alerting. We’ll be reviewing and improving alerts within this area of our stack to guide our future selves more rapidly to the root cause of similar issues in future.
On a similar note, we found our diagnostic tools to be weak in this area, and relied on ad-hoc knowledge more than we would like. In concert with the review of alerts, we’ll improve our playbooks and scripts for assessing the situation when such alerts are triggered.
Finally, we are documenting guidance around when and how support tickets should be raised with AWS to reduce the amount of ad-hoc decisions we have to make on this front.
If you have any further questions, please contact us at support@cronofy.com