On October 20th, AWS's US-East-1 region experienced a significant incident impacting multiple underlying services.
This led to the operation of Cronofy's US data center, which is hosted in this AWS region, being severely impacted. None of Cronofy's other data centers were affected at any time.
The main impact was felt between 07:15 and 09:15 UTC where many critical components struggled to communicate with AWS's services. Performance was severely degraded throughout this period.
We saw some recovery from 09:15 UTC, with the backlog of high priority tasks completed by 09:30 UTC, these are the tasks most likely to have a user-noticeable impact.
Some issues remained, mainly the ability to provision additional servers which has been widely reported by other AWS customers.
However, we had sufficient capacity to clear the backlog of lower priority tasks by 10:20 UTC. Performance of the US data was back to normal from this time, and we continued to monitor.
Throughout the day we ran with an altered configuration to obtain as much capacity as we could from AWS to provide as smooth a service as possible.
Our ability to provision additional capacity was restored at 19:15 UTC.
The AWS incident was resolved at 22:53 and we consider this to be closed.
We always ask the questions:
Could this have been resolved sooner?
Could this have been identified sooner
Our customers could have been informed sooner in this particular instance:
Could this have been prevented
Review backup StatusPage options:
Review deployment pipeline setup for added resiliency in such situations