Major outage in US data center

Incident Report for Cronofy

Postmortem

Summary

On October 20th, AWS's US-East-1 region experienced a significant incident impacting multiple underlying services.

This led to the operation of Cronofy's US data center, which is hosted in this AWS region, being severely impacted. None of Cronofy's other data centers were affected at any time.

The main impact was felt between 07:15 and 09:15 UTC where many critical components struggled to communicate with AWS's services. Performance was severely degraded throughout this period.

We saw some recovery from 09:15 UTC, with the backlog of high priority tasks completed by 09:30 UTC, these are the tasks most likely to have a user-noticeable impact.

Some issues remained, mainly the ability to provision additional servers which has been widely reported by other AWS customers.

However, we had sufficient capacity to clear the backlog of lower priority tasks by 10:20 UTC. Performance of the US data was back to normal from this time, and we continued to monitor.

Throughout the day we ran with an altered configuration to obtain as much capacity as we could from AWS to provide as smooth a service as possible.

Our ability to provision additional capacity was restored at 19:15 UTC.

The AWS incident was resolved at 22:53 and we consider this to be closed.

Postmortem

We always ask the questions:

  • Could this have been resolved sooner?

    • No, we very much at the mercy of AWS as they worked to resolve the issue and remove throttling as the services started to come back to life. The lack of access to the AWS console also made it difficult for our team to make meaningful changes to mitigate impact at the time.
    • As a business we operate our Data Centres completely independently of each other. This was a decision we made in order to ensure our customer’s data remains exactly where they choose to store it and removes any concerns around cross-region data transfers. As a result of this, we did not have the option to fallback onto another AWS region at the time of the US-East-1 outage, which could’ve provided increased availability at the height of the incident. We believe our initial decision to not perform any cross-region data transfers is still in the best interests of our customers, and don’t believe changing this to provide extra resiliency to be a viable option.
  • Could this have been identified sooner

    • No, our PagerDuty alerts came through 10 minutes before AWS announced the incident, and an internal incident channel was opened shortly afterwards.
    • Our platform, however, showed enough signs of life that our external monitoring services were not reporting downtime.
    • Our customers could have been informed sooner in this particular instance:

      • Our third-party StatusPage (status.cronofy.com) was not available during the worst of the incident, only returning at 09:50 UTC. Due to the AWS issues, we could also not update our DNS records to point to a different location.
      • We had to make use of improvised channels (LinkedIn, X, Facebook) until we could set up a temporary page at status.cronofy.com at 08:45 UTC. This resulted in delays in informing our customers of the issues we were facing.
  • Could this have been prevented

    • No, this issue was due to an AWS incident and was not in our control.
    • The previously mentioned Data Residency choices mean that falling back to a different region was not an option.

Actions

  • Review our external StatusPage to ensure we have the best setup for informing customers of ongoing incidents.
  • Review backup StatusPage options:

    • Update our internal Incident Playbook to list backup notification platform options
  • Review deployment pipeline setup for added resiliency in such situations

Posted Oct 22, 2025 - 09:36 UTC

Resolved

On October 20th, AWS's us-east-1 region experienced a significant incident impacting multiple underlying services.

This led to the operation of Cronofy's US data center, which is hosted in this AWS region, being severely impacted. None of Cronofy's other data centers were affected at any time.

The main impact was felt between 07:15 and 09:15 UTC where many critical components struggled to communicate with AWS's services. Performance was severely degraded throughout this period.

We saw some recovery from 09:15 UTC, with the backlog of high priority tasks completed by 09:30 UTC, these are the tasks most likely to have a user-noticeable impact.

Some issues remained, mainly the ability to provision additional servers which has been widely reported by other AWS customers.

However, we had sufficient capacity to clear the backlog of lower priority tasks by 10:20 UTC. Performance of the US data was back to normal from this time, and we continued to monitor.

Throughout the day we ran with an altered configuration to obtain as much capacity as we could from AWS to provide as smooth a service as possible.

Our ability to provision additional capacity was restored at 19:15 UTC.

While the underlying AWS incident is still open, we are considering this incident to be resolved as we have been able to resume normal operations.

A postmortem of the incident will take place and be attached to this incident in the next 48 hours.

If you have any queries in the interim, please contact us at support@cronofy.com.
Posted Oct 20, 2025 - 21:58 UTC

Update

AWS continue to work towards a full resolution.

We remain unable to reliably provision additional capacity in our US data center. However, since the previous message, we have managed to increase the available capacity.

Performance of the US data center continues to be normal, and we will continue to monitor.
Posted Oct 20, 2025 - 13:40 UTC

Update

Since around 07:15 UTC our US data center has been severely impacted by an ongoing incident in AWS us-east-1, where it is hosted: https://health.aws.amazon.com/health/status

At 09:15 UTC we started seeing signs of recovery. As of 09:30 UTC all high priority tasks were cleared which are those with the most visible user impact. Lower priority tasks were cleared by 10:20 UTC.

AWS continue to work towards full resolution.

We remain unable to reliably provision additional capacity in this data center. We have taken steps to ensure we retain the capacity we already have.

Performance of the US data is now back to normal, we continue to monitor whilst the underlying AWS incident is active.
Posted Oct 20, 2025 - 10:29 UTC

Monitoring

Since around 07:15 UTC our US data center has been severely impacted by an ongoing incident in AWS us-east-1, where it is hosted: https://health.aws.amazon.com/health/status

AWS have identified a potential root cause for the issue and are working on multiple parallel paths to accelerate recovery.

At 09:20 UTC we started seeing signs of recovery. As of 09:30 UTC all high priority tasks were cleared which are those with the most visible user impact.

We are unable to scale additional capacity to clear lower priority tasks at present, which are more background processes such as polling calendars (Apple) and housekeeping style tasks. This backlog is being processed, but not at a pace that means we are clearing it as quickly as we would like.

All other data centers are fully operational.
Posted Oct 20, 2025 - 09:52 UTC
This incident affected: Scheduler, API, Background Processing, and Developer Dashboard.