CA data center - Interruption of service

Incident Report for Cronofy

Postmortem

Summary

On October 23rd, between 10:11 and 10:18 UTC, we experienced a partial outage in our Canadian data center.

During this time customers may have received server errors or experienced increased response latency when using the scheduler or our API. Background processing was not impacted during this incident.

At 10:09 UTC, we released a change that increased how much work each individual web server process can handle at a time. This inadvertently caused those processes to hit resource limits after a short period of time, which then resulted in those processes being restarted. This then kept repeating until we reverted the change.

The first errors happened at 10:11 UTC, and our engineers were alerted to this at 10:14. At 10:17 we reverted the change, and by 10:19 UTC the server errors had stopped and response latency had returned to normal.

Postmortem

We always ask the questions:

Could it have been resolved sooner
- No, reverting the change happened very quickly after we were alerted to the errors. The errors stopped as soon as our deployment processed had rolled back
Could it have been identified sooner
- No, alerts came through within minutes of the initial change being deployed. Our external monitoring tools also check the availability of our API and scheduler every minute.
Could it have been prevented
- Yes
  - This was not tested in our test environment before deploying. Replicating production traffic loads in difficult, so had we deployed this in our test environment we may not have encountered the exact same issue. However, there may have been some signs of increase resource utilization.
  - Rather than increasing how much work each individual web server process can handle, we could have increased the number of processes running. This is known as scaling horizontally, rather than scaling vertically.
    - There was also an oversight on how resource utilization would be affected by this change. From this we have now learnt more about the nuance involved with making these specific changes.

Actions

Make it harder to make this same mistake again by adding comments to the codebase to ensure if a similar change is made in the future that we also appropriately increase resource limits
We felt that we could have been quicker to communicate about the issue on our status page. To help with this in the future we will:
- Update our Incident Playbook to ensure guidelines on communication timelines are clear to all
- Give more engineers access to post and update incident on our status page

Posted Oct 27, 2025 - 14:18 UTC

Resolved

Between 10:12 and 10:18 UTC a change to the way back-end resources are allocated led to a scenario whereby the services processing requests began being forcibly restarted.

This was caused by the services rapidly exceeding predefined resource limits and being taken out of service.

This resulted in a partial outage that affected a small number of requests.

Our internal monitoring alerted us to the issue and the changes were reverted.

We're continuing to monitor the situation but consider this incident resolved.

Full service was restored at 10:18 UTC.

Posted Oct 23, 2025 - 10:59 UTC

Investigating

Customers using our Canadian data center may have experienced interruption of service when accessing both our API and application.

This was caused by a recent change and will have presented as HTTP 502 - Gateway Timeout statuses.

We have reverted the change and are in the process of assessing the exact scope of the issue.

Posted Oct 23, 2025 - 10:35 UTC

This incident affected: Scheduler and API.