US data center performance degradation

Incident Report for Cronofy

Postmortem

On Wednesday, October 2nd between 00:56 and 01:04 UTC, an increasing number of requests to app.cronofy.com and api.cronofy.com failed entirely or timed out while being processed.

The root cause was our primary database being unable to process requests in a timely fashion. The subsequent back pressure then caused dependent services to time out resulting in an outage.

Timeline

00:56 - Primary database begins showing signs of congestion.

00:57 - Timeouts start being reported by monitoring.

00:59 - Initial alerting thresholds are breached. On-call engineer is notified.

01:00 - Investigation begins. app.cronofy.com and api.cronofy.com return timeout statuses.

01:01 - Additional alert thresholds for API response and HTTP status breached.

01:04 - Confirmation of performance degradation.

01:04 - Database congestion clears.

01:04 - Last timeout statuses are returned. app.cronofy.com and api.cronofy.com return healthy statuses.

01:05 - Rate limits hit for some clients as failed requests are retried in bulk.

01:09 - Engineer confirms resumption of service.

01:10 - 02:40 - Investigation and monitoring

Retrospective

We ask three primary questions in our retrospective:

Could we have resolved it sooner?
Could we have identified it sooner?
Could we have prevented it?

While this issue resolved itself before engineer intervention it could arguably have done so sooner.

The initial identification of issues routing requests from our load balancers to our servers, while correct, has ultimately proven to be a symptom rather than the root cause and, though our services did effectively self-heal, we have identified areas for improvement that should enable them to avoid the need to do so in future.

This incident has also highlighted some gaps in our monitoring that would have enabled us to take action before the point at which timeouts began to be returned and made identifying the root cause a simpler task.

Actions

We’re going to be spending some time re-working and improving our database monitoring to address the areas we’ve identified. They’re largely 1-in-a-million events but, when processing the number of events we do, that’s more frequent than we feel is acceptable.

We’ll be adding additional telemetry and improving our handling of database statements. This is to enable us to notice negative trends in performance well in advance of them becoming an issue and to be even more robust in how we handle them.

We’re adding a new section to our playbook to cover additional actions for scenarios similar to aid in speeding up our response.

Posted Oct 03, 2024 - 16:31 UTC

Resolved

US data center performance has remained normal and the incident is resolved.

Around 00:56 UTC inbound traffic to api.cronofy.com and app.cronofy.com began to show signs of performance degradation. This was observed to be an issue routing traffic from our load balancers to their respective target groups and on to our servers.

This resulted in an increase in processing time which, in turn, resulted in some requests timing out.

By 01:04 UTC the issue with the load balancers routing traffic had been resolved and traffic flow returned to usual levels.

A small backlog of requests was worked through by 01:10 UTC and normal operations resumed.

A postmortem of the incident will take place and be attached to this incident in the next 48 hours. If you have any queries in the interim, please contact us at support@cronofy.com.

Posted Oct 02, 2024 - 02:38 UTC

Monitoring

We're continuing to monitor traffic flow but all indicators show that, an increase in incoming traffic being retried remotely aside, as of 01:06 UTC routing had returned to normal.

Posted Oct 02, 2024 - 02:11 UTC

Identified

Performance has returned to expected levels.

Between 00:56 and 01:04 UTC, traffic making it's way from our load balancers to our servers did not do so in a timely manner. This will have resulted in possible timeouts for requests to api.cronofy.com and app.cronofy.com and potential server errors for API integrators and Scheduler users.

Posted Oct 02, 2024 - 01:55 UTC

Investigating

We have seen some performance degradation in our US data center.

Initial findings appear similar to those of 26 Sept 2024. Improved monitoring has highlighted this issue earlier and we are in the process of investigating further.

Posted Oct 02, 2024 - 01:42 UTC

This incident affected: Scheduler, API, and Background Processing.