Connectivity issues in the US datacenter

Incident Report for Cronofy

Postmortem

On Saturday, 23rd July 2022, we experienced a 12-minute outage in our US data center between 17:29 and 17:41 UTC.

During this time, our API at api.cronofy.com and our web application at app.cronofy.com were not reachable. Any requests made are likely to have failed to connect or received a 500-range status code rather than being handled successfully. Our web application hosts the developer dashboard, Scheduler, Real-Time Scheduling pages, and end-user authorization flows. Our background processing of jobs, such as calendar synchronization, were not affected.

Cronofy records all API calls into an API request table before processing. The outage was triggered when the database locked this table. Without being able to write requests to the table, all API requests began to queue up and timeout, and once the queue was full, be rejected outright. This, in turn, caused our infrastructure to mark these servers as unhealthy and take them out of service.

We experienced a very similar incident in February 2021. Since that incident, we have performed major version upgrades to our PostgreSQL clusters, and we had thought those upgrades had fixed this issue, as we had not had a recurrence for a long time. It is now clear that the major version upgrades have, unfortunately, not fixed this particular issue.

To help prevent this issue from happening again, we will be making changes to how data is stored within our PostgreSQL cluster.

Timeline

All times UTC on Saturday, 23rd July 2022 and approximate for clarity

17:29 App and API requests began to fail
17:31 The on-call engineer is alerted to the App and API being unresponsive
17:35 Attempts to mitigate the issue are made, including launching more servers. These result in temporary improvements but do not fix this issue.
17:37 The initial alerts clear as connectivity is temporarily restored as our attempts to resolve this issue temporarily work.
17:38 New alerts are raised for the app and API being unresponsive
17:39 Incident channel created, and other engineers come online to help
17:41 This incident is created. While this is being done, telemetry shows that API and app requests are being processed again.
17:52 Incident status is changed to monitoring and we continue to investigate the root cause.
18:47 Incident status is resolved

Actions

The actions for this incident fall into two categories, what we can do straight away, and what we can do in the medium/long-term.

Short term

To improve the performance of database queries we use several indexes within our PostgreSQL clusters, these help to locate the data in a fast and efficient manner. This locking issue always seems to occur when these indexes are being updated and the database gets into a state where it is waiting for some operations to resolve. Therefore, we are going to review which indexes are actively used and determine whether any can safely be removed or consolidated, as this will reduce the chances of the issue occurring by reducing the number of indexes which need updating.

We are also going to look at whether we can improve our alerts to help us to identify the root cause of this type of issue faster, and give our on-call engineers a clearer signal that this is the root cause While we currently don’t have a way of resolving the issue directly (the database eventually resolves the locks), this will help us provide clearer messaging and faster investigations.

Medium/long term

In the medium to long term, we will review the storage of API and app requests and determine whether PostgreSQL is the correct storage technology. This is likely to lead to re-architecting how we store some types of data to ensure our service is robust in the future.

Further questions?

If you have any further questions, please contact us at support@cronofy.com

Posted Jul 28, 2022 - 09:34 UTC

Resolved

The service is still healthy, and we have identified the likely root cause as a rare case in our database management system being triggered. This caused high levels of locking and degraded performance.

This occurred at 17:29 UTC and lasted until the locks resolved at 17:41 UTC.

We are investigating short and medium-term solutions to change our infrastructure to avoid a repeat incident.

Posted Jul 23, 2022 - 18:47 UTC

Update

Everything is continuing to perform at normal levels. We are still investigating the root cause and monitoring the service.

Posted Jul 23, 2022 - 18:09 UTC

Monitoring

We have identified an unusually high number of locks in our database, causing a performance degradation due to high contention. This has now passed and we are monitoring the service while continuing to investigate the root cause.

Posted Jul 23, 2022 - 17:52 UTC

Update

We are seeing normal service resuming and are still investigating the source of the issue.

Posted Jul 23, 2022 - 17:45 UTC

Investigating

We are currently investigating high levels of errors when trying to communicate with the API, Scheduler, and Developer Dashboard.

Posted Jul 23, 2022 - 17:41 UTC

This incident affected: Scheduler, API, and Developer Dashboard.