On Saturday, 23rd July 2022, we experienced a 12-minute outage in our US data center between 17:29 and 17:41 UTC.
During this time, our API at api.cronofy.com and our web application at app.cronofy.com were not reachable. Any requests made are likely to have failed to connect or received a 500-range status code rather than being handled successfully. Our web application hosts the developer dashboard, Scheduler, Real-Time Scheduling pages, and end-user authorization flows. Our background processing of jobs, such as calendar synchronization, were not affected.
Cronofy records all API calls into an API request table before processing. The outage was triggered when the database locked this table. Without being able to write requests to the table, all API requests began to queue up and timeout, and once the queue was full, be rejected outright. This, in turn, caused our infrastructure to mark these servers as unhealthy and take them out of service.
We experienced a very similar incident in February 2021. Since that incident, we have performed major version upgrades to our PostgreSQL clusters, and we had thought those upgrades had fixed this issue, as we had not had a recurrence for a long time. It is now clear that the major version upgrades have, unfortunately, not fixed this particular issue.
To help prevent this issue from happening again, we will be making changes to how data is stored within our PostgreSQL cluster.
All times UTC on Saturday, 23rd July 2022 and approximate for clarity
17:29 App and API requests began to fail
17:31 The on-call engineer is alerted to the App and API being unresponsive
17:35 Attempts to mitigate the issue are made, including launching more servers. These result in temporary improvements but do not fix this issue.
17:37 The initial alerts clear as connectivity is temporarily restored as our attempts to resolve this issue temporarily work.
17:38 New alerts are raised for the app and API being unresponsive
17:39 Incident channel created, and other engineers come online to help
17:41 This incident is created. While this is being done, telemetry shows that API and app requests are being processed again.
17:52 Incident status is changed to monitoring and we continue to investigate the root cause.
18:47 Incident status is resolved
The actions for this incident fall into two categories, what we can do straight away, and what we can do in the medium/long-term.
To improve the performance of database queries we use several indexes within our PostgreSQL clusters, these help to locate the data in a fast and efficient manner. This locking issue always seems to occur when these indexes are being updated and the database gets into a state where it is waiting for some operations to resolve. Therefore, we are going to review which indexes are actively used and determine whether any can safely be removed or consolidated, as this will reduce the chances of the issue occurring by reducing the number of indexes which need updating.
We are also going to look at whether we can improve our alerts to help us to identify the root cause of this type of issue faster, and give our on-call engineers a clearer signal that this is the root cause While we currently don’t have a way of resolving the issue directly (the database eventually resolves the locks), this will help us provide clearer messaging and faster investigations.
In the medium to long term, we will review the storage of API and app requests and determine whether PostgreSQL is the correct storage technology. This is likely to lead to re-architecting how we store some types of data to ensure our service is robust in the future.
If you have any further questions, please contact us at support@cronofy.com