On Wednesday, October 2nd between 00:56 and 01:04 UTC, an increasing number of requests to app.cronofy.com and api.cronofy.com failed entirely or timed out while being processed.
The root cause was our primary database being unable to process requests in a timely fashion. The subsequent back pressure then caused dependent services to time out resulting in an outage.
00:56
- Primary database begins showing signs of congestion.
00:57
- Timeouts start being reported by monitoring.
00:59
- Initial alerting thresholds are breached. On-call engineer is notified.
01:00
- Investigation begins. app.cronofy.com and api.cronofy.com return timeout statuses.
01:01
- Additional alert thresholds for API response and HTTP status breached.
01:04
- Confirmation of performance degradation.
01:04
- Database congestion clears.
01:04
- Last timeout statuses are returned. app.cronofy.com and api.cronofy.com return healthy statuses.
01:05
- Rate limits hit for some clients as failed requests are retried in bulk.
01:09
- Engineer confirms resumption of service.
01:10 - 02:40
- Investigation and monitoring
We ask three primary questions in our retrospective:
While this issue resolved itself before engineer intervention it could arguably have done so sooner.
The initial identification of issues routing requests from our load balancers to our servers, while correct, has ultimately proven to be a symptom rather than the root cause and, though our services did effectively self-heal, we have identified areas for improvement that should enable them to avoid the need to do so in future.
This incident has also highlighted some gaps in our monitoring that would have enabled us to take action before the point at which timeouts began to be returned and made identifying the root cause a simpler task.
We’re going to be spending some time re-working and improving our database monitoring to address the areas we’ve identified. They’re largely 1-in-a-million events but, when processing the number of events we do, that’s more frequent than we feel is acceptable.
We’ll be adding additional telemetry and improving our handling of database statements. This is to enable us to notice negative trends in performance well in advance of them becoming an issue and to be even more robust in how we handle them.
We’re adding a new section to our playbook to cover additional actions for scenarios similar to aid in speeding up our response.