Degraded US data center performance
Incident Report for Cronofy
Postmortem

Summary

Performance was degraded in the US data center for several hours yesterday (16:30-20:00 UTC).

We did not open an incident at the time as we should have done. Some of our instrumentation has been found to be lacking which led us to underestimate the level of the degradation and we will be resolving this as well as taking steps to resolve the root cause of the degradation.

Background

At 16:30 (all times UTC) there were some early signs of our primary database becoming a bottleneck. By 16:45 there were signs of slower response times and a small backlog of work within our queuing system. At this point we were alerted to a potential problem.

Our initial actions were to stop any non-essential tasks from running whilst we assessed the situation.

At 17:06 backpressure lead to a significant number of requests to be rejected. At this point, if not before, we should have opened an incident but did not.

Other than that minute, response times were slower than normal but with a 99th percentile of under 5 seconds. Whilst the queues had a backlog, it was not growing and to our estimations tasks would be processed within 30 seconds.

Performance then held stable, if degraded, until the end of the incident at 20:00.

Lessons learned

Our incident creation rules don’t err on the side of creation

No single measure, other than potentially the requests being rejected at 17:06, was a red flag but there were several amber flags for a significant period of time. The fact that multiple measures went amber and that individually they stayed amber for a significant period should have triggered the creation of an incident but didn’t.

Our understanding of queue latency is insufficient

Several customers contacted us querying a degradation in performance that exceeded our estimates. Investigation of this has determined that our estimation of queue latency was inaccurate, to the point where that measure alone would have triggered the creation of an incident.

Our primary database had insufficient headroom

We had been planning an upgrade in the near future, but we had obviously not planned it soon enough.

Actions

Revisit the rules for incident creation

We don’t want to be in the situation where our customers are telling us about incidents. Therefore, so our internal rules will be revisited to make them more quantative to remove any “feeling” from the process, to be multiplicative so that many ambers mean a red, and include a time component so that something that is amber for a prolonged period becomes red.

Better instrumentation our queues

Queue depth has proved to be insufficient as a measure for us to understand the overall performance of our background processing. We will make changes in order to collect more detailed data about our queue health and build alerting around it.

In the mean time being we are going to reduce our metric of what an acceptable queue depth is.

Upgrade our primary database

Whilst many things meant we did not communicate this incident well, the root cause of the incident was the need to upgrade our primary database.

This has already been done and we are monitoring it closely to ensure the upgrade has provided the necessary headroom to avoid a similar incident.

Posted Jan 29, 2019 - 15:44 UTC

Resolved
From 16:30 to 20:00 UTC, the performance of the US data center was degraded, primarily affecting the performance of the synchronization process.

Postmortem to follow.
Posted Jan 28, 2019 - 16:30 UTC