Performance was degraded in the US data center for several hours yesterday (16:30-20:00 UTC).
We did not open an incident at the time as we should have done. Some of our instrumentation has been found to be lacking which led us to underestimate the level of the degradation and we will be resolving this as well as taking steps to resolve the root cause of the degradation.
At 16:30 (all times UTC) there were some early signs of our primary database becoming a bottleneck. By 16:45 there were signs of slower response times and a small backlog of work within our queuing system. At this point we were alerted to a potential problem.
Our initial actions were to stop any non-essential tasks from running whilst we assessed the situation.
At 17:06 backpressure lead to a significant number of requests to be rejected. At this point, if not before, we should have opened an incident but did not.
Other than that minute, response times were slower than normal but with a 99th percentile of under 5 seconds. Whilst the queues had a backlog, it was not growing and to our estimations tasks would be processed within 30 seconds.
Performance then held stable, if degraded, until the end of the incident at 20:00.
No single measure, other than potentially the requests being rejected at 17:06, was a red flag but there were several amber flags for a significant period of time. The fact that multiple measures went amber and that individually they stayed amber for a significant period should have triggered the creation of an incident but didn’t.
Several customers contacted us querying a degradation in performance that exceeded our estimates. Investigation of this has determined that our estimation of queue latency was inaccurate, to the point where that measure alone would have triggered the creation of an incident.
We had been planning an upgrade in the near future, but we had obviously not planned it soon enough.
We don’t want to be in the situation where our customers are telling us about incidents. Therefore, so our internal rules will be revisited to make them more quantative to remove any “feeling” from the process, to be multiplicative so that many ambers mean a red, and include a time component so that something that is amber for a prolonged period becomes red.
Queue depth has proved to be insufficient as a measure for us to understand the overall performance of our background processing. We will make changes in order to collect more detailed data about our queue health and build alerting around it.
In the mean time being we are going to reduce our metric of what an acceptable queue depth is.
Whilst many things meant we did not communicate this incident well, the root cause of the incident was the need to upgrade our primary database.
This has already been done and we are monitoring it closely to ensure the upgrade has provided the necessary headroom to avoid a similar incident.