US API degradation
Incident Report for Cronofy
Postmortem

On Wednesday April 12th 2023 between 03:16 and 05:35 (2 hours 19 minutes) Cronofy experienced an issue causing 17% of the API traffic to our US environments to fail, returning an HTTP 500 error to the API caller.

The underlying cause was that our system failed to heal following a database failover.

An incorrectly configured alarm, and a gap in our playbook resulted in the issue extending longer than it should.

In line with our principles, we are publishing a public post-mortem to describe what has happened, why it impacted users, and what we will do to prevent it from happening in the future.

Timeline

Times are from Wednesday April 12th 2023, in UTC and rounded for clarity.

At 03:16 the primary node that writes to the database cluster failed. The secondary node was promoted and the cluster failed over automatically, recovering as planned. This incident alerted us via our pager service, triggering an investigation, which started at 03:20.

After the database failover, all visible metrics were pointing towards a successful automated resolution of the issue. Our core metrics were looking healthy and at this time, we marked the incident as recovered without manual intervention at 03:30.

It is normal for some of our incidents to be resolved in this way. The on-call engineer makes a decision based on available information as to whether an incident room is necessary to handle an alarm.

When a failover occurs, the original failed node is rebooted, and then becomes the read-only partner in the cluster, whilst the existing read-only secondary node is promoted to primary. Some of the connections from the web application nodes were still connected to the read-only secondary database node, but were treating it as if they were connected to the primary writable node. This lead to failures in some of the actions that were taking place in the API.

At this time, our monitoring system was not alerting, as the although the metric for measuring HTTP 500 was reporting correctly, the alarm was misconfigured. When the alarm returned no data, this was viewed the same a 0 errors. This resulted in no further alarms to alert us to the degradation of the service, and reinforced the belief that the service was healthy again.

At 05:00 an automated notification was posted into our internal slack monitoring channel to show that two health metrics had not fully recovered. This wasn’t an alarm level notification, so did not re-trigger the on-call pager service.

At 05:30 an engineer reviewed the notification in slack, and inspected the health metrics. The increased levels of HTTP 500 errors being returned by the Cronofy API was identified. The incident was reopened, and investigation restarted. Our site reliability team was reactivated to triage the issue.

At 05:35 Cronofy took the decision to replace all application instances in the US environment. An automated redeployment was triggered. This reset all the nodes connections to the database cluster, flushing out the connections to the read-only node, and returned API error responses to normal levels by 05:37.

Retrospective

We ask three primary questions in our retrospective:

  • Could we have resolved it sooner?
  • Could we have identified it sooner?
  • Could we have prevented it?

The root cause for this issue is that our systems did not correctly self-heal when a database failover occurred. Although the database failover was the event that triggered the system to fail, it is something that is a rare, expected, and unavoidable event. The database cluster correctly recovered, and was back in an operating mode within a few minutes.

Another significant factor in the severity of the impact of this issue was the robustness of Cronofy’s response to the issue. This was the first occurrence of a database failover happening outside of core working hours. During working hours, multiple engineers would see the error and respond to it, each checking a different set of metrics or systems. This collaborative approach was not correctly translated in to the guidance available in the incident response playbook, resulting in an incomplete set of checks taking place once the database failover had completed.

This could definitely have been resolved sooner, by forcing the nodes to reconnect to the database. This should trigger automatically in cases of database failover, and there should also be enough information available to the issue response team to have knowledge that a wider issue is still ongoing.

The lack of identification the impacts of the issue had not been fully resolved is what prolonged the incident, and this is linked to several factors:

  • Confirmation bias of errors trending downwards, and the early hours wake up causing the on-call engineer to miss the elevated error rate.
  • Misconfigured alarm adding to the confirmation bias by not making it clear that the issue was ongoing.
  • A reliance on tacit group knowledge instead of explicit documented steps for database failovers meaning that the on-call engineer didn’t know the additional validation and checks that would have identified the issue sooner.

Actions to be taken

We are disappointed with the length of time that it took us to resolve this incident. There were multiple smaller failures that led to this incident having a higher impact. A few different checks, technical changes or alarms being raised could have mitigated this error, and prevented the longer outage.

  • We are improving the way our applications self-heal in the event of a database failover. Some error messages occurred indicating that the system was trying to write to the wrong part of the database cluster, but these messages did not cause the system to reset.
  • We will update our guidance for our internal teams on the actions to take when a database failover occurs. This will include a more precise checklist, and specific metrics to review.
  • We are going through all our alarms to ensure that they are looking at the right data, at the right time.

Further questions?

If you have any further questions, please contact us at support@cronofy.com

Posted Apr 13, 2023 - 15:33 UTC

Resolved
Our US data center experienced a failover of the primary database at around 03:15 UTC. From this point around 20% of API requests encountered an error resulting in a 500 response, for specific operations such as creating events the failure rate was 80%.

This lasted until around 05:30 UTC (around 2h15m later) when the elevate errors were recognized and all processes restarted. Errors then returned to normal levels a few minutes later.

From an initial investigation we believe "write" connections were attempting to write to a node which had become a replica as part of the failover, which then failed. Some processes restarted automatically, but API processes appear to have not done so, nor did in the face of intermittent errors. That issue was cleared when we made all the processes restart and service then returned to normal.

We will be conducting a full postmortem of this event and will post that against this incident by the end of the week.
Posted Apr 12, 2023 - 03:00 UTC