Increased error rates

Incident Report for Cronofy

Postmortem

On Monday 22nd August between 09:09 and 09:20 UTC all API calls creating or deleting events failed. Users of the Scheduler would be unaffected, as operations were retried automatically after 09:20 UTC.

This outage was caused by a bug in a change to our API request journalling which records each API request received by Cronofy.

Timeline

At 09:04 a deployment was triggered including an update to our API request journal.

At 09:09 the deployment began rolling out, and the change came in to force. Seconds later, an alert was triggered and engineers began investigating.

At 09:11 an additional alarm triggered for our Site Reliability team informing them of an increase in the number of failing API requests.

At 09:15 with many more alerts triggering, we triggered a further deployment reversing the change.

At 09:19 all deployments reverting the change completed, and the last error was observed.

Retrospective

We ask three primary questions in our retrospective:

Could we have resolved it sooner?
Could we have identified it sooner?
Could we have prevented it?

After identification, the issue was resolved in approximately 4 minutes. We believe our automated deployment pipeline strikes a good balance between speed and robustness so no significant improvement can be found here.

The change had been highlighted as one in a risky area and had passed code review. Due to the anticipated risk, an engineer was actively checking for errors after the deployment. It took around 6 minutes from the first error being seen to making the call to revert the change. Given the severity of the issue, this was too long and we have taken action to avoid this in future.

The change was being made in a critical area of our platform. This is an area that has recently been under development. Manual testing was performed against our staging environment but failed to exercise the affected path.

Our reviews focussed too heavily on the intended change in behavior. We missed the unintended side effects of the change which led to this issue. Our automated tests for this area were not as comprehensive as we thought and did not detect the bug either.

Actions

Automated tests in the area have been reviewed and expanded to provide more certainty when making changes in this area. This will prevent such changes passing review.

We’ve strengthened guard clauses in this area to produce more descriptive errors, earlier, if a similar mistake were to be made in future. This will both prevent such changes passing review, and in the worst case aid faster identification of issues.

We’ve altered our playbook for deploying high-risk code changes to recommend at least two engineers are present and monitoring errors and telemetry. This will improve our chances of identifying issues sooner.

Posted Aug 29, 2023 - 13:57 UTC

Resolved

A change deployed at 09:08UTC introduced a bug that lead to an internal component being overwhelmed and dropping API requests. Our alerting spotted this and we rolled back, completing at 09:19UTC.

We will be conducting a post-mortem to understand why this wasn’t caught before release and to improve our QA processes.

Posted Aug 21, 2023 - 10:13 UTC

Investigating

We’re investigating an increased error rate in all data centers; the error rate has returned to normal but we’re investigating any side affects.

Posted Aug 21, 2023 - 09:31 UTC

This incident affected: API.