Background processing degraded

Incident Report for Cronofy

Postmortem

On Monday April 22nd between 11:00 and 13:30 UTC our background processing services had a major performance degradation meaning background work was delayed for around 2 hours in some cases.

This impacted operations such as synchronizing schedules to push events into calendars and to update people's availability.

A change in our software's dependencies led to our background processors pulling work from queues but not processing that work as expected. This led to work messages being stuck in a state where the queues believed they were being worked on, so did not allow other background processors to perform the work instead.

For a subset of the background processing during this period we had to wait for a configured timeout of 2 hours to expire, at which point the background work messages became available again, and the backlog was cleared. Full service was resumed to all data centers, including processing any delayed messages, by 13:30 UTC.

Further details, lessons learned, and further actions we will be taking can be found below.

Timeline

All times rounded for clarity and UTC

On Monday April 22nd at 10:55 a change was merged which incorporated some minor version changes in dependencies that we use to interact with AWS services. This was to facilitate work against an AWS service we were not previously using.

This change in dependencies interacted with a dependency that had not changed such that our calls to fetch work messages from AWS Simple Queue Service (SQS) reported as containing no messages when in fact they did. This meant that messages were being processed as far as AWS SQS was concerned (in-flight), but our application code did not see them in order to process them.

This change went live from 10:58, with the first alert as a result of the unexpected behavior being triggered at 11:12. The bad change was reverted at 11:15 and fully removed by 11:20.

This meant that background work queued between 10:58 and up to 11:20 was stuck in limbo where AWS SQS thought it was being processed.

For our data centers in Australia, Canada, UK, and Singapore regular service was resumed at this point. New messages could be received and processed, and we could only wait for the messages in limbo to be released by AWS SQS to process those.

In our German and US data centers we had hit a hard limit of SQS with 120,000 messages being considered "in flight" for our high priority queue. This meant that we were unable to read from those queues, but were still allowed to write to them. Once we realised and understood this issue, we released a change to route all new messages to other queues and avoid this problem. This was in place at 12:00.

Whilst we were able to make changes to remove the initial problem, and avoid the effects of the secondary problem caused by hitting the hard limit. The status of the individual work messages was outside of our control. AWS SQS does not have a way to force messages back onto the queue which is the operation which we needed to resolve the issue.

We looked for other alternatives but the work messages aren't accessible in any way via AWS APIs when in this state. Instead we had to wait for the configured timeout to expire, which would release the messages again.

We took more direct control over capacity throughout this incident, including preparing additional capacity for the backlog of work messages being released.

Once the work messages became visible after reaching their two hour timeout, we were able to process them successfully with the full service being resumed to all data centers, including processing any delayed messages, by 13:30 UTC.

We then reverted changes applied during the incident to help handle it, returning things back to their regular configuration.

Retrospective

The questions we ask ourselves in an incident retrospective are:

Could it have been identified sooner?
Could it have been resolved sooner?
Could it have been prevented?

Also, we don't want to focus too heavily on this specifics of an individual incident, instead look for holistic improvements alongside targeted ones.

Could it have been identified sooner?

For something with this significant an impact, it taking 12 minutes to alert us was too slow.

Halving the time to alert would have significantly reduced the impact of this incident, potentially avoiding the second-order issue experienced in our German and US data centers.

The false-negative nature of the behavior meant that other safeguards were not triggered. Cronofy's code was not mishandling or ignoring an error, the silent failure meant our application code was unaware of a problem.

Could it have been resolved sooner?

The key constraint on the resolution of the incident was the "in flight" timeout we had configured for the respective queues.

We don't want to rush such a change to a critical part of our infrastructure but our initial analysis suggests a timeout of 15-30 minutes is likely reasonable and would have made a significant difference to the time to full service recovery.

Cloud it have been prevented?

As the cause was a change deployed by ourselves rather than an external factor, undoubtedly.

In hindsight, something touching AWS-related dependencies must always be tested in our staging environment and this change was not. This would likely have led to the issue being noticed before being deployed at all.

Actions

We will be creating additional alerts around metrics that went well outside of normal bounds that would have drawn our attention much sooner.

We will be reducing the timeouts configured on our AWS SQS queues to reduce the time messages are considered "in-flight" without any other interaction to align more closely with observed background processing execution times.

We are changing how we reference AWS-related dependencies to make them more explicit, and alongside carrying a warning to ensure full testing is performed in our staging environment first. We will also be adding the AWS dependencies to our quarterly patching cycle to keep them contemporary, reducing the possibility of such cross-version incompatibilities.

Further questions?

If you have any further questions, please contact us at support@cronofy.com

Posted Apr 23, 2024 - 17:32 UTC

Resolved

Background task processing has remained normal and the incident is resolved.

Around 11:00 UTC, a change was deployed to production which inadvertently broke the background processing.

We received our first alert at 11:12 UTC. We reverted the offending change at 11:15 UTC, but this did not restore background processing completely.

Multiple changes were made to minimize the impact of the issue, which restored most functionality for newly created jobs by 11:54 UTC. This left a backlog of work queued for processing between 11:00-11:54 UTC stuck in an unprocessable state.

The stuck backlog work was due to requeue at around 13:00 UTC due to reaching a configured timeout of 2 hours. We attempted to find a way to process this work sooner but were unsuccessful. As anticipated, the work became available from around 13:00 UTC and was processed for about half an hour, completing by 13:30 UTC.

This fully restored normal operations.

A postmortem of the incident will take place and be attached to this incident in the next 48 hours. If you have any queries in the interim, please contact us at support@cronofy.com.

Posted Apr 22, 2024 - 13:54 UTC

Monitoring

The backlog of jobs has been processed and we are monitoring to verify that everything remains normal.

Posted Apr 22, 2024 - 13:29 UTC

Update

New work continues to process as normal, and the stuck jobs are expected to be processed in the next hour.

Posted Apr 22, 2024 - 12:57 UTC

Identified

Capacity has been increased but the stuck jobs are still not processing due to an issue with the queue. We are working to find a way to process these jobs. New jobs are avoiding the stuck queue and are processing as expected.

Posted Apr 22, 2024 - 12:32 UTC

Update

We have mitigated most of the impact for background processing but have jobs stuck in US & DE. We are working to further increase capacity and execute these jobs.

Posted Apr 22, 2024 - 12:04 UTC

Update

Processing has recovered in all data centers apart from US & DE, which have one partly recovered. We are working to mitigate the issue in these two data centers.

Posted Apr 22, 2024 - 11:41 UTC

Investigating

We have seen background processing degrade in all data centers after a recent deployment. We have reverted the change and are investigating the cause.

Posted Apr 22, 2024 - 11:23 UTC

This incident affected: Background Processing.