US data center unresponsive

Incident Report for Cronofy

Postmortem

What happened

On Friday 5th February 2021, our US data center was not responsive for nearly three minutes, between the times of 16:53:45 UTC and 16:56:40 UTC.

During this time, our API at api.cronofy.com as well as our web application at app.cronofy.com were not reachable. Our web application hosts the developer dashboard, Scheduler application, Real-Time Scheduling pages, and end-user authorization flows. Our background processing of jobs such as calendar syncronization was not affected.

We first became aware of the issue at 16:55:07 UTC and began investigating, however the data center recovered automatically without direct action from ourselves.

Our investigation

We identified that the symptoms of the issue matched an incident from November 23 2020. This time, we were able to identify the underlying cause with much more certainty.

The root cause was determined to be a slowdown in the database table responsible for logging developers' API calls to us. As calls backed up, the application eventually stopped serving requests entirely. Our API and web applications are hosted together, and as a result both were affected by the downtime despite the underlying issue being specific to our API.

AWS RDS Performance Insights identified an automatically scheduled housekeeping operation as being responsible for a disproportionate load to the database table, at an already busy time.

What we're doing

Today, we are making changes to our database configuration for the relevant table. The effect of this will be that individual housekeeping operations will happen more often, but are smaller in scope, faster, and more predictable in their timing. We've previously made the same change to other tables that have similar access patterns and higher size and throughput, so are confident in its effect.

In the near term, we will change the deployment of our applications, so that the API and web applications run more independently. This will help lessen the impact of any future issues of this nature.

Posted Feb 09, 2021 - 12:27 UTC

Resolved

This outage appears very similar in nature to the previous US outage on Nov 23, 2020. Details of that and its post-mortem can be found here: https://status.cronofy.com/incidents/01syy96xwvpy

We will conduct a full root cause analysis and publish the outcomes in the coming days.

In the meantime, if you have further questions, please email support@cronofy.com.

Posted Feb 05, 2021 - 17:32 UTC

Monitoring

The US data center started showing signs of the API being inaccessible from 16:53:45 (times UTC), it was fully accessible again from 16:56:40.

Signs are that the primary database in this data center experienced an event starting at this time which ended around 16:56:30, with external access recovering a few seconds later.

All metrics have returned to normal levels since this time.

Posted Feb 05, 2021 - 17:19 UTC

Investigating

The US data center was temporarily unreachable, it has recovered automatically and we are investigating the cause.

Posted Feb 05, 2021 - 16:58 UTC

This incident affected: API and Background Processing.