US data center unresponsive

Incident Report for Cronofy

Postmortem

On Monday 23rd November 2020, our US data center was unavailable for approximately 2m30s from 12:54:15 to 12:56:45 UTC.

Any requests made to the US data center during this time are likely to have failed to connect or received a 500-range status code rather than being handled successfully. Full service resumed after this time with no signs of degradation since.

Usually in such situations we are able to identify and share a definitive root cause, but unfortunately we have as yet been unable to get to an answer that we're fully satisfied with. In lieu of that, we are going to share the known symptoms and how those resulted in the US being unavailable for this period in adherence with our principles.

What we know

Every API call made is recorded to a journal, this allows us to surface calls on the developer dashboard, and also offload as much work as possible to background processes so our API can respond quickly. This journal is stored in an AWS RDS Aurora Postgres database. The leading symptom which lead to the outage is that writing to this journal began taking significantly longer than normal, to the point of timing out as of 12:54:15.

This significant increase in time taken to handle API requests produced back-pressure on the web servers, eventually leading them to take themselves offline altogether roughly 100 seconds later, and shortly after bring themselves back online.

This may sound undesirable, but our belief is that this behavior helped shorten the length of the incident as it effectively removed all load from the journal which allowed it to return to normal operation. Our concern is that we cannot definitively prove this.

AWS RDS Performance Insights highlighted the issues with the journal and pointed towards a high amount of time spent waiting on LWLock buffer_content. Guidance on this type of problem points to reviewing updates and foreign key constraints. However, these are not relevant as the journal is, aside from housekeeping, append-only and so does not do any updates. It also, in the interests of performance, does not have any foreign key constraints.

So whilst we know what component caused the outage, we do not understand what led it to behave abnormally, and therefore what direct steps we can take.

What we are doing

A general audit of database health has been performed, with particular focus on the journal, with no adverse findings.
We are reviewing the overall database configuration, primarily to see if there are any levers which will capture more data in such situations in future.
We have had further discussions about how we can alter how the journal works in case this is not a one-off, with a view of being able to implement them more quickly if necessary.
As the back-pressure appeared to resolve the issue, we are carefully considering whether to configure that to kick in more quickly.
We are reviewing the timeouts around writing to the journal. This is somewhat speculative, but one way to reduce the impact of locks being held at the database level in a holistic sense, is to reduce how long they could be held for.

It's deeply unsatisfying for us all to not have a definitive "this was caused by X and so we have done Y" for an outage, but in this situation I'm satisfied we have done sufficient investigation, and are taking steps to reduce the likelihood and impact of a similar problem in the future.

If you have further questions, please get in touch via support@cronofy.com

Garry Shutler, CTO and co-founder

Posted Nov 26, 2020 - 14:29 UTC

Resolved

The incident has now been resolved.

We will conduct a full root cause analysis and publish the outcomes in the coming days.

In the meantime, if you have further questions, please email support@cronofy.com.

Posted Nov 23, 2020 - 15:36 UTC

Update

Cronofy services remain stable and operational.

Our Engineering team are working with AWS to investigate the root cause of this issue. Upon initial investigations, the issue seems to have stemmed from a minor database issue.

Please direct any questions, to support@cronofy.com.

Posted Nov 23, 2020 - 14:31 UTC

Monitoring

Service has been restored and services are now operational. We're now investigating the root cause to better understand what happened.

We'll provide further updates as we get them. Please direct any questions in the interim, to support@cronofy.com.

Posted Nov 23, 2020 - 13:22 UTC

Investigating

The US data center was briefly unresponsive. We are aware of this and investigating the cause.

Posted Nov 23, 2020 - 13:00 UTC