On Monday 23rd November 2020, our US data center was unavailable for approximately 2m30s from 12:54:15 to 12:56:45 UTC.
Any requests made to the US data center during this time are likely to have failed to connect or received a 500-range status code rather than being handled successfully. Full service resumed after this time with no signs of degradation since.
Usually in such situations we are able to identify and share a definitive root cause, but unfortunately we have as yet been unable to get to an answer that we're fully satisfied with. In lieu of that, we are going to share the known symptoms and how those resulted in the US being unavailable for this period in adherence with our principles.
Every API call made is recorded to a journal, this allows us to surface calls on the developer dashboard, and also offload as much work as possible to background processes so our API can respond quickly. This journal is stored in an AWS RDS Aurora Postgres database. The leading symptom which lead to the outage is that writing to this journal began taking significantly longer than normal, to the point of timing out as of 12:54:15.
This significant increase in time taken to handle API requests produced back-pressure on the web servers, eventually leading them to take themselves offline altogether roughly 100 seconds later, and shortly after bring themselves back online.
This may sound undesirable, but our belief is that this behavior helped shorten the length of the incident as it effectively removed all load from the journal which allowed it to return to normal operation. Our concern is that we cannot definitively prove this.
AWS RDS Performance Insights highlighted the issues with the journal and pointed towards a high amount of time spent waiting on LWLock buffer_content. Guidance on this type of problem points to reviewing updates and foreign key constraints. However, these are not relevant as the journal is, aside from housekeeping, append-only and so does not do any updates. It also, in the interests of performance, does not have any foreign key constraints.
So whilst we know what component caused the outage, we do not understand what led it to behave abnormally, and therefore what direct steps we can take.
It's deeply unsatisfying for us all to not have a definitive "this was caused by X and so we have done Y" for an outage, but in this situation I'm satisfied we have done sufficient investigation, and are taking steps to reduce the likelihood and impact of a similar problem in the future.
If you have further questions, please get in touch via support@cronofy.com
Garry Shutler, CTO and co-founder