On November 18th 16:47UTC we had several errors about our database cluster encountering a sync issue. This ultimately resulted in an automated recovery mechanism kicking in, resulting in about 90 seconds where most communication with the database was blocked.
The errors indicate that the root cause was an issue with the storage layer that the database relies on. No data was lost during this procedure, but customers will have seen our APIs respond with HTTP 500 errors as the database was unreachable.
We posted an incident page as soon as we had investigated and understood that there were no ongoing issues.
We have set up further alerting on these specific errors going forward to provide more thorough monitoring coverage. We will also shortly be announcing a database upgrade, planned prior to this incident, which will include stability improvements for our database technology.
16:47 - We get the first database log message relating to replication16:54 - Database replica lag begins to climb16:55 - We start to see more of the replication errors16:56 - Internal alerting highlights increased client and server errors16:57 - We get a message stating that the database replica has been disconnected and reconnected to automatically resolve the issue.16:58 - Database replica lag returns to normal17:14 - Internal incident channel opened for investigation17:37 - Public Status Page incident opened17:37 - Monitored the database to make sure error levels had returned to normal18:33 - Status Page incident closedCould it have bee identified sooner?
Yes
Could it have been resolved sooner?
No
Could it have been prevented?
No