Increased API error rates in our US data center

Incident Report for Cronofy

Postmortem

On November 18th 16:47UTC we had several errors about our database cluster encountering a sync issue. This ultimately resulted in an automated recovery mechanism kicking in, resulting in about 90 seconds where most communication with the database was blocked.

The errors indicate that the root cause was an issue with the storage layer that the database relies on. No data was lost during this procedure, but customers will have seen our APIs respond with HTTP 500 errors as the database was unreachable.

We posted an incident page as soon as we had investigated and understood that there were no ongoing issues.

We have set up further alerting on these specific errors going forward to provide more thorough monitoring coverage. We will also shortly be announcing a database upgrade, planned prior to this incident, which will include stability improvements for our database technology.

  • 16:47 - We get the first database log message relating to replication
  • 16:54 - Database replica lag begins to climb
  • 16:55 - We start to see more of the replication errors
  • 16:56 - Internal alerting highlights increased client and server errors
  • 16:57 - We get a message stating that the database replica has been disconnected and reconnected to automatically resolve the issue.
  • 16:58 - Database replica lag returns to normal
  • 17:14 - Internal incident channel opened for investigation
  • 17:37 - Public Status Page incident opened
  • 17:37 - Monitored the database to make sure error levels had returned to normal
  • 18:33 - Status Page incident closed

Retrospective

  • Could it have bee identified sooner?

    • Yes

      • The database replication error is rare enough that it should’ve been worth being notified of. Had we been notified of this, we would’ve known there were issues.
  • Could it have been resolved sooner?

    • No

      • Even if we had known about this earlier, we couldn’t have done anything other than what ended up happening to resolve the issue.
  • Could it have been prevented?

    • No

      • With our current understanding, the issue appears to be related to a database issue that was out of our control. Once the issue started happening we wouldn’t have been able to take any action to resolve as the issue essentially self healed.

Actions

  • Set up more alerting for specific database replication errors.
  • Conduct further internal reviews to solidify our understand on the above errors.
Posted Nov 19, 2025 - 15:48 UTC

Resolved

Between 16:55 UTC and 17:03 UTC we observed increased API error rates in our US data center. During this time customers may have received 401 Unauthorized or 500 Internal Server Error responses. Since that time error rates have returned to normal levels.

A postmortem of the incident will take place and be attached to this incident in the coming days.
Posted Nov 18, 2025 - 18:33 UTC

Update

Error rates remain at normal levels, and we are continuing to monitor.
Posted Nov 18, 2025 - 18:18 UTC

Monitoring

Between 16:55 UTC and 17:03 UTC we observed increased API error rates in our US data center.

Error rates returned to normal by 17:03 UTC. We are monitoring and investigating the root cause.
Posted Nov 18, 2025 - 17:37 UTC