German data centre outage

Incident Report for Cronofy

Postmortem

All,

We’re sorry that you experienced issues with Cronofy services as a result of this incident.

The reason for this outage was due to EC2/ASG issues within the AWS eu-central-1 region. This meant nodes were not able to spin up to handle an increase in traffic. Eventually, all resources were consumed causing unresponsiveness and 5xx errors. Normal service was restored when Amazon managed to resolve the issue.

Cronofy has conducted a full postmortem and identified remediation steps. These include improving monitoring of the Cronofy infrastructure and additional infrastructure to fallback to in the event of a reoccurrence.

Please contact our support team with any further queries at support@cronofy.com.

Thanks

Karl Bagci, Head of Operations

Posted Nov 14, 2019 - 14:45 UTC

Resolved

The underlying AWS incident has been marked resolved and we have not seen any problems since then.

Posted Nov 12, 2019 - 14:30 UTC

Update

Service seems stable but AWS are still reporting issues (see https://status.aws.amazon.com/).

Until AWS close their incidents we will monitor in case of regression.

Posted Nov 12, 2019 - 11:51 UTC

Monitoring

There seems to have been a general issue with provisioning additonal servers within AWS Frankfurt (eu-central-1) as several mechanisms we tried were failing. This could be related to the network errors reported by AWS but is not confirmed by their status updates.

These appear to be working now, at least partially, and sufficient capacity has been added to resume normal service and also catch up with the backlog of work.

We are continuing to monitor and have changed configuration to ensure we are overprovisioned whilst we understand what is happening.

Posted Nov 12, 2019 - 09:59 UTC

Update

Additional servers failed to be added as load increased, this may be related to some network issues reported by Amazon in the eu-central-1 region but we are still investigating.

We're taking steps to restore normal service whilst continuing investigation.

Posted Nov 12, 2019 - 09:25 UTC

Investigating

We are currently investigating this issue.

Posted Nov 12, 2019 - 08:32 UTC

This incident affected: API, Background Processing, and Developer Dashboard.