Public Link Issues

Incident Report for Cronofy

Postmortem

Between 08:43 and 13:23 UTC on Thursday May 22nd 2025, attempts to book times via the Public Links feature of the Scheduler failed, visitors to Public Links would have instead seen an erroneous message that the Public Link was disabled.

The root cause was the development of a new feature on top of Public Links, where normal booking flows were erroneously interpreted as making use of the new feature, and then failing a validation check that should not have been applied.

As a result, 23 attempted bookings were not accepted.

During the incident, we fixed the root cause with a patch.

Following the incident, we contacted all affected owners of the impacted Public Links.

Timeline

All times are on May 22nd 2025.

  • 08:43 UTC - A change is merged which causes Public Link bookings to fail with a “disabled” message
  • 12:50 UTC - The issue is raised internally as a result of our own testing of the product
  • 12:57 UTC - The problematic code is identified, but we are unable to confidently issue a simple rollback since other changes to the area had been since introduced
  • 13:18 UTC - A fix to the code is written and reviewed
  • 13:23 UTC - The fix is approved and merged
  • 13:30-16:00 UTC - Work is undertaken to identify all affected public links
  • 16:29 UTC - All affected customers notified

Retrospective

We always ask the questions:

  • Could the issue have been resolved sooner?
  • Could be issue have been identified sooner?
  • Could be issue have been prevented?

In this case, we feel our resolution time was reasonably good. In hindsight, we may have been able to have the fix released around 15 minutes faster had we acted more confidently to implement a patch concurrently with internal debate and investigation over the ability to roll back.

We are not happy with our speed of identification, and also that this was an issue that could have been prevented quite easily.

On identification, the issue was live for around 4 hours before being noticed internally. The failure in this case manifested in users being routed to an otherwise normal “disabled” page, and bookings not being made. It’s harder to alert on things not happening, given the usage rate of this feature, and the natural peaks and troughs of daily activity.

We did see an area for improvement in that we were using a normal “disabled” page as a catch-all for a few other error cases; by adding Telemetry around these different cases, we can positively identify unusual behaviour separately from the normal “disabled” case.

We also saw possibilities to improve our playbook for monitoring usage of new features. In this case — had we configured a trigger with a shorter duration & period — we would have seen unexpected activity for the unreleased new feature. This would have led us to investigate and notice the issue sooner.

Prevention is the place with the clearest room for improvement. We had focused our manual testing on the new feature being developed, and failed to test the vanilla case of Public Link bookings which were touched by the code changed.

We use both automated and manual tests during feature development. The nature of the underlying issue, and the necessary interaction of multiple steps of the booking flow to cause it, made it less likely for our automated test suite to reasonably catch it. However, we failed to catch the erroneous code at Code Review stage, and failed to manually test the critical path adjacent to the changes being made.

Actions

We are going to more strictly manually test critical paths in our system when making changes adjacent to those areas to ensure there aren’t unintended side effects from in-development features.

We are going to add increased monitoring of the error cases in the affected area — before they are sent to any fallback pages — so that anomalous activity and behaviour is positively identifiable and triggers a visible alert.

We are going to review our playbook for new feature usage telemetry to add better guidance for engineers to set up triggers that are more visible by default.

Posted May 23, 2025 - 14:12 UTC

Resolved

We have reverted the change and confirmed that Public Links are now creating events as expected.

A postmortem of the incident will take place and be attached to this incident the next 3 working days.

If you have any queries in the interim, please contact us at support@cronofy.com
Posted May 22, 2025 - 13:31 UTC

Monitoring

We have reverted the broken code, it is deploying at present and we are monitoring to find out how many requests may have been affected.
Posted May 22, 2025 - 13:25 UTC

Identified

We have identified a recent change has made public links error - we are reverting this imminently.
Posted May 22, 2025 - 13:08 UTC
This incident affected: Scheduler.