Events being removed from Apple calendars

Incident Report for Cronofy

Postmortem

Between Saturday February 1st 2020 and Tuesday February 4th 2020, we were falsely reporting events as being cancelled for some Apple calendar users.

The underlying cause was that Apple altered how they responded to one type of request we make in such a way that we incorrectly treated some events as being deleted. We believe this change happened on or slightly before Saturday February 1st 2020.

Based on our investigation this affected up to 5% of Apple calendars. The relatively low volume coupled with the subtlety in the change of the data received meant the issue slipped past our monitoring and alerting. It was the arrival of similar support tickets that alerted us to the problem.

Once identified, the problem was resolved by Tuesday February 4th 2020 at 14:00 UTC, and all potentially affected calendars resynchronized an hour later by 15:00 UTC.

Further details, lessons learned, and further actions we will be taking can be found below.

Timeline

On Monday February 3rd we received two support tickets informing us of events being erroneously cancelled in Apple calendars. These were escalated to engineering on Tuesday February 4th as they had not been picked up on the Monday.

As the support tickets were reporting common symptoms they were investigated together, this started at around Tuesday 10:15 UTC.

Around 11:00 UTC the engineer investigating began to suspect the most likely cause was a systemic issue rather than a user error or a problem with an integration.

5 minutes later a third similar support ticket arrived and the call was made to open an incident to communicate the issue widely and our progress against it.

As part of this we created an internal video conference and the investigating engineer was joined by another engineer to assist with the investigation and resolution.

During such incidents the priorities are:

Understand the problem
Stop things getting worse
Restore full functionality
Repair as much damage as possible

Our initial investigation revealed no particular pattern to the accounts affected or the timing of the events affect, other than being on or after Saturday and them being Apple calendars.

We reviewed our synchronization engine for recent changes, as the first thing to eliminate is a recent regression in behavior. This bore no fruit, so we delved into verifying whether events flagged as deleted still existed on Apple's servers.

Finding they did exist, we investigated how we could be treating them as having been deleted. This led to an inspection of the point at which we process responses from Apple's calendar servers. Our hypothesis had shifted towards Apple's responses having changed in some way.

We found that Apple were responding strangely to a particular type of request that fetches multiple events in a single request, commonly done as part of a synchronization process where a CalDAV server tells you the URLs of events that have changed, and then you make a request to get the full details of those events. Specifically this is a CALDAV:calendar-multiget REPORT.

We make CalDAV requests to get the details of multiple events at a time for the sake of efficiency. For example we'll ask Apple's calendar servers:

Send the details for event1, event2, and event3 in Julia's calendar

In response, we expect something like:

In Julia's calendar: event1 is on Monday, event2 does not exist, event3 is on Thursday.

The change in behavior we observed is that Apple's calendar server rather than telling us details for all events, it instead stopped telling us about all events after it found one that no longer exists.

So in our example, when asking:

Send the details for event1, event2, and event3 in Julia's calendar

We are now receiving:

In Julia's calendar: event1 is on Monday.

So even though event3 exists, we are no longer being told about it. This change in behavior violated an assumption of ours about how CalDAV servers behave, in such a way we were not set up to detect and that meant we incorrectly believed event3 no longer existed.

When we understood the source of the problem we were quickly able to both stop things getting worse and restore full functionality.

The fix put in place was to verify the answer received from the server contained an answer for each event we asked about, and if not to fall back to checking each event individually. This highlighted any ambiguous responses from Apple, and took a brute force approach in this situation to provide higher certainty about the state of someone's calendar. This was in place by 14:00 UTC.

To repair the damage done, we checked all Apple calendars for signs of this issue. We did this by finding all Apple calendars where it appeared a user had deleted an event in the past 7 days. Not all these calendars would have been affected, but it narrowed things down sufficiently to make for a quicker resolution for those who were.

We chose 7 days as the reports we had did not include any deletions happening before Saturday February 1st, and due to the nature of the outcome we felt that we would have received reports earlier had the change in behavior happened significantly before then.

With the candidates identified, we performed a full synchronization of their calendars and checked for events being modified in that window of time to gauge the potential impact. This process of identification and then synchronization took around an hour from 14:00-15:00 UTC.

We continued to monitor the situation and were happy the problem was resolved by 16:00 UTC and closed the incident shortly after.

Retrospective

The primary questions we had in our retrospective were:

Could we have detected this before our customers?
Could we have elevated this sooner?

The root cause of this incident was a change in Apple's responses which does not follow the CalDAV specification, tied to our integration assuming the specification would be followed.

Users deleting events in and of itself is not an unexpected action

Had Apple responses reported an error we would have noticed sooner, but it was a somewhat silent failure. The volume of events affected meant that the deletions would come within a margin of error and so we would reasonably assume the events themselves were actually deleted.

We do not believe our integration was unreasonable, but it was not as hardened as it could be. In part because Apple have been one of the most reliable providers we integrate with over the years.

We would want to have investigated this problem sooner, but with no alerts being triggered and the low volume of tickets that didn't happen until the following day. In hindsight this was not correct, but we feel it was not unreasonable.

Actions

We fell beneath our own high standards in how long it took us to identify and resolve this incident.

There was no single thing which could be pointed to as a cause for the delay, nonetheless there are several actions we will be taking to reduce the chances of similar happening in future:

We will be raising concerns in a more visible way internally when the second ticket with a similar nature is received through using our shared channel to encourage eyes on problems sooner
We will be reviewing our CalDAV integration code for places where we can be more assertive about the type of response we are expecting, helping to flag similar problems at source
We will be reviewing our monitoring to see if additional measures could be put in place to recognize a deviation such as this

There are also actions we will be taking directly related to this incident:

We will be filing a bug with Apple to encourage them to revert the change in response on their side
We will be improving our fix to be less "brute force" in approach

Further questions?

If you have any further questions, please contact us at support@cronofy.com

Posted Feb 05, 2020 - 16:17 UTC

Resolved

We have fully resynchronized all Apple calendars that may have been affected. This will have reinstated events that were cancelled in error.

A post mortem of the incident will now take place and be attached to this incident.

If you have any queries in the interim, please contact us at support@cronofy.com.

Posted Feb 04, 2020 - 16:12 UTC

Monitoring

We believe we have resolved the cause of the false cancellations and are continuing to monitor.

In parallel, we are looking to fully resynchronize all Apple calendars that may have been affected. Falsely cancelled events will be reinstanted by this process.

Posted Feb 04, 2020 - 14:15 UTC

Update

Our Engineering team continue to work to eliminate the impact this incident is having on events in Apple calendars. We've deployed changes to further mitigate the impact.

We will continue to work to fully remediate the incident, and will keep you updated as we progress.

Posted Feb 04, 2020 - 13:52 UTC

Identified

Our investigations have identified that Apple have changed how they respond to certain requests. This change resulted in the deletion of events from Apple calendars.

We've identified a potential remediation step to prevent further impact of this incident. The change we're making stops the incident impacting other events, but may affect synchronization with Apple calendars in the short term.

We'll continue to work on this incident In order to restore full service. We will continue to provide updates as we progress.

Posted Feb 04, 2020 - 12:44 UTC

Update

Our Engineering team are still investigating the issue. We'll have another update on the incident before 13:00 GMT.

Posted Feb 04, 2020 - 11:58 UTC

Investigating

Cronofy are currently investigating reports of events being removed from Apple calendars.

We will provide another update before 12:00 GMT.

Posted Feb 04, 2020 - 11:16 UTC

This incident affected: Major Calendar Providers (Apple).