Between Saturday February 1st 2020 and Tuesday February 4th 2020, we were falsely reporting events as being cancelled for some Apple calendar users.
The underlying cause was that Apple altered how they responded to one type of request we make in such a way that we incorrectly treated some events as being deleted. We believe this change happened on or slightly before Saturday February 1st 2020.
Based on our investigation this affected up to 5% of Apple calendars. The relatively low volume coupled with the subtlety in the change of the data received meant the issue slipped past our monitoring and alerting. It was the arrival of similar support tickets that alerted us to the problem.
Once identified, the problem was resolved by Tuesday February 4th 2020 at 14:00 UTC, and all potentially affected calendars resynchronized an hour later by 15:00 UTC.
Further details, lessons learned, and further actions we will be taking can be found below.
On Monday February 3rd we received two support tickets informing us of events being erroneously cancelled in Apple calendars. These were escalated to engineering on Tuesday February 4th as they had not been picked up on the Monday.
As the support tickets were reporting common symptoms they were investigated together, this started at around Tuesday 10:15 UTC.
Around 11:00 UTC the engineer investigating began to suspect the most likely cause was a systemic issue rather than a user error or a problem with an integration.
5 minutes later a third similar support ticket arrived and the call was made to open an incident to communicate the issue widely and our progress against it.
As part of this we created an internal video conference and the investigating engineer was joined by another engineer to assist with the investigation and resolution.
During such incidents the priorities are:
Our initial investigation revealed no particular pattern to the accounts affected or the timing of the events affect, other than being on or after Saturday and them being Apple calendars.
We reviewed our synchronization engine for recent changes, as the first thing to eliminate is a recent regression in behavior. This bore no fruit, so we delved into verifying whether events flagged as deleted still existed on Apple's servers.
Finding they did exist, we investigated how we could be treating them as having been deleted. This led to an inspection of the point at which we process responses from Apple's calendar servers. Our hypothesis had shifted towards Apple's responses having changed in some way.
We found that Apple were responding strangely to a particular type of request that fetches multiple events in a single request, commonly done as part of a synchronization process where a CalDAV server tells you the URLs of events that have changed, and then you make a request to get the full details of those events. Specifically this is a CALDAV:calendar-multiget REPORT.
We make CalDAV requests to get the details of multiple events at a time for the sake of efficiency. For example we'll ask Apple's calendar servers:
Send the details for event1, event2, and event3 in Julia's calendar
In response, we expect something like:
In Julia's calendar: event1 is on Monday, event2 does not exist, event3 is on Thursday.
The change in behavior we observed is that Apple's calendar server rather than telling us details for all events, it instead stopped telling us about all events after it found one that no longer exists.
So in our example, when asking:
Send the details for event1, event2, and event3 in Julia's calendar
We are now receiving:
In Julia's calendar: event1 is on Monday.
So even though
event3 exists, we are no longer being told about it. This change in behavior violated an assumption of ours about how CalDAV servers behave, in such a way we were not set up to detect and that meant we incorrectly believed
event3 no longer existed.
When we understood the source of the problem we were quickly able to both stop things getting worse and restore full functionality.
The fix put in place was to verify the answer received from the server contained an answer for each event we asked about, and if not to fall back to checking each event individually. This highlighted any ambiguous responses from Apple, and took a brute force approach in this situation to provide higher certainty about the state of someone's calendar. This was in place by 14:00 UTC.
To repair the damage done, we checked all Apple calendars for signs of this issue. We did this by finding all Apple calendars where it appeared a user had deleted an event in the past 7 days. Not all these calendars would have been affected, but it narrowed things down sufficiently to make for a quicker resolution for those who were.
We chose 7 days as the reports we had did not include any deletions happening before Saturday February 1st, and due to the nature of the outcome we felt that we would have received reports earlier had the change in behavior happened significantly before then.
With the candidates identified, we performed a full synchronization of their calendars and checked for events being modified in that window of time to gauge the potential impact. This process of identification and then synchronization took around an hour from 14:00-15:00 UTC.
We continued to monitor the situation and were happy the problem was resolved by 16:00 UTC and closed the incident shortly after.
The primary questions we had in our retrospective were:
The root cause of this incident was a change in Apple's responses which does not follow the CalDAV specification, tied to our integration assuming the specification would be followed.
Users deleting events in and of itself is not an unexpected action
Had Apple responses reported an error we would have noticed sooner, but it was a somewhat silent failure. The volume of events affected meant that the deletions would come within a margin of error and so we would reasonably assume the events themselves were actually deleted.
We do not believe our integration was unreasonable, but it was not as hardened as it could be. In part because Apple have been one of the most reliable providers we integrate with over the years.
We would want to have investigated this problem sooner, but with no alerts being triggered and the low volume of tickets that didn't happen until the following day. In hindsight this was not correct, but we feel it was not unreasonable.
We fell beneath our own high standards in how long it took us to identify and resolve this incident.
There was no single thing which could be pointed to as a cause for the delay, nonetheless there are several actions we will be taking to reduce the chances of similar happening in future:
There are also actions we will be taking directly related to this incident:
If you have any further questions, please contact us at email@example.com