On October 23rd, between 10:11 and 10:18 UTC, we experienced a partial outage in our Canadian data center.
During this time customers may have received server errors or experienced increased response latency when using the scheduler or our API. Background processing was not impacted during this incident.
At 10:09 UTC, we released a change that increased how much work each individual web server process can handle at a time. This inadvertently caused those processes to hit resource limits after a short period of time, which then resulted in those processes being restarted. This then kept repeating until we reverted the change.
The first errors happened at 10:11 UTC, and our engineers were alerted to this at 10:14. At 10:17 we reverted the change, and by 10:19 UTC the server errors had stopped and response latency had returned to normal.
We always ask the questions:
Could it have been resolved sooner
Could it have been identified sooner
Could it have been prevented
Yes
Rather than increasing how much work each individual web server process can handle, we could have increased the number of processes running. This is known as scaling horizontally, rather than scaling vertically.
We felt that we could have been quicker to communicate about the issue on our status page. To help with this in the future we will: