Root Cause Analysis:
August 3 & 4, 2017 – Okta Performance Issue
Problem description & Impact
On Thursday, August 3, 2017, at approximately 4:55am PDT, Okta experienced a performance issue in US Cell 2 in which end-users and administrators encountered sporadic slowness while attempting to access any interactive user-pages within their Okta tenant. During the event, an average of 10% of incoming requests to US Cell 2 experienced latency until 8:05am, with peaks of up to 40% latency in requests between 4:55am and 6:30am PDT and between 7:55am and 8:05am PDT.
On Friday, August 4th, 2017, Okta experienced a similar issue in US Cell 2 beginning at 5:30am PDT, resulting in a similar user experience. During this event, an average of 20% of incoming requests to US Cell 2 experienced intermittent latency until 6:00am, with a peak of 25% of requests.
Okta has identified the root cause of the performance issues as a recently deployed performance optimization, which was intended to improve the speed of processing certain endpoint requests. The optimization had the unintended side-effect of using additional database CPU resources than expected during periods of high load in US Cell 2, which resulted in slow system responsiveness during these peak load times.
The performance optimization had been thoroughly tested and did not exhibit performance issues during development/testing or subsequent deployment to Okta’s Preview or Production Cell environments. However, the peak traffic pattern in US Cell 2 surfaced the unexpected performance profile.
Mitigating Steps & Corrective Actions
On Thursday, August 3rd, at approximately 4:57am PDT, and Friday, August 4th, at approximately 5:30am PDT, Okta’s monitoring identified and alerted on a spike in database CPU utilization and corresponding interactive user session latency. Okta began mitigation steps by routing traffic away from the affected server-tier and adding additional web application server capacity to handle the increased load until the increased load subsided.
The process and performance learnings gained during the August 3rd event directly affected Okta’s ability to respond quicker during the performance issue on August 4th, and subsequently minimize the risk of future occurrences.
Following the events of August 3rd & 4th, Okta has taken the following action to prevent further performance issues and improve our response to similar issues in the past.