Root Cause Analysis:
March 2, 2017 – Okta Performance Issue
Problem Description & Impact:
On Thursday, March 2, 2017, at approximately 1:40pm PST, Okta experienced an issue in US Cell 2 in which up to 25% of HTTP requests experienced latency or timed-out. Most users who retried their request were successful. The sporadic latency improved over time as Okta worked to resolve the issue. The issue was completely resolved by 2:21pm PST.
At approximately 1:38pm PST, Okta was making preparations for the upcoming production release, 2017.05, for Thursday evening. A pre-release script was executed which contained outdated IP address configurations for a subset of Proxy Servers in rotation. The script did not fully account for the recent per-cell Proxy Server (“Router”) enhancements deployed by Okta. This resulted in two edge proxy-servers being removed from the rotation, due to IP address changes, while still accepting traffic.
Okta monitoring detected elevated error rates shortly after the pre-release script removed the Proxy Servers from the rotation. Okta began triage and mitigation steps to bypass the impacted Proxy Servers and removed references from DNS. Once the root cause was identified and request failures were stabilized, the Proxy Server configuration was corrected and the impacted servers were returned to service.
Okta has taken the following actions following this service disruption: