Root Cause Analysis:
Minor Service Disruption
October 25, 2017
On Wednesday, October 25th, 2017, Okta experienced a service disruption between 6:00am - 6:18am PDT in US Cell 2. During this window, system response times increased, and a subset of US Cell 2 requests encountered timeouts or failures. The error rates in US Cell 2 averaged around 4% and peaked just below 10% for one minute at 6:15am. These errors intermittently caused the Okta home page to load incompletely and in some cases SSO to timeout for SWA apps when clicked by an end user. There was no impact to customers accessing Okta via APIs, or to end users signing into O365 or other SAML applications.
Problem description & Impact
At roughly 6:00am PDT, Okta’s proactive monitoring alerted us to growing number of HTTP 502 errors reported on our load balancer tier. Upon further investigation, it was discovered that there was a hardware failure impacting a read replica database in US Cell 2 which is utilized to provide increased capacity within the cell. The hardware failure caused a subset of Web Application tier servers to fail to an offline state which resulted in HTTP 502 errors being returned for in-flight end-user requests during the fail-over.
Mitigating Steps & Corrective Actions
By 6:20am PDT, Okta operations team responded to the problem and took quick actions to route impacted traffic to alternate Web Application tier servers. After this mitigating activity, the HTTP 502 errors were abated. Okta subsequently returned the offline Web Application tier servers into service by 7:05am PDT and the incident was closed.
In response to this event, Okta is implementing configuration changes to prevent similar Web Application Tier server failures following this type of DB failure. Okta is also looking to review/enhance the logging to capture end user experience/impact more clearly when SWA/plugin failures are encountered.