Root Cause Analysis
Intermittent session failures – 5/23/2017 - 5/30/2017
Problem Description & Impact:
On 5/23/2017 between approximately 8:20pm and 10:50pm PDT, customers in US Cell 1 may have experienced intermittent failures when making API calls against the sessions endpoint, interactive login failures, and unexpected user session resets. Additional occurrences of this behavior were subsequently observed on 5/28/2017 at approximately 7:12pm PDT and on 5/29/2017, between 10:42pm and 10:55pm PDT. There were no further recurrences and the issue was fully resolved on 5/30/2017 @ 5:40pm PDT.
Okta found the root cause for the sporadic session errors to be a configuration error which had been deployed within our message queuing infrastructure. Specifically, a server was incorrectly referenced in the pool of servers used for brokering user session request. Given the distributed and redundant design of our message queueing infrastructure, the session error rate affected approximately 15% of session requests during the events within US Cell 1.
Mitigation steps and future preventative measures:
On 5/23/2017 at 10:50pm PDT, Initial investigation found a single web application server was encountering an elevated number of invalid sessions errors and was removed from service to mitigate any impact and root-cause investigation was initiated. Following removal of the suspected web application server, monitoring showed an immediate reduction in session errors. During the initial triage of the event, customer impact was thought to be negligible and Okta did not initiate a Trust Event alert.
While isolation and root-cause investigation continued, Okta experienced two similar events which occurred on 5/28/2017 (resolved at 7:12pm PDT) and on 5/29/2017 (resolved at 10:55am PDT). In both cases, the impacted servers were identified and removed from service to mitigate any user impact. Additional logging was deployed in the last occurrence to aid in identifying the root cause.
On 5/30/2017 at 5:40pm PDT, Okta deployed the corrected message queue node configuration to resolve the issue as well as added additional monitoring to identify and resolve similar spikes in session errors in the future. Okta has also implemented additional configuration/ deployment monitoring to prevent similar configuration errors from occurring in the future.