Root Cause Analysis:
April 24th, 2018 – Service Disruption
Problem Description & Impact
On Tuesday, April 24th, 2018, beginning at approximately 7:53am PDT, Okta experienced a service disruption in US Cell 2. End users in US Cell 2 attempting to reach Okta received HTTP 500 errors, and customer integrations attempting to reach Okta API endpoints in US Cell 2 encountered elevated response times and errors.
The most significant period of impact was a 14-minute period between 7:53am and 8:07am. As actions were taken to mitigate the disruption, the success rate began improving beginning at 8:07am and by 8:25am approximately 80% of requests were successfully being serviced. Service was returned to normal in US Cell 2 at 8:48am PDT.
The service disruption was traced to a combination of a procedural error and an automation bug, both of which occurred while security-patching infrastructure within US Cell 2. The procedural error routed the security-patching effort to an inappropriate cache cluster. Specifically, the manual review and approval procedure failed to prevent a high-usage cluster from being selected as the target. The patching was intended for an alternate, lower usage, cluster where applying the patch would not have resulted in customer impact. The automation bug, in the script utilized in patch application, led to problematic sequencing of the change across the servers in the cluster. Consequently, the targeted cluster was unable to properly failover across nodes.
Mitigation Steps and Future Preventative Measures
At 7:53am PDT, Okta's proactive system monitoring alerted Okta technical operations team of the error condition. To stabilize the situation, application tier nodes were recycled to flush request queues, resolve resource contention, and redistribute load amongst available cache nodes. By 8:48am PDT, US Cell 2 returned to normal performance levels.
To prevent recurrence, Okta has undertaken the following actions:
Okta immediately implemented additional safeguards to review and confirm operational patching procedures. Additional review and validation steps (such as go / no-go criteria-based checklists) have been added to the workflow for approving and scheduling any forthcoming operational patching change requests.
Okta has also prioritized a deep technical assessment to strengthen the patching automation for cache clusters. The technical assessment, currently underway, is scheduled to continue thru May, and will include additional testing and code improvements to the patching script for cache-clusters. Additionally, a thorough architecture review and extensive performance testing will be executed to ensure that a cell can continue operating during cache infrastructure failures even when cache cluster failover is disrupted.