Root Cause Analysis
June 6th, 2018
Problem Description and Impact
On June 6th, beginning at 5:12am PDT, Okta experienced elevated CPU processing on application servers across all cells, triggering automated alerts. Given the number of servers with increased CPU processing, customer impact was a possibility and Okta proactively posted an alert to the https://trust.okta.com page. The increase in processing levels however was sustainable by the infrastructure and customer impact throughout the duration of the incident remained negligible until the issue was resolved at 5:59am PDT.
Okta identified the increased CPU utilization on application servers was the result of an error within an operation script related to edge service protection, which had recently been deployed.
Mitigating steps and future preventive measures
Okta responded to the alerts of increased CPU utilization and took mitigating actions to reduce resource utilization across the affected cells. Once the offending script was identified and terminated, service performance returned to normal.
Following the mitigating actions to resolve the incident, Okta also made process changes to add additional oversight steps to our deployment methodologies to prevent such incidents from occurring in the future.