Root Cause Analysis - Service Alert - 06/06/2018
Published: Jun 10, 2018   -   Updated: Jun 22, 2018

Root Cause Analysis  
June 6th, 2018  
Service Alert  

Problem Description and Impact  

On June 6th, beginning at 5:12am PDT, Okta experienced elevated CPU processing on application servers across all cells, triggering automated alerts. Given the number of servers with increased CPU processing, customer impact was a possibility and Okta proactively posted an alert to the page.  The increase in processing levels however was sustainable by the infrastructure and customer impact throughout the duration of the incident remained negligible until the issue was resolved at 5:59am PDT. 

Root Cause  

Okta identified the increased CPU utilization on application servers was the result of an error within an operation script related to edge service protection, which had recently been deployed. 

Mitigating steps and future preventive measures 

Okta responded to the alerts of increased CPU utilization and took mitigating actions to reduce resource utilization across the affected cells.  Once the offending script was identified and terminated, service performance returned to normal.

Following the mitigating actions to resolve the incident, Okta also made process changes to add additional oversight steps to our deployment methodologies to prevent such incidents from occurring in the future.