Root Cause Analysis - March 2 2017 Performance Issue Skip to main content
https://support.okta.com/help/oktaarticledetailpage?childcateg=&id=ka02a000000xalpsak&source=documentation&refurl=http%3a%2f%2fsupport.okta.com%2fhelp%2fdocumentation%2fknowledge_article%2froot-cause-analysis-march-2-2017-performance-issue
How satisfied are you with the Okta Help Center?
Thank you for your feedback!
How satisfied are you with the Okta Help Center?
1
2
3
4
5
Very Dissatisfied
Very satisfied
Enter content less than 200 characters.
Average Rating:
Root Cause Analysis - March 2 2017 Performance Issue
Published: Mar 15, 2017   -   Updated: Mar 15, 2017
Root Cause Analysis:
March 2, 2017 – Okta Performance Issue
 
Problem Description & Impact: 

On Thursday, March 2, 2017, at approximately 1:40pm PST, Okta experienced an issue in US Cell 2 in which up to 25% of HTTP requests experienced latency or timed-out.  Most users who retried their request were successful.  The sporadic latency improved over time as Okta worked to resolve the issue.  The issue was completely resolved by 2:21pm PST.

Root Cause: 

At approximately 1:38pm PST, Okta was making preparations for the upcoming production release, 2017.05, for Thursday evening.  A pre-release script was executed which contained outdated IP address configurations for a subset of Proxy Servers in rotation.  The script did not fully account for the recent per-cell Proxy Server (“Router”) enhancements deployed by Okta.  This resulted in two edge proxy-servers being removed from the rotation, due to IP address changes, while still accepting traffic.

Mitigation Steps: 

Okta monitoring detected elevated error rates shortly after the pre-release script removed the Proxy Servers from the rotation.  Okta began triage and mitigation steps to bypass the impacted Proxy Servers and removed references from DNS.  Once the root cause was identified and request failures were stabilized, the Proxy Server configuration was corrected and the impacted servers were returned to service.

Okta has taken the following actions following this service disruption:
  1. Okta is implementing additional QA test scripts to catch automation / deployment conditions which could lead to incorrectly mapped IP addresses.
  2. Okta is updating the impacted pre-release scripts to prevent incorrect IP address re-assignments.
  3. Okta is adding additional safeguards within our automation tools to block the restart of new machines when an IP address conflict/mismatch is detected with the automation scripts.
  4. Okta is implementing additional alerting within our automation / deployment tools when conflict/mismatches are detected.

Post a Comment