Root Cause Analysis
July 11, 2017
Slow System Performance in US Cell 2
Problem Description & Impact:
On Tuesday, July 11, 2017 @ 6:05 am PDT, Okta experienced a performance issue in US Cell 2 in which a peak of 40% of requests experienced slow system responsiveness. Customers may also have experienced intermittent AD real-time sync errors during this period. However, Office365 requests remained unaffected for the duration of the incident. The issue was resolved and service was returned to normal at approximately 6:30am PDT.
Okta has determined the root cause for the performance issue to be the result of a database driver bug which prevented database connections from being released properly after spikes in database utilization. The effect of database connections not being released properly resulted in increased web application server response times.
Mitigation steps and future preventative measures:
At 6:05am PST, Okta identified an increase in latency and error rates within US Cell 2. Following initial assessment, Okta began mitigation steps by routing traffic away from the affected server-tier and clearing the error condition before returning affected servers into service.
The following actions have been completed or are scheduled for completion to prevent this issue from occurring again in the future.
- Okta has deployed a series of database configuration change to prevent the database connection issue from re-occurring until the database driver bug-fix is deployed.
- Okta is currently testing the database driver bug-fix and will deploy the fix as soon as it is certified by our engineering team.
- Okta has identified and will deploy additional server management automation efficiencies to improve our ability to quickly mitigate any similar issues in the future.