Root Cause Analysis:
August 21, 2017 - Service Degradation
Problem Description & Impact:
On August 21, 2017, between 9:40am and 10:30am PDT, Okta experienced a service degradation in US Cell 5 in which a small number of interactive user-sessions resulted in incomplete page loads post authentication. Full traffic analysis during the incident window indicates approximately 1% of interactive user authentications were encountering a HTTP 404 response when loading several page elements following authentication rather than users’ Okta dashboards loading successfully.
Mitigation steps and future preventative measures:
Immediately following root cause determination, the recently updated infrastructure was patched to include the newest versions of static content. This action was completed and normal functionality resumed at 10:30 PDT for all user requests.
In order to prevent future recurrence of this issue additional automated post deployment validation checks are being implemented to ensure the correct static contact resides on the source nodes. In addition, the sensitivity of monitoring for this condition has been increased to trigger on a smaller number of 404s generated when retrieving static content