You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Incident description

Crowd uses AD as the back-end to authenticate users.  The check between crowd and AD occurs over an SSL channel (ldaps - port 636) and is secured by certs in the java keystore (cacerts) on crowd.  Authentication fails if the cert has expired or is wrong. In this case, the communication between the Domain controllers and Crowd authentication was broken (which resulted in Dashboard not able to authenticate) due to the certificate change on the domain controllers. 

The certificates on the domain controllers were changed because IT team patched and upgraded subordinate PKI server which is the certificate issuing authority for all the Microsoft Windows boxes. This resulted in automated change of certificates even though the previous certificates were not expired yet. This is not an expected behavior after the patching window is done so not something which will/should happen every time we patch servers.  

Incident severity: CRITICAL

Data loss: NO

Timeline


Time (CET)
18 Jun, 23:16Issue Reported by OC
19 Jun, 07:35

Picked up by Michael H the following morning

19 Jun, 08:30

Fixed by turning off SSL temporarily to restore the service, while further investigation was carried out. Initial investigation revealed certificate has expired but later turned out that wasn't the case.

19 Jun, 10:30

Further investigations were carried out to avoid such failures in future

19 Jun, 16:08

The actual cause identified for the failure - due to IT patching certificates were automatically changed.

20 Jun, 09:30

Proposal was discussed between IT and SWD to avoid such failures in future.





Total downtime: 09:14 hours.

Proposed Solution

1) Any PKI patches on domain controller should go via change control and should be communicated to DevOps by IT so preventive measures could be taken in timely manner.

STATUS: IT agreed

2) In this particular case it wasn't the case that the certificate has expired but it could also cause similar issue. So as a proactive measure some nagios automated checks can be introduced:

  •  “check_keystore" to check expiry date of the cert. Issue a warning (email alert) if the cert expires within 30 days.
    STATUS - This check has already been implemented, from 20th Jun 2018,  11:30

  •  "check_ssl_connection":  checks that an SSL connection can be established from crowd to each of the AD servers.   Alert email sent when the connection fails - which means the connection is already broken but we’ll know sooner and can react more quickly. (There’s a java class “SSLPoke.class that can be used for this)
    STATUS - Need to be discussed with Konstantin to plan implementation.

3) Crowd authentication proved unreliable at various occasions, on the other side, federated login proved to be more reliable and working successfully with various operational services. As a long term solution the authentication for Dashboard application should be changed from Crowd to federated.

STATUS: Planning would be influenced by the decision on Dashboard V 3.0

  • No labels