Incident description

Crowd uses Active Directory (AD) as the back-end to authenticate users.  The check between crowd and AD occurs over an SSL channel (ldaps - port 636) and is secured by certs in the java keystore (cacerts) on crowd.  Authentication fails if the cert has expired or is wrong. In this case, the communication between the AD and Crowd authentication was broken (which resulted in Dashboard not able to authenticate) due to the certificate change on the AD. 

The certificates on the AD were changed because IT team patched and upgraded subordinate PKI server which is the certificate issuing authority for all the Microsoft Windows boxes. This resulted in automated change of certificates even though the previous certificates were not expired yet. This is not an expected behavior after the patching window is done so not something which will/should happen every time IT team patch servers.  

Incident severity: CRITICAL

Data loss: NO

Timeline


Time (CET)
18 Jun, 23:16Issue Reported by OC
19 Jun, 07:35

Picked up by Michael H the following morning

19 Jun, 08:30

Fixed by turning off SSL temporarily to restore the service. Initial investigation revealed certificate has expired but later turned out that wasn't the case.

19 Jun, 10:30

Further investigations were carried out to avoid such failures in future

19 Jun, 16:08

The actual cause identified for the failure - due to IT patching certificates were automatically changed.

20 Jun, 09:30

Proposal was discussed between IT and SWD to avoid such failures in future.

20 Jun, 11:30

Part one of Nagios check in the proposed solution implemented

20 Jun, 16.30

New certs provided by IT installed on crowd servers. SSL switched back on (crowd ↔ AD).

Total downtime: 09:14 hours.

Proposed Solution

1) Any PKI patches on domain controller should go via change control. The changes should be communicated to DevOps by IT so that preventive measures could be taken in timely manner.

STATUS: IT agreed

2) In this particular case it wasn't the case that the certificate has expired but it could also cause similar issue. So as a proactive measure some nagios automated checks can be introduced:

  •  “check_keystore" to check expiry date of the cert. Issue a warning (email alert) if the cert expires within 30 days.
    STATUS - This check has already been implemented, from 20th Jun 2018,  11:30

  •  "check_ssl_connection":  checks that an SSL connection can be established from crowd to each of the AD servers.   Alert email sent when the connection fails - which means the connection is already broken but we’ll know sooner and can react more quickly. (There’s a java class “SSLPoke.class that can be used for this)
    STATUS - Need to be discussed with Konstantin to plan implementation.

3) Crowd authentication proved unreliable at various occasions, on the other side, federated login proved to be more reliable and working successfully with various operational services. As a long term solution the authentication for Dashboard application should be changed from Crowd to federated.

STATUS: Planning would be influenced by the decision on Dashboard V 3.0

  • No labels