Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: new SSL certs installed on crowd.

Incident description

Crowd uses Active Directory (AD) as the back-end to authenticate users.  The check between crowd and AD occurs over an SSL channel (ldaps - port 636) and is secured by certs in the java keystore (cacerts) on crowd.  Authentication fails if the cert has expired or is wrong. In this case, the communication between the Domain controllers AD and Crowd authentication was broken (which resulted in Dashboard not able to authenticate) due to the certificate change on the domain controllersAD

The certificates on the domain controllers AD were changed because IT team patched and upgraded subordinate PKI server which is the certificate issuing authority for all the Microsoft Windows boxes. This resulted in automated change of certificates even though the previous certificates were not expired yet. This is not an expected behavior after the patching window is done so not something which will/should happen every time we IT team patch servers.  

Incident severity: CRITICAL

...

Time (CET)
18 Jun, 23:16Issue Reported by OC
19 Jun, 07:35

Picked up by Michael H the following morning

19 Jun, 08:30

Fixed by turning off SSL temporarily to restore the service, while further investigation was carried out. Initial investigation revealed certificate has expired but later turned out that wasn't the case.

19 Jun, 10:30

Further investigations were carried out to avoid such failures in future

19 Jun, 16:08

The actual cause identified for the failure - due to IT patching certificates were automatically changed.

20 Jun, 09:30

Proposal was discussed between IT and SWD to avoid such failures in future.

20 Jun, 11:30

Part one of Nagios check in the proposed solution implemented

20 Jun, 16.30

New certs provided by IT installed on crowd servers. SSL switched back on (crowd ↔ AD).

Total downtime: 09:14 hours.

...

1) Any PKI patches on domain controller should go via change control and . The changes should be communicated to DevOps by IT so that preventive measures could be taken in timely manner.

...