Production Dashboard Outage 2019-07-27

Incident description

Incident severity: CRITICAL

Data loss: NO

Time (CET)
18 Jun, 23:16	Issue Reported by OC
19 Jun, 07:35	Picked up by Michael H the following morning
19 Jun, 08:30	Fixed by turning off SSL temporarily to restore the service. Initial investigation revealed certificate has expired but later turned out that wasn't the case.
19 Jun, 10:30	Further investigations were carried out to avoid such failures in future
19 Jun, 16:08	The actual cause identified for the failure - due to IT patching certificates were automatically changed.
20 Jun, 09:30	Proposal was discussed between IT and SWD to avoid such failures in future.
20 Jun, 11:30	Part one of Nagios check in the proposed solution implemented
20 Jun, 16.30	New certs provided by IT installed on crowd servers. SSL switched back on (crowd ↔ AD).

Total downtime: 09:14 hours.