Production Dashboard Outage 2018-06-18

Incident description

Crowd uses AD as the back-end to authenticate users. The check between crowd and AD occurs over an SSL channel (ldaps - port 636) and is secured by certs in the java keystore (cacerts) on crowd. Authentication fails if the cert has expired or is wrong. In this case, the communication between the Domain controllers and Crowd authentication was broken (which resulted in Dashboard not able to authenticate) due to the certificate change on the domain controllers.

The certificates on the domain controllers were changed because IT team patched and upgraded subordinate PKI server which is the certificate issuing authority for all the Microsoft Windows boxes. This resulted in automated change of certificates even though the previous certificates were not expired yet. This is not an expected behavior after the patching window is done so not something which will/should happen every time we patch servers.

Incident severity: CRITICAL

Data loss: NO

Timeline

Time (CET)
18 Jun, 23:16	Issue Reported by OC
19 Jun, 07:35	Picked up by Michael H the following morning
19 Jun, 08:30	Fixed by turning off SSL temporarily to restore the service, while further investigation was carried out. Initial investigation revealed certificate has expired but later turned out that wasn't the case.
19 Jun, 10:30	Further investigations were carried out to avoid such failures in future
19 Jun, 16:08	The actual cause identified for the failure - due to IT patching certificates were automatically changed.
20 Jun, 9:30 A.M.	Proposal was discussed between IT and SWD to avoid such failures in future.
16:45	DevOps confirmed that there are no backups or extra copies on VMware storage
17:00	Konstantin Lepikhov called Qaiser Ahmed in Slack, no response.
17:00	Dick Visser confirmed that he has backups on server at Amsterdam university (those are daily backups taken directly by VMs itself).
18:26	Qaiser Ahmed confirmed on #devops channel that whole folder called AMS_UBUNTU on vmware cluster is not backed up and there's no data left.
18:30	Dick Visser recreated new VMs in the VMWare cluster and started the restore process
20:30	Dick Visser restored the backup and brought all sites online.
20:45	Konstantin Lepikhov made an official announcement on the #it and #general Slack channels about the incident and the resolution.
21:00	Dick Visser started restore of filesender-prod.geant.org.
21:50	Dick Visser finished restore of filesender-prod.geant.org, with the exception of user files as these aren't backed up due to privacy issues, the fact this is a demonstration service.

Total downtime: 5:39 hours.

Proposed Solution

Nagios checks which could help to prevent authentication problems:

1) “check_keystore" to check expiry date of the cert. Issue a warning (email alert) if the cert expires within 30 days.

2) "check_ssl_connection": checks that an SSL connection can be established from crowd to each of the AD servers. Alert email sent when the connection fails - which means the connection is already broken but we’ll know sooner and can react more quickly. (There’s a java class “SSLPoke.class that can be used for this)

I already have a draft check_keystore.sh script that does 1 --- I just need to plug it into nagios. I haven’t done 2 yet.

Lessons learned

Qaiser Ahmed confirmed that now whole AMS_UBUNTU folder on VMware cluster is backed up. Anyway we need to test this especially backup restore (how is performed and how much time it takes).
DevOps Team will find the ways to isolate production environment and have better awareness regarding invasive operations within Puppet and Ansible infrastructure.
IT team should take actions regarding backup procedures for production environment located on GEANT VMWare cluster.
We need a better monitoring and incident handling, especially interaction between stakeholders and departments (DevOps and IT/OC).
The monitoring that Dick Visser is responsible for did work, but the check interval could be slightly improved - first Nagios alarm came in 20 minutes after system went down.
The backups that Dick Visser is responsible for also worked, and the entire webserver could be completely restored from scratch. The RPO for this system (1 day) stems from the time it was first put into production a few years ago, when it contained much less user contributed content, and updates happened less frequently. This could be improved to something like 1h.

Page tree

Production Dashboard Outage 2018-06-18

Incident description

Timeline

Proposed Solution

Lessons learned