Haproxy Outage 2021-03-17

Incident description

In order to fix the automatic renewal of the OV certificates, we renewed one of our existing certificates: *.dante.net

When OV renewal started working again, the new certificate was put in place and triggered a Haproxy service restart.

A missing DNS record for dndrdc01.win.dante.org.uk caused the Haproxy service to fail to restart. The server dndrdc01 was decommissioned on 22 May 2020, but the record was cleaned up recently. We still don't know the exact time when this record was expunged (we don't have access to Windows DNS server), and we know that today morning even uat-haproxy went down for the same reason.

A change request was not raised because this is an unattended job, which is run nightly, every day and I have only run in advance the job that would have been triggered in the night. The first certificate that was going to expire would have caused the same issue. We could not imagine that a DNS record was deleted the same day.

Incident severity: CRITICAL

Data loss: NO

Timeline

Time (CET)
17 Mar, 10:47	/var/log/haproxy_1.log shows the error about happroxy being down
17 Mar, 10:55	disabled puppet on prod-haproxy02 and failed over the connection over it

Total downtime: 7 minutes

Proposed Solution

Using a test certificate would not protect us.

The way around, in the event of an unattended renewal, we would have experienced 4 hours of downtime. The renewal happens overnight, and we'd have discovered this problem the morning after. Forcing a manual renewal we have only had few minutes of downtime.

Possible solution: do not trigger an unattended service restart.

Sensu will start alerting 30 days before expiration and we can run the procedure manually (replacing an expiring certificate will not need a RFC).

The downside is that we will do it during working hours, while the cron job runs overnight.

Page tree

Haproxy Outage 2021-03-17

Incident description

Timeline

Proposed Solution