Haproxy Outage 2021-03-17

Incident description

A missing DNS record for dndrdc01.win.dante.org.uk caused the Haproxy service to fail to restart (the change was carried out by IT on 22 May 2020, but the record was cleaned up recently).

In order to fix the automatic renewal of the OV certificates, we renewed one of our existing certificates: *.dante.net

When OV renewal started working again, the new certificate was put in place and triggered a Haproxy service restart.

In the event of an unattended renewal, we would have experienced 4 hours of downtime (the renewal happens overnight, and we'd have discovered this problem the morning after. Forcing a manual renewal we have only had few minutes of downtime.

Incident severity: CRITICAL

Data loss: NO

Timeline

Time (CET)
17 Mar, 10:47	/var/log/haproxy_1.log shows the error about happroxy being down
17 Mar, 10:55	disabled puppet on prod-haproxy02 and failed over the connection over it

Total downtime: 7 minutes

Proposed Solution

not available.

Using a test certificate would not protect us. The way around, the real certificate would have been changed overnight, causing at least 4 hours of downtime.

Page tree

Haproxy Outage 2021-03-17

Incident description

Timeline

Proposed Solution