Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Incident description

Root Cause - Missing DNS record for dndrdc01.win.dante.org.uk 

A change was implemented In order to fix the automatic renewal of the OV certificates, we renewed one of our existing certificates: *.dante.net

When OV renewal started working again, the new certificate was put in place and triggered a Haproxy service restart.

...

which triggered the HA Proxy service to restart. However, a missing DNS record for dndrdc01.win.dante.org.uk caused the

...

restart of the HA Proxy service to fail

...

.

The server dndrdc01 was decommissioned on 22 May 2020, but the record was cleaned up recently. We still don't know the exact time when this record was expunged

...

, however, a similar issue was observed on another VM this morning. This needs to be investigated further with the help of IT.

A change request was not raised because this

...

was considered a low-risk operation and was difficult to foresee the failure caused by the missing DNS record.

Incident Severity

At the same time, we could not imagine that a DNS record was deleted this same day.

Incident severityCRITICAL

Data loss: NO

Timeline

...

Total downtime: 7 minutes

Proposed Solution

Using a test certificate would not protect us.

The way around, in the event of an unattended renewal, we would have experienced 4 hours of downtime. The renewal happens overnight, and we'd have discovered this problem the morning after. Forcing a manual renewal we have only had few minutes of downtime.

Possible solution: do not trigger an unattended service restart.

Sensu will start alerting 30 days before expiration and we can run the procedure manually (replacing an expiring certificate will not need a RFC).

In future, even the low-risk operations on critical services such as HA Proxy should be carried out in the planned manner and out of business hours.

Identify the root cause of the communication gap regarding the missing DNS record with help from IT and take appropriate actionThe downside is that we will do it during working hours, while the cron job runs overnight.