You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 11 Current »

Incident description

Root Cause - Missing DNS record for dndrdc01.win.dante.org.uk 

A change was implemented to fix the automatic renewal of the OV certificates, which triggered the HA Proxy service to restart. However, a missing DNS record for dndrdc01.win.dante.org.uk caused the restart of the HA Proxy service to fail.

The server dndrdc01 was decommissioned on 22 May 2020, but the record was cleaned up recently. We still don't know the exact time when this record was expunged, however, a similar issue was observed on another VM this morning. This needs to be investigated further with the help of IT.

A change request was not raised because this was considered a low-risk operation and was difficult to foresee the failure caused by the missing DNS record.

Incident Severity: CRITICAL

Data loss: NO

Timeline


Time (CET)
17 Mar, 10:47/var/log/haproxy_1.log shows the error about happroxy being down
17 Mar, 10:55

disabled puppet on prod-haproxy02 and failed over the connection over it

Total downtime: 7 minutes

Proposed Solution

In future, even the low-risk operations on critical services such as HA Proxy should be carried out in the planned manner and out of business hours.

Identify the root cause of the missing DNS record with help from IT and take appropriate action.

  • No labels