Incident description

Root Cause - Missing DNS record for dndrdc01.win.dante.org.uk 

A change was implemented to fix the automatic renewal of the OV certificates, which triggered the HA Proxy service to restart. However, a missing DNS record for dndrdc01.win.dante.org.uk caused the restart of the HA Proxy service to fail.

The server dndrdc01 was decommissioned on 22 May 2020, but the record was cleaned up recently. We still don't know the exact time when this record was expunged, however, a similar issue was observed on another VM this morning. This needs to be investigated further with the help of IT.

A change request was not raised because this was considered a low-risk operation and was difficult to foresee the failure caused by the missing DNS record.

Incident Severity: CRITICAL

Data loss: NO

Timeline


Time (CET)
17 Mar, 10:47/var/log/haproxy_1.log shows the error about happroxy being down
17 Mar, 10:55

disabled puppet on prod-haproxy02 and failed over the connection over it

Total downtime: 7 minutes

Proposed Solution

In future, even the low-risk operations on critical services such as HA Proxy should be carried out in the planned manner and out of business hours.

Identify the root cause of the communication gap regarding the missing DNS record with help from IT and take appropriate action.

  • No labels

2 Comments

  1. Hi Mandeep SainiI think that we have an alternative, or an opportunity, here. As we know that there is a report of this error being flagged, albeit as a warning, in UAT. I think it is worth while having a review amongst a quorum of admins before changes are pushed to critical services to ensure that tests have come back positive, or at least, that there has been no behaviour change in the TEST and UAT systems before pushing from PROD repo. It would also be worth while configuring a commit hook to email the quorum, or slack them, when a commit into PROD repo for critical systems has been made for review. There are plugins already available for this practice I believe if we want to follow that route and retain the agility in the configuration process.


  2. Hi Mandeep Sainiwe are at an empasse to a clear resolution to this issue as we just do not have the relevant information. We do not have requests logged for DNS resolution of nodes in the backend groups for haproxy, which is a debug setting and should not be in production windows servers, therefore we cannot conclude whether haproxy asked for a resolution for dndrdc01, or not. We cannot also therefore conclude whether haproxy received a resolution from a DNS server, or not. 
    What we can do is, in the future,

    1. Alert when a node is seen as down from haproxy. This is also related to monitoring and alerting on going work.
    2. Enable haproxy to start even when a resolution for a node cannot be obtained (ref: Dick Visserand Massimiliano Adamofinding in haproxy config docs). 
    3. Enable a peer review process for pushed into production to include alerts and messages found in TEST and UAT systems before a PROD push. This should be through enough to find as many issues as we are aware of but not to be so cumbersome as to prevent productivity.
    4. Accept that in the future we will find exceptions to the measures put in place above and allow a learning process to remediate and move forward.