Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

A missing DNS record for dndrdc01.win.dante.org.uk caused the Haproxy service to fail to restart. The server dndrdc01 was decommissioned on 22 May 2020, but the record was cleaned up recently. We still don't know the exact time when this record was expunged (we don't have access to the Windows DNS server), and we know that today morning even uat-haproxy went down for the same reason.

  • A change request was not raised because this was considered a low risk operation, which is run as an unattended job,

...

  • nightly, every day

...

  • . In fact, the action didn't cause any issue on itw own, but it triggered another issue.
  • The first certificate that was going to expire would have caused the same issue

...

  • , maybe the coming night
  • We could not fpresee

...

  • that a DNS record was deleted this same day.


Incident severity: CRITICAL

...

The downside is that we will do it during working hours, while the cron job runs overnight.


At the same the LDAP backend, kept working, because HAProxy relied for 1 year on the hot-stanby:

  • we need to monitor each Haproxy backend (and check why one backend is in Warning state)
  • this change on May 2020, would have brought down IDP, but Haproxy use a standby not and was able to guarantee service continuity