Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Incident description

On 27th July 2019, the host names opsdb1.dante.net and opsdb2.dante.net could not be resolved from the both Dashboard boxes.

Incident severity: CRITICAL

Data loss: NO

Timeline


Time (CET)
18 Jun

28 Jul,
23
22:
16
00Issue Reported by OC
19 Jun
28 Jul,
07
22:
35
30

Picked up by

Michael H the following morning

Robert L

28 Jul, 22:45

Fixed by updating the Dashboard application to point at prod-opsdb01,geant.net (and 02).

29 Jul, 08:10

The issue reported to Devops for root cause analyses.

29 Jul
19 Jun, 08:30

Fixed by turning off SSL temporarily to restore the service. Initial investigation revealed certificate has expired but later turned out that wasn't the case.

19 Jun, 10:30

Further investigations were carried out to avoid such failures in future

19 Jun, 16:08

The actual cause identified for the failure - due to IT patching certificates were automatically changed.

20 Jun
, 09:30

Proposal was discussed between IT and SWD to avoid such failures in future.

20 Jun
30 Jul, 11:30
Part one of Nagios check in the proposed solution implemented


20 Jun, 16.30
New certs provided by IT installed on crowd servers. SSL switched back on (crowd ↔ AD).

Total downtime: 09:14 hours.

...