Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Incident severity: 

Status
colourRed
titleCritical
 Complete service outage

Data loss:  

true
Status
subtle
colourBlueRed
titleNoYes

Total duration of incident: 1 day23 hours


Timeline

All times are in UTC

DateTimeDescription

 

08:36LON/AMS ... etcAMS-LON-IPTRUNK-300G (interface ae9), start_time=2022-05-15 08:36:45, end_time=2022-05-15 08:38:20

 

08:37rmq partition, correlator stopped proce3ssingThe RabbitMQ experienced a cluster at 08:37 UTC, and didn't recover

  

time ...08:38

Dashboard sent error notifications and the standard gui (from inside the Géant network) showed correlator error state.  First-line support is however outside the Géant network and the status indicators have never worked for them.notifications

  

06:19

Josep Rivera and Will Barber alerted on #dashboard-v3-users that Dashboard was not processing traps or showing new alarms.

 

07:09Erik Reid began investigating the situation.timeoc/josep ... etc (todo)

 

timeerik restarted rmq (todo)

Proposed Solution

07:26The RabbitMQ cluster was manually restarted and normal service restored.

Proposed Solution

  • The production cluster stability is known to be vulnerable to LON/AMS trunk issues (cf. DBOARD3-515), and a proposed cluster recovery algorithm is being tested as a workaround for this.
  • Testing of the efficacy of the new recovery algorithm is not complete, but it is assumed that at least the new configuration would not be less stable under these conditions.  It was therefore decided to fast-track deployment of this workaround to production.  The group will confirm this decision in next Monday's project review meeting ().
  • Firewall configuration on the dashboard monitoring service will be updated so that first-line support also sees the correct system status indications from outside the Géant network....