Incident Description

Dashboard stopped processing traps from Sunday 08:37 UTC until Monday approximately 08:10 UTC

The impact of this service degradation was:


Incident severity:  Complete service outage

Data loss:

Total duration of incident: 23 hours


Timeline

All times are in UTC

DateTimeDescription

 

08:36AMS-LON-IPTRUNK-300G (interface ae9), start_time=2022-05-15 08:36:45, end_time=2022-05-15 08:38:20

 

08:37The RabbitMQ experienced a cluster at 08:37 UTC, and didn't recover

  

08:38

Dashboard sent error notifications and the standard gui (from inside the Géant network) showed correlator error state.  First-line support is however outside the Géant network and the status indicators have never worked for them.

  

07:19

Josep Rivera and Will Barber alerted on #dashboard-v3-users that Dashboard was not processing traps or showing new alarms.

 

08:09Erik Reid identified the situation and manually restarted the RabbitMQ cluster

Proposed Solution