Dashboard stopped processing traps from Sunday 08:37 UTC until Monday approximately 08:10 UTC
The impact of this service degradation was:
Incident severity: Complete service outage
Data loss:
Total duration of incident: 23 hours
All times are in UTC
Date | Time | Description |
---|---|---|
| 08:36 | AMS-LON-IPTRUNK-300G (interface ae9), start_time=2022-05-15 08:36:45, end_time=2022-05-15 08:38:20 |
| 08:37 | The RabbitMQ experienced a cluster at 08:37 UTC, and didn't recover |
| 08:38 | Dashboard sent error notifications and the standard gui (from inside the Géant network) showed correlator error state. First-line support is however outside the Géant network and the status indicators have never worked for them. |
| 07:19 | Josep Rivera and Will Barber alerted on #dashboard-v3-users that Dashboard was not processing traps or showing new alarms. |
| 08:09 | Erik Reid identified the situation and manually restarted the RabbitMQ cluster |