Incident Description

Dashboard stopped processing traps from Sunday 08:37 UTC until Monday approximately 08:10 UTC

The impact of this service degradation was:

The production dashboard stopped processing traps and therefore no alarms were presented to NOC.
No alarms or notifications were generated during the outage period.

Incident severity: Complete service outage

Data loss:

Total duration of incident: 23 hours

Timeline

All times are in UTC

Date	Time	Description
15 Mar 2022	08:36	AMS-LON-IPTRUNK-300G (interface ae9), start_time=2022-05-15 08:36:45, end_time=2022-05-15 08:38:20
15 Mar 2022	08:37	The RabbitMQ experienced a cluster at 08:37 UTC, and didn't recover
15 May 2022	08:38	Dashboard sent error notifications and the standard gui (from inside the Géant network) showed correlator error state. First-line support is however outside the Géant network and the status indicators have never worked for them.
16 May 2022	07:19	Josep Rivera and Will Barber alerted on #dashboard-v3-users that Dashboard was not processing traps or showing new alarms.
16 May 2022	08:09	Erik Reid identified the situation and manually restarted the RabbitMQ cluster

The production cluster stability is known to be vulnerable to LON/AMS trunk issues (cf. DBOARD3-515), and a proposed cluster recovery algorithm is being tested as a workaround for this.
Testing of the efficacy of the new recovery algorithm is not complete, but it is assumed that at least the new configuration would not be less stable under these conditions. It was therefore decided to fast-track deployment of this workaround to production, the group will confirm this decision in next Monday's project review meeting (23 May 2022).