UPDATE ......From Tuesday 8 April 2025 we have changed the way that Single Sign-on works on this wiki. Please see here for more information:
Update
...
- Open one or more of the following RabbitMQ management consoles. (Credentials are in the "GÉANT Dashboard v3" LastPass folder)
- Scroll down to the "Nodes" section
- There should be 3 rows in the table and all status icons should be green (currently - there is a red bar showing a deprecated node - this will be removed when possible). The expected node names are:
- rabbit@prod-noc-alarms01
- rabbit@prod-noc-alarms02
- rabbit@prod-noc-alarms03
...
- If all 3 nodes appear in the list, but if the state of the nodes is different when logging into their respective administration gui's
- follow these instructions to restart/rebootstrap the cluster the cluster
Possible Cause: Alarms are not forwarded to Geant Argus
Analysis
- Open one or more of the following RabbitMQ management consoles. (Credentials are in the "GÉANT Dashboard v3" LastPass folder)
- Click on the Queues and Streams tab
- note the
dashboard.notifiers.argus
queue. It should have a Running state and less than 20 total messages.
Solution
- If the queue has more messages and the message count increases, then the notifiers are not properly running
- log into the following servers via ssh
- prod-noc-alarms-ui01.geant.org
- prod-noc-alarms-ui02.geant.org
- restart the argus notifier service:
systemctl restart argus-notifier.service
- the
dashboard.notifiers.argus
queue should now start to empty
Collectors have stopped working
Analysis
- Open this Correlation status dashboard
- Scroll down to the "Collectors" panel
- Check that the graph shows a nonzero rate of traps being processes
Solution
- On each of the following servers:
- prod-noc-alarms01.geant.org
- prod-noc-alarms02.geant.org
- prod-noc-alarms03.geant.org
- Log in via ssh and execute the following command:
sudo systemctl restart trap_collector
Possible Cause: Correlators have stopped working
Analysis
- Open this Correlation status dashboard
- Scroll down to the "Collectors" panel
- Check that the graph shows the leader collector processing a non-zero rate of traps. The current leader can be identified by the FORWARDER with state 2 in the "Raft States" panel.
Solution
- On each of the following servers:
- prod-noc-alarms01.geant.org
- prod-noc-alarms02.geant.org
- prod-noc-alarms03.geant.org
- Log in via ssh and execute the following command:
sudo systemctl restart trap_correlator
...