UPDATE ......From Tuesday 8 April 2025 we have changed the way that Single Sign-on works on this wiki. Please see here for more information:
Update
...
- Open one or more of the following RabbitMQ management consoles. (Credentials are in the "GÉANT Dashboard v3" LastPass folder)
- Scroll down to the "Nodes" section
- There should be 3 rows in the table and all status icons should be green (currently - there is a red bar showing a deprecated node - this will be removed when possible). The expected node names are:
- rabbit@prod-noc-alarms01
- rabbit@prod-noc-alarms02
- rabbit@prod-noc-alarms03
...
- If all 3 nodes appear in the list, but if the state of the nodes is different when logging into their respective administration gui's
- follow these instructions to restart/rebootstrap the cluster the cluster
Possible Cause: Alarms are not forwarded to Geant Argus
Analysis
- Open one or more of the following RabbitMQ management consoles. (Credentials are in the "GÉANT Dashboard v3" LastPass folder)
- Click on the Queues and Streams tab
- note the
dashboard.notifiers.argus
queue. It should have a Running state and less than 20 total messages.
Solution
- If the queue has more messages and the message count increases, then the notifiers are not properly running
- log into the following servers via ssh
- prod-noc-alarms-ui01.geant.org
- prod-noc-alarms-ui02.geant.org
- restart the argus notifier service:
systemctl restart argus-notifier.service
- the
dashboard.notifiers.argus
queue should now start to empty
Collectors have stopped working
Analysis
- Open this Correlation status dashboard
- Scroll down to the "Collectors" panel
- Check that the graph shows a nonzero rate of traps being processes
Solution
- On each of the following servers:
- netprod-noc-alarms01.geant.org
- netprod-noc-alarms02.geant.org
- netprod-noc-alarms03.geant.org
- Log in via ssh and execute the following command:
sudo systemctl restart trap_collector
Possible Cause: Correlators have stopped working
Analysis
- Open this Correlation status dashboard
- Scroll down to the "Collectors" panel
- Check that the graph shows the leader collector processing a non-zero rate of traps. The current leader can be identified by the FORWARDER with state 2 in the "Raft States" panel.
Solution
- On each of the following servers:
- netprod-noc-alarms01.geant.org
- netprod-noc-alarms02.geant.org
- netprod-noc-alarms03.geant.org
- Log in via ssh and execute the following command:
sudo systemctl restart trap_correlator
In case production operation isn't restored quickly ...
If the production environment can't be recovered quickly and operation restored, please refer temporarily to the UAT environment. This UAT environment continually processes the same traps as production, and uses the same IMS instance, so should be useable while production operation is being restored. To access the UAT environment gui, please navigate to one of the following:
Related articles
Content by Label | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
...