Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Open one or more of the following RabbitMQ management consoles.  (Credentials are in the "GÉANT Dashboard v3" LastPass folder)
  2. Scroll down to the "Nodes" section
  3. There should be 3 rows in the table and all status icons should be green (currently - there is a red bar showing a deprecated node - this will be removed when possible).  The expected node names are:
    • rabbit@prod-noc-alarms01
    • rabbit@prod-noc-alarms02
    • rabbit@prod-noc-alarms03

...

  1. If all 3 nodes appear in the list, but if the state of the nodes is different when logging into their respective administration gui's

Possible Cause: Alarms are not forwarded to Geant Argus

Analysis
  1. Open one or more of the following RabbitMQ management consoles.  (Credentials are in the "GÉANT Dashboard v3" LastPass folder)
  2. Click on the Queues and Streams tab
  3. note the dashboard.notifiers.argus queue. It should have a Running state and less than 20 total messages. 
Solution
  1. If the queue has more messages and the message count increases, then the notifiers are not properly running
  2. log into the following servers via ssh
    • prod-noc-alarms-ui01.geant.org
    • prod-noc-alarms-ui02.geant.org
  3. restart the argus notifier service:
    • systemctl restart argus-notifier.service
  4. the dashboard.notifiers.argus queue should now start to empty

Collectors have stopped working

Analysis

  1. Open this Correlation status dashboard
  2. Scroll down to the "Collectors" panel
  3. Check that the graph shows a nonzero rate of traps being processes

Solution

  1. On each of the following servers:
    • netprod-noc-alarms01.geant.org
    • netprod-noc-alarms02.geant.org
    • netprod-noc-alarms03.geant.org
  2. Log in via ssh and execute the following command:
    • sudo systemctl restart trap_collector

Possible Cause: Correlators have stopped working

Analysis

  1. Open this Correlation status dashboard
  2. Scroll down to the "Collectors" panel
  3. Check that the graph shows the leader collector processing a non-zero rate of traps.  The current leader can be identified by the FORWARDER with state 2 in the "Raft States" panel.

Solution

  1. On each of the following servers:
    • netprod-noc-alarms01.geant.org
    • netprod-noc-alarms02.geant.org
    • netprod-noc-alarms03.geant.org
  2. Log in via ssh and execute the following command:
    • sudo systemctl restart trap_correlator

In case production operation isn't restored quickly ...

If the production environment can't be recovered quickly and operation restored, please refer temporarily to the UAT environment.  This UAT environment continually processes the same traps as production, and uses the same IMS instance, so should be useable while production operation is being restored.  To access the UAT environment gui, please navigate to one of the following:

Content by Label
showLabelsfalse
max5
spacesSD
showSpacefalse
sortmodified
reversetrue
typepage
cqllabel = "kb-troubleshooting-article" and type = "page" and space = "SD"
labelskb-troubleshooting-article

...