Dashboard Correlator
High-Level Process Flows
JMS Messaging Entry Point
Incoming JMS messages to be processed by the Correlator application are received by one of the following 3 classes:
- MessagingEntryPoint (the entry point for all process-related messages)
- SystemSettingChangeEntryPoint (the entry point for all notifications about changes to system settings which impact on correlation eg. correlation window)
- NetworkMapReloader (receives only one type of message: a notification to reload the network map)
The <jms:listener-container> beans in correlator.xml map a particular JMS destination (queue or topic) to a "handling" method in one of these classes (example shown below):
<jms:listener-container connection-factory="connectionFactory" destination-type="topic" concurrency="1"> <!—Concurrency=1 for topics --> <jms:listener destination="#{esbProperties.finalisedSettingQueue}" ref="systemSettingChangeEntryPoint" method="receiveCorrelationWindowChange"/> </jms:listener-container>
There is one listener container for queues, and one for topics.
Main Event Processing Flows
Listed below are the high-level event processing flows for down/up event handling and correlation. This is only an outline of the overall algorithm (a guide to what these parts of the application do, rather than a detailed explanation).
Note that correlation is handled in a cross-cutting thread rather than in the event handling threads themselves to ensure a cleaner concurrent algorithm. The event processing threads are as short running as possible, can run in parallel (thus maximising parallel processing), but do not share mutable state with one another. They do, however, share mutable state with the correlation thread (see concurrency document for more detail on high-level concurrency approach).
DOWN Event
- Receives trap
- Finds trap source
- For DOWN event,
- Registers DOWN event on trap source
- Updates trap source status and other trap source state
- Adds this trap source to a sorted concurrent collection of trap-sources-with-registered-events
- Schedules correlation for some time in future (defined by current correlation interval – usually around 20 seconds)
- Stores the correlation task for this event in a concurrent map (so it can be cancelled if this event is correlated by another event's correlation task)
UP Event
- For UP event,
- If there are currently registered events,
- Registers UP event on trap source
- Updates trap source status and other trap source state
- If there are no currently registered events,
- Retrieves last down event sequence number from trap source
- Finds the DB down alarm entity corresponding to this sequence number
- Creates a new up alarm entity corresponding to this event
- Assigns the up alarm entity to the down alarm entity
- Checks to see if the correlated alarm to which this down alarm entity is linked needs clearing, and clears if it does
- Checks to see if the coalesced alarm of which this correlated alarm forms a part needs "group clearing":
- clears if it does
- sends a message to the OTRS client to update the state of the ticket linked to this coalesced alarm to RESTORED
- Sends a message to the webapp to refresh all the display of all logged-in browsers (to display any clearance changes)
- If there are currently registered events,
Correlation
- Acquires network map write lock
- Removes the trap source of the event whose correlation thread this is and all related trap sources from trap-sources-with-registered-events
- Creates all source alarms and a single correlated alarm from the events registered on these trap sources:
- This includes any UP events registered which came in during the correlation window
- Deregisters all events from the related trap sources
- Cancels all correlation tasks from these events (all except the correlation task currently running), since they have now already been correlated
- Checks to see if the traps which resulted in this correlated alarm being generated were caused by maintenance in whole or in part:
- If all affected services are under maintenance,
- Automatically acknowledges the alarm
- Adds a comment saying "Planned maintenance"
- If only some affected services are maintenance,
- Removes the services under maintenance from the alarm description
- Adds a comment saying "The following services are under maintenance: ____"
- If all affected services are under maintenance,
- Checks for any rules which catch this correlated alarm, and execute them if they do (affects severity)
- Schedules execution of following services at the end of the "clearance window":
- Short term check
- Coalesce
- Otrs service
- Releases write lock