Project Overview

Project Name: Argus Dashboard
Purpose: The purpose of this project is to evaluate whether the web-based Argus application is suitable as a dashboard for monitoring our Network & Alerts system.

Current State Assessment

System A (As-Is)

Overview: Existing Internal Dashboard (Java Application) (Geant) (Closed-Source)

Strengths:

Aggregates alarms from various sources such as Infinera, NCC, Juniper, IMS, etc.
Instinctive and familiar to the consumers of the service.
Proven reliability (Tried and tested)
Been through years of development and features requests / refinements.
Complex filtering capabilities.
Blacklisting.
Different viewing modes (Large screen mode, pc mode).
Supports alerts with different states and prioritisation levels.
Provides an API for integration.
Retains alarm history for reporting purposes.
Complete ownership and control of the source code

Weaknesses:

Difficulty in updating due to a legacy Java code base.
Potential security vulnerabilities due to its legacy nature.
Limited compatibility with modern tools developed over time.
Aesthetic limitations compared to modern HTML5-based GUIs.
Complex build pipeline.

System B (As-Is)

Overview: Argus (Sikt) (Open-Source)

Strengths:

Looks great. Modern UI (HTML Frontend easy to develop and extend. Google-able)
Built on a solid well known technology stack. (Django backend)
Potential to use partially managed development freeing up internal resources
Developed within the family of networks
Source code is available
Potential for customisable screens based on user permissions
Easier to extend functionality (i.e. short-lived alarms could be integrated?)
Plug-in Architecture
Enjoyable development experience.

Weaknesses:

Ontological alignment. Our “Alarms” and Argus “Incidents” are slightly different, so we need to explore the consequences of this (e.g. we want “Alarms” to be displayed that don’t have incident tickets in the ticketing system)
Django requires internal technical expertise or even ‘Django devs’
Lack of Alarm states (our “phases”)
Absence of alarm history or searchability.
Only one line of acknowledgment (We have first and second line requirements)
Inability to drill down into issues.
No blacklists.
Filtering not as comprehensive.
Won’t naturally coalesce or correlate (integrations required).
Flapping not addressed (for future consideration).
Prioritisation not handled.

Desired Future State

Overview:
We also recognise and appreciate the mission of the Sikt team. A common tool could be used among NRENs for Alert aggregation adhering to ITIL best practices and standards.

Argus has positioned itself as a promising candidate for an alert aggregation tool by adopting an open-source approach and actively promoting its usage and availability at networking conferences.

We don’t want 'a fork' of Argus, but would strongly prefer a unified system that can accommodate extended use cases. Our rough impression is that the UI “skin” would be a relatively straightforward part to develop on its own, but the fundamental Argus backend use case differs – the main discussion points will be to decide if it’s feasible to have a common backend and/or pluggable architecture that can accommodate both applications.

Argus meets some of our requirements. However currently misses others to be considered fit for purpose. To be a complete replacement for existing tools it must also achieve at least the following :

Alert complexity: The Alarms need to have 'states'. For severity and prioritisation. These are vital for recognising importance of incidents.
Deep diving. Tools for drilling down into Alerts. Coalescing and correlating issues for remediation
History is vital for reporting on availability and utilisation of services and a major requirement of our stakeholders
Integration to our existing or new workflows. API’s and other services like ticketing etc (Also 3rd party systems can change i.e. OTRS > Avanti. So we need to be able to pivot quickly when necessary)
Simple to develop new features (on request of users. Co-Invention with the NOC, SOC teams)

We cover these below a little more extensively in the Gap Analysis and will from this document raise an RFI to give the development team and opportunity to feedback on the feasibility of filling these currently identified gaps. These are not considered exhaustive.

Gap Analysis

Functional Gaps

Feature/Functionality: Alarm States

⁃ Current State (System A): Alarm states are complex and can also be coloured or flashing. The first and second line support teams are trained to recognise these at a glance.
⁃ Current State (System B): Alarms only appear to have one fixed state

Desired Future State: That alarms can have states. For example flashing if new. Or different colours to say whether they are pending or urgent.
Gap: The current inability of Argus to support multiple alarm states (e.g. flashing for new, different colours for pending/urgent) poses a functional gap when compared to the desired future state. The absence of such features hinders the support teams ability to react quickly to distinguish or respond to the different alarm conditions, leading to a potential delay in addressing critical issues. This could have an impact on operations and UX.

Feature/Functionality: Multiple stages of Acknowledgment

⁃ Current State (System A): Dashboard currently differentiates between 1st and 2nd line support teams acknowledgement that Alerts have been recognised
⁃ Current State (System B): Alarms only have 1 level of acknowledgement. ( ‘Acked’ is a tag or status )

Desired Future State: That Alarms have 2 levels of acknowledgement
Gap: The current inability of Argus to support multiple teams poses a problem and is a functional gap. 2nd line support don’t need to be wasting their time on issues that have already been investigated by first line of support. Or similarly do need to know whether 1st line has addressed a given issues with in the time set by the SLA.

Feature/Functionality: Correlation and Coalescing

⁃ Current State (System A): Supports coalescing and correlating issues for remediation.
⁃ Current State (System B): Lacks the ability to coalesce and correlate alerts effectively.

Desired Future State: A system capable of intelligently correlating and coalescing related alerts to provide a more consolidated and actionable view.
Gap: The absence of advanced correlation and coalescing capabilities in Argus may lead to an increased workload for support teams and hinder the efficiency of issue resolution.

Feature/Functionality: Status of live systems

⁃ Current State (System A): Shows a status of services (traffic lights)
⁃ Current State (System B): Lacks the ability to show a status of live services

Desired Future State: The system should show the status of collector, classifier, correlation, inventory provider
Gap: The absence of component for showing our live services means the teams cannot be confident that the Alarms are currently up to date or not missing information.

Feature/Functionality: Priority

⁃ Current State (System A): Alerts can be prioritised by a number
⁃ Current State (System B): Lacks the ability to prioritise. Only has severity which is different

Desired Future State: The system should show the support team to set a priority order on the tickets
Gap: The absence of prioritisation make it difficult for the NOC and different lines of support to know which Alert should take precedence

Technical Gaps

Integration Points: Modern Technology Stack

⁃ Current State (System A): Uses a legacy Java code base
⁃ Current State (System B): Built on a modern web stack

Desired Future State: A technology stack that aligns with modern development practices and tooling
Gap: The current dashboards reliance on a legacy Java code base presents a technical gap in comparison to the desired future state. This may result in challenges related to updates, security and integration with modern tooling

Integration Points: API Flexibility

⁃ Current State (System A): Provides a well-established API for integration with other tools.
⁃ Current State (System B): Has limited API flexibility.

Desired Future State: A flexible API architecture that allows seamless integration with various tools and workflows.
Gap: The current limitations in Argus' API flexibility might pose challenges in integrating it with existing and future tools within the network infrastructure.

Data Gaps

Data Flow: History of Alarms

⁃ Current State (System A): Data retention
⁃ Current State (System B): No data retention

Desired Future State: We would like that Argus had an additional table created that could store the history of Alarms. This is useful for reporting.
Gap: The current system does not keep alarms in history. We would need all arms to be stored infinitely. This is not just useful but a requirement from reporting.

Data Flow: Real-Time Monitoring

⁃ Current State (System A): Supports real-time monitoring and updates.
⁃ Current State (System B): Lacks real-time monitoring capabilities.

Desired Future State: Real-time monitoring features to ensure timely detection and response to critical network events.
Gap: The absence of real-time monitoring in Argus may result in delays in identifying and addressing urgent issues, impacting overall network responsiveness.

Recommendations

To bridge the gaps we would need to either fork the Argus project or engage with their developers about the possibility of ticketed development work. As stated, forking the Argus project would not be the desired outcome.
Validate this against our own schedule.
Break down all the tasks into granular measured pieces of work and deliverables that can be managed and developed in an Agile way.
Do an refinement of each task so that we can be assured the task and deliverable is clearly understood by the development team. Creating Acceptance criteria.
Decide if ITIL naming conventions and standards are compatible with the existing OC workflow
Respect that the Argus team would be building the tool to be multi purpose and identify where we might need to compromise or create internal requirements using plugins to fill any gaps they can’t fill
Be involved in plugging the functional and technical gaps with some internal work for parts of the system that require integration. I.e. by building plug-ins for rendering the status of services (traffic lights)
Ensure close collaboration with the Argus development team and clear communication of expectations and requirements.
Have a plan for testing and deployment as well as develop the systems and architecture for automated delivery and deployment on our systems.

Action Plan

Meet with the Argus development team by [DATE] to try and define a ways of working to make sure what can / can’t be done
Create a list of questions which they can feedback on (RFI)
Assess whether some development can be done in-house
Prototypes with Plugin Architecture
Assess the upgrade plan for future releases of the tool
Do they envisage a forked version? How will we upgrade the ‘core’?
Consult with all internal engineers on the viability of outsourcing some of the development
A follow up conversation to discuss Argus responses. Maybe even inclusive of timings?
tbc (Develop the plan further once we know we can collaborate)

Risks and Mitigations

The biggest risk is not meeting deadlines expected of us by consumers and users of the service (R)
Understanding what’s viable for delivery. (M)
Create a statement of work (SOW) that underlines these deliverables (M)
Underestimating the task if we don’t utilise Argus and take all the development in-house. (R)
We don’t have to retire the previous service until it has passed A/B testing (M)
If the 3rd party development team later decides they don’t have capacity to work with us for bugfixes, feature requests, enhancements, etc (R)
We would have to fork off the development. Do we have the Django/React skills to take over the development internally? (M)
If the solution is to fork the current state of the Argus project, what are the expectations (from all sides) regarding alignment on future improvements. (R/M?)

Conclusion

A new and modern Alert aggregation platform is required by the Geant NOC/SOC first and second line support teams. One that satisfies the needs of the consumers of the service but can be maintained by the development team.

There is currently a backlog of feature requests coming from the NOC. We must recognise the importance of balancing user needs with development team maintainability.

Any conclusions must be reached by a consensus within the internal development team and the work package leaders. It should be agreed that cooperation and sharing is a proactive and viable course of action. One that allows us focus on other development requirements like testing, automating deployments, architecture and integration with other services.

Page tree

Argus: Initial Gap Analysis