This is a new category of article that falls under "RARE software architecture" special blog series. As its name implies, it deals with topics related to RARE/freeRouter software / Monitoring.

Requirement

Basic Linux/Unix knowledge
Service provider networking knowledge

Overview

In Greek mythology, Prometheus is a Titan that is credited mankind creation by stealing Fire from Gods and by giving it to human. In the RARE context, Prometheus is a the software from prometheus.io project. It became very popular in the IT industry as it is very simple to implement/configure while providing a great number of metrics without impacting application performance. It is heavily used in microservices environment such as docker and Kubernetes. The mythological reference gives us an indication of how Prometheus is operating. At a constant rate, Prometheus metric collector or server is stealing metrics from Prometheus agent. All the stolen metrics are then consolidated in Time Series database ready to be poured to a queueing system for proper visualization.

Before going further, allow me a brief digression by sharing with you a small anecdote that leds to this ongoing work related to network monitoring for RARE. As mentioned previously, our focus is to elaborate RARE/freeRouter solution the possibility to be monitored in an operational environment. In that context, we started with the implementation of a lightweight SNMP stack that provided relevant result via SNMP tools like LibreNMS. This is great for organisation that wouldn’t want invest time on anything but SNMP.

However, we felt a lack of flexibility due to SNMP inherent structure and we needed more versatile and instant monitoring capabilities. More importantly the need to export infinite metric type from Control Plane in a more flexible way arise. How metrics such as: Number of IPv4/IPv6 routes, IPv4 BGP prefix, IPv6 BGP prefix platform JVM memory etc. could be shared without too much hassle ?

After some internal discussion, I just said: "I’m not a monitoring expert but we have tools like ELK and PROMETHEUS and GRAFANA in NMaaS catalog … Shouldn’t we consider use this ?"

The answer was: « Let’s give it a try and fire up a Prometheus and Grafana instance from NMaaS platform !»

Some hacking at the control plane code level were initiated, after few hours freeRouter lead developer came up with a solution and said: Let me introduce you "freeRouter prometheus agent »

And thanks to the great support of NMaaS team, in few minutes and some point and clicks (it took longer than expected as I’m not good with GUI) we were able to test this agent.

Why is it important you might say ? It is just that with prometheus simplicity and low resource overhead with have full control plane metrics visibility !

As a side note this is not a replacement for INT/Telemetry/Netflow/IPFIX that provide different type of data that are to at the same scale…
People with INT/TELEMETRY/NETFLOW/IPFIX are talking about a "data lake" or "data deluge". Which is correct, if you think about the complexity of resolving a gigantic producer/consumer data problem. This needs the relevant IT infrastructure in order to process all of the data provided by these protocol at the NREN scale.

While in our case, we are just focusing on exposing CONTROL PLANE METRICS at the network element level. We simply monitor and ensure a router operation by using prometheus metrics

While he above might be true, the number of metrics exported from a prometheus target can be very high. Fine tuning might be necessary in order to make sure that all metrics are really necessary for network monitoring purpose. This explosion of metrics exposure can add unnecessary workload at the control plane level.

Again, kudos to NMaaS team that made this happen so that we could test this on the P4 LAB with — ZERO — effort.

Article objective

In this article, we will present freeRouter and Prometheus integration and as an example we will implement one of the 22 grafana dashboard that we developed and published here. In the rest of the article we will assume that you are a running one or more freeRouter nodes.

Diagram

RARE > 2020/10/13 > RARE software architecture: [ Monitoring #001 ] - "Thanks for the fire, Prometheus !" > image2020-10-13_14-52-22.png

[ #001 ] - Cookbook

The first step is to implement a prometheus server. Using NMaaS it is pretty instantaneous. However, if you plan to deploy prometheus in an other platform just follow the installation guide here.

Once deployed you can push the following prometheus.yaml config:

global:
  scrape_interval: 15s
  evaluation_interval: 30s
alerting:
  alertmanagers:
    - static_configs:
      - targets:
rule_files:
scrape_configs:
  - job_name: 'router'
    metrics_path: /metrics
    scrape_interval: 15s
    static_configs:
    - targets: ['192.168.0.1:9001','192.168.0.2:9001']
      labels:

In this configuration we assume that we have 2 freeRouters that are configured as above (192.168.0.1:9001 and 192.168.0.2:9001) in prometheus worls these are called targets:

each target are interrogated or "scraped" very "scrap_interval" which is 15s here
the main job name is called; "router"
metrics_path is: "/metrics" so the scraped URL is: "http://192.168.0.1:9001/metrics"

Note that this had to be deployed only once for all of your routers. However, each time you'd like to add a new router, you have to add a new target in the "targets" YAML list.

In this example let's focus our interest interface metrics. Please note that this configuration should be deployed on each freeRouter and connectivity should be available between all targets and the prometheus server.

The objective is to tell freeRouter control plane to expose hardware and software counter interface metric using the sensor object.
You have 2 types of sensor:
- Universal sensor: Sensor definition that you can cut/paste anywhere
- User specific sensor: Sensor definition that you need to adjust depending freeRTr configuration implemented by user

!-------------------------------------------------------------------------------
! Example of universal sensor:
! That can be copy paste as is.
!-------------------------------------------------------------------------------
!
sensor ifaces-hw
path interfaces-hw/interface/counter
prefix freertr-ifaces
key name interfaces-hw/interface
command sho inter hwsumm
prepend iface_hw_byte_
name 0 ifc=
replace \. _
column 1 name st
column 1 replace admin -1
column 1 replace down 0
column 1 replace up 1
column 2 name tx
column 3 name rx
column 4 name dr
.
exit
!
!-------------------------------------------------------------------------------
! Example of sensor you need to adjust: 
! You need to adapt your BGP process number: 
! Here replace 65535 by your BGP process number)
!-------------------------------------------------------------------------------
!
sensor bgp4peer
path bgp4/peer/peer
prefix freertr-bgp4peer
key name bgp4/peer
command sho ipv4 bgp 65535 summ
prepend bgp4_peer_
name 0 peer=
replace \. _
column 2 name state
column 2 replace false 0
column 2 replace true 1
column 3 name learn
column 4 name advert
.
exit
!

So this basically means:

From freeRouter CLI, issue the following command:

sho inter hwsumm
interface   state  tx          rx          drop
hairpin41   up     67404       0           0
hairpin42   up     153134      0           0
sdn1        up     412319805   1057514903  1152305
sdn2        up     1038840147  407307558   202
sdn3        admin  0           0           0
sdn4        admin  0           0           0
sdn5        admin  0           0           0
sdn6        admin  0           0           0
sdn998      up     9154        0           0
sdn999      up     199178      262939      0
tunnel1965  up     0           9122896     0

prepend to the metric name: "iface_hw_byte_"
column 0 will have prometheus label ifc=
replace all dots "." by "_" . (so interface bundle1.123 will become bundle1_123)
column 1 defines a metric name "iface_hw_byte_" concatenated to "st" => "iface_hw_byte_st" which is essentially interface status
if column 1 "state" value is admin/down/up we associate value -1/0/1
column 2 defines a metric name "iface_hw_byte_" concatenated to "tx" => "iface_hw_byte_tx" which is essentially interface bytes transmitted counter
column 3 defines a metric name "iface_hw_byte_" concatenated to "rx" => "iface_hw_byte_rx" which is essentially interface bytes received counter
column 4 defines a metric name "iface_hw_byte_" concatenated to "dr" => "iface_hw_byte_dr" which is essentially interface bytes dropped counter

Then you need to bind the configured sensor to prometheus server:

!-------------------------------------------------------------------------------
! Example of Prometheus agent configuration
! And sensor bindings
!-------------------------------------------------------------------------------
!
server prometheus pr
 sensor ifaces-hw
 sensor bgp4peer
 interface <prometheus_agent_interface_binding>
 vrf <prometheus_agent_vrf_bingind>
 exit
!

And if you followed this correctly, we are repeating these lines for software interface counter metric.

You can view Prometheus configuration for various Grafana dashboard here. Feel free to study these Prometheus configuration and activate them as you see fit depending on your requirements. The set of dashboard is not exhaustive and is by no means absolute. Feel free to submit additional dashboard ! We would gladly add them in the current list of freeRouter Dashboard.

After this definition a freeRouter level you should have:

4 metrics related to hardware counters

iface_hw_byte_st
iface_hw_byte_tx
iface_hw_byte_rx
iface_hw_byte_dr

4 metrics related to software counters

iface_sw_byte_st
iface_sw_byte_tx
iface_sw_byte_rx
iface_sw_byte_dr

Which is a total of 8 metrics

From that point you can check via prometheus console:

RARE > 2020/10/13 > RARE software architecture: [ Monitoring #001 ] - "Thanks for the fire, Prometheus !" > image2020-10-13_16-13-17.png

check the "Targets" menu drop down selection

RARE > 2020/10/13 > RARE software architecture: [ Monitoring #001 ] - "Thanks for the fire, Prometheus !" > image2020-10-13_16-14-5.png

From that point you should be able to use PromQL query filed in order to check that you can retrieve the metrics we defined above.

For metric visualisation, we will use Grafana. Therefore:

install Grafana from official web site.
Once installed configure Prometheus as Grafana data source:

RARE > 2020/10/13 > RARE software architecture: [ Monitoring #001 ] - "Thanks for the fire, Prometheus !" > image2020-10-13_16-19-58.png

fill in all the prometheus server information

RARE > 2020/10/13 > RARE software architecture: [ Monitoring #001 ] - "Thanks for the fire, Prometheus !" > image2020-10-13_16-20-12.png

check the the data source is defined correctly by clicking the "Save & test" button

RARE > 2020/10/13 > RARE software architecture: [ Monitoring #001 ] - "Thanks for the fire, Prometheus !" > image2020-10-13_16-20-24.png

At that point your Grafana and Prometheus are correctly binded.

now you need to import "RARE/freeRouter interface bytes" dashboard

RARE > 2020/10/13 > RARE software architecture: [ Monitoring #001 ] - "Thanks for the fire, Prometheus !" > image2020-10-13_16-25-37.png

download freeRouter interface bytes dashboard here

RARE > 2020/10/13 > RARE software architecture: [ Monitoring #001 ] - "Thanks for the fire, Prometheus !" > image2020-10-13_16-27-30.png

import the dashboard via ID or simply download JSON or use JSON panel

RARE > 2020/10/13 > RARE software architecture: [ Monitoring #001 ] - "Thanks for the fire, Prometheus !" > image2020-10-13_16-29-36.png

And Voila !

In order to immediately see the graph zoom in to 5m period with a refresh of 5s and you should see automagically the interface bytes TX/RX on all interface for each targets.

RARE > 2020/10/13 > RARE software architecture: [ Monitoring #001 ] - "Thanks for the fire, Prometheus !" > image2020-10-13_16-37-33.png

Discussion

This example related to interface metrics is universal, as the metrics at freeRouter level are yielded through a generic CLI command:

"show interface hwsummary"
or "show interface swsummary".

However some metrics cannot be retrieved by generic interface. Some metrics will be tied to specificities of your network. These can be the AS number, IGP process name, VRF name etc.

Let me give you a couple of examples:

But your network context you could have arbitrary deployed "isis 2200". (2200 is RENATER AS number)

sensor lsigp4int
path lsigp4int/peer/peer
prefix freertr-lsigp4int
key name lsigp4int/peer
command sho ipv4 ospf 1 interface
prepend lsigp4_iface_
name 0 proto="ospf1",ifc=
replace \. _
column 1 name neighbors
.
exit

sensor lsigp4peer
path lsigp4peer/peer/peer
prefix freertr-lsigp4peer
key name lsigp4peer/peer
command sho ipv4 ospf 1 topology 0 | inc reach
prepend lsigp4_peers_
name 0 proto="ospf1",node=
replace \. _
column 2 name reachable
column 2 replace false 0
column 2 replace true 1
column 3 name neighbors
.
exit

sensor lsigp4perf
path lsigp4perf/peer/peer
prefix freertr-lsigp4perf
key name lsigp4perf/peer
command sho ipv4 ospf 1 spf 0 | inc reachable|fill|calc|run
prepend lsigp4_perf_
labels proto="ospf1"
skip 0
column 1 name val
.
exit

sensor lsigp6int
path lsigp6int/peer/peer
prefix freertr-lsigp6int
key name lsigp6int/peer
command sho ipv6 ospf 1 interface
prepend lsigp6_iface_
name 0 proto="ospf1",ifc=
replace \. _
column 1 name neighbors
.
exit

sensor lsigp6peer
path lsigp6peer/peer/peer
prefix freertr-lsigp6peer
key name lsigp6peer/peer
command sho ipv6 ospf 1 topology 0 | inc reach
prepend lsigp6_peers_
name 0 proto="ospf1",node=
replace \. _
replace \/ _
column 2 name reachable
column 2 replace false 0
column 2 replace true 1
column 3 name neighbors
.
exit

sensor lsigp6perf
path lsigp6perf/peer/peer
prefix freertr-lsigp6perf
key name lsigp6perf/peer
command sho ipv6 ospf 1 spf 0 | inc reachable|fill|calc|run
prepend lsigp6_perf_
labels proto="ospf1"
skip 0
column 1 name val
.
exit

sensor lsigp4metric
path lsigp4metric/peer/peer
prefix freertr-lsigp4metric
prepend lsigp4_metric_
command show ipv4 ospf 1 metric
name 0 proto="ospf1",ifc=
key name lsigp4metric/peer
replace \. _
column 4 name metric
.
exit

sensor lsigp6metric
path lsigp6metric/peer/peer
prefix freertr-lsigp6metric
prepend lsigp6_metric_
command show ipv6 ospf 1 metric
name 0 proto="ospf1",ifc=
key name lsigp6metric/peer
replace \. _
column 4 name metric
.
exit

sensor bgp4peer
path bgp4/peer/peer
prefix freertr-bgp4peer
key name bgp4/peer
command sho ipv4 bgp 65535 summ
prepend bgp4_peer_
name 0 peer=
replace \. _
column 2 name state
column 2 replace false 0
column 2 replace true 1
column 3 name learn
column 4 name advert
.
exit

sensor bgp4perf
path bgp4/perf/perf
prefix freertr-bgp4perf
key name bgp4/perf
command sho ipv4 bgp 65535 best | exc last
prepend bgp4_perf_
replace \s _
column 1 name val
.
exit

sensor bgp6peer
path bgp6/peer/peer
prefix freertr-bgp6peer
key name bgp6/peer
command sho ipv6 bgp 65535 summ
prepend bgp6_peer_
name 0 peer=
replace \: _
column 2 name state
column 2 replace false 0
column 2 replace true 1
column 3 name learn
column 3 name advert
.
exit

sensor bgp6perf
path bgp6/perf/perf
prefix freertr-bgp6perf
key name bgp6/perf
command sho ipv6 bgp 65535 best | exc last
prepend bgp6_perf_
replace \s _
column 1 name val
.
exit

sensor ldp4nul
path ldp4nul/peer/peer
prefix freertr-ldp4nul
key name ldp4nul/peer
command sho ipv4 ldp inet nulled-summary
prepend ldp4null_
name 3 ip=
skip 2
replace \. _
column 0 name prefix_learn
column 1 name prefix_advert
column 2 name prefix_nulled
.
exit

sensor ldp6nul
path ldp6nul/peer/peer
prefix freertr-ldp6nul
key name ldp6nul/peer
command sho ipv6 ldp inet nulled-summary
prepend ldp6null_
name 3 ip=
skip 2
replace \: _
column 0 name prefix_learn
column 1 name prefix_advert
column 2 name prefix_nulled
.
exit

Conclusion

In this 1st article, you were presented :

freeRouter/Prometheus integration
How to add a new router in the list of Prometheus target
How to integrate a RARE/freeRouter Grafana Dashboard. (Feel free to adapt the other available dashboard query to your context !)

In Prometheus philosophy, normally the user should do only the minimum of tweaking regarding configuration. Ultimately, he should be only be able to enable a metric or simply disable it if the scrape cost is too high. However in freeRouter/Prometheus integration process, you see that some metric are issued using specific $variable (VRF, BGP/IGP process number ...) Which makes impossible to maintain this universality. However, from the network operator point of view this should not be a showstopper. On the contrary, it is a powerful choice to be able to alter these command via $variables.

Remember in freeRouter philosophy you can have multiple VRF, multiple IGP and multiple BGP process number ! (Which is not the case for all routing platform)

Last but not least, this Prometheus agent was developed quickly because of one reason, all the objects at the control plane level were already well structured in table form as previously described in this article. So implementing this table row/column logic in order to derive a prometheus metric was technically possible without too much hassle.