Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Date/time (UTC)

Action

Actor

09:17

Ashley Brown noticed that the most recent Inventory Provider update failed and asked on the dashboard users Slack channel.  Erik Reid investigated.

Erik Reid 

09:31

noticed this in the production Inventory Provider logs and asked Sam Roberts on Slack to investigate

Code Block
languagebash
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: 2023-11-17 09:14:12,473 - inventory_provider.tasks.worker (605) - ERROR - unhandled exception loading srx2.ch.office.geant.net info: ncclient timed out while waiting for an rpc reply.
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: Traceback (most recent call last):
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: File "/home/inventory/venv/lib64/python3.6/site-packages/inventory_provider/tasks/worker.py", line 576, in reload_router_config_chorded
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: hostname, update_callback=self.log_warning)
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: File "/home/inventory/venv/lib64/python3.6/site-packages/inventory_provider/tasks/worker.py", line 650, in retrieve_and_persist_interface_info
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: interface_info_str = juniper.get_interface_info_for_router(hostname, InventoryTask.config["ssh"])
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: File "/home/inventory/venv/lib64/python3.6/site-packages/inventory_provider/juniper.py", line 495, in get_interface_info_for_router
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: reply = _raw_rpc(router, etree.Element('get-interface-information'))
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: File "/home/inventory/venv/lib64/python3.6/site-packages/inventory_provider/juniper.py", line 151, in _raw_rpc
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: obj = router.rpc(command)
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: File "/home/inventory/venv/lib64/python3.6/site-packages/ncclient/manager.py", line 251, in execute
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: huge_tree=self._huge_tree).request(*args, **kwds)
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: File "/home/inventory/venv/lib64/python3.6/site-packages/ncclient/operations/third_party/juniper/rpc.py", line 52, in request
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: return self._request(rpc)
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: File "/home/inventory/venv/lib64/python3.6/site-packages/ncclient/operations/rpc.py", line 381, in _request
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: raise TimeoutExpiredError('ncclient timed out while waiting for an rpc reply.')
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: ncclient.operations.errors.TimeoutExpiredError: ncclient timed out while waiting for an rpc reply.


Erik Reid 

09:41

Bjarke Madsen asked on Slack if anyone had information about Sensu check failure notifications.  Erik Reid shared the critical error info:

Code Block
languagebash
2023-11-17 09:29:05,546 - brian_monitoring_checks.inventory - DEBUG - using inventory base api url: http://localhost:8080
2023-11-17 09:29:05,560 - brian_monitoring_checks.influx - DEBUG - select count(egress), count(ingress), count(errorsOut), count(errorsIn), count(egressv6), count(ingressv6) from interface_rates where time > now() - 20m group by hostname, interface_name
2023-11-17 09:29:05,832 - brian_monitoring_checks.influx - DEBUG - closing influx session
2023-11-17 09:29:09,736 - brian_monitoring_checks.check_counters - ERROR - no ingress/egress data in the last 20m for these routers: {'rt1.sof.bg.geant.net', 'mx1.bud.hu.geant.net', 'rt1.rig.lv.geant.net', 'qfx.fra.de.geant.net', 'mx1.dub2.ie.geant.net', 'rt1.buc.ro.geant.net', 'rt1.bra.sk.geant.net', 'rt1.kau.lt.geant.net', 'rt1.mil2.it.geant.net', 'srx1.ch.office.geant.net', 'rt2.rig.lv.geant.net', 'rt2.chi.md.geant.net', 'rt1.chi.md.geant.net', 'rt2.tar.ee.geant.net', 'rt2.bra.sk.geant.net', 'mx1.par.fr.geant.net', 'rt1.cor.ie.geant.net', 'qfx.lon2.uk.geant.net', 'mx1.fra.de.geant.net', 'rt1.mar.fr.geant.net', 'mx1.poz.pl.geant.net', 'qfx.par.fr.geant.net', 'rt2.cor.ie.geant.net', 'rt1.lju.si.geant.net', 'mx1.dub.ie.geant.net', 'mx2.ath.gr.geant.net', 'srx1.am.office.geant.net', 'mx2.lis.pt.geant.net', 'mx2.zag.hr.geant.net', 'rt2.kau.lt.geant.net', 'mx1.lon.uk.geant.net', 'rt2.kie.ua.geant.net', 'mx1.gen.ch.geant.net', 'rt1.ams.nl.geant.net', 'rt1.pra.cz.geant.net', 'mx1.ams.nl.geant.net', 'srx2.ch.office.geant.net', 'srx2.am.office.geant.net', 'rt1.fra.de.geant.net', 'rt2.ams.nl.geant.net', 'mx1.buc.ro.geant.net', 'mx1.ath2.gr.geant.net', 'mx1.ham.de.geant.net', 'rt1.por.pt.geant.net', 'mx1.mad.es.geant.net', 'mx1.vie.at.geant.net', 'rt1.kie.ua.geant.net', 'mx1.sof.bg.geant.net', 'rt2.bru.be.geant.net', 'rt1.bil.es.geant.net', 'rt1.tar.ee.geant.net', 'rt1.ham.de.geant.net', 'rt1.bru.be.geant.net', 'mx1.lon2.uk.geant.net'}
2023-11-17 09:29:09,740 - brian_monitoring_checks.check_counters - DEBUG - check returned with status: 2


Bjarke Madsen 

09:44

Bjarke Madsen noticed that the kapacitor speed removal process was failing, because the Inventory Provider /poller/speeds api was returning errors:

Code Block
languagebash
Sensu has detected a problem with this host.

Notification Type: PROBLEM
Host: prod-service-proxy-monitoring02.geant.org
State: DOWN
Check: brian-kapacitor-tasks
Occurrences: 1
Date/Time: 2023-11-16 20:20:48
Info: task "remove_spikes_gwsd_rates" has been restarted (was not executing)
task "interface_rates" is executing
task "multicast_rates" is executing
task "remove_spikes_dscp32_rates" has been restarted (was not executing)
task "gwsd_rates" is executing
task "remove_spikes_multicast_rates" has been restarted (was not executing)
task "service_enrichment" is executing
task "service_enrichment_lambda" is executing
task "remove_spikes_interface_rates" has been restarted (was not executing)
task "dscp32_rates" is executing



09:50

The Inventory Provider update that occurred on (TT#2023111334002463) included the code changes that were failing.  It was decided to roll this back.

09:58

The Inventory Provider was rolled back in production and the data processing pipeline functionality was restored.

10:13

The team decided there were 2 issues:

  • the ncclient connection error causing the Inventory Provider update to fail: this was a transient error, caused by network connection issues (this is considered a part of
    Jira
    serverJira
    serverId5228d933-268f-3077-a879-21fb01eb8d41
    keyDBOARD3-822
    )
  • the data processing pipeline failure caused by the /poller/speeds 

10:28

Sam Roberts found that the failure was being caused when the /poller/speeds processor computed the aggregate speed for ae6 on mx2.zag.hr.geant.net

12:17

Sam Roberts found that the failure on computing the aggregate speed for mx2.zag.hr.geant.net/ae6 is because the Inventory cache data included et-4/0/1, et-4/0/2 and et-4/0/2.0.  A logical interface in the list was unexpected and the processing failed when parsing this name.

Sam Roberts heard from Robert Latta that the OC had been testing on this interface, but the details weren't clear.

12:30

Bjarke Madsen attempted to restore GWS Direct rates in the outage timespan, but an error with a command caused the data to be modified past the outage duration, rendering the data unavailable temporarily.


15:02

Sam Roberts prepared a MR for Inventory Provider to fix both of the issues above.


16:03

Ashley Brown explained to Robert Latta and Erik Reid the test configuration that was enabled on mx2.zag.hr.geant.net.  The details are described in

Jira
serverJira
serverId5228d933-268f-3077-a879-21fb01eb8d41
keyDBOARD3-833


21 15:30

Bjarke Madsen restored availability of GWS Direct rates and copied over missing data in the outage duration from UAT.


13:30

Bjarke Madsen restored (interface/scid) rates by copying from UAT to production








...