Table of Contents

Incident

...

Update Digital Services Managers of incident in Slack Channel
Create a channel in Slack with the name of the Incident

Write a summary of the incident in a few sentences. Include what happened, why, the severity of the incident and how long the impact lasted.

EXAMPLE:

Between the hour of {time range of incident, e.g., 15:45 and 16:35} on {DATE}, {NUMBER} users encountered {EVENT SYMPTOMS}.

The event was triggered by a {CHANGE} at {TIME OF CHANGE THAT CAUSED THE EVENT}.

The {CHANGE} contained {DESCRIPTION OF OR REASON FOR THE CHANGE}, such as a change in code to update a system}.

A bug in this code caused {DESCRIPTION OF THE PROBLEM}.

The event was detected by {MONITORING SYSTEM}. The team started working on the event by {RESOLUTION ACTIONS TAKEN}.

This {SEVERITY LEVEL} incident affected {X%} of users.

There was further impact as noted by {e.g., NUMBER OF SUPPORT TICKETS SUBMITTED, SOCIAL MEDIA MENTIONS, CALLS TO ACCOUNT MANAGERS} were raised in relation to this incident.

2. Leadup

Describe the sequence of events that led to the incident, for example, previous changes that introduced bugs that had not yet been detected.

EXAMPLE:

At {16:00} on {MM/DD/YY}, ({AMOUNT OF TIME BEFORE CUSTOMER IMPACT, e.g. 10 days before the incident in question}), a change was introduced to {PRODUCT OR SERVICE} in order to {THE CHANGES THAT LED TO THE INCIDENT}.

This change resulted in {DESCRIPTION OF THE IMPACT OF THE CHANGE}.

3. Fault

Describe how the change that was implemented didn't work as expected. If available, attach screenshots of relevant data visualisations that illustrate the fault.

4. Impact

Describe how the incident impacted internal and external users during the incident. Include how many support cases were raised.

EXAMPLE:

For {XXhrs XX minutes} between {XX:XX UTC and XX:XX UTC} on {MM/DD/YY}, {SUMMARY OF INCIDENT} our users experienced this incident.

This incident affected {XX} customers (X% OF {SYSTEM OR SERVICE} USERS), who experienced {DESCRIPTION OF SYMPTOMS}.

{XX NUMBER OF SUPPORT TICKETS AND XX NUMBER OF SOCIAL MEDIA POSTS} were submitted.

5. Detection

When did the team detect the incident? How did they know it was happening? How could we improve time-to-detection? Consider: How would we have cut that time by half?

EXAMPLE:

This incident was detected when the {ALERT TYPE} was triggered and {TEAM/PERSON} were paged.

Next, {SECONDARY PERSON} was paged, because {FIRST PERSON} didn't own the service writing to the disk, delaying the response by {XX MINUTES/HOURS}.

{DESCRIBE THE IMPROVEMENT} will be set up by {TEAM OWNER OF THE IMPROVEMENT} so that {EXPECTED IMPROVEMENT}.

6. Response

Who responded to the incident? When did they respond, and what did they do? Note any delays or obstacles to responding.

EXAMPLE:

After receiving a page at {XX:XX UTC}, {ON-CALL ENGINEER} came online at {XX:XX UTC} in {SYSTEM WHERE INCIDENT INFO IS CAPTURED}.

This engineer did not have a background in the {AFFECTED SYSTEM} so a second alert was sent at {XX:XX UTC} to {ESCALATIONS ON-CALL ENGINEER} into the who came into the room at {XX:XX UTC}.

7. Recovery

Describe how the service was restored and the incident was deemed over. Detail how the service was successfully restored, and how you knew what steps you needed to take to get to recovery.

Depending on the scenario, consider these questions: How could you improve time to mitigation? How could you have cut that time by half?

EXAMPLE:

We used a three-pronged approach to the recovery of the system:

{DESCRIBE THE ACTION THAT MITIGATED THE ISSUE, WHY IT WAS TAKEN, AND THE OUTCOME}

By Increasing the size of the BuildEng EC3 ASG to increase the number of nodes available to support the workload and reduce the likelihood of scheduling on oversubscribed nodes
Disabled the Escalator autoscaler to prevent the cluster from aggressively scaling-down
Reverting the Build Engineering scheduler to the previous version.

8. Timeline

Detail the incident timeline.

Include any notable lead-up events, any starts of activity, the first known impact, and escalations. Note any decisions or changed made, and when the incident ended, along with any post-impact events of note.

Summary

Polled BRIAN data failed to be saved from approximately 20:00 UTC 16 Nov 2023 until service was restored approximately 10:30 UTC 17 Nov 2023.

This is a low/medium severity incident since 1) the system was still online and user interaction was not affected, and 2) the data loss was temporary.

Leadup

On Wednesday 15 Nov 2023 there was a scheduled update of the Inventory Provider (cf. TT#2023111334002463).

On 16 Nov 2023 the NOC changed the configuration of et-4/0/2 on mx2.zag.hr.geant.net for testing. But the configuration change triggered an as-yet-unknown issue with the Inventory Provider netconf configuration parsing that caused a fatal error in the BRIAN client consuming the new Inventory Provider api endpoint, which caused the data processing pipeline to fail.

Fault

The Inventory Provider update that was deployed on Wednesday 15 Nov 2023 included code changes which used

On Thursday 16 Nov 2023 the NOC disabled LAG aggregation on et-4/0/2 mx2.zag.hr.geant.net and created a logical VLAN et-4/0/2.0. The netconf configuration parsing logic of the Inventory Provider only handled interface-level active/inactive flags. But the NOC testing configuration 1) left the interface enabled, but only disabled LAG aggregation, and 2) created a logical vlan unit on the physical interface.

The behavior of the Inventory Provider is to set the LAG interface of logical interfaces to the same as the physical interface, as long as both are active. This has always been the behavior, as it makes sense for the use case of alarm correlation. But in the NOC configuration, only the LAG configuration was marked as inactive, not the full interface. The Inventory Provider doesn't/didn't contain this logic.

The new Inventory Provider changes to the /poller/speeds api failed when a logical unit was present in the list of bundled LAG physical interfaces, and this api is required for the BRIAN spike removal processor to work. When the spike removal filter failed, the BRIAN data processing pipeline stopped processing new data.

Image Added

Impact

BRIAN data failed to save polled data from approximately 20:00 UTC 16 Nov 2023 until service was restored approximately 10:30 UTC 17 Nov 2023.

The UAT environment continued to properly save data during this time, so the lost production environment data can be recovered from UAT. So far no users have informed us that they have noticed any issues. Therefore the impact is relatively small, and also temporary.

Detection

At the start of the workday Friday morning 17 Nov 2023, Bjarke Madsen and Erik Reid were alerted to email notifications from Sensu about BRIAN-specific check failures.

Response

Bjarke Madsen asked for information about the Sensu email notifications he was receiving. Erik Reid noticed that the notifications indicated no interface rate information was being stored in Influx, and Bjarke Madsen manually confirmed this was the case. Robert Latta noticed errors in the logs of UAT and Test Inventory Providers. Bjarke Madsen made the connection with this and the UDF spike removal Kapacitor process, which had crashed and caused the data processing pipeline to stop processing data.

Recovery

Data collection was restored by Bjarke Madsen rolling back the previous update to Inventory Provider in the production environment (from version 0.111 down to 0.110).

Bjarke Madsen confirmed that the UAT data processing pipeline had fortunately not failed for the affected time period and the data could therefore be copied from there into the production system. This was done on <TBD>.

Timeline

Detail the incident timeline.

Include any notable lead-up events, any starts of activity, the first known impact, and escalations. Note any decisions or changed made, and when the incident ended, along with any post-impact events of note.

Date/time (UTC)

Action

Actor

17 Nov 2023 09:17

Ashley Brown noticed that the most recent Inventory Provider update failed and asked on the dashboard users Slack channel. Erik Reid investigated.

Erik Reid

17 Nov 2023 09:31

noticed this in the production Inventory Provider logs and asked Sam Roberts on Slack to investigate

Code Block

language	bash

Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: 2023-11-17 09:14:12,473 - inventory_provider.tasks.worker (605) - ERROR - unhandled exception loading srx2.ch.office.geant.net info: ncclient timed out while waiting for an rpc reply.
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: Traceback (most recent call last):
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: File "/home/inventory/venv/lib64/python3.6/site-packages/inventory_provider/tasks/worker.py", line 576, in reload_router_config_chorded
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: hostname, update_callback=self.log_warning)
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: File "/home/inventory/venv/lib64/python3.6/site-packages/inventory_provider/tasks/worker.py", line 650, in retrieve_and_persist_interface_info
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: interface_info_str = juniper.get_interface_info_for_router(hostname, InventoryTask.config["ssh"])
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: File "/home/inventory/venv/lib64/python3.6/site-packages/inventory_provider/juniper.py", line 495, in get_interface_info_for_router
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: reply = _raw_rpc(router, etree.Element('get-interface-information'))
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: File "/home/inventory/venv/lib64/python3.6/site-packages/inventory_provider/juniper.py", line 151, in _raw_rpc
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: obj = router.rpc(command)
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: File "/home/inventory/venv/lib64/python3.6/site-packages/ncclient/manager.py", line 251, in execute
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: huge_tree=self._huge_tree).request(*args, **kwds)
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: File "/home/inventory/venv/lib64/python3.6/site-packages/ncclient/operations/third_party/juniper/rpc.py", line 52, in request
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: return self._request(rpc)
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: File "/home/inventory/venv/lib64/python3.6/site-packages/ncclient/operations/rpc.py", line 381, in _request
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: raise TimeoutExpiredError('ncclient timed out while waiting for an rpc reply.')
Nov 17 09:14:12 prod-inventory-provider01 celery[4072]: ncclient.operations.errors.TimeoutExpiredError: ncclient timed out while waiting for an rpc reply.

Erik Reid

17 Nov 2023 09:41

Bjarke Madsen asked on Slack if anyone had information about Sensu check failure notifications. Erik Reid shared the critical error info:

Code Block

language	bash

2023-11-17 09:29:05,546 - brian_monitoring_checks.inventory - DEBUG - using inventory base api url: http://localhost:8080
2023-11-17 09:29:05,560 - brian_monitoring_checks.influx - DEBUG - select count(egress), count(ingress), count(errorsOut), count(errorsIn), count(egressv6), count(ingressv6) from interface_rates where time > now() - 20m group by hostname, interface_name
2023-11-17 09:29:05,832 - brian_monitoring_checks.influx - DEBUG - closing influx session
2023-11-17 09:29:09,736 - brian_monitoring_checks.check_counters - ERROR - no ingress/egress data in the last 20m for these routers: {'rt1.sof.bg.geant.net', 'mx1.bud.hu.geant.net', 'rt1.rig.lv.geant.net', 'qfx.fra.de.geant.net', 'mx1.dub2.ie.geant.net', 'rt1.buc.ro.geant.net', 'rt1.bra.sk.geant.net', 'rt1.kau.lt.geant.net', 'rt1.mil2.it.geant.net', 'srx1.ch.office.geant.net', 'rt2.rig.lv.geant.net', 'rt2.chi.md.geant.net', 'rt1.chi.md.geant.net', 'rt2.tar.ee.geant.net', 'rt2.bra.sk.geant.net', 'mx1.par.fr.geant.net', 'rt1.cor.ie.geant.net', 'qfx.lon2.uk.geant.net', 'mx1.fra.de.geant.net', 'rt1.mar.fr.geant.net', 'mx1.poz.pl.geant.net', 'qfx.par.fr.geant.net', 'rt2.cor.ie.geant.net', 'rt1.lju.si.geant.net', 'mx1.dub.ie.geant.net', 'mx2.ath.gr.geant.net', 'srx1.am.office.geant.net', 'mx2.lis.pt.geant.net', 'mx2.zag.hr.geant.net', 'rt2.kau.lt.geant.net', 'mx1.lon.uk.geant.net', 'rt2.kie.ua.geant.net', 'mx1.gen.ch.geant.net', 'rt1.ams.nl.geant.net', 'rt1.pra.cz.geant.net', 'mx1.ams.nl.geant.net', 'srx2.ch.office.geant.net', 'srx2.am.office.geant.net', 'rt1.fra.de.geant.net', 'rt2.ams.nl.geant.net', 'mx1.buc.ro.geant.net', 'mx1.ath2.gr.geant.net', 'mx1.ham.de.geant.net', 'rt1.por.pt.geant.net', 'mx1.mad.es.geant.net', 'mx1.vie.at.geant.net', 'rt1.kie.ua.geant.net', 'mx1.sof.bg.geant.net', 'rt2.bru.be.geant.net', 'rt1.bil.es.geant.net', 'rt1.tar.ee.geant.net', 'rt1.ham.de.geant.net', 'rt1.bru.be.geant.net', 'mx1.lon2.uk.geant.net'}
2023-11-17 09:29:09,740 - brian_monitoring_checks.check_counters - DEBUG - check returned with status: 2

Bjarke Madsen

17 Nov 2023 09:44

Bjarke Madsen noticed that the kapacitor speed removal process was failing, because the Inventory Provider /poller/speeds api was returning errors:

Code Block

language	bash

Sensu has detected a problem with this host.

Notification Type: PROBLEM
Host: prod-service-proxy-monitoring02.geant.org
State: DOWN
Check: brian-kapacitor-tasks
Occurrences: 1
Date/Time: 2023-11-16 20:20:48
Info: task "remove_spikes_gwsd_rates" has been restarted (was not executing)
task "interface_rates" is executing
task "multicast_rates" is executing
task "remove_spikes_dscp32_rates" has been restarted (was not executing)
task "gwsd_rates" is executing
task "remove_spikes_multicast_rates" has been restarted (was not executing)
task "service_enrichment" is executing
task "service_enrichment_lambda" is executing
task "remove_spikes_interface_rates" has been restarted (was not executing)
task "dscp32_rates" is executing

17 Nov 2023 09:50

The Inventory Provider update that occurred on 15 Nov 2023 (TT#2023111334002463) included the code changes that were failing. It was decided to roll this back.

Erik Reid , Bjarke Madsen

17 Nov 2023 09:58

The Inventory Provider was rolled back in production and the data processing pipeline functionality was restored.

Bjarke Madsen

17 Nov 2023 10:13

The team decided there were 2 issues:

the ncclient connection error causing the Inventory Provider update to fail: this was a transient error, caused by network connection issues (this is considered a part of
Jira
server Jira
serverId 5228d933-268f-3077-a879-21fb01eb8d41
key DBOARD3-822
)
the data processing pipeline failure caused by the /poller/speeds

Sam Roberts , Bjarke Madsen , Erik Reid

17 Nov 2023 10:28

Sam Roberts found that the failure was being caused when the /poller/speeds processor computed the aggregate speed for ae6 on mx2.zag.hr.geant.net

Sam Roberts

17 Nov 2023 12:17

Sam Roberts found that the failure on computing the aggregate speed for mx2.zag.hr.geant.net/ae6 is because the Inventory cache data included et-4/0/1, et-4/0/2 and et-4/0/2.0. A logical interface in the list was unexpected and the processing failed when parsing this name.

Sam Roberts heard from Robert Latta that the OC had been testing on this interface, but the details weren't clear.

Sam Roberts

17 Nov 2023 15:02

Sam Roberts prepared a MR for Inventory Provider to fix both of the issues above.

17 Nov 2023 16:03

Ashley Brown explained to Robert Latta and Erik Reid the test configuration that was enabled on mx2.zag.hr.geant.net. The details are described in

Jira

server	Jira
serverId	5228d933-268f-3077-a879-21fb01eb8d41
key	DBOARD3-833

Root Cause Identification

...

Date/time

...

Action

...

Actor

...

The Five Whys is a root cause identification technique root cause identification technique. Here’s how you can use it:
Begin with a description of the impact and ask why it occurred.
Note the impact that it had.
Ask why this happened, and why it had the resulting impact.
Then, continue asking “why” until you arrive at a root cause.
List the "whys" in your post-mortem documentation.
...

Root

...

Cause

An specific unexpected and unhandled configuration change corresponded with a piece of new code that couldn't handle the output of that change.

Backlog Check

Note the final root cause of the incident, the thing identified that needs to change in order to prevent this class of incident from happening again.

EXAMPLE:

A bug in connection pool handling led to leaked connections under failure conditions, combined with lack of visibility into connection state.

...

Review your engineering backlog to find out if there was any unplanned work there that could have prevented this incident, or at least reduced its impact?
A clear-eyed assessment of the backlog can shed light on past decisions around priority and risk 12. Recurrence

Recurrence

Now that you know the root cause, can you look back and see any other incidents that could have the same root cause? If yes, note what mitigation was attempted in those incidents and ask why this incident occurred again. 13.

Lessons

...

Learned

Discuss what went well in the incident response, what could have been improved, and where there are opportunities for improvement. 14.

Corrective

...

Actions

Describe the corrective action ordered to prevent this class of incident in the future. Note who is responsible and when they must complete the work and where that work is being tracked.

Page tree

Versions Compared

Old Version 2

New Version 3

Key

Incident

Summary

Leadup

Fault

Impact

Detection

Response

Recovery

Timeline

Root Cause Identification

Root

Cause

Backlog Check

Review your engineering backlog to find out if there was any unplanned work there that could have prevented this incident, or at least reduced its impact?
A clear-eyed assessment of the backlog can shed light on past decisions around priority and risk 12. Recurrence

Recurrence

Lessons

Learned

Corrective

Actions

Page tree

Page History

Versions Compared

Old Version 2

New Version 3

Key

Incident

Summary

Leadup

Fault

Impact

Detection

Response

Recovery

Timeline

Root Cause Identification

Root

Cause

Backlog Check

Review your engineering backlog to find out if there was any unplanned work there that could have prevented this incident, or at least reduced its impact? A clear-eyed assessment of the backlog can shed light on past decisions around priority and risk 12. Recurrence

Recurrence

Lessons

Learned

Corrective

Actions

Review your engineering backlog to find out if there was any unplanned work there that could have prevented this incident, or at least reduced its impact?
A clear-eyed assessment of the backlog can shed light on past decisions around priority and risk 12. Recurrence