Polled BRIAN data failed to be saved from approximately 20:00 UTC until service was restored approximately 10:30 UTC .
This is a low/medium severity incident since 1) the system was still online and user interaction was not affected, and 2) the data loss was temporary.
On Wednesday there was a scheduled update of the Inventory Provider (cf. TT#2023111334002463).
On the NOC changed the configuration of et-4/0/2 on mx2.zag.hr.geant.net for testing. But the configuration change triggered an as-yet-unknown issue with the Inventory Provider netconf configuration parsing that caused a fatal error in the BRIAN client consuming the new Inventory Provider api endpoint, which caused the data processing pipeline to fail.
The Inventory Provider update that was deployed on Wednesday included code changes which used
On Thursday the NOC disabled LAG aggregation on et-4/0/2 mx2.zag.hr.geant.net and created a logical VLAN et-4/0/2.0. The netconf configuration parsing logic of the Inventory Provider only handled interface-level active/inactive flags. But the NOC testing configuration 1) left the interface enabled, but only disabled LAG aggregation, and 2) created a logical vlan unit on the physical interface.
The behavior of the Inventory Provider is to set the LAG interface of logical interfaces to the same as the physical interface, as long as both are active. This has always been the behavior, as it makes sense for the use case of alarm correlation. But in the NOC configuration, only the LAG configuration was marked as inactive, not the full interface. The Inventory Provider doesn't/didn't contain this logic.
The new Inventory Provider changes to the /poller/speeds api failed when a logical unit was present in the list of bundled LAG physical interfaces, and this api is required for the BRIAN spike removal processor to work. When the spike removal filter failed, the BRIAN data processing pipeline stopped processing new data.
BRIAN data failed to save polled data from approximately 20:00 UTC until service was restored approximately 10:30 UTC .
The UAT environment continued to properly save data during this time, so the lost production environment data can be recovered from UAT. So far no users have informed us that they have noticed any issues. Therefore the impact is relatively small, and also temporary.
At the start of the workday Friday morning , Bjarke Madsen and Erik Reid were alerted to email notifications from Sensu about BRIAN-specific check failures.
Bjarke Madsen asked for information about the Sensu email notifications he was receiving. Erik Reid noticed that the notifications indicated no interface rate information was being stored in Influx, and Bjarke Madsen manually confirmed this was the case. Robert Latta noticed errors in the logs of UAT and Test Inventory Providers. Bjarke Madsen made the connection with this and the UDF spike removal Kapacitor process, which had crashed and caused the data processing pipeline to stop processing data.
Data collection was restored by Bjarke Madsen rolling back the previous update to Inventory Provider in the production environment (from version 0.111 down to 0.110).
Bjarke Madsen confirmed that the UAT data processing pipeline had fortunately not failed for the affected time period and the data could therefore be copied from there into the production system. This was done on .
A prior attempt at data restoration was made around 12:30 UTC on , to make Kapacitor re-process counters into rates for the timespan with missing data. Counters were available on production in the outage timespan, but were not converted to rates.
During this attempt at data restoration using Kapacitor's ability to re-process older counters (first attempted on GWS Direct data) an incorrect argument was given to the replay command to limit the processing to within the outage timespan.
This caused the GWS Direct data to be modified past the outage timespan, modifying the tags and making it temporarily unavailable until around 15:30 UTC, as the measurement had to be re-created with the tags restored.
Detail the incident timeline.
Include any notable lead-up events, any starts of activity, the first known impact, and escalations. Note any decisions or changed made, and when the incident ended, along with any post-impact events of note.
Date/time (UTC) | Action | Actor | |
09:17 | Ashley Brown noticed that the most recent Inventory Provider update failed and asked on the dashboard users Slack channel. Erik Reid investigated. | ||
09:31 | noticed this in the production Inventory Provider logs and asked Sam Roberts on Slack to investigate
| ||
09:41 | Bjarke Madsen asked on Slack if anyone had information about Sensu check failure notifications. Erik Reid shared the critical error info:
| Bjarke Madsen | |
09:44 | Bjarke Madsen noticed that the kapacitor speed removal process was failing, because the Inventory Provider /poller/speeds api was returning errors:
| ||
09:50 | The Inventory Provider update that occurred on (TT#2023111334002463) included the code changes that were failing. It was decided to roll this back. | ||
09:58 | The Inventory Provider was rolled back in production and the data processing pipeline functionality was restored. | ||
10:13 | The team decided there were 2 issues:
| ||
10:28 | Sam Roberts found that the failure was being caused when the /poller/speeds processor computed the aggregate speed for ae6 on mx2.zag.hr.geant.net | ||
12:17 | Sam Roberts found that the failure on computing the aggregate speed for mx2.zag.hr.geant.net/ae6 is because the Inventory cache data included et-4/0/1, et-4/0/2 and et-4/0/2.0. A logical interface in the list was unexpected and the processing failed when parsing this name. Sam Roberts heard from Robert Latta that the OC had been testing on this interface, but the details weren't clear. | ||
12:30 | Bjarke Madsen attempted to restore GWS Direct rates in the outage timespan, but an error with a command caused the data to be modified past the outage duration, rendering the data unavailable temporarily. | ||
15:02 | Sam Roberts prepared a MR for Inventory Provider to fix both of the issues above. | ||
16:03 | Ashley Brown explained to Robert Latta and Erik Reid the test configuration that was enabled on mx2.zag.hr.geant.net. The details are described in | ||
15:30 | Bjarke Madsen restored availability of GWS Direct rates and copied over missing data in the outage duration from UAT. | ||
13:30 | Bjarke Madsen restored (interface/scid) rates by copying from UAT to production | ||
The Five Whys is a root cause identification technique root cause identification technique. Here’s how you can use it:
An specific unexpected and unhandled configuration change corresponded with a piece of new code that couldn't handle the output of that change.
Review your engineering backlog to find out if there was any unplanned work there that could have prevented this incident, or at least reduced its impact?
A clear-eyed assessment of the backlog can shed light on past decisions around priority and risk
Now that you know the root cause, can you look back and see any other incidents that could have the same root cause? If yes, note what mitigation was attempted in those incidents and ask why this incident occurred again.
Discuss what went well in the incident response, what could have been improved, and where there are opportunities for improvement.
Describe the corrective action ordered to prevent this class of incident in the future. Note who is responsible and when they must complete the work and where that work is being tracked.