Cacti production incident - 06-03-2020

IT were alerted to a disk space issue with prod-cacti02-vie-at.geant.org on Friday 6th March 2020 for /var partition.

The partition space continued to grow reasonably quickly and on assessment it was estimated that the partition may not make it through the weekend.

A new disk was added in preparation to grow the partition on to the new disk space.

An emergency maintenance ticket was created and a member of the NOC approved for the maintenance to go ahead on Friday night at 9pm, to avoid interruption to users using the system should anything go wrong.

During the addition of the disk it was found that the LVM of the physical disk could not be completed due to an error with a missing UUID. Despites efforts to fix the issue while the machine was online none were to prove successful.

We took the decision to reboot the machine in the hope that the rescan of the devices would fix or make clear what the issue was, unfortunately the OS would not boot and we were forced to revert to the last known good backup, Thursday 5th 23:10.

After a few issues with VMware not relinquishing information about the newly added 3rd disk, we had to delete the VM prod-cacti02-vie-at.geant.org altogether and then let the backup replace the box. This proved more successful and the box was booted.

On restoration of the service we reviewed the state of the box and found that in /var/lib/mysql that there were a lot of relay logs. This was unusual so we checked the mysql replication status and found that replication from the master (prod-cacti01-fra-de.geant.org) was broken and had been broken since 25th February according to the mysqld.log, due to a mis-match in the default values of id in three tables where inserts from the master were trying to write. We fixed the mismatch and replication started and continued to flow. There was a further issue were replication from prod-cacti02-vie-at.geant.org slave to the master prod-cacti01-fra-de.geant.org. This was due to the backup showing an earlier binlog number that was expected by master for its replication. As that was momentary we reset the slave process on prod-cacti01-fra-de.geant.org and replication again continued from vie to fra. As no futher issues were being alerted systemically we closed the incident and made NOC aware that this situation would need review on Monday morning.

Incident severity: CRITICAL

Data loss: YES

Time line of the events as they unfolded is as below:

Date	Time	Notes
06/03/2020	11:21	First critical alert received. Decision to review and see how fast the partition review would consume space
06/03/2020	17:10	Alert was reviewed again and found to be consuming more space than expected
06/03/2020	17:30	Logged on and added new disk via VMware UI. Logged onto server and attempted to extend the existing LVM in the ususal manner. The server produced errors when the physical volume was created with a message about a missing UUID which pvs confirmed. Remediation to retrieve the situation were unsuccessful and a reboot was requested to confirm if a device rescan would fix the issue or provide more information.
06/03/2020	19:54	Emergency ticket to perform a reboot at 21:00 and was approved by NOC.
06/03/2020	21:00	Unfortunately the VM did not boot so we were forced to restore from backup. Mutliple fsck options were tried but were not successful
06/03/2020	21:30	Restore from backup was requested.
06/03/2020	22:25	After issues with the restore a good VM version was restored and booted.
06/03/2020	22:30	Investigated the mass of relay logs in /var/lib/mysql
06/03/2020	22:33	Logged in to mysql on vie and reset the id default value in data_template_data_rra, poller_item, data_input_data tables
06/03/2020	22:53	Logged into prod-cacti01-fra-de.geant.net to fixe the replication break from restore showing older binlog entry than expeted.
06/03/2020	23:14	Notified NOC that this fix will need review on Monday but that replication was fixed. No notification at this point that anything was broken.
09/03/2020	16:37	Issue reported to Software Development - "Cacti system that has been rebuilt from a backup is broken and half the graphs aren’t working. "
09/03/2020	17:00	Had internal SWD call to understand the issue, decided need more information from stakeholders. Organised meeting with stakeholders on Tuesday morning and informed them via email.
10/03/2020	10:00	In the meeting turned out no one has complete picture of the system, need some time to review the cluster setup. Decision was taken to reconvene later in the afternoon.
10/03/2020	10:55	A slack channel was setup to discuss the findings of the review of the setup and issue.
10/03/2020	12:15	Cacti runs "unison" to perform a two ways synchronization. Unison stopped working the first time as a consequence of the filesystem corruption, and didn't work with the restored system, because the two VMs were not in sync. We removed the DB created by Unison and started Unison from scratch on both systems and the sync started working again.
10/03/2020	12:40	Stakeholders confirmed system is looking good, no need for the afternoon meeting.

Lessons Learned:

Multi-master MySQL leads to split brain.
Only backups can save against filesystem corruption.
A process should be defined on to deal with such outages in future including emergency maintenance so community is informed, checking post changes to make sure graphs on both systems are being updated etc.
Setup monitoring of two way synchronisation of RRD files.
Product Owner of Cacti should also be informed about any changes in relation to Cacti GCS plugin as the plugin is deployed on the same VM as Cacti.
Even though Cacti operations is owned by IT but for any tasks/issues in relation to the application, SWD should be kept in the loop.

Next Steps:

SWD to propose a solution to avoid Multi-master split brain issue in future and system can be made more robust.
Setup monitoring of two way synchronisation of RRD files.
During investigation a bug in relation to Cacti GCS plugin also identified - debug and fix the issue,

Page tree

Cacti production incident - 06-03-2020