Cacti production incident - 06-03-2020

IT were alerted to a disk space issue with prod-cacti02-vie-at.geant.org on Friday 6th March 2020 for /var partition.

The partition space continued to grow reasonably quickly and on assessment it was estimated that the partition may not make it through the weekend.

A new disk was added in preparation to grow the partition on to the new disk space.

An emergency maintenance ticket was created and a member of the NOC approved for the maintenance to go ahead on Friday night at 9pm, to avoid interruption to users using the system should anything go wrong.

During the addition of the disk it was found that the LVM of the physical disk could not be completed due to an error with a missing UUID. Despites efforts to fix the issue while the machine was online none were to prove successful.

We took the decision to reboot the machine in the hope that the rescan of the devices would fix or make clear what the issue was, unfortunately the OS would not boot and we were forced to revert to the last known good backup, Thursday 5th 23:10.

After a few issues with VMware not relinquishing information about the newly added 3rd disk, we had to delete the VM prod-cacti02-vie-at.geant.org altogether and then let the backup replace the box. This proved more successful and the box was booted.

On restoration of the service we reviewed the state of the box and found that in /var/lib/mysql that there were a lot of relay logs. This was unusual so we checked the mysql replication status and found that replication from the master (prod-cacti01-fra-de.geant.org) was broken and had been broken since 25th February according to the mysqld.log, due to a mis-match in the default values of id in three tables where inserts from the master were trying to write. We fixed the mismatch and replication started and continued to flow. There was a further issue were replication from prod-cacti02-vie-at.geant.org slave to the master prod-cacti01-fra-de.geant.org. This was due to the backup showing an earlier binlog number that was expected by master for its replication. As that was momentary we reset the slave process on prod-cacti01-fra-de.geant.org and replication again continued from vie to fra. As no futher issues were being alerted systemically we closed the incident and made NOC aware that this situation would need review on Monday morning.

Time line of the events as they unfolded is as below:

Date	Time	Notes
06/03/2020	11:21	First critical alert received. Decision to review and see how fast the partition review would consume space
06/03/2020	17:10	Alert was reviewed again and found to be consuming more space than expected
06/03/2020	17:30	Logged on and added new disk via VMware UI. Logged onto server and attempted to extend the existing LVM in the ususal manner. The server produced errors when the physical volume was created with a message about a missing UUID which pvs confirmed. Remediation to retrieve the situation were unsuccessful and a reboot was requested to confirm if a device rescan would fix the issue or provide more information.
06/03/2020	19:54	Emergency ticket to perform a reboot at 21:00 and was approved by NOC.
06/03/2020	21:00	Unfortunately the VM did not boot so we were forced to restore from backup. Mutliple fsck options were tried but were not successful
06/03/2020	21:30	Restore from backup was requested.
06/03/2020	22:25	After issues with the restore a good VM version was restored and booted.
06/03/2020	22:30	Investigated the mass of relay logs in /var/lib/mysql
06/03/2020	22:33	Logged in to mysql on vie and reset the id default value in data_template_data_rra, poller_item, data_input_data tables
06/03/2020	22:53	Logged into prod-cacti01-fra-de.geant.net to fixe the replication break from restore showing older binlog entry than expeted.
06/03/2020	23:14	Notified NOC that this fix will need review on Monday.

Page tree

Cacti production incident - 06-03-2020