Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Extra information plus cosmetics

Incident description

We have a system which able to inventory and collect details about servers inside our Puppet and Ansible configuration management systems called Foreman. Recently Konstantin Lepikhov connected VM management to that system which gave some benefits like better action handling and inventory data
data collection.

13.02.2018 at 14:06 CET Konstantin Lepikhov run the task on that system which does the housekeeping: removes old puppetdb entries from database which hasn't been updated for more than 2 weeks. Unfortunately, he hasn't checked
checked that this also removes all hosts in Ansible system because this systems runs only sometimes and all host entries are not up to date. And because those Ansible hosts where linked to VMs in the VMWare cluster Foreman removed them too.

Incident severity: CRITICAL

Data loss: YES

Timeline

...


Time (CET)
15:06

...

Foreman triggered vm remove action.
15:35

...

...

 contacted us about problems with

...

wordpress1.geant.org and filesender-prod.geant.org because the Nagios monitoring that he still runs for these services from the University of Amsterdam alerted that these system are not accessible

15:41

...

...

 confirmed that hosts are missing in

...

VMware

15:44

...

...

 identified the issue

...

 and started investigation on

...

VMware cluster

16:43

...

...

 contacted Qaiser in Slack to confirm backup existence

16:

...

44

Dick Visser contacted Qaiser Ahmed on his mobile phone, no answer

16:45

DevOps confirmed that there are no backups or extra copies on

...

VMware storage

17:00

...

...

 called Qaiser Ahmed

...

 in Slack, no response.

...

17:00

...

...

 confirmed that he has backups on server at

...

Amsterdam university (those are daily backups taken directly by VMs

...

itself).

18:26

...

...

 confirmed on #devops channel that whole folder called AMS_UBUNTU on vmware cluster is not backed up and there's no data left.

18:30

...

 recreated new VMs in VMWare cluster and started backup restore.

20:30

...

 restored backup and bring all sites online.

20:45

...

 made an official announce on #it and #general channel about incident and that problem solved.

21:00

Dick Visser started restore of filesender-prod.geant.org.

...

21:50

Dick Visser finished restore of filesender-prod.geant.org, with the exception of user files as these aren't backed up due to privacy issues, the fact this is a demonstration service.


Current situation

All data on server wordpress1.geant.org restored from backup taken at midnight 12.02.2018 that 2018-02-18 means there was an unrecoverable data loss for everything which where posted between 00:00 till 2pm.

All user data on server filesender-prod.geant.org got lost and unrecoverable (~400G)(~400G) as there are no backups. However this was a design decision for two reasons: a) this system is considered a demonstration system, and b) the volume of data (400GB) was too big for the backup system.

AARC website was undergoing a major update on files to index them and make them more accessible to people.  One day of work went lost; if there were a more recent back up the damage would have been much less.

...

[root@foreman-test ~]# fgrep 'Removing Compute instance' /var/log/foreman/production.log
2018-02-13 14:06:32 b3687224 [app] [I] Removing Compute instance for filesender-prod.geant.org
2018-02-13 14:06:46 b3687224 [app] [I] Removing Compute instance for prod-insight.geant.org
2018-02-13 14:06:52 b3687224 [app] [I] Removing Compute instance for prod-twiki.geant.net
2018-02-13 14:07:01 b3687224 [app] [I] Removing Compute instance for test-backup.geant.net
2018-02-13 14:07:07 b3687224 [app] [I] Removing Compute instance for test-crowd.win.dante.org.uk
2018-02-13 14:07:16 b3687224 [app] [I] Removing Compute instance for uat-insight.geant.org
2018-02-13 14:07:28 b3687224 [app] [I] Removing Compute instance for wordpress1.geant.org

Lessons learned

Qaiser Ahmed confirmed that now whole AMS_UBUNTU folder is backed up. Anyway we need to test this especially backup restore (how is performed and how much time it takes).

...