You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Incident description

We have a system which able to inventory and collect details about servers inside our Puppet and Ansible configuration management systems called Foreman. Recently Konstantin Lepikhov connected VM management to that system which gave some benefits like better action handling and inventory data
collection.

13.02.2018 at 14:06 CET Konstantin Lepikhov run the task on that system which does the housekeeping: removes old puppetdb entries from database which hasn't been updated for more than 2 weeks. Unfortunately, he hasn't checked
that this also removes all hosts in Ansible system because this systems runs only sometimes and all host entries are not up to date. And because those Ansible hosts where linked to VMs in VMWare cluster Foreman removed them too.

Incident severity: CRITICAL

Data loss: YES

Timeline

all times here in CET

14:06 - foreman triggered vm remove action.
15:35 - Dick Visser contacted us about problems with wordpress and
filesender because he got monitoring alert that sites are not accessible
15:41 - Konstantin Lepikhov confirmed that hosts are missing in vmware
15:44 - Konstantin Lepikhov identified the issue and started investigation on vmware cluster
16:43 - Konstantin Lepikhov contacted Qaiser in Slack to confirm backup existence
16:45 - DevOps confirmed that there are no backups or extra copies on
vmware storage
17:00 - Konstantin Lepikhov called Qaiser Ahmed in Slack, no response.
...
17:00 - Dick Visser confirmed that he has backups on server at
Amsterdam university (those are daily backups taken directly by VMs
itself).

18:26: Qaiser Ahmed confirmed on #devops channel that whole folder called AMS_UBUNTU on vmware cluster is not backed up and there's no data left.

18:30 Dick Visser recreated new VMs in VMWare cluster and started backup restore.

20:30 Dick Visser restored backup and bring all sites online.

20:45 Konstantin Lepikhov made an official announce on #it and #general channel about incident and that problem solved.

TOTAL DOWNTIME: 6 hours 40min.

Current situation

All data on server wordpress1.geant.org restored from backup taken at midnight 12.02.2018 that means there was an unrecoverable data loss for everything which where posted between 00:00 till 2pm.

All user data on server filesender-prod.geant.org lost and unrecoverable (~400G).

AARC website was undergoing a major update on files to index them and make them more accessible to people.  One day of work went lost; if there were a more recent back up the damage would have been much less.

There are other servers deleted but those loses are insignificant.


[root@foreman-test ~]# fgrep 'Removing Compute instance' /var/log/foreman/production.log
2018-02-13 14:06:32 b3687224 [app] [I] Removing Compute instance for filesender-prod.geant.org
2018-02-13 14:06:46 b3687224 [app] [I] Removing Compute instance for prod-insight.geant.org
2018-02-13 14:06:52 b3687224 [app] [I] Removing Compute instance for prod-twiki.geant.net
2018-02-13 14:07:01 b3687224 [app] [I] Removing Compute instance for test-backup.geant.net
2018-02-13 14:07:07 b3687224 [app] [I] Removing Compute instance for test-crowd.win.dante.org.uk
2018-02-13 14:07:16 b3687224 [app] [I] Removing Compute instance for uat-insight.geant.org
2018-02-13 14:07:28 b3687224 [app] [I] Removing Compute instance for wordpress1.geant.org

Lessons learned

Qaiser Ahmed confirmed that now whole AMS_UBUNTU folder is backed up. Anyway we need to test this especially backup restore (how is performed and how much time it takes).

DevOps Team will find the ways to isolate production environment and have better awareness regarding invasive operations within Puppet and Ansible infrastructure.

IT team should take actions regarding backup procedures for production environment located on GEANT VMWare cluster.

We need a better monitoring and incident handling, especially interaction between shareholders and departments (DevOps and IT/OC).

  • No labels