Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: more lessons to be learnt, formatting, grammar

...

Time (CET)
15:06Foreman triggered vm remove action.
15:35

Dick Visser contacted us about problems with wordpress1.geant.org and filesender-prod.geant.org because the Nagios monitoring that he still runs for these services from the University of Amsterdam alerted that these system are not accessible

15:41

Konstantin Lepikhov confirmed that hosts are missing in VMware

15:44

Konstantin Lepikhov identified the issue and started investigation on VMware cluster

16:43

Konstantin Lepikhov contacted Qaiser in Slack to confirm backup existence

16:44

Dick Visser contacted Qaiser Ahmed on his mobile phone, no answer

16:45

DevOps confirmed that there are no backups or extra copies on VMware storage

17:00

Konstantin Lepikhov called Qaiser Ahmed in Slack, no response.

17:00

Dick Visser confirmed that he has backups on server at Amsterdam university (those are daily backups taken directly by VMs itself).

18:26

Qaiser Ahmed confirmed on #devops channel that whole folder called AMS_UBUNTU on vmware cluster is not backed up and there's no data left.

18:30

Dick Visser recreated new VMs in the VMWare cluster and started backup the restore .process

20:30

Dick Visser restored the backup and bring brought all sites online.

20:45

Konstantin Lepikhov made an official announce announcement on the #it and #general channel Slack channels about the incident and that problem solvedthe resolution.

21:00

Dick Visser started restore of filesender-prod.geant.org.

21:50

Dick Visser finished restore of filesender-prod.geant.org, with the exception of user files as these aren't backed up due to privacy issues, the fact this is a demonstration service.

Total downtime: 5

Current situation

All data on server wordpress1.geant.org restored from backup taken at midnight 2018-02-18 means there was an unrecoverable data loss for everything which where posted between 00:00 till 2pm.

...

[root@foreman-test ~]# fgrep 'Removing Compute instance' /var/log/foreman/production.log
2018-02-13 14:06:32 b3687224 [app] [I] Removing Compute instance for filesender-prod.geant.org
2018-02-13 14:06:46 b3687224 [app] [I] Removing Compute instance for prod-insight.geant.org
2018-02-13 14:06:52 b3687224 [app] [I] Removing Compute instance for prod-twiki.geant.net
2018-02-13 14:07:01 b3687224 [app] [I] Removing Compute instance for test-backup.geant.net
2018-02-13 14:07:07 b3687224 [app] [I] Removing Compute instance for test-crowd.win.dante.org.uk
2018-02-13 14:07:16 b3687224 [app] [I] Removing Compute instance for uat-insight.geant.org
2018-02-13 14:07:28 b3687224 [app] [I] Removing Compute instance for wordpress1.geant.org

Lessons learned

  • Qaiser Ahmed confirmed that now whole AMS_UBUNTU folder on VMware cluster is backed up. Anyway we need to test this especially backup restore (how is performed and how much time it takes).
  • DevOps Team will find the ways to isolate production environment and have better awareness regarding invasive operations within Puppet and Ansible infrastructure.
  • IT team should take actions regarding backup procedures for production environment located on GEANT VMWare cluster.
  • We need a better monitoring and incident handling, especially interaction between stakeholders and departments (DevOps and IT/OC).
  • The monitoring that Dick Visser is responsible for did work, but the check interval could be slightly improved - first Nagios alarm came in 20 minutes after system went down.
  • The backups that Dick Visser is responsible for also worked, and the entire webserver could be completely restored from scratch. The RPO for this system (1 day) stems from the time it was first put into production a few years ago, when it contained much less user contributed content, and updates happened less frequently. This could be improved to something like 1h.