Production wordpress site outage 2018-02-13

Incident description

We have a system which able to inventory and collect details about servers inside our Puppet and Ansible configuration management systems called Foreman. Recently Konstantin Lepikhov connected VM management to that system which gave some benefits like better action handling and inventory data collection.

2018-02-13 at 14:06 CET Konstantin Lepikhov run the task on that system which does the housekeeping: removes old puppetdb entries from database which hasn't been updated for more than 2 weeks. Unfortunately, he hasn't checked that this also removes all hosts in Ansible system because this systems runs only sometimes and all host entries are not up to date. And because those Ansible hosts where linked to VMs in the VMWare cluster Foreman removed them too.

Incident severity: CRITICAL

Data loss: YES

Timeline

Time (CET)
15:06	Foreman triggered vm remove action.
15:35	Dick Visser contacted us about problems with wordpress1.geant.org and filesender-prod.geant.org because the Nagios monitoring that he still runs for these services from the University of Amsterdam alerted that these system are not accessible
15:41	Konstantin Lepikhov confirmed that hosts are missing in VMware
15:44	Konstantin Lepikhov identified the issue and started investigation on VMware cluster
16:43	Konstantin Lepikhov contacted Qaiser in Slack to confirm backup existence
16:44	Dick Visser contacted Qaiser Ahmed on his mobile phone, no answer
16:45	DevOps confirmed that there are no backups or extra copies on VMware storage
17:00	Konstantin Lepikhov called Qaiser Ahmed in Slack, no response.
17:00	Dick Visser confirmed that he has backups on server at Amsterdam university (those are daily backups taken directly by VMs itself).
18:26	Qaiser Ahmed confirmed on #devops channel that whole folder called AMS_UBUNTU on vmware cluster is not backed up and there's no data left.
18:30	Dick Visser recreated new VMs in the VMWare cluster and started the restore process
20:30	Dick Visser restored the backup and brought all sites online.
20:45	Konstantin Lepikhov made an official announcement on the #it and #general Slack channels about the incident and the resolution.
21:00	Dick Visser started restore of filesender-prod.geant.org.
21:50	Dick Visser finished restore of filesender-prod.geant.org, with the exception of user files as these aren't backed up due to privacy issues, the fact this is a demonstration service.

Total downtime: 5:39 hours.

Current situation

All data on server wordpress1.geant.org restored from backup taken at midnight 2018-02-13 means there was an unrecoverable data loss for everything which where posted between 00:00 till 2pm.

All user data on server filesender-prod.geant.org got lost (~400G) as there are no backups. However this was a design decision for two reasons: a) this system is considered a demonstration system, and b) the volume of data (400GB) was too big for the backup system.

AARC website was undergoing a major update on files to index them and make them more accessible to people. One day of work went lost; if there were a more recent back up the damage would have been much less.

There are other servers deleted but those loses are insignificant.

[root@foreman-test ~]# fgrep 'Removing Compute instance' /var/log/foreman/production.log
2018-02-13 14:06:32 b3687224 [app] [I] Removing Compute instance for filesender-prod.geant.org
2018-02-13 14:06:46 b3687224 [app] [I] Removing Compute instance for prod-insight.geant.org
2018-02-13 14:06:52 b3687224 [app] [I] Removing Compute instance for prod-twiki.geant.net
2018-02-13 14:07:01 b3687224 [app] [I] Removing Compute instance for test-backup.geant.net
2018-02-13 14:07:07 b3687224 [app] [I] Removing Compute instance for test-crowd.win.dante.org.uk
2018-02-13 14:07:16 b3687224 [app] [I] Removing Compute instance for uat-insight.geant.org
2018-02-13 14:07:28 b3687224 [app] [I] Removing Compute instance for wordpress1.geant.org

Lessons learned

Qaiser Ahmed confirmed that now whole AMS_UBUNTU folder on VMware cluster is backed up. Anyway we need to test this especially backup restore (how is performed and how much time it takes).
DevOps Team will find the ways to isolate production environment and have better awareness regarding invasive operations within Puppet and Ansible infrastructure.
IT team should take actions regarding backup procedures for production environment located on GEANT VMWare cluster.
We need a better monitoring and incident handling, especially interaction between stakeholders and departments (DevOps and IT/OC).
The monitoring that Dick Visser is responsible for did work, but the check interval could be slightly improved - first Nagios alarm came in 20 minutes after system went down.
The backups that Dick Visser is responsible for also worked, and the entire webserver could be completely restored from scratch. The RPO for this system (1 day) stems from the time it was first put into production a few years ago, when it contained much less user contributed content, and updates happened less frequently. This could be improved to something like 1h.

Page tree

Production wordpress site outage 2018-02-13

Incident description

Timeline

Current situation

Lessons learned