Incident description

The server that runs all the Wordpress site ( became unreachable at 12:10:59 CET

Incident severity: CRITICAL

Data loss: NO

Monitoring alerted: YES


Time (CET)
12:10Apache server stop accepting incoming requests

Chris Atherton reported on #it channel that site is not working correctly


Konstantin Lepikhov confirmed the issue with wordpress1 site on #devops channel


Dick Visser connected to VM via console and confirmed that network is down (gateway not reachable)


Massimiliano Adamo have restarted network service inside VM, after that everything started working and network came up.


Konstantin Lepikhov announced that problem fixed.

Total downtime: 20 minutes.


As part of BAU and the handover of his responsibilities, Dick Visser was working on migrating a VM from the University of Amsterdam, into the GEANT VMware cluster in Frankfurt.

The IP address that was allocated for this purpose was This was still in use by a test VM called  This test VM wasn't used anymore and it was halted, and subsequently deleted at 11:51:15.

The migration of the VM (name: into the Frankfurt cluster was then started at 11:57, and finished at 12:01.

This went smoothly and the VM was powered up at 12:02:10, and once signed in to the console, it's IP address and gateway were assigned, and tested.

Once connectivity was confirmed, the VM was halted and powered up again at 12:09:48 to make sure everything worked as expected, so that the VM comes back up when it's rebooted unintendedly.

According to the logs, a minute later the VM (which has IP address was live migrated from fra-prd-esx01 to fra-prd-esx02.  This process started at 12:10:49 and finished 10 seconds later at 12:10:59.

This was done by DRS, Dynamic Resource Scheduler. It's purpose is to optimise performance by distributing VMs evenly across the various hypervisors, and this is a standard feature of VMware, and happens fully automatically. 

The last known-good log entry in the apache log file is from 5 seconds before this: 2018-02-22 12:10:54.342563 "GET /wp-content/plugins/templatesnext-toolkit/css/owl.carousel.css?ver=2.2.1 HTTP/1.1" 200 934 "" "Mozilla/5.0 (Linux; Android 5.0; Zn1 Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/ Mobile Safari/537.36"

Logging in to the console of the VM showed that everything was running, but there was no network connectivity. Even the gateway was not reachable.

After Massimiliano Adamo restarted the network on the VM at 12:29:18, everything started working again.

The timing of all events indicates that powering on the newly migrated VM triggered DRS to migrate the VM to another hypervisor. But during this process the network connectivity to the VM was lost.

The reason for this needs to be further investigated, because DRS moving VMs around is common practise, and this should not impact VMs at all.


DRS migration:

  • No labels