Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Incident description

The server that runs all the Wordpress site (wordpress1.geant.org) became unreachable at 12:10:59 CET

Incident severity: CRITICAL

Data loss: NO

Monitoring alerted: YES

Timeline

Time (CET)
12:10Apache server stop accepting incoming requests
12:12

Chris Atherton reported on #it channel that site aac-project.eu is not working correctly

12:21

Konstantin Lepikhov confirmed the issue with wordpress1 site on #devops channel

12:23

Dick Visser connected to VM via console and confirmed that network is down (gateway not reachable)

12:29

Massimiliano Adamo have restarted network service inside VM, after that everything started working and network came up.

12:30

Konstantin Lepikhov announced that problem fixed.

Total downtime: 20 minutes.

Analysis


As part of BAU and the handover of his responsibilities, Dick Visser was working on migrating a VM from the University of Amsterdam, into the GEANT VMware cluster in Frankfurt.

...

According to the logs, a minute later the wordpress1.geant.org VM (which has IP address 83.97.92.46) was live migrated from fra-prd-esx01 to fra-prd-esx02.  This process started at 12:10:49 and finished 10 seconds later at 12:10:59.

...

The timing of all events indicates that powering on the newly migrated VM triggered DRS to migrate the wordpress1.geant.org VM to another hypervisor. But during this process the network connectivity to the VM was lost.

The reason for this needs to be further investigated, because DRS moving VMs around is common practise, and this now should not impact VMs at all.






Logs/screendump



DRS migration:

Image AddedMonitoring alerted: YES