Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Incident description

The server that runs all the Wordpress site (wordpress1.geant.org) became unreachable at 12:10:59 CET

Incident severity: CRITICAL

Data loss: NO

Monitoring alerted: YES

Timeline

Time (CET)
12:10Apache server stop accepting incoming requests
12:12

Chris Atherton reported on #it channel that site aac-project.eu is not working correctly

12:21

Konstantin Lepikhov confirmed the issue with wordpress1 site on #devops channel

12:23

Dick Visser connected to VM via console and confirmed that network is down (gateway not reachable)

12:29

Massimiliano Adamo have restarted network service inside VM, after that everything started working and network came up.

12:30

Konstantin Lepikhov announced that problem fixed.

Total downtime: 20 minutes.

Analysis


As part of BAU and the handover of his responsibilities, Dick Visser was working on migrating a VM from the University of Amsterdam, into the GEANT VMware cluster in Frankfurt.

...

According to the logs, a minute later the wordpress1.geant.org VM (which has IP address 83.97.92.46) was live migrated from fra-prd-esx01 to fra-prd-esx02.  This process started at 12:10:49 and finished 10 seconds later at 12:10:59.

...

The reason for this needs to be further investigated, because DRS moving VMs around is common practise, and this should not impact VMs at all.






Logs/screendump



DRS migration:

Image AddedMonitoring alerted: YES