Incident Description

EMS service was not running on either prod-events01.geant.org or prod-events02.geant.org which resulted in EMS service not being available.


The impact of this service degradation was:

  • Users could not access EMS
  • No emails from EMS were sent during the outage period


Incident severity: CRITICAL Complete service outage

Data loss: NO

Total duration of incident: 2 days


Timeline

All times are in UTC

DateTimeDescription

 

06:20

According to the puppet and systemd logs, puppet tried to re-install EMS on the prod-events01.geant.org and prod-events02.geant.org 

The PIP re-install of EMS resulted in partial EMS directories in the python virtual environment. This is explained in the following Stack Overflow page: https://stackoverflow.com/questions/55565760/anaconda-python-site-packages-subfolders-with-tilde-in-name-what-are-they

 

08:18User enquired on Slack (#general) if EMS is down

 

08:30 - 09:12

Ian Galpin investigated and resolved the issue

  • Determined that the venv for EMS was broken
  • Disabled puppet
  • Fixed virtualenv
  • Restarted services

 

09:12

Service was restored and Ian Galpin informed the user on the Slack channel 

Proposed Solution

  • One of the reasons why the service outage was not discovered by Sensu is that:
    • Sensu checks that the service is up by doing a standard http check on the primary service hostname (https://events.geant.org) which is hosted by HAProxy
    • When both downstream servers/services are unreachable ( prod-events01.geant.org and prod-events02.geant.org ), HAProxy returns a generic 'Maintenance' page
    • The 'Maintenance' page is returned with an HTTP 200 OK code, which results in the Sensu check marking the service as up and available
  • Massimiliano Adamo has updated HAProxy to return an HTTP error code instead of a 200 code
  • A secondary reason for not detecting the outage is that we're not monitoring the service availability on the downstream servers
    • Plans are in place to add additional monitoring and checks for all SWD services and to test the outage detection and alerting
  • Ian Galpin to determine why puppet and pip break the virtualenv
  • Mandeep Saini to present the Incident Handling Process presentation
    • Many new team members have joined since the last time the presentation was given


  • No labels