You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 16 Next »

Incident description

During the testing CloudBolt (an interface from which people can order self service VMs) configuration was applied to our configuration management (Puppet) server, which changed configuration on some nodes pointing them to wrong environment (test instead of production). CloudBolt in this configuration, is seen by Puppet as ENC (External Node Classifier). The ENC pushes some configuration to all the servers managed by puppet, and when you use a ENC, you cannot switch the configuration back, and it must be done from CloudBolt. This caused the services to pick the configuration from wrong environment that caused the outage. 

Some other production services environment was also pointed to test but there was no loss of service as the configuration for test and production was same.

Incident severity: CRITICAL

Data loss: NO

Affected Services 

Following services were inaccessible for GEANT staff members during the outage because they all use GEANT Staff IdP for authentication:

    • Intranet
    • GEANT wiki
    • EventR
    • wordpress sites
    • Compendium
    • Filesender
    • FoD
    • Lifesize
    • BOX

Cause

Failure to follow the change management process - needs further invetigation.

Timeline


Time (CET)
03 Aug, 12:36

Issue Reported by Cristian Bandea on slack channel #techies

03 Aug, 13:02

Andrew Jarvis sent direct Slack to Dick Visser and Konstantin Lepikhov but no response.

03 Aug, 13:25

Andrew Jarvis contacted Massimiliano Adamo, whom then investigated.

03 Aug, 13:46

user-7da5d pointed out to CloudBolt and found the issue (wrong puppet environment)

03 Aug, 14:00

user-7da5d have switched off prod-idp01 VM leaving only prod-idp02 functioning

03 Aug, 16:00Saltstack was used to fix all the servers at once


Total downtime: 1.5 hours

Resolution

CloudBolt changes were reverted and all production and UAT VMs were restored back in their environment.

  • No labels