Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Following services were inaccessible for GEANT staff members during the outage because they all use GEANT Staff IdP for authentication:

    • SharePoint (e.g. Intranet, Partner Portal)
    • GEANT wiki
    • EventR
    • wordpress sites
    • Compendium
    • Filesender
    • FoD
    • Lifesize
    • BOX
    • sympa (lists at prod-lists01.geant.net server)

Cause

  • Failure to follow the change management process - needs further investigation. 
  • Lack of planning. Preparing an isolated environment would have taken up to a week of work but we have been asked to make this work the same day. Massimiliano Adamo and Michael Haller asked to postpone and to meet on Friday.
  •  Massimiliano Adamo asked twice in the meeting for objections and if we intended to take an action. It was a team decision. 
  • severe lack of knowledge of puppet integration with CloudBolt, from the side of the consultants. The consultants have been asked repeatedly certain questions and they were either evading the answer or providing wrong replies (i.e.: CloudBolt doesn't provide Agent Bootstrap, but they said it does (star) This implies that they don't know the implementation details). The consultants have been asked clearly about the environment settings. 

(star) agents must be pre-installed and configured in the image (from CloudBolt documentation: http://docs.cloudbolt.io/configuration-managers/puppet/index.html ) invetigation.


Timeline


, whom then investigated.
Time (CET)
03 Aug, 12:36

Issue Reported by Cristian Bandea on slack channel #techies

03 Aug, 13:02

Andrew Jarvis sent direct Slack to Dick Visser and Konstantin Lepikhov but no response.

03 Aug, 13:25

Massimiliano Adamo, started investigating

03 Aug, 13:35

03 Aug, 13:46

user-7da5d pointed out to CloudBolt and found the issue (wrong puppet environment)

03 Aug, 14:00

user-7da5d have switched off prod-idp01 VM leaving only prod-idp02 functioning

03 Aug, 16:00Saltstack was used to fix all the servers at once


...