Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Incident Description

Users of EMS could not logon to EMS.  They were presented with the login screen, which took them to the IDP selection page (as per normal).  After successful authentication on the IDP, they where redirected to EMS. However, instead of being logged in on EMS, they where logged out.


The reason for degradation:

  • EMS/Indico stores user sessions in redis
  • prod-events01.geant.org and prod-events02.geant.org could not resolve the hostname for the redis server (master.production-events-redis.service.ha.geant.net)
  • With the connection to redis lost, Indico could not create or manage user sessions


The impact of this service degradation was:

  • Users could not manage their events, for example:
    • Editing the event
    • Accessing registation lists
    • Sending out reminder emails


Incident severity: 

Status
subtletrue
colourYellow
titleMed
 Partial service degradation

...

Total duration of incident:  15 hours


Timeline

All times are in UTC

DateTimeDescription

 

21:55:53 

First error in indico.log of redis being unavailable:

ConnectionError: Error -2 connecting to master.production-events-redis.service.ha.geant.net:6379. Name or service not known.

 

10:42First user query about EMS login problem (Slack #general)

 

11:14

Ian Galpin identified the dns resolution problem

Code Block
languagetext
themeDJango
[root@prod-events01 log]# ping master.production-events-redis.service.ha.geant.net
ping: master.production-events-redis.service.ha.geant.net: Name or service not known

 

12:06Service degradation incident email sent out to product owner (Steffie Bosman)

 

12:12

Massimiliano Adamo identified a problem with PowerDNS

Code Block
languagetext
themeDJango
[root@prod-events02 ~]# host slave.production-events-redis.service.ha.geant.net
slave.production-events-redis.service.ha.geant.net has address 83.97.94.19
[root@prod-events02 ~]# host slave.production-events-redis.service.ha.geant.net
slave.production-events-redis.service.ha.geant.net has address 83.97.94.19
Host slave.production-events-redis.service.ha.geant.net not found: 3(NXDOMAIN)
[root@prod-events02 ~]# host slave.production-events-redis.service.ha.geant.net
Host slave.production-events-redis.service.ha.geant.net not found: 3(NXDOMAIN)
[root@prod-events02 ~]# host slave.production-events-redis.service.ha.geant.net
Host slave.production-events-redis.service.ha.geant.net not found: 3(NXDOMAIN)
[root@prod-events02 ~]# host slave.production-events-redis.service.ha.geant.net
Host slave.production-events-redis.service.ha.geant.net not found: 3(NXDOMAIN)
[root@prod-events02 ~]# host slave.production-events-redis.service.ha.geant.net
slave.production-events-redis.service.ha.geant.net has address 83.97.94.19


consul DNS resolution seemed to work:

Code Block
languagetext
themeDJango
dig slave.production-events-redis.service.ha.geant.net @prod-consul01.geant.org -p 8600
dig slave.production-events-redis.service.ha.geant.net @prod-consul02.geant.org -p 8600
dig slave.production-events-redis.service.ha.geant.net @prod-consul03.geant.org -p 8600


 

12:30

Massimiliano Adamo resolved the PowerDNS issue by disabling the packetcache config option:

the problem was this parameter (we are almost sure):
https://docs.powerdns.com/recursor/settings.html#disable-packetcache
it defaults to NO
but now I have set to yes

The following GitHub issue might explain the issue: https://github.com/PowerDNS/pdns/issues/8160

 

13:01Service restored email sent out to product owner (Steffie Bosman)

Proposed Solution

  • Additional monitoring (Sensu checks) will be added

...