Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

During the investigation of incident with IDP somebody restarted sympa service on production instance (prod-lists01.geant.net):

~~~(all times below are UTC)

Aug  3 12:57:35 prod-lists01 bounced[29387]: notice main::sigterm() Signal TERM received, still processing current task
Aug  3 12:57:35 prod-lists01 bounced[29387]: notice main::sigterm() Signal TERM received, still processing current task
Aug  3 12:57:35 prod-lists01 bounced[29387]: notice main:: Bounced exited normally due to signal
Aug  3 12:57:35 prod-lists01 bounced[29387]: notice main:: Bounced exited normally due to signal
Aug  3 12:57:37 prod-lists01 archived[29380]: notice main::sigterm() Signal TERM received, still processing current task
Aug  3 12:57:37 prod-lists01 archived[29380]: notice main::sigterm() Signal TERM received, still processing current task
Aug  3 12:57:37 prod-lists01 archived[29380]: notice main:: Archived exited normally due to signal
Aug  3 12:57:37 prod-lists01 archived[29380]: notice main:: Archived exited normally due to signal
Aug  3 12:57:39 prod-lists01 bulk[29373]: notice main::sigterm() Signal TERM received, still processing current task
Aug  3 12:57:39 prod-lists01 bulk[29373]: notice main::sigterm() Signal TERM received, still processing current task
Aug  3 12:57:39 prod-lists01 bulk[29373]: notice main:: Bulk exited normally due to signal
Aug  3 12:57:39 prod-lists01 bulk[29373]: notice main:: Bulk exited normally due to signal
Aug  3 12:57:41 prod-lists01 sympa_msg[29366]: notice main::sigterm() Signal TERM received, still processing current task
Aug  3 12:57:41 prod-lists01 sympa_msg[29366]: notice main::sigterm() Signal TERM received, still processing current task
Aug  3 12:57:41 prod-lists01 sympa_msg[29366]: notice main:: Sympa/msg exited normally due to signal
Aug  3 12:57:41 prod-lists01 sympa_msg[29366]: notice main:: Sympa/msg exited normally due to signal
Aug  3 12:57:43 prod-lists01 task_manager[29392]: notice main::sigterm() Signal TERM received, still processing current task
Aug  3 12:57:43 prod-lists01 task_manager[29392]: notice main::sigterm() Signal TERM received, still processing current task
Aug  3 12:57:43 prod-lists01 task_manager[29392]: notice main:: Task_Manager exited normally due to signal
Aug  3 12:57:43 prod-lists01 task_manager[29392]: notice main:: Task_Manager exited normally due to signal
Aug  3 12:57:45 prod-lists01 sympa/health_check[27182]: info main:: Configuration file read, default log level 0
Aug  3 12:57:45 prod-lists01 sympa/health_check[27182]: info main:: Configuration file read, default log level 0
Aug  3 12:57:46 prod-lists01 sympa_msg[27188]: info main::_load() Configuration file read, default log level 0
Aug  3 12:57:46 prod-lists01 sympa_msg[27188]: info main::_load() Configuration file read, default log level 0
Aug  3 12:57:46 prod-lists01 sympa_msg[27188]: notice main:: Starting sympa/msg daemon, PID 27190
Aug  3 12:57:46 prod-lists01 sympa_msg[27188]: notice main:: Starting sympa/msg daemon, PID 27190
Aug  3 12:57:46 prod-lists01 sympa_msg[27190]: notice main:: Sympa/msg 6.2.8 Started
Aug  3 12:57:46 prod-lists01 sympa_msg[27190]: notice main:: Sympa/msg 6.2.8 Started
Aug  3 12:57:46 prod-lists01 sympa_msg[27190]: err main::#226 > Conf::checkfiles#827 Cannot access cafile /etc/sympa/ca-bundle.crt
Aug  3 12:57:46 prod-lists01 sympa_msg[27190]: err main::#226 > Conf::checkfiles#827 Cannot access cafile /etc/sympa/ca-bundle.crt

...


sympa_msg died due missing ca-bundle file in /etc/sympa directory. I don't know why it's missing, right now I've created a symlink pointing to /etc/pki/tls/certs/ca-bundle.crt. In this state sympa remained broken till Aug 6th ~ 13:00 CEST. sympa_msg process responsible for message delivery from ml spool to recipients. NOTE: only recipient message delivery where broken, everything else (ml posting, archiving etc) worked normally.


Incident severity: CRITICAL

Data loss: NO

Total disruption: 3 days.

Affected mail lists

Following mail lists were affected:

...

  • Puppet configuration lacks many things which sympa still depends. Strictly speaking at current state this puppet configuration is not fully suitable for management because many critical files are handled manually.

Timeline

...

  • .

Resolution

We need to re-write existing puppet module as fast as possible because it doesn't handle such things properly. The work started in test branch and current state there is way better (sympa installation and config handling are fully automated and it runs recent version already).

...