Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • The core issue seems to be related to VMWare storage and IT need to provide a solution for monitoring the health of the VM storage
  • The HA of primary services (EMS, Map, etc) and dependant core services (PostgreSQL etc) needs to be improved:
    • Additional monitoring configuration or setup changes are required to more easily identify the root cause of an outage
      • If a core service goes down (e.g. no PostgreSQL primary set), that alarm will be hard to spot in the hundreds of subsequent alarms
    PETE AND MAX TO PLEASE FILL IN 
  • Postgres 
    • FAILURE: fail-over fail to work because the it required a witness to be active and it had failed because it is the same VM cluster as the primary DB server
    • Proposed solution:
      • add two more witness servers so we have one on every site
      • second suggestion would be to change the way the postgres clients connect so the are cluster aware (ie: all node are listed in connection string and have the targetServerType=primary option) and connect directly to postgres, this will remove the the need for the consul config / intermediate (less parts, less to break)