Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Platform design updates

Purpose of the document

GÉANT presents a large number of services which are used by the community and internally.  These publicly exposed services have, in most cases, further dependencies on hidden services such as authentication systems and databases.  The expectation is that these services should be available 100% but in actuality they occasionally fail. Given this, duplicate copies of some services exist to provide redundancy, and while they are designed to protect from loss of data, failing-over the user interface requires manual intervention, which incurs a delay until carried out by an operator (this can sometime extend to a day or two if the outage occurs at the weekend, for instance). A further problem exists in that even where a service provides redundancy at the data and user-interface level, dependent service upon which they rely often are not redundant which means that there is nothing to fail over to and the underlying service needs to be repaired before the service is fully restored.

The following describes structures and facilities that will be introduced to improve service resilience and reliability in GÉANT.

Problems being addressed

A couple or examples of specific problems which have occurred in the recent past, which the work proposed in this document would hope to address, are listed here.

Dashboard

Two productions dashboard servers exist but users access a specific dashboard via a DNS CNAME dashboard-primary.geant.net.  If the server to which the CNAME refers fails for any reason, an operator needs to manually adjust the value of the CNAME to point to he other dashboard. 

Crowd authentication

Many systems, such as JIRA and Dashboard, depend on the Crowd server for user authentication and access control, which has failed occasionally in past and prevented user access to Dashboard or JIRA.  Providing a second Crowd server which can be failed-over to is relatively straight-forward (indeed, the uat-crowd server is configured identically to the prod-crowd) but failing-over to it is still a manual task in the current infrastructure. 

A number of other services, such as cacti, poller, compendium, and generally most of the newer servers have redundant copies deployed (as prod-somename01 and prod-somename02 production servers) but Two Crowd servers exists (prod-crowd, and uat-crowd) and contain identical information, but systems (such as Dashboard, JIRA, and others) are configured to use one or the other for authentication.  

As outlined above, a number of services hosted by GÉANT have suffered from time to time with poor availability  

A number of customer facing services have single points of failure, or where redundancy has been employed, manual intervention is required to access a specific redundant copy, such as where an end-user looks at either the 01 or 02 version of a service, or one of these alternative services is accessed via DNS CNAME entry which needs to updated manually.  This is also true of infrastructure services, such as databases, authentication servers or other dependent services.

Examples:

...

but none are set up for automatic failover and all would expose a loss of service if the 01 server failed. 

Generally, all service deployment can be done in the context of redundant services to provide high availability.  The impact to SWD is very low. Services  Services should be deployed such that they are essentially the primary in every case.  

Design goals

Provide an infrastructure which supports automatic service failover.  Service failover should be invisible to users of a service.

...

Follow-up work: identify and document remaining single points of failure.

...

Platform design

The current infrastructure tightly couples services with the server it is deployed upon.  The following describe a setup which maintains the coupling of the service a specific server but uses service discovery tools with DNS to advertise services dynamically.

Each server will run the consul agent and include a config listing the services it runs and how to monitor them (to test if they are serviceable).  This should be maintained in puppet.

The consul servers create, update and push a DNS zone file.  This should occur frequently enough as is reasonable to minimise a service DNS query miss.  Perhaps every 5 minutes or more frequently?


In this setup the consul client agent is installed on every node.  Included with the service when it is deployed on a server, is the consul configuration for the service, and the agent uses this to register the service with the Consul Server.  The Consul agent configuration includes a health check stanza which the Consul Server uses to confirm the health of a particular service and, thus, determines whether the service is advertised for use.

Services are advertised via DNS. 

Components

Consul: provides service discovery

...