Service Reliability - design document (draft)

Purpose of the document

GÉANT presents a large number of services which are used by the community and internally. These publicly exposed services have, in most cases, further dependencies on system services such as authentication systems and databases. The expectation is that these services should be available 100% but in actuality they occasionally fail. Given this, duplicate copies of some services exist to provide redundancy, and while they are designed to protect from loss of data, failing-over the user interface requires manual intervention, which incurs a delay until carried out by an operator (this can sometime extend to a day or two if the outage occurs at the weekend, for instance). A further problem exists in that even where a service provides redundancy at the data and user-interface level, dependent services upon which they rely often are not redundant, which means that there is nothing to fail over to and the underlying service needs to be repaired before the service is fully restored.

The following describes structures and facilities that will be introduced to improve service resilience and reliability in GÉANT.

Problems being addressed

A couple or examples of specific problems which have occurred in the recent past are listed here.

Dashboard

Two productions dashboard servers exist but users access a specific dashboard via a DNS CNAME dashboard-primary.geant.net. If the server to which the CNAME refers fails for any reason, an operator needs to manually adjust the value of the CNAME to point to he other dashboard.

Crowd authentication

Many systems, such as JIRA and Dashboard, depend on the Crowd server for user authentication and access control. The crowd server has failed occasionally in the past and prevented user access to Dashboard or JIRA. Providing a second Crowd server which can be failed-over to is relatively straight-forward (indeed, the uat-crowd server is configured identically to the prod-crowd) but failing-over to it is still a manual task in the current infrastructure.

A number of other services, such as cacti, poller, compendium, and generally most of the newer servers have redundant copies deployed (as prod-somename01 and prod-somename02 servers) but none are set up for automatic failover and all would expose a loss of service if the 01 server failed.

Generally, all service deployment can be done in the context of redundant services to provide high availability. The impact to SWD is low. Services should be deployed such that they are essentially the primary in every case.

Design goals

Provide an infrastructure which supports automatic service failover. Service failover should be invisible to users of a service.

Provide an infrastructure which provides service discoverability, where services:

auto-register when available
auto-deregister when no-longer available

Provide a zero downtime scheduled maintenance framework: in a system which employs redundant services, maintenance can be scheduled such that there is no loss of service, and that users are unaware that maintenance is on-going.

Automate service recovery, tolerant of system failure, within the context of the service reliability infrastructure. What this means is that, where possible, services should self-heal.

Monitor service availability. Services should be monitored for availability and redundancy.

Identify and document remaining single points of failure.

Platform design

The current infrastructure tightly couples services with the server it is deployed upon. The following describes a setup which maintains the coupling of the service to a specific server but uses service discovery tools and DNS to advertise services dynamically.

The service discovery tool is an agent called Consul developed by HashiCorp (the same company responsible for Vagrant and Terraform).

The Consul agent operates in either client or server mode. Consul clients, running on GÉANT linux servers, register services with the Consul server. The Consul server then has a record of services running on the network and where they are located. With this information it can either act as a source for DNS lookups (via a DNS forward zone) or can update the network DNS server with information it knows about network services.

In this setup the consul client agent is installed on every node. Included with the service when it is deployed on a server, is the consul configuration for the service, and the agent uses this to register the service with the Consul Server. The Consul agent configuration includes a health check stanza which the Consul Server uses to confirm the health of a particular service and, thus, determines whether the service is advertised for use.

Services are advertised in DNS. The Consul Servers can by queried directly via a DNS forwarder, for instance, alternatively the Consul server can update GÉANT's DNS infrastructure by periodically pushing DNS zone files to Infoblox. A DNS query port on a consul server is a single point of failure so pushing a zone file to Infoblox is a more resilient solution.

Example for the dashboard service (using the image above)

In the image above there are two linux (VMs) each running dashboard. In addition node 1 also hosts an instance of cacti and node 2 the bodportal. Each node runs the Consul agent in "client" mode and registers all the services it knows about with the Consul server. The consul server periodically assembles all registered services into a DNS zonefile which it pushes to the DNS infrastructure (which in GÉANT is Infoblox). In the example here, the zonefile will contain two entries for dashboard and one each for cacti and bodportal.

When a web client loads dashboard, the DNS query returns one of the IPs it has for dashboard; above, node2 is returned.

The consul server continues to accept service registrations from consul clients as well as actively monitoring the health of registered services. So, for instance, should the dashboard instance on node 2 fail, the consul server will identify this and remove the node 2 IP for dashboard from the DNS zonefile. The next time a client loads dashboard it will only be offered the IP for node 1.

Todo - Low-level design

Components:

Consul: provides service discovery: client and server modes.
Infoblox: DNS for services - infoblox api used for deploying zonefile
consul agent (client) installed on all servers which have services to be discovered
quorum of consul servers (3 or 5 Consul servers is recommended)
Consul server creates/generates DNS zonefile which is pushed to infoblox. Zonefile generated often (every minute?) but only pushed to infoblox if changed (need to detect and protect against flapping)
service checks for every service include

Page tree