Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The SP Proxy (https://login.terena.org/wayf) uses metadata that is generated by another host: https://pioneer.terena.org/mr.

This host polls various metadata sources from cron, using SimpleSAMLphp's metarefresh module.

...

When pioneer.terena.org/mr has finished, it uses rsync to synchronise the directory that hold holds the metadata to login.terena.org/wayf.

...

During the SimpleSAMLphp hackathon on 27 May 2015 a an attempt has been made to add some form of monitoring capabilities to metarefresh.

...

  • Write any custom Nagios checks and let the Nagios server poll the URLs. This option was rejected because the network checks should be done from the same host that does the metarefresh process. Otherwise subtle changes in network access lists, firewall, OS and library versions might yield different results.
  • Runs Run separate Nagios checks through/from the same host that does the metarefresh process. This approach was rejected because with dozens of URLs, any non-responsive URLs would significantly increase the time it takes to run the Nagios plugin. blah blah

In the end the follow approach was chosen. The metarefresh process already logs all possible errors, and also stores any Conditional GET values for each URL in a state file.

...

  • When the metarefresh config is changed, for example URLs are added/deleted/changed, the Nagios configuration needs to be changed as well, the Nagios process needs to be reloaded, and a forced check of the services needs to be scheduled.
  • The name of the service would be the src URL, but this contains lots of 'illegal' characters as far a Nagios is concerned. A first attempt to fix work around this was by stripping the path from the URL. This means that the name would be just the base URL, which makes sense to admin, but does not (yet?) contain illegal characters. However, this turned out to be a problem because metadata is polled from several sources that reside on the same base URL (http://my.org/saml/idp1, http://my.org/saml/idp2, etc). The base URL is the same in this case, which is also not allowed by Nagios. The final work-around was to append a hash of the URL to the base URL. This way it is easy to distinguish what domain/IdP has an issue, and the string will be unique for each URL
  • Because the way metarefresh is implemented, the URL state file is being overwritten for each URL, and only at the end the end will it yield the full status array. This causes a race condition with the Nagios check, it can happen that the Nagios check for a specific URL happens during the metarefresh, when there is no status information yet on that URL. A work-around was implemented by scheduling a simple cp command at the end of metarefresh cronjob. This will make sure the state file will always have all the URLs.

Because the logic of metarefresh is used, the same errors can be detected. This comes down to any connection errors, and all HTTP response codes other than 200 (OK) and 304 (Not Modified).

 

A few basic tests were done to confirm the functionality, such as having a URL return an HTTP 500 error:

 

Image Added

 

Based on the experiences, some recommendations:

 

  • Using the current metarefresh module has it's drawbacks. Notably, the race condition that exists in writing the state file is not a problem for the Conditional-GET state, is a real problem for the Nagios state file, and it not easy to fix. So it will be worth doing separate checks. 
  • blah