The SP Proxy (https://login.terena.org/wayf) uses metadata that is generated by another host: https://pioneer.terena.org/mr.
This host polls various metadata sources from cron, using SimpleSAMLphp's metarefresh module.
The reason for this dual VM is that the metarefresh module sometimes would get stuck validating XML signatures, and then the SP proxy would also hang, which sucks.
This behaviour has been fixed, but the dual VM set-up is still a good idea, so it will stay this way.
When pioneer.terena.org/mr has finished, it uses rsync to synchronise the directory that holds the metadata to login.terena.org/wayf.
pioneer.terena.org/mr is currently configured to poll a couple of dozen URLs for metadata.
This process works, and it has been improved so that when there are errors with the refreshing/polling, the previous/cached metadata is re-used, instead of it being nuked.
This is already an improvement, and when it happens, an error is logged, however this requires an administrator to look at the log files, which does not happen.
So, the error goes unnoticed, and eventually the cached metadata will expire, because it has a set lifetime embedded.
At this stage IdPs will start to disappear from the metadata until it's completely empty, until there are no valid entries any more and the complete set will have disappeared.
At this point the service will be unavailable.
During the SimpleSAMLphp hackathon on 27 May 2015 an attempt has been made to add some form of monitoring capabilities to metarefresh.
This can be done in many ways, however not all were considered:
In the end the follow approach was chosen. The metarefresh process already logs all possible errors, and also stores any Conditional GET values for each URL in a state file.
By slightly adapting the metarefresh module, it was possible to create an additional state file that holds error information about the URL.
This state file is then parsed by Nagios through check_by_ssh. This is not the best approach security wise, and could later be improved so that the data is exposed over HTTPS and protected by supplying a secret variable.
Several issues were encountered:
Because the logic of metarefresh is used, the same errors can be detected. This comes down to any connection errors, and all HTTP response codes other than 200 (OK) and 304 (Not Modified).
A few basic tests were done to confirm the functionality, such as having a URL return an HTTP 500 error:
Based on the experiences, some recommendations: