Low data throughput for e-VLBI project
PTS Case 2
Very Long Baseline Interferometry (VLBI) is a technique in radio-astronomy for correlating data from geographically diverse radio-telescopes in order to emulate a single very large telescope. Currently this is achieved in Europe by sending data to a local hard disk array then mailing the hard disk array to the central Joint Institute for VLBI in Europe (JIVE) in Dwingeloo, NL. The European VLBI Network (EVN) hope to improve this process by sending the data direct to JIVE over data networks (NRENs and G�ANT2). For this to be successful the sustained data rate from each site back to JIVE should be at least 128Mbps, and ideally 512Mbps. However, early tests were not been able to consistently sustain even the low rate of 128Mbps. Tests conducted include live data transfers and 'iperf' tests, using both TCP and UDP.
The EVN case has been complicated by the fact that they rely on very specialist data processors for collecting or transmitting (depending in whether the transfer will be via hard disk pack or the network) the astronomical data. These devices, Linux-based PCs called Mk5A systems, are located at each radio-telescope site and in JIVE (in JIVE there are multiple units – one is needed for each remote site involved in a given observation).
The Mk5A system was designed to be compatible with the existing data correlators, and as such the data mimics the format of the original VLBI magnetic tapes. Specifically, each VLBI data frame consists of a 20 byte header and 2480 bytes of data. There is one such data frame per track, and many tracks are recorded simultaneously. The actual make up of the application protocol data units (PDUs), is hidden from the operators, and is done by proprietary hardware and software. VLBI data needs to be transferred at a specific rate, which may be 16Mbps or any harmonic of 2 greater than it, up to 1024Mbps (so 16, 32, 64, 128, 256, 512 or 1024 Mbps). Although TCP is the normal method used for transporting VLBI data, it is possible to use UDP, EVN experience had been that there was no great improvement in using it (from the end-user’s point of view).
Because in bench tests Mk5A systems (connected back to back) have had no problems with high speed data transfers, the e-VBLI community suspected the root cause of the problems as being network related.
In order to investigate network performance separately from any other issue, bwctl (a wrapper program for the bandwidth measurement tool iperf) was installed on various machines across the e-VLBI network, including the Mk5As themselves. Bwctl was also available on two Linux workstations, in the G�ANT UK and IT PoPs. This meant that pure network tests could be run between EVN sites, between G�ANT sites, or between any combination of the two (it should be noted however that the UK-IT G�ANT route is not part of any actual e-VLBI path).
Bwctl showed that the poorest TCP performance was experienced on the path between the Torun radiotelescope (in Poland) and JIVE (with traffic tests only reaching 200-300 Mbps), so this was where effort was concentrated. ***The PERT’s investigation (lead by PSNC) found the bottleneck to be caused by core network devices installed in three locations in PIONIER network. These switches (10 GE Black Diamonds BD6808) offer two queuing regimes for the 10GE card: packet-based and flow-based. The former schedules each incoming packet to a different queue, introducing significant reordering. The latter preserves packet order, but limits the single flow capacity to the queue size (and this policy is implemented in PIONIER backbone). This is very unfortunate, because in ideal conditions (empty network) the size of a single flow cannot exceed 1Gbit/s (7Gbit/s remains unused and unavailable for that flow). The situation is even worse in the presence of Internet traffic (multiple different flows), where each queue already has some background traffic scheduled. As an example, if the single link has 4Gbit/s of traffic load, it means that average queue load is 500Mbit/s. Each new flow will encounter congestion conditions when its size reaches 500Mbit/s, even if there is still 4 Gbit/s of free bandwidth on the switch. The issue was fully described in the paper presented at TERENA 2005 Networking Conference: “Shall we worry about Packet Reordering?” available from the TERENA website at: http://www.terena.nl/events/tnc2005/programme/presentations/show.php?pres_id=74
At this stage it was clear that the most promising way forward is to make more use of advanced TCP implementations, and/or utilities such as Tsunami, which adds a reliable transfer layer on top of standard UDP. However it has been discovered that the Mk5As cannot use the newer Linux 2.6 kernel, because of problems with the required Jungo drivers. The designers of the Mk5A (based at the MIT Haystack observatory, MA, USA) were notified and are trying to resolve this problem.
In the meantime to work round this problem the PERT recommended a trial deployment of an Americandeveloped system called the Logistical Session Layer, which sub-divides long TCP connections into several concatenated short connections, which has been shown to improve TCP performance in packet loss conditions. Full details of the LSL and its use in the EVN are given in Appendix B of this document.
Although the Black Diamond switches are still in place, in May PSNC commissioned a new circuit (wavelength)
between Poznan and Gdansk, and thereby decreased the load on the path Torun to Poznan. As a result, there
was a significant improvement for the Torun to JIVE traffic, such that now a consistent 700Mbps plus can be
achieved with iperf.
– Main.TobyRodwell - 26 Jan 2007