We are upgrading this site on Friday 5 March commencing at 17:00 UTC and running until 20:00 UTC. During the maintenance window there will be several reboots and service interruptions so we strongly recommend that you don’t attempt to use the site during the maintenance window.
Page tree
Skip to end of metadata
Go to start of metadata


A world wide, distributed Grid Computing environment is currently under development for the upcoming Large Hadron Collider (LHC) at CERN, organized in several so called Tier-centres. CERN will be the source of a very high data load, originating from particle detectors at the LHC accelerator ring, with an estimated 2000 MByte/sec. Part of the data processing is done in a Tier-0 centre, located at CERN. It is responsible mainly for initial processing of data and its archiving to tape. One tenth of the data is distributed to each of the 10 Tier-1 centres for further processing and backup-purposes. The Gridka cluster (www.gridka.de), located at Forschungszentrum Karlsruhe/Germany (www.fzk.de), is one of these Tier-1 centres. Further details about the LHC computing model can be found here: http://les.home.cern.ch/les/talks/lcg%20overview%20sep05.ppt.

A 10 Gbit link is active between the CERN Tier-0 centre and GridKa in order to handle this high load of date.

Often network administrators, after having set up a Gigabit connection (or even one with 10Gb/s), sadly realize that what they get out of it is not what they had expected. The performance achieved with vanilla systems is often far away from what it should be. The current problems of Gigabit- and 10 GB/s-Ethernet were not yet apparent years ago, when the Fast Ethernet technology was released. Currently it is really difficult to get close to line speed and it is almost impossible to fill a full-duplex communication when the latter technology is used. While at the height of the Fast Ethernet technology the networks were the bottlenecks, nowadays the problems have moved towards the end-systems, and the networks are not likely to be the bottlenecks anymore.

The requirements of a full duplex 10Gbps communication are just too demanding for the capabilities of current end-systems. Most of the applications in use today are based on the TCP/IP protocol. The inherent unreliability of the IP protocol is complemented by several methods in the TCP layer of the OSI model to ensure a correct and reliable end-to-end comunication. This reliability does of course not come for free, as each TCP-flow within a system passes the Front Side Bus (FSB) up to four times and the memory is accessed up to five times.

Currently vendors try to decrease the number of transferences back and forth through the FSB and the interruption rate that a 10Gbps system has to deal with. This is done by introducing offload engines (see http://www.redbooks.ibm.com/redbooks/SG245287/\) and Interrupt Coalescence to current 10Gbps network devices, with the aim of getting the best out of the current technology found in end-systems. The first approach implements some software TCP procedures of last year in hardware, thereby eliminating one cycle, which reduces the number of accesses to two Front Side Bus transferences and three memory accesses. The second approach simply places each new package into a queue during a set time rather than sending it as soon it is ready. It is hoped that when the time frame is finished, there will be more packets ready to be sent. This allows to send them all in one interruption rather than generating an interruption for every single packet.

The various 10Gbps tests run at the Forschungszentrum Karlsruhe can be divided into two big groups:

  • a local test inside the Forschungszentrum testbed
  • tests involving experiments in a Wide Area Network environment (WAN), between Forschungszentrum Karlsruhe and CERN. Such tests use the Deutsche Forschungsnetz and Geant through a 20 ms RTT path over a shared Least Best Effort ATM MPLS Tunnel. This allows DFN and Geant to stay in control, as their 10Gbps backbone could be easily filled up with these tests only. This would effectively cut off the communication of thousands of scientist across Europe ...

The local environment at Forschungszentrum Karlsruhe consisted of two IBM xSeries 345 Intel Xeon based systems, both equipped with a 10Gbps LR card, kindly provided by Intel. With these unmodified systems the throughput went up slightly above 1Gbps. After modifying Intel's device driver's default interruption coalescence configuration, using Ethernet extended non-standard jumbo frames, and setting to its maximum the MMRBC register of the PCI-X command register set, an unidirectional single stream of slightly over 5.3Gbps could be sent in a back to back transference, this way improving it by more than 400%. As both the IBM system's load rose up to 99%, no higher throughput could be achieved with these machines. The bottleneck in this case was the memory subsystem of the Xeon systems.

In the WAN environment, a single Intel Itanium node at CERN plus one of the already tuned IBM systems were configured to take part in the wide area tests. Both were configured in the same way. The first tests were really disappointing, as they did not go beyond a few MegaBytes. Once the TCP SACK (selective acknowledgements, RFC 2018) and TCP Timestamps (RFC 1323) were enabled, and the TCP windows were enlarged by means of the sysctl parameters in order to match the bandwidth delay product (BDP), the throughput drastically increased up to 5.4Gbps. In the latter case, the BDP is roughly 20Mbit for this 20ms RTT across Germany, France and Switzerland. In this situation the bottleneck was again the xServers memory subsystem. This did not come unexpected, as two different architectures were brought face to face; Xeon versus Itanium.

Here is the modification of the TCP stack, as done using the Linux kernels sysctl parameters:

net.ipv4.tcp_timestamps =1

net.ipv4.tcp_sack = 1

net.ipv4.tcp_rmem = 10000000 25165824 50331648# sets min/default/max TCP

read buffer, default 4096 87380 174760

net.ipv4.tcp_wmem = 10000000 25165824 50331648# sets min/pressure/max TCP

write buffer, default 4096 16384 131072

net.ipv4.tcp_mem = 32768 65536 131072 # sets min/pressure/max TCP buffer

space, default 31744 32256 32768

Related links:

Forschungszentrum Karlsruhe: http://www.fzk.de

GridKa: http://www.gridka.de

CERN: http://www.cern.ch

LHC GridComputing: Unable to render embedded object: File (les.home.cern.ch/les/talks/lcg%20overview%20sep05.ppt) not found.

IBM Redbook: http://www.redbooks.ibm.com/redbooks/SG245287


Marc Garc�a Mart�

Bruno Hoeft

– Main.MonicaDomingues - 24 Oct 2005

-- Main.SimonLeinen - 14 Oct 2006 (added cross-references)

  • Tests Ten Gbps: Logical topology of the Forschungszentrum /CERN 10Gbps testbed
  • No labels