Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

There are many factors that affect the performance of data transfers. To achieve a high level of performance, ensuring the DTN elements of a Science DMZ deployment are well configured and well-tuned is of key importance.

Each use case should be examined in terms of the specific requirements involved. It is important to keep in mind that changes that improve performance in one data transfer scenario might negatively affect another, as well as to understand what effect these changes have on the end to end transfer performance as a whole. For example, TCP tuning parameters for long-distance transfer are generally different to those for short-distance. Specific values for such parameters are available from various sources. ESnet provides general performance advice [FASTERDATA] which includes settings suitable for certain DTN setups, e.g. Linux tuning information [ESnet]. Here, areas of interest are discussed with references on where to get additional information.

 

DTNs usually mount to a connected file system, whether it is a Storage Area Network (SAN) or High Performance Computing (HPC) network, with a network interface to either transmit or receive data files. Dedicated software transfer tools like GridFTP [GridFTP], XRootd [XRootd], XDD [XDD], FDT [FDT], BBCP [BBCP], FTS [FTS-CERN], etc. are best installed on a DTN instance to achieve better input/output performance concerning data transfer.

Since DTNs are placed in a high-performance network “on ramp” at sites, as shown in Figure 1.1 above, for security reasons, only software for dedicated data transfers are installed on the servers with "allow" access only to the endpoint sites (not open to the normal Internet traffic), and any in-network filtering is typically performed by efficient ACLs, rather than full, heavyweight stateful firewalls (of a type that would protect the rest of the site network for its day to day business traffic).

Please note that while following the next practices for tuning on the DTN endpoints for a data transfer, it is also important that the local network architecture is also appropriately designed.

In the following sections, we describe examples of DTN tuning for Linux DTNs. More information and detail, for example on pacing and MTU settings, can be found in [FASTERDATA2].


Section


Column
width22%

Networking

Networking


Column
width22%

Storage

Storage


Column
width22%

Architecture

Architecture












Networking

Various kernel parameters affect network settings. These kernel parameters can be inspected and modified using the sysctl tool or the files under /proc/sys/. Below, they are referred to using the sysctl name.

Socket Parameters

The following are some of the settings that affect socket networking parameters for all protocols.

net.core.rmem_default

net.core.wmem_default

net.core.rmem_max

net.core.wmem_max

The rmem parameters refer to the socket receive buffer size and the wmem refer to the send buffer size. Under TCP socket parameters, a sending host will typically need to buffer data to support the bandwidth delay product (BDP) of the link, i.e. enough memory for all bits in flight; the higher the RTT of the link, the more memory that is required (see the TCP Throughput Calculator [TTC] resource for an example of a BDP and buffer size calculator). As the names imply, those ending in default set the default value used, and those ending in max set the maximum value that can be requested by a program. A privileged program can set a buffer size beyond the maximum value [socket].

TCP Socket Parameters

The following parameters affect socket sizes for the Transmission Control Protocol (TCP) protocol in particular.

net.ipv4.tcp_mem

net.ipv4.tcp_rmem

net.ipv4.tcp_wmem

The tcp_rmem and tcp_wmem parameters are similar to the socket parameters in that they influence the socket buffer size but they are set differently. Each takes three values: a minimum, a default and a maximum. The default value overrides the value set by rmem_default and wmem_default. The maximum values do not override the settings of rmem_max and wmem_max. The minimum and maximum values set the range in which TCP can dynamically adjust the buffer size. This has no effect on the buffer size that a program can request. The setting tcp_mem affects how TCP manages its memory usage. It takes three values: low, pressure and high. Below the value of low, TCP will stop regulating memory usage (if it was previously). Above the value of pressure, TCP will start regulating its memory usage. The high value is the maximum amount of memory that TCP will use. These are not set in bytes but in the number of pages of memory. They refer to global memory usage not individual sockets [tcp].

Typically, larger buffer sizes are better for higher throughput or data transfers with a high round-trip-time. However, unnecessarily high values will waste RAM.

Another TCP parameter which can have a significant effect on performance is the congestion control algorithm used. The following parameters influence which algorithm is used.

net.ipv4.tcp_available_congestion_control

net.ipv4.tcp_allowed_congestion_control

net.ipv4.tcp_congestion_control

The tcp_available_congestion_control parameter is read-only and shows what algorithms are available on the system. The tcp_congestion_control parameter sets the default algorithm used; the Fasterdata site recommends htcp, but the newer TCP BBR developed by Google and being standardised in the IETF is certainly worth considering due to its greater resilience to packet loss). A program may instead use any of the algorithms set in tcp_allowed_congestion_control and a privileged program can use any of the available algorithms [tcp].

In order for a particular algorithm to be available it may require that a module is loaded into the kernel. While choosing the right congestion control algorithm can improve performance it also affects fairness. On shared links, some algorithms may negatively impact the network performance of other systems.  For example, TCP BBR is more “aggressive” and thus may dominate some traditional TCP algorithms.

The ss command can be used to get detailed information about current TCP connections including what congestion control algorithm they are using:

ss --tcp --info

NIC Driver Settings

In addition to kernel parameters, network performance can be tuned by adjusting settings that affect the actual network hardware. NIC manufacturers may supply proprietary tools to adjust settings. However, some network driver settings can be modified using the ethtool command. Two important settings that should be looked at are the size of the send and receive rings and the offload features.

ethtool --show-ring INTERFACE

ethtool --show-features INTERFACE

Larger rings typically improve throughput but can have a negative impact on latency or jitter. Offloading features allow some work that would need to be done by the CPU to be done on the NIC instead. This can lead to lower CPU utilization and higher throughput.

MTU Settings

The Maximum Transmission Unit (MTU) size can have a significant effect on throughput.  For most typical Ethernet networks, the default MTU is 1500 bytes.  This means each Ethernet frame sent will have a maximum size of 1500 bytes.  However, some networks may support larger MTUs, often up to 9000 bytes.  Where such MTUs are available end to end, greater throughput can be achieved; see the TCP Throughput Calculator [TTC] to explore the effect.  A higher MTU also puts less pressure on devices on the path in terms of their requirements to support high packet per second processing. 

Many NREN backbone networks support 9000 MTU, some support slightly higher to allow for some framing overhead.  Where a Science DMZ network is connected directly on to an NREN backbone through its site router, it may be possible to configure some or all of the Science DMZ to a higher MTU.

In Appendix A, the The example tuning for the GTS DTN setup uses an MTU of 8986 bytes, using a simple ifconfig to set the MTU value.






Storage

When performing disk-to-disk transfers, tuning the storage devices can be just as important as network tuning. Storage also needs read and write performance sufficient that it does not become a bottleneck for data transfers. The type of storage device used and how it connects to the system will affect what parameters can be tuned. The following subsections include some of the areas that should be examined when deciding on what hardware to use and how to maximize performance.

Bus vs Disk Speed

It is important to look at a storage device’s transfer rate and not just at that of the interface. A drive may internally have a transfer rate that is much lower than the speed of the interface it connects to. Two common connection types for storage are SATA and SAS. There are multiple versions of both but SATA-3 and SAS-3 are common. SATA-3 has a data throughput of 6 Gb/s and SAS has a throughput of 12 Gb/s. Most storage devices also transfer data sequentially faster than random access so expect a performance impact when data is accessed non-sequentially.

Spinning versus Solid State Disks

Traditional spinning disks tend to be slower than solid state disks, but cheaper and available in larger capacities. Both types of disks can connect to the system via SATA or SAS. Spinning disks typically have slower internal throughput than either the SATA or SAS bus, meaning that they cannot sustain transfers at the bus speed. Some SSDs are capable of matching the bus speed, but not all.

NVMe

Solid state disks can also be attached to a system using NVMe. This means that they are directly attached to the PCIe bus, which is substantially faster than either the SATA or SAS bus. The disadvantage of using the PCIe bus is that, compared to SATA or SAS, fewer storage devices will be able to connect. Only so many PCIe lanes are available per CPU and they are also needed for other hardware devices. PCIe versions 3.1 and 4 are common. Standard NVMe drives use four PCIe lanes each giving them a potential throughput of 31.5 Gb/s with PCIe 3.1 and 63.0 Gb/s with PCIe 4.0.

RAID

When a single storage device cannot achieve the desired level of performance, multiple devices aggregate into one logical device. RAID allows multiple devices to be combined in various configurations. While RAID is intended mostly to increase reliability, depending on the configuration, performance and/or reliability of the storage system can be improved. High storage throughput can be achieved by utilising multiple disks in a RAID-0 configuration at the cost of reduced reliability.

A RAID controller is a hardware device that is between the host and storage devices. It does the necessary calculations to present multiple storage devices to the operating system as one logical device. Instead of using a physical controller, RAID can also be implemented in software. Zettabyte File System (ZFS) is a file system that supports software RAID arrays [ZFS]. This allows for the aggregation of storage devices to increase read and write performance without the use of dedicated hardware. However, this may not achieve the same performance as a hardware implementation and it will create additional work for the CPU.

Remote Storage

There are multiple types of storage adapters that can be used to connect to remote storage such as InfiniBand, Fibre Channel or even Ethernet. The storage adapter selected has similar requirements to the network adapter. The throughput to the storage system should be at least as fast as the desired data transfer rate and offloading features are desirable to reduce the load on the CPU.









Architecture

Besides the storage and networking subsystems, there are a few other areas that should be considered. The underlying hardware architecture everything else is running on is of course important. The selected CPU and memory must fit the use case and be configured properly. AMD provides a variety of guides for tuning their EPYC processors [EPYC].

Fast Cores vs Many Cores

Data transfer applications can be CPU intensive. However, whether it is better to have a few very fast cores or many slower cores depends on if the application is single or multi-threaded. To complicate matters, it may also depend on the data set. Applications such as GridFTP cannot use multiple threads to transfer a single file but can use a separate thread for each file in a data set [Globus-FAQ].

NUMA

Non-Uniform Memory Access (NUMA) is when access times differ depending on which CPU is accessing which portion of memory. How devices are connected also determines which NUMA node they are in. For the best performance a process should run in the same NUMA node as the data it needs to access. Netflix showed that tuning NUMA parameters could substantially improve data transfer performance [NUMA]. Tools such as numactl and numad can be used to manage NUMA policies. To determine which NUMA node a device is in lspci can be used. For best performance, it may be necessary to physically move a NIC or storage controller, so that they are in the same NUMA node. In order for NUMA layout to be visible to an operating system, it must be enabled in the BIOS.

The following command can be used to display NUMA nodes on a system:

numactl --hardware

IRQ Handling

For best performance, interrupts should be handled as close to the hardware as possible, while not overloading a single CPU. The daemon irqbalance controls how hardware interrupts are distributed over the CPU cores.

Power Saving

Features designed to save power when the system is not fully utilised may be desirable in production. When measuring system performance, they should be disabled to provide a more accurate measure of what a system is capable of. The utility cpupower controls power management features of a CPU. The available options depend on the processor. See the cpupower documentation for more info.

Here is an example of how to change the CPU frequency governor on all CPU cores:

cpupower --cpu all frequency-set --governor performance