Network Adapters Demultiplexing to Multiple Threads

Over the past years, multi-threaded and multi-core machines have become dominant, not just in the high-performance arena but also on personal machines, including laptops. This has naturally led to attempts to make networking benefit from these multiple cores. For some earlier architecture, this was only possible when the system had multiple interfaces: different interfaces could be "bound" to different processors/cores. This method was used for example on multiprocessor SPARC machines under Solaris 9.

Recently the capacity of network interfaces has grown faster than single-core compute power, so it has become more and more interesting to have multiple cores serve a single interface. For the outbound direction, this is relatively simple: Multiple threads (running on different cores) generate packets and send them to a single adapter. The adapter has to be able to accept packets from several cores. On modern adapters, this is supported by multiple transmit queues (multi-queue adapters).

A tricker but equally important issue is distributing the load of incoming traffic from a single interface (adapter) towards multiple cores. There are several approaches to this:

Multi-Queue Network Adapters

Modern network adapters support multiple receive queues, and can distribute incoming traffic to them based on signatures including VLAN tags, source/destination MAC/IP addresses, and in some cases protocol and port numbers. An early example of this are the "multithreaded" Gigabit Ethernet and 10GE adapters (=ngxe=/Neptune) for Oracle/Sun's Solaris systems, but this feature is quickly moving into the mainstream.

On the Intel architecture, multi-queue NICs use MSI-X (the extended version of Message Signaled Interrupts) to send interrupts to multiple cores. The feature that distributes arriving packets to different CPUs based on (hashed) connection identifiers is called RSS (Receive Side Scaling) on Intel adapters, and its kernel-side support on Windows was introduced as part of the Scalable Networking Pack in Windows 2003 SP2.

Receive Side Scaling (RSS) or Receive Packet Steering (RPS)

This performance enhancement works as follows: Incoming packets are distributed over multiple logical CPUs (e.g. cores) based on a hash over the source and destination IP addresses and port numbers. This hashing ensures that packets from the same logical connection (e.g. TCP connection) are always handled by the same CPU/core. On some network adapters, the work of computing the hash can be outsourced to the network adapter. For example, some Intel and Myricom adapters compute a Toeplitz hash from these header fields. This has the beneficial effect of avoiding cache misses on the CPU that performs the steering - the receiving CPU will usually have to read these fields anyway.

For Linux, this enhancement was contributed by Tom Herbert from Google and integrated in version 2.6.35 of the Linux kernel, which was released in August 2010. This implementation allows the administrator to configure the set of CPUs that are candidates to handle packets from a given interface via bitmasks through a sysfs interface.

HP-UX had a software-only implementation of Receive Packet Steering under the name of Inbound Packet Scheduling (IPS).

Note that Intel has a patent application (US patent application #20090323692 on "Hashing packet contents to determine a processor".

Receive Flow Steering (RFS)

RFS extends and improves on RPS as follows: Rather than simply using the connection signature (hash on some header fields) to pseudo-randomly select a CPU, it uses a table that maps connections (via their hashes) to the CPUs on which there are processes running that have the corresponding sockets open. The actual implementation is somewhat more involved, because it has to cope with processes being rescheduled across CPUs, multiple processes on different CPUs listening on the same socket, etc.

The Linux implementation was also contributed by Tom Herbert and integrated in the Linux kernel as of 2.6.35 (August 2010).

Network adapters need suitable hardware support for this. For Intel adapters, this is called Ethernet Flow Director and is included at least on recent server adapters/chipsets such as the X520/52899EB and XL710.

`SO_INCOMING`

A new getsockopt() option SO_INCOMING_CPU has been proposed as an addition to the Linux kernel. It would allow a process to determine which logical CPU (core) a given =accept()=/=connect()=ed socket is mapped to. This information could be used by applications to intelligently distribute load across different "worker" threads that run on different cores. This is an alternative to RFS in the sense that the application tries to use the most suitable core for the socket, whereas in RFS the system tries to steer flows to the respective core where they are consumed.

References

Intel� Ethernet Flow Director (video). This is a marketing video, but it explains the issues and mechanisms clearly, concisely, and in an entertaining way. The video makes it seem as if Microsoft and Intel were the pioneers for these features; that may be true, but I (Main.SimonLeinen) have the feeling that Sun's implementations may have predated theirs. If you know more, please contact me - just curious!
Receive-Side Scaling Enhancements in Windows Server 2008, Windows Hardware Developer Central (WHDC), November 2008.
Hashing packet contents to determine a processor, Li Yadong, Tang Xinan, United States Patent Application 20090323692, December 2009
rps: Receive packet steering, T. Herbert, lwn.net , November 2009. This is the original detailed description of the RPS patch for Linux by its author. Includes some benchmark figures showing the performance improvements.
Receive Packet Steering, J. Corbet, lwn.net , November 2009. A nice article motivating RPS and outlining its implementation.
HP-UX forum thread on RPS, in particular Rick Jones' response to the original question, May 2010
rfs: Receive Flow Steering, T. Herbert, lwn.net , April 2010. Includes performance figures.
Receive Flow Steering, J. Edge, lwn.net , April 2010

-- Main.SimonLeinen - 2010-08-02 - 2014-12-07

Page tree

MultiThreadDemux