High Performance Linux

> Try Tempesta FW, a high performance open source application delivery controller for the Linux/x86-64 platform.

> Or check custom high-performance solutions from Tempesta Technologies, INC.

> Careers: if you love low-level C/C++ hacking and Linux, we'll be happy to hear from you.

Thursday, September 20, 2012

Linux: scaling softirq among many CPU cores

Some years ago I have tested network interrupts affinity - you set ~0 as a CPU mask to balance network interrupts among all your CPU cores and you get all softirq instances running in parallel. Such interrupts distribution among CPU cores sometimes is a bad idea due to CPU caches computational burden and probable packets reordering. In most cases it is not recommended for servers performing some TCP application (e.g. web server). However this ability is crucial for some low level packet applications like firewalls, routers or Anti-DDoS solutions (in last cases most of the packets must be dropped as quick as possible), which do a lot of work in softirq. So for some time I was thinking that there is no problem to share softirq load between CPU cores.

To get softirq sharing between CPU cores you just need to do

    $ for irq in `grep eth0 /proc/interrupts | cut -d: -f1`; do \
        echo ffffff > /proc/irq/$irq/smp_affinity; \

This makes (as I thought) your APIC to distribute interrupts between all your CPUs in round-robin fashion (or probably using some more cleaver technique). And this really was working in my tests.

Recently our client concerned about this ability, so I wrote very simple testing kernel module which just makes more work in softirq:

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/netfilter.h>
#include <linux/netfilter_ipv4.h>


 * Just eat some local CPU time and accept the packet.
static unsigned int
st_hook(unsigned int hooknum, struct sk_buff *skb,
        const struct net_device *in,
        const struct net_device *out,
        int (*okfn)(struct sk_buff *))
    unsigned int i;
    for (i = 0; i <= 1000 * 1000; ++i)

    return NF_ACCEPT;

static struct nf_hook_ops st_ip_ops[] __read_mostly = {
        .hook = st_hook,
        .owner = THIS_MODULE,
        .pf = PF_INET,
        .hooknum = NF_INET_PRE_ROUTING,
        .priority = NF_IP_PRI_FIRST,

static int __init
    if (nf_register_hooks(st_ip_ops, ARRAY_SIZE(st_ip_ops))) {
        printk(KERN_ERR "%s: can't register nf hook\n",
        return 1;
    printk(KERN_ERR "%s: loaded\n", __FILE__);

    return 0;

static void
    nf_unregister_hooks(st_ip_ops, ARRAY_SIZE(st_ip_ops));
    printk(KERN_ERR "%s: unloaded\n", __FILE__);


I loaded the system with iperf over 1Gbps channel. And I was very confused when see that only one CPU of 24-cores machine was doing whole the work and all other CPUs was doing nothing!

To understand what's going on lets have a look how Linux handles incoming packets and interrupts from network card (e.g. Intel 10 Gigabit PCI Express which is placed at drivers/net/ixgbe). Softirq works in per-cpu kernel threads, ksoftirqd (kernel/softirq.c: ksoftirqd()), i.e. if you have 4-cores machine, then you have 4 ksoftirqd threads (ksoftirqd/0, ksoftirqd/1, ksoftirqd/2 and ksoftirqd/3). ksoftirqd() calls do_softirq(), which by-turn calls __do_softirq(). The last one uses softirq_vec vector to get required hadler for current softirq type (e.g. NET_RX_SOFTIRQ for receiving or NET_TX_SOFTIRQ for sending softirqs correspondingly). The next step is to call virtual function action() for the handler. For NET_RX_SOFTIRQ net_rx_action() (net/core/dev.c) is called here. net_rx_action() reads napi_struct from per-cpu queue softnet_data and calls virtual function poll() - a NAPI callback (ixgbe_poll() in our case) which actually reads packets from the device ring queues. The driver processes interrupts in ixgbe_intr(). This function runs NAPI through call __napi_schedule(), which pushes current napi_struct to per-cpu softnet_data->poll_list, which net_rx_action() reads packets (on the same CPU) from. Thus softirq runs on the same core which received hardware interrupt.

This way theoretically if harware interrupts are going to N cores, then these and only these N cores are doing softirq. So I had a look at /proc/interrupts statistics and saw that only one 0th core is actually receiving interrupts from NIC while I set ~0 mask in smp_affinity for the interupt (actually I had MSI-X card, so I set the mask to all the interrupt vectors for the card).

I started googling for the answers why on earth interupts do not distribute among all the cores. The first topics which I found were nice articles by Alexander Sandler:

Following these articles not all hardware is actually able to spread interrupts between CPU cores. During my tests I was using IBM servers of particular model, but this is not the case of the client - they use very different hardware. This is why I saw one nice picture on my previous tests, but faced quite different behaviour on other hardware.

The good news is that linux 2.6.35 has introduced nice feature -  RPS (Receive Packet Steering). The core of the feature is get_rps_cpu() from dev/net/core.c, which computes a hash from IP source and destination addresses of an incoming packet and determines a which CPU send the packet to based on the hash. netif_receive_skb() or netif_rx() which call the function puts the packet to appropriate per-cpu queue for further processing by softirq. So there are two important consequences:
  1. packets are processed by different CPUs (with processing I mostly mean Netfilter pre-routing hooks);
  2. it is unlikely that packets belonging to the same TCP stream are reordered (packets reordering is a well-known problem for TCP performance, see for example Beyond softnet).
To enable the feature you should specify CPUs mask as following (the adapter from the example is connected via MSI-X and has 8 tx-rx queues, so we need to update masks for all the queues):

    $ for i in `seq 0 7`; do \
        echo fffffff > /sys/class/net/eth0/queues/rx-$i/rps_cpus ; \

After runnign linux-2.6.35 and setting all CPUs to be able to process softirq I got following nice picture in top:

  2238 root      20   0  411m  888  740 S  152  0.0   2:38.94 iperf
    10 root      20   0     0    0    0 R  100  0.0   0:35.44 ksoftirqd/2
    19 root      20   0     0    0    0 R  100  0.0   0:46.48 ksoftirqd/5
    22 root      20   0     0    0    0 R  100  0.0   0:29.10 ksoftirqd/6
    25 root      20   0     0    0    0 R  100  0.0   2:47.36 ksoftirqd/7
    28 root      20   0     0    0    0 R  100  0.0   0:33.73 ksoftirqd/8
    31 root      20   0     0    0    0 R  100  0.0   0:46.63 ksoftirqd/9
    40 root      20   0     0    0    0 R  100  0.0   0:45.33 ksoftirqd/12
    46 root      20   0     0    0    0 R  100  0.0   0:29.10 ksoftirqd/14
    49 root      20   0     0    0    0 R  100  0.0   0:47.35 ksoftirqd/15
    52 root      20   0     0    0    0 R  100  0.0   2:33.74 ksoftirqd/16
    55 root      20   0     0    0    0 R  100  0.0   0:46.92 ksoftirqd/17
    58 root      20   0     0    0    0 R  100  0.0   0:32.07 ksoftirqd/18
    67 root      20   0     0    0    0 R  100  0.0   0:46.63 ksoftirqd/21
    70 root      20   0     0    0    0 R  100  0.0   0:28.95 ksoftirqd/22
    73 root      20   0     0    0    0 R  100  0.0   0:45.03 ksoftirqd/23
     7 root      20   0     0    0    0 R   99  0.0   0:47.97 ksoftirqd/1
    37 root      20   0     0    0    0 R   99  0.0   2:42.29 ksoftirqd/11
    34 root      20   0     0    0    0 R   77  0.0   0:28.78 ksoftirqd/10
    64 root      20   0     0    0    0 R   76  0.0   0:30.34 ksoftirqd/20

So as we see almost all of the cores are doing softirqs.