High Performance Linux

Friday, September 21, 2012

Linux: scaling softirq among many CPU cores

Some years ago I have tested network interrupts affinity - you set ~0 as a CPU mask to balance network interrupts among all your CPU cores and you get all softirq instances running in parallel. Such interrupts distribution among CPU cores sometimes is a bad idea due to CPU caches computational burden and probable packets reordering. In most cases it is not recommended for servers performing some TCP application (e.g. web server). However this ability is crucial for some low level packet applications like firewalls, routers or Anti-DDoS solutions (in last cases most of the packets must be dropped as quick as possible), which do a lot of work in softirq. So for some time I was thinking that there is no problem to share softirq load between CPU cores.

To get softirq sharing between CPU cores you just need to do

    $ for irq in `grep eth0 /proc/interrupts | cut -d: -f1`; do \
        echo ffffff > /proc/irq/$irq/smp_affinity; \
    done

This makes (as I thought) your APIC to distribute interrupts between all your CPUs in round-robin fashion (or probably using some more cleaver technique). And this really was working in my tests.

Recently our client concerned about this ability, so I wrote very simple testing kernel module which just makes more work in softirq:

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/netfilter.h>
#include <linux/netfilter_ipv4.h>

MODULE_LICENSE("GPL");

/**
 * Just eat some local CPU time and accept the packet.
 */
static unsigned int
st_hook(unsigned int hooknum, struct sk_buff *skb,
        const struct net_device *in,
        const struct net_device *out,
        int (*okfn)(struct sk_buff *))
{
    unsigned int i;
    for (i = 0; i <= 1000 * 1000; ++i)
        skb_linearize(skb);

    return NF_ACCEPT;
}

static struct nf_hook_ops st_ip_ops[] __read_mostly = {
    {
        .hook = st_hook,
        .owner = THIS_MODULE,
        .pf = PF_INET,
        .hooknum = NF_INET_PRE_ROUTING,
        .priority = NF_IP_PRI_FIRST,
    },
};

static int __init
st_init(void)
{
    if (nf_register_hooks(st_ip_ops, ARRAY_SIZE(st_ip_ops))) {
        printk(KERN_ERR "%s: can't register nf hook\n",
               __FILE__);
        return 1;
    }
    printk(KERN_ERR "%s: loaded\n", __FILE__);

    return 0;
}

static void
st_exit(void)
{
    nf_unregister_hooks(st_ip_ops, ARRAY_SIZE(st_ip_ops));
    printk(KERN_ERR "%s: unloaded\n", __FILE__);
}

module_init(st_init);
module_exit(st_exit);

I loaded the system with iperf over 1Gbps channel. And I was very confused when see that only one CPU of 24-cores machine was doing whole the work and all other CPUs was doing nothing!

To understand what's going on lets have a look how Linux handles incoming packets and interrupts from network card (e.g. Intel 10 Gigabit PCI Express which is placed at drivers/net/ixgbe). Softirq works in per-cpu kernel threads, ksoftirqd (kernel/softirq.c: ksoftirqd()), i.e. if you have 4-cores machine, then you have 4 ksoftirqd threads (ksoftirqd/0, ksoftirqd/1, ksoftirqd/2 and ksoftirqd/3). ksoftirqd() calls do_softirq(), which by-turn calls __do_softirq(). The last one uses softirq_vec vector to get required hadler for current softirq type (e.g. NET_RX_SOFTIRQ for receiving or NET_TX_SOFTIRQ for sending softirqs correspondingly). The next step is to call virtual function action() for the handler. For NET_RX_SOFTIRQ net_rx_action() (net/core/dev.c) is called here. net_rx_action() reads napi_struct from per-cpu queue softnet_data and calls virtual function poll() - a NAPI callback (ixgbe_poll() in our case) which actually reads packets from the device ring queues. The driver processes interrupts in ixgbe_intr(). This function runs NAPI through call __napi_schedule(), which pushes current napi_struct to per-cpu softnet_data->poll_list, which net_rx_action() reads packets (on the same CPU) from. Thus softirq runs on the same core which received hardware interrupt.

This way theoretically if harware interrupts are going to N cores, then these and only these N cores are doing softirq. So I had a look at /proc/interrupts statistics and saw that only one 0th core is actually receiving interrupts from NIC while I set ~0 mask in smp_affinity for the interupt (actually I had MSI-X card, so I set the mask to all the interrupt vectors for the card).

I started googling for the answers why on earth interupts do not distribute among all the cores. The first topics which I found were nice articles by Alexander Sandler:

Following these articles not all hardware is actually able to spread interrupts between CPU cores. During my tests I was using IBM servers of particular model, but this is not the case of the client - they use very different hardware. This is why I saw one nice picture on my previous tests, but faced quite different behaviour on other hardware.

The good news is that linux 2.6.35 has introduced nice feature -  RPS (Receive Packet Steering). The core of the feature is get_rps_cpu() from dev/net/core.c, which computes a hash from IP source and destination addresses of an incoming packet and determines a which CPU send the packet to based on the hash. netif_receive_skb() or netif_rx() which call the function puts the packet to appropriate per-cpu queue for further processing by softirq. So there are two important consequences:
  1. packets are processed by different CPUs (with processing I mostly mean Netfilter pre-routing hooks);
  2. it is unlikely that packets belonging to the same TCP stream are reordered (packets reordering is a well-known problem for TCP performance, see for example Beyond softnet).
To enable the feature you should specify CPUs mask as following (the adapter from the example is connected via MSI-X and has 8 tx-rx queues, so we need to update masks for all the queues):

    $ for i in `seq 0 7`; do \
        echo fffffff > /sys/class/net/eth0/queues/rx-$i/rps_cpus ; \
    done

After runnign linux-2.6.35 and setting all CPUs to be able to process softirq I got following nice picture in top:

  2238 root      20   0  411m  888  740 S  152  0.0   2:38.94 iperf
    10 root      20   0     0    0    0 R  100  0.0   0:35.44 ksoftirqd/2
    19 root      20   0     0    0    0 R  100  0.0   0:46.48 ksoftirqd/5
    22 root      20   0     0    0    0 R  100  0.0   0:29.10 ksoftirqd/6
    25 root      20   0     0    0    0 R  100  0.0   2:47.36 ksoftirqd/7
    28 root      20   0     0    0    0 R  100  0.0   0:33.73 ksoftirqd/8
    31 root      20   0     0    0    0 R  100  0.0   0:46.63 ksoftirqd/9
    40 root      20   0     0    0    0 R  100  0.0   0:45.33 ksoftirqd/12
    46 root      20   0     0    0    0 R  100  0.0   0:29.10 ksoftirqd/14
    49 root      20   0     0    0    0 R  100  0.0   0:47.35 ksoftirqd/15
    52 root      20   0     0    0    0 R  100  0.0   2:33.74 ksoftirqd/16
    55 root      20   0     0    0    0 R  100  0.0   0:46.92 ksoftirqd/17
    58 root      20   0     0    0    0 R  100  0.0   0:32.07 ksoftirqd/18
    67 root      20   0     0    0    0 R  100  0.0   0:46.63 ksoftirqd/21
    70 root      20   0     0    0    0 R  100  0.0   0:28.95 ksoftirqd/22
    73 root      20   0     0    0    0 R  100  0.0   0:45.03 ksoftirqd/23
     7 root      20   0     0    0    0 R   99  0.0   0:47.97 ksoftirqd/1
    37 root      20   0     0    0    0 R   99  0.0   2:42.29 ksoftirqd/11
    34 root      20   0     0    0    0 R   77  0.0   0:28.78 ksoftirqd/10
    64 root      20   0     0    0    0 R   76  0.0   0:30.34 ksoftirqd/20

So as we see almost all of the cores are doing softirqs.

15 comments:

  1. If we'll ever meet, I owe you a huge and tasty beer :D This article saved my masters degree!!! :D

    ReplyDelete
  2. What a nice article! This is what exactly I need.
    But, once I applied RPS on a 8 cores machine. I start seeing system hang with the following back trace:
    INFO: rcu_sched self-detected stall on CPU { 4} (t=5251 jiffies g=207806 c=207805 q=168)
    CPU: 4 PID: 2646 Comm: bash Tainted: P O 3.12.19-linux #15
    INFO: rcu_sched detected stalls on CPUs/tasks: { 4} (detected by 5, t=5252 jiffies, g=207806, c=207805, q=168)
    Task dump for CPU 4:
    bash R running task 0 2646 2626 0x00000004
    Call Trace:
    [c0000000f3873910] [0000000000000001] 0x1 (unreliable)
    Call Trace:
    [c0000000f3872ee0] [c00000000000a144] .show_stack+0x168/0x278 (uable)
    [c0000000f3872fd0] [c0000000008ac730] .dump_stack+0x84/0xb0
    [c0000000f3873050] [c0000000000d18e0] .rcu_check_callbacks+0x3f8/0x868
    [c0000000f3873190] [c00000000005871c] .update_process_times+0x50/0x94
    [c0000000f3873220] [c0000000000b2500] .tick_sched_handle.isra.17+0x5c/0x7c
    [c0000000f38732b0] [c0000000000b2584] .tick_sched_timer+0x64/0xa0
    [c0000000f3873350] [c000000000079164] .__run_hrtimer+0xc0/0x250
    [c0000000f38733f0] [c00000000007a00c] .hrtimer_interrupt+0x144/0x31c
    [c0000000f3873500] [c000000000013140] .timer_interrupt+0x12c/0x270
    [c0000000f38735b0] [c00000000001d054] exc_0x900_common+0x104/0x108
    --- Exception: 901 at .smp_call_function_many+0x344/0x3d4
    LR = .smp_call_function_many+0x300/0x3d4
    [c0000000f38738a0] [c0000000000b9794] .smp_call_function_many+0x2dc/0x3d4 (unreliable)
    [c0000000f3873980] [c00000000002daac] .flush_tlb_mm+0xac/0xb4
    [c0000000f3873a20] [c00000000013fdac] .tlb_flush_mmu.part.77+0x3c/0xbc
    [c0000000f3873ab0] [c000000000140050] .tlb_finish_mmu+0x7c/0x80
    [c0000000f3873b30] [c000000000148c2c] .unmap_region+0xf4/0x144
    [c0000000f3873c60] [c00000000014b630] .do_munmap+0x27c/0x394
    [c0000000f3873d20] [c00000000014b79c] .vm_munmap+0x54/0x88
    [c0000000f3873db0] [c00000000014c774] .SyS_munmap+0x28/0x38
    [c0000000f3873e30] [c000000000000598] syscall_exit+0x0/0x8c

    ReplyDelete
    Replies
    1. Hi,

      you obviously faced a kernel bug. Unfortunately, I can't get the clue from the call trace. Meantime, RPS are quite old and stable feature, so believe you use unstable kernel or use some buggy drivers or custom kernel modules.

      Delete
  3. Great find, answers some questions I've had for a long time

    ReplyDelete
  4. Very nice article with a lot of explanations, you made my softirq struggle become much clearer.
    Thank you!

    ReplyDelete
  5. im struggling with same issue,
    all traffic handeled by one cpu
    but i can see that the server is doing the softirq on all cpu's
    but most of the time, only CPU-0 is getting the process of all the network:

    root@goran-1:~# ps -ef | grep irq
    root 3 2 0 Jun17 ? 00:25:38 [ksoftirqd/0]
    root 62 2 0 Jun17 ? 00:01:17 [ksoftirqd/1]
    root 67 2 0 Jun17 ? 00:00:53 [ksoftirqd/2]
    root 72 2 0 Jun17 ? 00:00:42 [ksoftirqd/3]
    root 77 2 0 Jun17 ? 00:00:38 [ksoftirqd/4]
    root 82 2 0 Jun17 ? 00:00:34 [ksoftirqd/5]
    root 87 2 0 Jun17 ? 00:00:01 [ksoftirqd/6]
    root 93 2 0 Jun17 ? 00:00:00 [ksoftirqd/7]
    root 98 2 0 Jun17 ? 00:00:01 [ksoftirqd/8]
    root 103 2 0 Jun17 ? 00:00:00 [ksoftirqd/9]
    root 108 2 0 Jun17 ? 00:00:01 [ksoftirqd/10]
    root 113 2 0 Jun17 ? 00:00:01 [ksoftirqd/11]
    root 118 2 0 Jun17 ? 00:00:09 [ksoftirqd/12]
    root 123 2 0 Jun17 ? 00:00:09 [ksoftirqd/13]
    root 128 2 0 Jun17 ? 00:00:08 [ksoftirqd/14]
    root 133 2 0 Jun17 ? 00:00:08 [ksoftirqd/15]
    root 138 2 0 Jun17 ? 00:00:05 [ksoftirqd/16]
    root 143 2 0 Jun17 ? 00:00:06 [ksoftirqd/17]
    root 148 2 0 Jun17 ? 00:00:00 [ksoftirqd/18]
    root 153 2 0 Jun17 ? 00:00:00 [ksoftirqd/19]
    root 158 2 0 Jun17 ? 00:00:00 [ksoftirqd/20]
    root 163 2 0 Jun17 ? 00:00:00 [ksoftirqd/21]
    root 168 2 0 Jun17 ? 00:00:00 [ksoftirqd/22]
    root 173 2 0 Jun17 ? 00:00:00 [ksoftirqd/23]


    How can i tweak that all the traffic will be divided by all these processes?

    ReplyDelete
    Replies
    1. Hi Aydin,

      do you use RSS? Do you bind the NIC queues to different CPUs? What about your workload (how many IPs and ports are presented in ingress traffic)? Also probably your NIC allows configuration of hashing function used for RSS - just check documentation for your adapter.

      Delete
  6. Hi Alexander,

    I set the parameters as below, but still only one ksoftirqd handling interrupts, my kernel version is 3.10.0-327.el7, centos 7.2.1511, do you have any idea?

    Thanks,
    Daniel

    /sys/class/net/em1/queues
    echo ffffffff > rx-0/rps_cpus
    echo ffffffff > rx-1/rps_cpus
    echo ffffffff > rx-2/rps_cpus
    echo ffffffff > rx-3/rps_cpus
    echo ffffffff > rx-4/rps_cpus
    echo ffffffff > rx-5/rps_cpus
    echo ffffffff > rx-6/rps_cpus
    echo ffffffff > rx-7/rps_cpus
    echo ffffffff > tx-0/xps_cpus
    echo ffffffff > tx-1/xps_cpus
    echo ffffffff > tx-2/xps_cpus
    echo ffffffff > tx-3/xps_cpus
    echo ffffffff > tx-4/xps_cpus
    echo ffffffff > tx-5/xps_cpus
    echo ffffffff > tx-6/xps_cpus
    echo ffffffff > tx-7/xps_cpus

    echo 0-31 > /proc/irq/31/smp_affinity_list
    echo 0-31 > /proc/irq/32/smp_affinity_list
    echo 0-31 > /proc/irq/33/smp_affinity_list
    echo 0-31 > /proc/irq/34/smp_affinity_list
    echo 0-31 > /proc/irq/35/smp_affinity_list
    echo 0-31 > /proc/irq/36/smp_affinity_list
    echo 0-31 > /proc/irq/37/smp_affinity_list
    echo 0-31 > /proc/irq/38/smp_affinity_list
    echo 0-31 > /proc/irq/39/smp_affinity_list

    ReplyDelete
    Replies
    1. [root@jucloud176 docker-593b95aa9984e179fa5ccafa3288504e854a2f7fb806684ff3ef7d706e3f554e.scope]# grep em1 /proc/interrupts
      31: 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge em1
      32: 53802 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 509922331 0 0 0 0 0 0 0 0 PCI-MSI-edge em1-TxRx-0
      33: 412001491 1930 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge em1-TxRx-1
      34: 9770208 382549536 0 0 0 0 0 0 4385147 0 0 0 0 0 0 0 19255469 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge em1-TxRx-2
      35: 51568 0 462987078 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge em1-TxRx-3
      36: 52754 0 0 474545928 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge em1-TxRx-4
      37: 56773 0 0 0 374690735 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge em1-TxRx-5
      38: 56621 0 0 0 0 191607518 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge em1-TxRx-6
      39: 37805 0 0 0 0 0 237601961 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge em1-TxRx-7

      Delete
    2. Hi Daniel,

      how many flows do you use in the test? For example, if you execute the benchmark with only one TCP flow, then it's expected that all the packets go to the same CPU: the kernel tries to minimize inter-CPU locking and data transfers, so it do its best to put packets from the same flow to the same CPU.

      Delete
    3. Hi Alexander,

      Thanks for your quick response!
      I use wrk with 8 threads running, command as below:
      ./wrk -t 8 -c 1000 -d 180 --latency "http://10.18.10.89:80/success"

      31228 root 20 0 710740 19308 1144 S 38.3 0.0 0:55.68 wrk
      31231 root 20 0 710740 19308 1144 S 37.6 0.0 0:55.92 wrk
      31230 root 20 0 710740 19308 1144 S 36.6 0.0 0:55.61 wrk
      31232 root 20 0 710740 19308 1144 S 36.6 0.0 0:55.37 wrk
      31227 root 20 0 710740 19308 1144 S 36.3 0.0 0:55.31 wrk
      31226 root 20 0 710740 19308 1144 S 35.6 0.0 0:56.18 wrk
      31229 root 20 0 710740 19308 1144 S 34.7 0.0 0:55.88 wrk
      31225 root 20 0 710740 19308 1144 S 34.0 0.0 0:55.14 wrk

      So there should be 8 TCP flow,right?
      Or I have to run wrk on different servers?

      Thanks,
      Daniel

      Delete
    4. Hi Alexander,

      I just tried running wrk with same command on two different servers at same time, but still only one ksoftirqd handling.

      Delete
    5. Hi Daniel,

      each TCP connection is an independent flow, so your test has 1000 flows. However all the flows has only different source TCP port. Hashing in NICs aren't perfect (however, some adapters allow to customize the function) so the so small difference in the flows don't allow even distribution among CPUs.

      I had a look at your data once more. I see in your /proc/interrupts , so I dare say all the 8 queues of your adapter are processed on different CPUs. You have only 8 queues, so it has sense to expect loading of 8 CPU cores. Also your adapter should coalesce interrupts (please check this with ethtool -c), so basically we don't see huge numbers in /proc/interrupts under heavy network loads.

      Since you have only 8 queues and 32 CPUs it has sense to bind the queues to only one CPU or not more than 4, i.e. use masks like '2' or 'f0' for rps_cpus and '2' or '4-7' for smp_affinity_list.

      I just tired your configuration (RPS and IRQ lines are bound to all the CPUs) and really got results like the following on my 12 cores server:

      # top -d 10 -n 5 -bp $(echo `ps -Ao pid,comm|grep ksoftirqd|awk '{print $1}'`|sed -e 's/\s/,/g')

      PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
      3 root 20 0 0 0 0 R 33.2 0.0 2:17.45 ksoftirqd/0
      18 root 20 0 0 0 0 S 6.6 0.0 1:04.88 ksoftirqd/2
      38 root 20 0 0 0 0 S 6.6 0.0 0:22.84 ksoftirqd/6
      13 root 20 0 0 0 0 S 0.0 0.0 1:11.92 ksoftirqd/1
      23 root 20 0 0 0 0 S 0.0 0.0 1:04.72 ksoftirqd/3
      28 root 20 0 0 0 0 S 0.0 0.0 3:28.37 ksoftirqd/4
      33 root 20 0 0 0 0 S 0.0 0.0 0:23.96 ksoftirqd/5
      43 root 20 0 0 0 0 S 0.0 0.0 0:22.80 ksoftirqd/7
      48 root 20 0 0 0 0 S 0.0 0.0 0:03.82 ksoftirqd/8
      53 root 20 0 0 0 0 S 0.0 0.0 0:36.81 ksoftirqd/9
      58 root 20 0 0 0 0 S 0.0 0.0 0:40.25 ksoftirqd/10
      63 root 20 0 0 0 0 S 0.0 0.0 0:37.86 ksoftirqd/11


      However the load distribution becomes closer to even if I bind each RPS/IRQ queue to only one separate CPU:


      # top -d 10 -n 5 -bp $(echo `ps -Ao pid,comm|grep ksoftirqd|awk '{print $1}'`|sed -e 's/\s/,/g')

      PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
      3 root 20 0 0 0 0 S 4.9 0.0 2:45.94 ksoftirqd/0
      13 root 20 0 0 0 0 S 3.7 0.0 1:17.86 ksoftirqd/1
      38 root 20 0 0 0 0 S 3.0 0.0 0:25.50 ksoftirqd/6
      18 root 20 0 0 0 0 S 2.1 0.0 1:08.27 ksoftirqd/2
      23 root 20 0 0 0 0 S 2.0 0.0 1:07.14 ksoftirqd/3
      28 root 20 0 0 0 0 S 1.9 0.0 3:30.61 ksoftirqd/4
      33 root 20 0 0 0 0 S 1.8 0.0 0:25.92 ksoftirqd/5
      43 root 20 0 0 0 0 S 0.3 0.0 0:24.45 ksoftirqd/7
      48 root 20 0 0 0 0 S 0.1 0.0 0:03.98 ksoftirqd/8
      58 root 20 0 0 0 0 S 0.1 0.0 0:40.41 ksoftirqd/10
      63 root 20 0 0 0 0 S 0.1 0.0 0:38.02 ksoftirqd/11
      53 root 20 0 0 0 0 S 0.0 0.0 0:36.96 ksoftirqd/9


      Still imperfect, but much better. BTW, I used

      ./wrk -c 32768 -t 16 -d 300 http://192.168.0.1/

      ran from other server.

      Delete