High Performance Linux: Linux: scaling softirq among many CPU cores

Thursday, September 20, 2012

Linux: scaling softirq among many CPU cores

Some years ago I have tested network interrupts affinity - you set ~0 as a CPU mask to balance network interrupts among all your CPU cores and you get all softirq instances running in parallel. Such interrupts distribution among CPU cores sometimes is a bad idea due to CPU caches computational burden and probable packets reordering. In most cases it is not recommended for servers performing some TCP application (e.g. web server). However this ability is crucial for some low level packet applications like firewalls, routers or Anti-DDoS solutions (in last cases most of the packets must be dropped as quick as possible), which do a lot of work in softirq. So for some time I was thinking that there is no problem to share softirq load between CPU cores.

To get softirq sharing between CPU cores you just need to do

    $ for irq in `grep eth0 /proc/interrupts | cut -d: -f1`; do \

        echo ffffff > /proc/irq/$irq/smp_affinity; \

    done

This makes (as I thought) your APIC to distribute interrupts between all your CPUs in round-robin fashion (or probably using some more cleaver technique). And this really was working in my tests.

Recently our client concerned about this ability, so I wrote very simple testing kernel module which just makes more work in softirq:

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/netfilter.h>
#include <linux/netfilter_ipv4.h>

MODULE_LICENSE("GPL");

/**
 * Just eat some local CPU time and accept the packet.
 */
static unsigned int
st_hook(unsigned int hooknum, struct sk_buff *skb,

        const struct net_device *in,
        const struct net_device *out,

        int (*okfn)(struct sk_buff *))
{
    unsigned int i;
    for (i = 0; i <= 1000 * 1000; ++i)
        skb_linearize(skb);

    return NF_ACCEPT;
}

static struct nf_hook_ops st_ip_ops[] __read_mostly = {
    {
        .hook = st_hook,
        .owner = THIS_MODULE,
        .pf = PF_INET,
        .hooknum = NF_INET_PRE_ROUTING,
        .priority = NF_IP_PRI_FIRST,
    },
};

static int __init
st_init(void)
{
    if (nf_register_hooks(st_ip_ops, ARRAY_SIZE(st_ip_ops))) {
        printk(KERN_ERR "%s: can't register nf hook\n",

               __FILE__);
        return 1;
    }
    printk(KERN_ERR "%s: loaded\n", __FILE__);

    return 0;
}

static void
st_exit(void)
{
    nf_unregister_hooks(st_ip_ops, ARRAY_SIZE(st_ip_ops));
    printk(KERN_ERR "%s: unloaded\n", __FILE__);
}

module_init(st_init);
module_exit(st_exit);

I loaded the system with iperf over 1Gbps channel. And I was very confused when see that only one CPU of 24-cores machine was doing whole the work and all other CPUs was doing nothing!

To understand what's going on lets have a look how Linux handles incoming packets and 
interrupts from network card (e.g. Intel 10 Gigabit PCI Express which is
 placed at drivers/net/ixgbe). Softirq works in per-cpu kernel threads, ksoftirqd (kernel/softirq.c:
ksoftirqd()), i.e. if you have 4-cores machine, then you have 4 ksoftirqd threads (ksoftirqd/0, ksoftirqd/1, ksoftirqd/2 and ksoftirqd/3). ksoftirqd() calls do_softirq(), which by-turn calls __do_softirq(). The last one uses softirq_vec vector to get required hadler for current softirq type (e.g. NET_RX_SOFTIRQ for receiving or NET_TX_SOFTIRQ for sending softirqs correspondingly).
The next step is to call virtual function action() for the handler. For NET_RX_SOFTIRQ net_rx_action() (net/core/dev.c) is called here.
net_rx_action() reads napi_struct from per-cpu queue softnet_data and calls virtual function poll() - a NAPI callback (ixgbe_poll() in our case) which actually reads packets from the device ring queues.
The driver processes interrupts in ixgbe_intr(). This function runs NAPI through call __napi_schedule(),
which pushes current napi_struct to per-cpu
softnet_data->poll_list, which net_rx_action() reads packets (on the 
same CPU) from. Thus softirq runs on the same core which received 
hardware interrupt.



This way theoretically if harware interrupts are going to N cores, then these and only these N cores are doing softirq. So I had a look at /proc/interrupts statistics and saw that only one 0th core is actually receiving interrupts from NIC while I set ~0 mask in smp_affinity for the interupt (actually I had MSI-X card, so I set the mask to all the interrupt vectors for the card).

I started googling for the answers why on earth interupts do not distribute among all the cores. The first topics which I found were nice articles by Alexander Sandler:

SMP affinity and proper interrupt handling in Linux 

Why interrupt affinity with multiple cores is not such a good thing 

MSI-X – the right way to spread interrupt load 

Following these articles not all hardware is actually able to spread interrupts between CPU cores. During my tests I was using IBM servers of particular model, but this is not the case of the client - they use very different hardware. This is why I saw one nice picture on my previous tests, but faced quite different behaviour on other hardware.

The good news is that linux 2.6.35 has introduced nice feature -  RPS (Receive Packet Steering). The core of the feature is get_rps_cpu() from dev/net/core.c, which computes a hash from IP source and destination addresses of an incoming packet and determines a which CPU send the packet to based on the hash. netif_receive_skb() or netif_rx() which call the function puts the packet to appropriate per-cpu queue for further processing by softirq. So there are two important consequences:

packets are processed by different CPUs (with processing I mostly mean Netfilter pre-routing hooks);
it is unlikely that packets belonging to the same TCP stream are reordered (packets reordering is a well-known problem for TCP performance, see for example Beyond softnet).

To enable the feature you should specify CPUs mask as following (the adapter from the example is connected via MSI-X and has 8 tx-rx queues, so we need to update masks for all the queues):

    $ for i in `seq 0 7`; do \

        echo fffffff > /sys/class/net/eth0/queues/rx-$i/rps_cpus ; \

    done

After runnign linux-2.6.35 and setting all CPUs to be able to process softirq I got following nice picture in top:

  2238 root      20   0  411m  888  740 S  152  0.0   2:38.94 iperf
    10 root      20   0     0    0    0 R  100  0.0   0:35.44 ksoftirqd/2
    19 root      20   0     0    0    0 R  100  0.0   0:46.48 ksoftirqd/5
    22 root      20   0     0    0    0 R  100  0.0   0:29.10 ksoftirqd/6
    25 root      20   0     0    0    0 R  100  0.0   2:47.36 ksoftirqd/7
    28 root      20   0     0    0    0 R  100  0.0   0:33.73 ksoftirqd/8
    31 root      20   0     0    0    0 R  100  0.0   0:46.63 ksoftirqd/9
    40 root      20   0     0    0    0 R  100  0.0   0:45.33 ksoftirqd/12
    46 root      20   0     0    0    0 R  100  0.0   0:29.10 ksoftirqd/14
    49 root      20   0     0    0    0 R  100  0.0   0:47.35 ksoftirqd/15
    52 root      20   0     0    0    0 R  100  0.0   2:33.74 ksoftirqd/16
    55 root      20   0     0    0    0 R  100  0.0   0:46.92 ksoftirqd/17
    58 root      20   0     0    0    0 R  100  0.0   0:32.07 ksoftirqd/18
    67 root      20   0     0    0    0 R  100  0.0   0:46.63 ksoftirqd/21
    70 root      20   0     0    0    0 R  100  0.0   0:28.95 ksoftirqd/22
    73 root      20   0     0    0    0 R  100  0.0   0:45.03 ksoftirqd/23
     7 root      20   0     0    0    0 R   99  0.0   0:47.97 ksoftirqd/1
    37 root      20   0     0    0    0 R   99  0.0   2:42.29 ksoftirqd/11
    34 root      20   0     0    0    0 R   77  0.0   0:28.78 ksoftirqd/10
    64 root      20   0     0    0    0 R   76  0.0   0:30.34 ksoftirqd/20

So as we see almost all of the cores are doing softirqs.

19 comments:

UnknownAugust 10, 2015 at 5:25 PM
If we'll ever meet, I owe you a huge and tasty beer :D This article saved my masters degree!!! :D
ReplyDelete
Replies
Alexander KrizhanovskyAugust 11, 2015 at 9:01 AM
Nice to read, you're welcome ;)
ReplyDelete
Replies
UnknownAugust 14, 2016 at 10:33 PM
Great article! Congrats!
ReplyDelete
Replies
cpAugust 16, 2016 at 8:16 PM
What a nice article! This is what exactly I need.
But, once I applied RPS on a 8 cores machine. I start seeing system hang with the following back trace:
INFO: rcu_sched self-detected stall on CPU { 4} (t=5251 jiffies g=207806 c=207805 q=168)
CPU: 4 PID: 2646 Comm: bash Tainted: P O 3.12.19-linux #15
INFO: rcu_sched detected stalls on CPUs/tasks: { 4} (detected by 5, t=5252 jiffies, g=207806, c=207805, q=168)
Task dump for CPU 4:
bash R running task 0 2646 2626 0x00000004
Call Trace:
[c0000000f3873910] [0000000000000001] 0x1 (unreliable)
Call Trace:
[c0000000f3872ee0] [c00000000000a144] .show_stack+0x168/0x278 (uable)
[c0000000f3872fd0] [c0000000008ac730] .dump_stack+0x84/0xb0
[c0000000f3873050] [c0000000000d18e0] .rcu_check_callbacks+0x3f8/0x868
[c0000000f3873190] [c00000000005871c] .update_process_times+0x50/0x94
[c0000000f3873220] [c0000000000b2500] .tick_sched_handle.isra.17+0x5c/0x7c
[c0000000f38732b0] [c0000000000b2584] .tick_sched_timer+0x64/0xa0
[c0000000f3873350] [c000000000079164] .__run_hrtimer+0xc0/0x250
[c0000000f38733f0] [c00000000007a00c] .hrtimer_interrupt+0x144/0x31c
[c0000000f3873500] [c000000000013140] .timer_interrupt+0x12c/0x270
[c0000000f38735b0] [c00000000001d054] exc_0x900_common+0x104/0x108
--- Exception: 901 at .smp_call_function_many+0x344/0x3d4
LR = .smp_call_function_many+0x300/0x3d4
[c0000000f38738a0] [c0000000000b9794] .smp_call_function_many+0x2dc/0x3d4 (unreliable)
[c0000000f3873980] [c00000000002daac] .flush_tlb_mm+0xac/0xb4
[c0000000f3873a20] [c00000000013fdac] .tlb_flush_mmu.part.77+0x3c/0xbc
[c0000000f3873ab0] [c000000000140050] .tlb_finish_mmu+0x7c/0x80
[c0000000f3873b30] [c000000000148c2c] .unmap_region+0xf4/0x144
[c0000000f3873c60] [c00000000014b630] .do_munmap+0x27c/0x394
[c0000000f3873d20] [c00000000014b79c] .vm_munmap+0x54/0x88
[c0000000f3873db0] [c00000000014c774] .SyS_munmap+0x28/0x38
[c0000000f3873e30] [c000000000000598] syscall_exit+0x0/0x8c
ReplyDelete
Replies
Mikael RonstromAugust 22, 2016 at 5:06 PM
Great find, answers some questions I've had for a long time
ReplyDelete
Replies
ValyAugust 26, 2016 at 9:44 AM
Very nice article with a lot of explanations, you made my softirq struggle become much clearer.
Thank you!
ReplyDelete
Replies
UnknownJune 18, 2017 at 3:10 PM
im struggling with same issue,
all traffic handeled by one cpu
but i can see that the server is doing the softirq on all cpu's
but most of the time, only CPU-0 is getting the process of all the network:

root@goran-1:~# ps -ef | grep irq
root 3 2 0 Jun17 ? 00:25:38 [ksoftirqd/0]
root 62 2 0 Jun17 ? 00:01:17 [ksoftirqd/1]
root 67 2 0 Jun17 ? 00:00:53 [ksoftirqd/2]
root 72 2 0 Jun17 ? 00:00:42 [ksoftirqd/3]
root 77 2 0 Jun17 ? 00:00:38 [ksoftirqd/4]
root 82 2 0 Jun17 ? 00:00:34 [ksoftirqd/5]
root 87 2 0 Jun17 ? 00:00:01 [ksoftirqd/6]
root 93 2 0 Jun17 ? 00:00:00 [ksoftirqd/7]
root 98 2 0 Jun17 ? 00:00:01 [ksoftirqd/8]
root 103 2 0 Jun17 ? 00:00:00 [ksoftirqd/9]
root 108 2 0 Jun17 ? 00:00:01 [ksoftirqd/10]
root 113 2 0 Jun17 ? 00:00:01 [ksoftirqd/11]
root 118 2 0 Jun17 ? 00:00:09 [ksoftirqd/12]
root 123 2 0 Jun17 ? 00:00:09 [ksoftirqd/13]
root 128 2 0 Jun17 ? 00:00:08 [ksoftirqd/14]
root 133 2 0 Jun17 ? 00:00:08 [ksoftirqd/15]
root 138 2 0 Jun17 ? 00:00:05 [ksoftirqd/16]
root 143 2 0 Jun17 ? 00:00:06 [ksoftirqd/17]
root 148 2 0 Jun17 ? 00:00:00 [ksoftirqd/18]
root 153 2 0 Jun17 ? 00:00:00 [ksoftirqd/19]
root 158 2 0 Jun17 ? 00:00:00 [ksoftirqd/20]
root 163 2 0 Jun17 ? 00:00:00 [ksoftirqd/21]
root 168 2 0 Jun17 ? 00:00:00 [ksoftirqd/22]
root 173 2 0 Jun17 ? 00:00:00 [ksoftirqd/23]

How can i tweak that all the traffic will be divided by all these processes?
ReplyDelete
Replies
UnknownAugust 10, 2017 at 2:34 AM
Hi Alexander,

I set the parameters as below, but still only one ksoftirqd handling interrupts, my kernel version is 3.10.0-327.el7, centos 7.2.1511, do you have any idea?

Thanks,
Daniel

/sys/class/net/em1/queues
echo ffffffff > rx-0/rps_cpus
echo ffffffff > rx-1/rps_cpus
echo ffffffff > rx-2/rps_cpus
echo ffffffff > rx-3/rps_cpus
echo ffffffff > rx-4/rps_cpus
echo ffffffff > rx-5/rps_cpus
echo ffffffff > rx-6/rps_cpus
echo ffffffff > rx-7/rps_cpus
echo ffffffff > tx-0/xps_cpus
echo ffffffff > tx-1/xps_cpus
echo ffffffff > tx-2/xps_cpus
echo ffffffff > tx-3/xps_cpus
echo ffffffff > tx-4/xps_cpus
echo ffffffff > tx-5/xps_cpus
echo ffffffff > tx-6/xps_cpus
echo ffffffff > tx-7/xps_cpus

echo 0-31 > /proc/irq/31/smp_affinity_list
echo 0-31 > /proc/irq/32/smp_affinity_list
echo 0-31 > /proc/irq/33/smp_affinity_list
echo 0-31 > /proc/irq/34/smp_affinity_list
echo 0-31 > /proc/irq/35/smp_affinity_list
echo 0-31 > /proc/irq/36/smp_affinity_list
echo 0-31 > /proc/irq/37/smp_affinity_list
echo 0-31 > /proc/irq/38/smp_affinity_list
echo 0-31 > /proc/irq/39/smp_affinity_list
ReplyDelete
Replies
UnknownOctober 26, 2018 at 12:43 PM
Your article is the best article in the world in my opinion. I had to write my server a few times, and I was going to finish my career 😁 but your article saved me. Cheers.
ReplyDelete
Replies
AshokeOctober 29, 2018 at 11:25 PM
Hi,

I am trying to find if multiple queues are available and enabled for the NICs in my server. The first part I can find using

for irq in `grep eth0 /proc/interrupts | cut -d: -f1`;
do
//if this loop is entered multiple times, then multi-queue is available.
done

Is there a way to identify whether multi-queue is enabled or not for a NIC?

Thank you.
ReplyDelete
Replies
shadyabhiJune 3, 2019 at 9:52 PM
Great explanation, thanks for writing this down.
ReplyDelete
Replies
shadyabhiJune 3, 2019 at 9:52 PM
Great explanation, thanks for writing this down.
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.