High Performance Linux

Wednesday, December 14, 2011

Linux Zero Copy Output

Recently we implemented high-performance network traffic processing server. It uses splice(2)/vmsplice(2) Linux system calls which provide zero-copy data transfers between file descriptors and user space. On modern linux kernels it only makes difference for network output since network input is implemented by common copy_to_user() call.

Before starting with the system calls I ran tests provided by Jens Axboe(initial developer of splice) on two servers and got 3.7 times performance improvement. The results are below (please consider only sys time for xmit (sender) programs, since data copy is performed in kernel space). Large reall time is due to TCP buffers overflow. So the real bottleneck is network throughput (ever for loopback interface), but this calls still could have performance impact on large data transfers and huge local cpu usage (e.g. replication process in parallel with local huge read loading).

Splice() output:

# ./nettest/xmit -s65536 -p1000000  127.0.0.1 5500
opt packets=1000000
use port 5500
xmit: msg=64kb, packets=1000000 vmsplice() -> splice()
Connecting to 127.0.0.1/5500
usr=9259, sys=6864, real=27973

# ./nettest/recv -s65536 -r 5500
recv: msg=64kb, recv()
Waiting for connect...
Got connect!
usr=219, sys=27746, real=27973


Sendmsg() output:

# ./nettest/xmit -s65536 -p1000000 -n 127.0.0.1 5500
opt packets=1000000
use normal io
use port 5500
xmit: msg=64kb, packets=1000000 send()
Connecting to 127.0.0.1/5500
usr=8762, sys=25497, real=34261

# ./nettest/recv -s65536 -r 5500
recv: msg=64kb, recv()
Waiting for connect...
Got connect!
usr=198, sys=30100, real=34261 
 
Usage examples can be found in Jens's test code. This is really fast. However the only one big drawback of the technology is necessity of double buffering. Since data pages which sent to kernel by vmsplice() are directly used by network stack (BTW you can use the syscalls not only with TCP, but also with SCTP and UDP protocols - it has generic implementation in the kernel), so you can not use the pages until kernel completely send them to network. When it happen? Not latter than you wrote 2 size of network output buffer. Thus only when you send double network output buffer sized data you can use the pages again. In practice it means that you need special memory (page) allocator.

2 comments:

  1. Where to get code for the above mentioned case study?

    regards
    Raghu

    ReplyDelete
  2. Hi Raghunath,

    there is a link to Jens Axboe's splice suite above. Here it is http://git.kernel.dk/?p=splice.git;a=summary

    ReplyDelete