Recently we implemented high-performance network traffic processing server. It uses splice(2)/vmsplice(2) Linux system calls which provide zero-copy data transfers between file descriptors and user space. On modern linux kernels it only makes difference for network output since network input is implemented by common copy_to_user() call. Before starting with the system calls I ran tests provided by Jens Axboe(initial developer of splice) on two servers and got 3.7 times performance improvement. The results are below (please consider only sys time for xmit (sender) programs, since data copy is performed in kernel space). Large reall time is due to TCP buffers overflow. So the real bottleneck is network throughput (ever for loopback interface), but this calls still could have performance impact on large data transfers and huge local cpu usage (e.g. replication process in parallel with local huge read loading). Splice() output: # ./nettest/xmit -s65536 -p1000000 127.0.0.1 5500 opt packets=1000000 use port 5500 xmit: msg=64kb, packets=1000000 vmsplice() -> splice() Connecting to 127.0.0.1/5500 usr=9259, sys=6864, real=27973 # ./nettest/recv -s65536 -r 5500 recv: msg=64kb, recv() Waiting for connect... Got connect! usr=219, sys=27746, real=27973 Sendmsg() output: # ./nettest/xmit -s65536 -p1000000 -n 127.0.0.1 5500 opt packets=1000000 use normal io use port 5500 xmit: msg=64kb, packets=1000000 send() Connecting to 127.0.0.1/5500 usr=8762, sys=25497, real=34261 # ./nettest/recv -s65536 -r 5500 recv: msg=64kb, recv() Waiting for connect... Got connect! usr=198, sys=30100, real=34261
Usage examples can be found in Jens's test code. This is really fast. However the only one big drawback of the technology is necessity of double buffering. Since data pages which sent to kernel by vmsplice() are directly used by network stack (BTW you can use the syscalls not only with TCP, but also with SCTP and UDP protocols - it has generic implementation in the kernel), so you can not use the pages until kernel completely send them to network. When it happen? Not latter than you wrote 2 size of network output buffer. Thus only when you send double network output buffer sized data you can use the pages again. In practice it means that you need special memory (page) allocator.