Please feel free to write comments with additional references, good links about benchmarking, performance monitoring and profiling. Additions, corrections etc to make the "tutorial" more complete are very appreciated.
Define the Reference and the Target
The first thing to do is clearly define what "fast" means for your program. Probably you have business requirements like "the software must process at least 50,000 reqeusts per second (RPS)". Or probably you reference some existing software, e.g. "be as fast as the Competitor's solution".
Nginx is so cool and so widely used that use it for the reference in our development. Actually we could reference Varnish or Apache Traffic Server as well (and actually we do keep their design in mind), but they are very close to Nginx in peformance, just because they all are traditional user space daemons. However, since Tempesta FW is pure kernel solution with it's pros and cons it a priori delivers much more performance than any traditional user-space solutions (solutions based on user-space TCP/IP are different story). So we defined our target as "to be x10 times faster than traditional user-space Web accelerator" and use Nginx as an example of "traditional user-space Web accelerator" across the article.
Define the Measurements
The next step is definition of performance measurements. The truth is that only real production usage will show whether the software is fast enough. However, it's usually not wished to deploy a software in production without preliminary performance measurements. So we're just doing preliminary, synthetic in other words, performance testing.
Why the tests usually can't show real performance? Just because real workload you'll face very different use cases. For example for Web server in real deployment you see various DDoS attacks, users with old and/or RFC incompatible Web browsers, very different robots from Web search engines and many many other clients. Usually it's just impossible to emulate all the use patterns in tests.
Thus, we're defining very reduced set of test cases, or benchmarks, to analyze performance of our software in particular use cases. For example we can define following benchmark scenarios:
- Analyze performance for 1, 8, 64, 512, 2048, 8192 and 32768 connections. The test case analyzes how well the program scales with increasing number of connections. Small connections number challenges single CPU core performance while largest connections number challenges I/O multiplexing and probably Web accelerator and operating system internal synchronization mechanisms.
- Test minimal requests like 'GET / HTTP\1.0\n\n' (simple HTTP flood) as well as normal browser requests. Software should be fast for normal users as well as to be resistant to application layer DDoS attacks;
- Test small (16 bytes), medium (3KB) and large (16KB) Web content files. Well, 16 bytes seems unrealistic payload size, but it efficiently test networking I/O, HTTP processing and other logic. Meantime, very large payloads, like tens megabytes or more, just test network throughput minimizing Web accelerator role. So we don't use such large payloads. The interesting case is medium size: 3KB plus some space for HTTP headers and metadata should fit system page of size 4KB - this is very comfortable case for memory allocators.
- Test normal Web content processing and efficiency of rate limiting clients blocking. The motivation for the test case is the same as for (2): be fast for normal conditions and DDoS resistant.
- Test keep-alive HTTP sessions as well as new TCP connection for each request. New TCP connections for each HTTP request cause good stress to network I/O, Web accelerator sessions initialization and so on, while keep-alive connections move focus to HTTP processing.
One other example could be business requirement. Very useful requirement is to show performance results as high as possible, but without cheating (e.g. without using jumbo frames). Usually the requirement arrives from marketing department since the guys must show where the product outperforms competitors. In particular you choose any benchmark test which is the most promising for your solution and do fair performance tests.
In our example test No. 4, rate limiting clients blocking, is exactly the case. Any modern user-space Web accelerator runs rate limiting on application layer, e.g. if Nginx rate limits a client it still has to accept TCP connection from the client, read and parse the request and only after that limiting modules get control and block the request. Meantime, Tempesta FW directly passes blocked IP addresses to IP layer and all following requests from the malicious client are immediately blocked at IP layer (you can find details from my presentation - it's in Russian, but the slides are in English). This is unique feature were the solution shines and performance results will just clearly show that.
Setup the Testbed
The last preparation step is definition and setting up testbed. The testbed configuration is crucial since it must emulate real life use case. Meantime, testbed must be able to explicitly show the software performance. I'll describe hardware and configuration issues with testbed setup at the below.
In our case we're using testbed configuration as depicted on the figure. There are 2 servers: one to run Web server perfromance measurement tool and the second one to run Web accelerator, Tempesta FW or Nginx in our tests. Each server has 8 CPU cores Intel Xeon E5620 (very old processor) and the servers are linked using 10Gbps Intel network adapters. Nginx backend is running on port 8080. Tempesta FW uses port 80 and Nginx as Web accelerator uses port 9000.
Note that we use Nginx backend (upstream) on the same host as analyzed Tempesta FW or Nginx as Web accelerator. While 3rd machine running the backend server looks more practical, it's undistinguishable from running backend on the same host in our test scenario because we use single page Web content which is cached by Web accelerator on first hit. Testing Web cache requires large Web content, such that the Web accelerator constantly evicts some cache entries and fetches new ones from the backend server. So more network I/O is introduced, probably disk I/O in Nginx case and surely cache eviction algorithm. This is relatively complex topic which merits its own benchmark and independent analysis. We'll do this for Tempesta FW in future. Meantime I want to share interesting Nginx vs Varnish performance study from BBC which discusses complex topics in the Web accelerators performance.
Warm Caches
So far so good, let's start banchmarking. ApacheBench (ab) is the first tool which Wikipedia mentions, so let's give it a try. We can make following simple run against Nginx as the first try (all senseless output like the tool banner is skipped for brivity):
# ab -n 10 http://192.168.100.100:9000/
Server Software: nginx/1.6.2
Server Hostname: 192.168.100.100
Server Port: 9000
Document Path: /
Document Length: 867 bytes
Concurrency Level: 1
Time taken for tests: 0.076 seconds
Complete requests: 10
Failed requests: 0
Total transferred: 10990 bytes
HTML transferred: 8670 bytes
Requests per second: 131.40 [#/sec] (mean)
Time per request: 7.610 [ms] (mean)
Time per request: 7.610 [ms] (mean, across all concurrent requests)
Transfer rate: 141.03 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.2 0 1
Processing: 0 7 21.2 0 68
Waiting: 0 7 21.2 0 68
Total: 1 8 21.4 1 69
Percentage of the requests served within a certain time (ms)
50% 1
66% 1
75% 1
80% 1
90% 69
95% 69
98% 69
99% 69
100% 69 (longest request)
One request took 69ms while others are within 1ms - too big difference, what's the reason? The answer is simple - the cache. To service the first request Web accelerator must transfer the request to the backend server and when it receives the response, it sends the response to the client and caches it. All following requests are serviced directly from the cache. So if you don't test caches, ensure that all the caches are properly warmed up before start of real benchmark. As it was mentioned before, we don't test cache, so I send one request to Web accelerator before each benchmark. Also keep in mind that caches can stale, so I use proxy_cache_valid any 30d for Nginx configuration to make sure that there is no cache invalidation during the tests. Actually there are many different caches which can affect results of benchmarks. E.g. CPU cache plays significant role in Intel TSX benchmarks. So rule of thumb is to make a test run before the actual benchmark to warm all the caches.
Performance of Performance Measurement Tool
Now caches are warm and we can start real measurement for concurrent sessions (several output lines are skipped as they're the same as for previous run):
# ab -n 100000 -c 64 http://192.168.100.100:9000/
Concurrency Level: 64
Time taken for tests: 20.906 seconds
Complete requests: 100000
Failed requests: 0
Total transferred: 109900000 bytes
HTML transferred: 86700000 bytes
Requests per second: 4783.21 [#/sec] (mean)
Time per request: 13.380 [ms] (mean)
Time per request: 0.209 [ms] (mean, across all concurrent requests)
Transfer rate: 5133.54 [Kbytes/sec] received Connection Times (ms)
min mean[+/-sd] median max
Connect: 1 6 0.7 6 10
Processing: 2 8 1.2 7 14
Waiting: 2 6 1.3 5 12
Total: 7 13 0.8 13 19
Percentage of the requests served within a certain time (ms)
50% 13
66% 14
75% 14
80% 14
90% 14
95% 15
98% 15
99% 15
100% 19 (longest request)
Only 4783 RPS for Nginx on 8 core machine? Yes, and that's just because ab can't produce more load, it's essentially the bottleneck of our installation. If you run top -b -p `pidof ab` -n 10 on load generation machine, then typically you find something like:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9120 root 20 0 34416 5984 3892 R 99.3 0.3 0:11.57 ab
The problem is that ab is unable to use more than one CPU core. If you need more workload, then you have to run many ab instances in parallel and merge their results somehow.
So it has sense to try other popular HTTP benchmarking tool, wrk, which is able to generate load in many threads. Running wrk in 8 threads (equal to CPU number) gives us much better results:
# ./wrk -t 8 -c 64 -d 30s http://192.168.100.100:9000/
Running 30s test @ http://192.168.100.100:9000/
8 threads and 64 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 4.66ms 10.93ms 285.48ms 99.14%
Req/Sec 2.09k 248.31 2.83k 90.03%
498152 requests in 30.02s, 524.48MB read
Requests/sec: 16594.13
Transfer/sec: 17.47MB
Tune the Reference System
17KRPS isn't so impressive results for Nginx on 8 core system, right? So the next step is to tune operating system and reference program to get more performance, such that our performance comparison becomes fair. There are many resources in Internet about Linux and Nginx performance tuning, e.g. Tuning NGINX for Performance. Also this Linux Web Server Performance Benchmark can give you an idea how to properly setup the system and which numbers you should expect from the benchmarks.
The previous results were for default Debian Nginx 1.62 configuration with almost 1KB index.html content. So I downloaded Nginx 1.11.3 which introduces performance improvements, replaced original index.html by "0123456789abcde" (16 bytes including trailing '\n') and used the same sysctl settings as in the benchmark mentioned above:
sysctl -w fs.file-max=5000000
sysctl -w net.core.netdev_max_backlog=400000
sysctl -w net.core.optmem_max=10000000
sysctl -w net.core.rmem_default=10000000
sysctl -w net.core.rmem_max=10000000
sysctl -w net.core.somaxconn=100000
sysctl -w net.core.wmem_default=10000000
sysctl -w net.core.wmem_max=10000000
sysctl -w net.ipv4.conf.all.rp_filter=1
sysctl -w net.ipv4.conf.default.rp_filter=1
sysctl -w net.ipv4.ip_local_port_range='1024 65535'
sysctl -w net.ipv4.tcp_congestion_control=bic
sysctl -w net.ipv4.tcp_ecn=0
sysctl -w net.ipv4.tcp_max_syn_backlog=12000
sysctl -w net.ipv4.tcp_max_tw_buckets=2000000
sysctl -w net.ipv4.tcp_mem='30000000 30000000 30000000'
sysctl -w net.ipv4.tcp_rmem='30000000 30000000 30000000'
sysctl -w net.ipv4.tcp_sack=1
sysctl -w net.ipv4.tcp_syncookies=0
sysctl -w net.ipv4.tcp_timestamps=1
sysctl -w net.ipv4.tcp_wmem='30000000 30000000 30000000'
sysctl -w net.ipv4.tcp_tw_recycle=1
sysctl -w net.ipv4.tcp_tw_reuse=1
My nginx.conf is
# Don't use gzip module for small content files.
user www-data;
pid /run/nginx-proxy.pid;
worker_processes 8;
worker_rlimit_nofile 100000;
error_log /opt/nginx-1.11.3/logs/error-proxy.log crit;
events {
worker_connections 16384;
use epoll;
multi_accept on;
accept_mutex off;
}
http {
server {
listen 9000 backlog=131072 deferred reuseport \
fastopen=4096;
location / {
proxy_pass http://127.0.0.1:8080/;
}
}
access_log off;
proxy_cache_path /opt/nginx/cache keys_zone=one:10m;
proxy_cache one;
proxy_cache_valid any 30d;
proxy_cache_valid 200 302 10m;
open_file_cache max=100000 inactive=30d;
open_file_cache_valid 30d;
open_file_cache_min_uses 2;
open_file_cache_errors on;
# `sendfile on` has no sense for small files
etag off;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
keepalive_requests 100000000;
reset_timedout_connection on;
client_body_timeout 40;
send_timeout 10;
}
So now we dramatically increase our results to almost 95KRPS for HTTP Keep-Alive connections:
# ./wrk -t 8 -c 64 -d 30s http://192.168.100.100:9000/
Running 30s test @ http://192.168.100.100:9000/
8 threads and 64 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 674.67us 632.24us 26.02ms 97.67%
Req/Sec 11.83k 1.09k 20.05k 57.67%
2830565 requests in 30.10s, 680.26MB read
Requests/sec: 94793.24
Transfer/sec: 22.60MB
And 17KRPS for "Connection: close" requests:
# ./wrk -t 8 -c 64 -d 30s --header 'Connection: close' http://192.168.100.100:9000/
Running 30s test @ http://192.168.100.100:9000/
8 threads and 64 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.27ms 2.26ms 26.19ms 71.41%
Req/Sec 2.16k 105.14 2.56k 69.83%
516962 requests in 30.03s, 121.77MB read
Requests/sec: 17215.52
Transfer/sec: 4.06MB
Compare Performance Measurements
So now we have data about performance of reference system and now is time to measure performance of our program. Comprehensive performance benchmarks, covering all the mentioned test cases, are complex and too large. So I use only 3, probably the most notable, of them. All the tests are run for 1, 8, 64, 512, 2048, 8192 and 32768 connections and use the same 16 byte index.html.
I used following configuration file for Tempesta FW:
listen 192.168.100.100:80;
server 127.0.0.1:8080 conns_n=8;
cache 1;
cache_fulfill * *;
The first test uses keep-alive HTTP connections and sends smallest possible GET requests:
./wrk -t $T -c $C -d 30s http://192.168.100.100:$P/
, where $C is connections number from 1 to 32,768, $T is threads number, 1 for $C = 1 and 8 for larger values and $P is port, 9000 for Nginx and 80 for Tempesta FW.
The second test adds 'Connection: close' header to the requests:
./wrk -t $T -c $C -d 30s --header 'Connection: close' http://192.168.100.100:$P/
And the last test mimics real life requests with large Cookie, keep-alive connections are used:
./wrk -t $T -c $C -d 30s --header 'Connection: keep-alive' --header 'Upgrade-Insecure-Requests: 1' --header 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' --header 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' --header 'Accept-Encoding: gzip, deflate, sdch' --header 'Accept-Language: en-US,en;q=0.8,ru;q=0.6' --header 'Cookie: a=sdfasd; sdf=3242u389erfhhs; djcnjhe=sdfsdafsdjfb324te1267dd; sdaf=mo2u8943478t67437461746rfdgfcdc; ityu=9u489573484duifhd; GTYFT=nsdjhcbyq3te76ewgfcZ; uityut=23Y746756247856425784657; GA=URHUFVHHVSDNFDHGYSDGF; a=%45345%dfdfg%4656%4534sdfjhsdb.sdfsg.sdfgsf.; nsdjhfb=4358345y; jkbsdff=aaaa; aa=4583478; aaaaa=34435345; rrr=iy7t67t6tsdf; ggg=234i5y24785y78ry534785; mmm=23uy47fbhdsfbgh; bsdfhbhfgdqqwew=883476757%345345; jksdfb=2348y; ndfsgsfdg=235trHHVGHFGC; a=sdfasd; sdf=3242u389erfhhs; djcnjhe=sdfsdafsdjfb324te1267dd; sdaf=mo2u8943478t67437461746rfdgfcdc; ityu=9u489573484duifhd; GTYFT=nsdjhcbyq3te76ewgfcZ; uityut=23Y746756247856425784657; GA=URHUFVHHVSDNFDHGYSDGF; a=%45345%dfdfg%4656%4534sdfjhsdb.sdfsg.sdfgsf.; nsdjhfb=4358345y; jkbsdff=aaaa; aa=4583478; aaaaa=34435345; rrr=iy7t67t6tsdf; ggg=234i5y24785y78ry534785; mmm=23uy47fbhdsfbgh; bsdfhbhfgdqqwew=883476757%345345; jksdfb=2348y; ndfsgsfdg=235trHHVGHFGC; erertrt=3242342343423324234; ggggggggg=8888888888888888888888888888888888888888888888888888888888888788' http://192.168.100.100:$P/
The plots show that we beat a user-space Web accelerator in 2 times at most and like 1.5 times only in average. Does it mean that we did poor job and didn't reach our targets? Not at all. Let's understand the results and figure out the next steps.
Understand the Results
As the first step let's see top output for Nginx server. It shows
%Cpu(s): 28.0 us, 44.7 sy, 0.0 ni, 0.2 id, 0.0 wa, 0.0 hi, 27.0 si, 0.0 st
for small requests and keep-alive connections or
%Cpu(s): 35.7 us, 42.3 sy, 0.0 ni, 0.2 id, 0.0 wa, 0.0 hi, 21.8 si, 0.0 st
for real life requests with keep-alive connections. In all the cases traffic generating machine is idle for about 15-30%:
%Cpu(s): 5.3 us, 52.6 sy, 0.0 ni, 28.0 id, 0.0 wa, 0.1 hi, 13.9 si, 0.0 st
Meantime Tempesta FW benchmarks behave very differently:
%Cpu(s): 0.0 us, 1.2 sy, 0.0 ni, 58.4 id, 0.0 wa, 0.0 hi, 40.3 si, 0.0 st
for real life requests and
%Cpu(s): 0.0 us, 0.2 sy, 0.0 ni, 94.3 id, 0.0 wa, 0.0 hi, 5.4 si, 0.0 st
for small requests. Note that in the last case 94% of CPU is idle, i.e. Tempesta FW uses only 5% of CPU to process almost 140KRPS! That's not bad at all. By the way, Tempesta FW works fully in softirq context, so this is why we see from the top output that CPU runs in softirq only.
And now let's have a look at traffic generating machine:
%Cpu(s): 6.0 us, 62.2 sy, 0.0 ni, 0.1 id, 0.0 wa, 0.0 hi, 31.7 si, 0.0 st
So now the traffic generator is the bottle neck again. This is why we could not show results better that x2 times faster than Nginx. So this time to cause more load onto our Web accelerator we need much stronger hardware to generate more HTTP load. However, if we simply increase number of CPU cores that doesn't change results for small connections number, like 1 or 8, because only a few CPU cores are able to generate the load. To show the best results for small connections number we need to use much more productive machine in single CPU core loads.
What to Improve
Ok, having more power hardware to generate HTTP traffic we can show numbers in 5-10 times better for keep-alive connections and small requests. However, our top results for real life requests weren't so promising: we ate almost 60% of CPU. So this is the point for improvements.
perf top (again, I reference to perf tutorial and nice Brendan Gregg's article) shows us following profile:
13.67% [tempesta_fw] [k] tfw_http_parse_req
4.37% [kernel] [k] __memset
3.80% [tempesta_fw] [k] __str_grow_tree
3.52% [tempesta_fw] [k] tfw_http_msg_field_chunk_fixup
2.98% [kernel] [k] debug_lockdep_rcu_enabled
2.80% [kernel] [k] lock_acquire
2.67% [kernel] [k] lock_release
2.14% [kernel] [k] check_poison_obj
2.03% [tempesta_fw] [k] tfw_pool_realloc
So the bottle neck is HTTP parser, actually known issue.
Rate Limiting
I didn't measure 4th case, rate limiting clients blocking. While the test is very notable, it's not easy to do it properly. The reason is that traditional Web server performance measurement tools like ab or wrk can't measure how many requests or connections were blocked, just how many of them were processed. To address the issue we're developing our own benchmark tool, tfw_bomber, which "bombs" Web server by normal as well as malicious request and is able to measure blocked connections and requests. I'll describe the tool in upcoming articles.
Program delays are mainly attributable to the application code (80%) and to kernel code (20%)
ReplyDeleteI used to look at what we thought was causing programs to run for a long time. Most of the time, it was reading or writing small amounts of data or rereading/rewriting same. A first optimisation we did was to increase buffer size, and write out many records at one time.
The second method was to use a interrupt sampler like the one proposed.
We would sample the code and record the address in the code when the sample was taken. Then we would remap the code to the compiled object/source(s). That would show us immediately, where in the code the samples were most located.
We also spent time on code locality (demand paging systems can't do everything), to put highly used code close together. Code inspection helped. Adding counters inside the code also helped.
What we knew is that the kernel developers would use the aforementioned tool, but the application programer was left to his experience. Experience counts.
Hi Leslie,
Deletethank you for sharing your experience!
I'd say it's kind of application and workload dependent how much time application spends in kernel or user space.
Also probably you find useful splice(2) family system calls which give you zero-copy IO, I wrote about them some time ago: http://natsys-lab.blogspot.ru/2011/12/linux-zero-copy-output.html.
I do agree. When I was testing our system software with kcachegrind a profiling tool, we came to know that the small memory chunck copy, multiple times was taking long time. It was corrected by replacing it with long chunks of memory copying lesser times. This resulted huge difference nearly 50%.
Delete