High Performance Linux



> Try Tempesta FW, a high performance open source application delivery controller for the Linux/x86-64 platform.

> Or check custom high-performance solutions from Tempesta Technologies, INC.

> Careers: if you love low-level C/C++ hacking and Linux, we'll be happy to hear from you.


Sunday, September 11, 2016

How To Analyze Performance of Your Program

Is your program fast? This question seems easy to answer and typically people do some benchmarks to know it. However, believe or not many ever experienced engineers do wrong things in performance analysis. There are many tutorials about profiling (e.g. for perf). There are also tutorials and presentations about Linux performance monitoring (e.g. Brendan Gregg's or mine, in Russian) and tuning (e.g. from RedHat). However, I haven't seen any consolidated tutorial on Linux server software performance analysis so far. So this article is an example of performance analysis of program, Tempesta FW, which is supposed to be used as introductory tutorial for software engineers on Linux software performance analysis.

Please feel free to write comments with additional references, good links about benchmarking, performance monitoring and profiling. Additions, corrections etc to make the "tutorial" more complete are very appreciated.


Define the Reference and the Target

The first thing to do is clearly define what "fast" means for your program. Probably you have business requirements like "the software must process at least 50,000 reqeusts per second (RPS)". Or probably you reference some existing software, e.g. "be as fast as the Competitor's solution".

Nginx is so cool and so widely used that use it for the reference in our development. Actually we could reference Varnish or Apache Traffic Server as well (and actually we do keep their design in mind), but they are very close to Nginx in peformance, just because they all are traditional user space daemons. However, since Tempesta FW is pure kernel solution with it's pros and cons it a priori delivers much more performance than any traditional user-space solutions (solutions based on user-space TCP/IP are different story). So we defined our target as "to be x10 times faster than traditional user-space Web accelerator" and use Nginx as an example of "traditional user-space Web accelerator" across the article.


Define the Measurements

The next step is definition of performance measurements. The truth is that only real production usage will show whether the software is fast enough. However, it's usually not wished to deploy a software in production without preliminary performance measurements. So we're just doing preliminary, synthetic in other words, performance testing.

Why the tests usually can't show real performance? Just because real workload you'll face very different use cases. For example for Web server in real deployment you see various DDoS attacks, users with old and/or RFC incompatible Web browsers, very different robots from Web search engines and many many other clients. Usually it's just impossible to emulate all the use patterns in tests.

Thus, we're defining very reduced set of test cases, or benchmarks, to analyze performance of our software in particular use cases. For example we can define following benchmark scenarios:
  1. Analyze performance for 1, 8, 64, 512, 2048, 8192 and 32768 connections. The test case analyzes how well the program scales with increasing number of connections. Small connections number challenges single CPU core performance while largest connections number challenges I/O multiplexing and probably Web accelerator and operating system internal synchronization mechanisms.
  2. Test minimal requests like 'GET / HTTP\1.0\n\n' (simple HTTP flood) as well as normal browser requests. Software should be fast for normal users as well as to be resistant to application layer DDoS attacks;
  3. Test small (16 bytes), medium (3KB) and large (16KB) Web content files. Well, 16 bytes seems unrealistic payload size, but it efficiently test networking I/O, HTTP processing and other logic. Meantime, very large payloads, like tens megabytes or more, just test network throughput minimizing Web accelerator role. So we don't use such large payloads. The interesting case is medium size: 3KB plus some space for HTTP headers and metadata should fit system page of size 4KB - this is very comfortable case for memory allocators.
  4. Test normal Web content processing and efficiency of rate limiting clients blocking. The motivation for the test case is the same as for (2): be fast for normal conditions and DDoS resistant.
  5.  Test keep-alive HTTP sessions as well as new TCP connection for each request. New TCP connections for each HTTP request cause good stress to network I/O, Web accelerator sessions initialization and so on, while keep-alive connections move focus to HTTP processing.
One of our client, a VPN vendor, knew their bottle neck - it was hard for them to scale single heavyweight TCP connection among many CPU cores for encryption. So the analysis case for them was exactly to pass one, as large as possible, TCP connection through the VPN agent and process it as quickly as possible using all available CPU cores. So knowing or assuming a priori bottle necks of your software is very useful to define good test case for benchmarks.

One other example could be business requirement. Very useful requirement is to show performance results as high as possible, but without cheating (e.g. without using jumbo frames). Usually the requirement arrives from marketing department since the guys must show where the product outperforms competitors. In particular you choose any benchmark test which is the most promising for your solution and do fair performance tests.

In our example test No. 4, rate limiting clients blocking, is exactly the case. Any modern user-space Web accelerator runs rate limiting on application layer, e.g. if Nginx rate limits a client it still has to accept TCP connection from the client, read and parse the request and only after that limiting modules get control and block the request. Meantime, Tempesta FW directly passes blocked IP addresses to IP layer and all following requests from the malicious client are immediately blocked at IP layer (you can find details from my presentation - it's in Russian, but the slides are in English). This is unique feature were the solution shines and performance results will just clearly show that.


Setup the Testbed

The last preparation step is definition and setting up testbed. The testbed configuration is crucial since it must emulate real life use case. Meantime, testbed must be able to explicitly show the software performance. I'll describe hardware and configuration issues with testbed setup at the below.

In our case we're using testbed configuration as depicted on the figure. There are 2 servers: one to run Web server perfromance measurement tool and the second one to run Web accelerator, Tempesta FW or Nginx in our tests. Each server has 8 CPU cores Intel Xeon E5620 (very old processor) and the servers are linked using 10Gbps Intel network adapters. Nginx backend is running on port 8080. Tempesta FW uses port 80 and Nginx as Web accelerator uses port 9000.

Note that we use Nginx backend (upstream) on the same host as analyzed Tempesta FW or Nginx as Web accelerator. While 3rd machine running the backend server looks more practical, it's undistinguishable from running backend on the same host in our test scenario because we use single page Web content which is cached by Web accelerator on first hit. Testing Web cache requires large Web content, such that the Web accelerator constantly evicts some cache entries and fetches new ones from the backend server. So more network I/O is introduced, probably disk I/O in Nginx case and surely cache eviction algorithm. This is relatively complex topic which merits its own benchmark and independent analysis. We'll do this for Tempesta FW in future. Meantime I want to share interesting Nginx vs Varnish performance study from BBC which discusses complex topics in the Web accelerators performance.


Warm Caches

So far so good, let's start banchmarking. ApacheBench (ab) is the first tool which Wikipedia mentions, so let's give it a try. We can make following simple run against Nginx as the first try (all senseless output like the tool banner is skipped for brivity):

    # ab -n 10 http://192.168.100.100:9000/

    Server Software:        nginx/1.6.2
    Server Hostname:        192.168.100.100
    Server Port:            9000

    Document Path:          /
    Document Length:        867 bytes

    Concurrency Level:      1
    Time taken for tests:   0.076 seconds
    Complete requests:      10
    Failed requests:        0
    Total transferred:      10990 bytes
    HTML transferred:       8670 bytes
    Requests per second:    131.40 [#/sec] (mean)
    Time per request:       7.610 [ms] (mean)
    Time per request:       7.610 [ms] (mean, across all concurrent requests)
    Transfer rate:          141.03 [Kbytes/sec] received

    Connection Times (ms)
                  min  mean[+/-sd] median   max
    Connect:        0    0   0.2      0       1
    Processing:     0    7  21.2      0      68
    Waiting:        0    7  21.2      0      68
    Total:          1    8  21.4      1      69

    Percentage of the requests served within a certain time (ms)
      50%      1
      66%      1
      75%      1
      80%      1
      90%     69
      95%     69
      98%     69
      99%     69
     100%     69 (longest request)

One request took 69ms while others are within 1ms - too big difference, what's the reason? The answer is simple - the cache. To service the first request Web accelerator must transfer the request to the backend server and when it receives the response, it sends the response to the client and caches it. All following requests are serviced directly from the cache. So if you don't test caches, ensure that all the caches are properly warmed up before start of real benchmark. As it was mentioned before, we don't test cache, so I send one request to Web accelerator before each benchmark. Also keep in mind that caches can stale, so I use proxy_cache_valid any 30d for Nginx configuration to make sure that there is no cache invalidation during the tests. Actually there are many different caches which can affect results of benchmarks. E.g. CPU cache plays significant role in Intel TSX benchmarks. So rule of thumb is to make a test run before the actual benchmark to warm all the caches.


Performance of Performance Measurement Tool

Now caches are warm and we can start real measurement for concurrent sessions (several output lines are skipped as they're the same as for previous run):

    # ab -n 100000 -c 64 http://192.168.100.100:9000/

    Concurrency Level:      64
    Time taken for tests:   20.906 seconds
    Complete requests:      100000
    Failed requests:        0
    Total transferred:      109900000 bytes
    HTML transferred:       86700000 bytes
    Requests per second:    4783.21 [#/sec] (mean)
    Time per request:       13.380 [ms] (mean)
    Time per request:       0.209 [ms] (mean, across all     concurrent requests)
    Transfer rate:          5133.54 [Kbytes/sec] received    Connection Times (ms)
    
              min  mean[+/-sd] median   max
    Connect:        1    6   0.7      6      10
    Processing:     2    8   1.2      7      14
    Waiting:        2    6   1.3      5      12
    Total:          7   13   0.8     13      19

    Percentage of the requests served within a certain time (ms)
      50%     13
      66%     14
      75%     14
      80%     14
      90%     14
      95%     15
      98%     15
      99%     15
     100%     19 (longest request)

Only 4783 RPS for Nginx on 8 core machine? Yes, and that's just because ab can't produce more load, it's essentially the bottleneck of our installation. If you run top -b  -p `pidof ab` -n 10 on load generation machine, then typically you find something like:

       PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
     9120 root      20   0   34416   5984   3892 R  99.3  0.3   0:11.57 ab



The problem is that ab is unable to use more than one CPU core. If you need more workload, then you have to run many ab instances in parallel and merge their results somehow.

So it has sense to try other popular HTTP benchmarking tool, wrk, which is able to generate load in many threads. Running wrk in 8 threads (equal to CPU number) gives us much better results:

    # ./wrk -t 8 -c 64 -d 30s http://192.168.100.100:9000/
    Running 30s test @ http://192.168.100.100:9000/
      8 threads and 64 connections
      Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency     4.66ms   10.93ms 285.48ms   99.14%
        Req/Sec     2.09k   248.31     2.83k    90.03%
      498152 requests in 30.02s, 524.48MB read
    Requests/sec:  16594.13
    Transfer/sec:     17.47MB


Tune the Reference System

17KRPS isn't so impressive results for Nginx on 8 core system, right? So the next step is to tune operating system and reference program to get more performance, such that our performance comparison becomes fair. There are many resources in Internet about Linux and Nginx performance tuning, e.g. Tuning NGINX for Performance. Also this Linux Web Server Performance Benchmark can give you an idea how to properly setup the system and which numbers you should expect from the benchmarks.

The previous results were for default Debian Nginx 1.62 configuration with almost 1KB index.html content. So I downloaded Nginx 1.11.3 which introduces performance improvements, replaced original index.html by "0123456789abcde" (16 bytes including trailing '\n') and used the same sysctl settings as in the benchmark mentioned above:

    sysctl -w fs.file-max=5000000
    sysctl -w net.core.netdev_max_backlog=400000
    sysctl -w net.core.optmem_max=10000000
    sysctl -w net.core.rmem_default=10000000
    sysctl -w net.core.rmem_max=10000000
    sysctl -w net.core.somaxconn=100000
    sysctl -w net.core.wmem_default=10000000
    sysctl -w net.core.wmem_max=10000000
    sysctl -w net.ipv4.conf.all.rp_filter=1
    sysctl -w net.ipv4.conf.default.rp_filter=1
    sysctl -w net.ipv4.ip_local_port_range='1024 65535'
    sysctl -w net.ipv4.tcp_congestion_control=bic
    sysctl -w net.ipv4.tcp_ecn=0
    sysctl -w net.ipv4.tcp_max_syn_backlog=12000
    sysctl -w net.ipv4.tcp_max_tw_buckets=2000000
    sysctl -w net.ipv4.tcp_mem='30000000 30000000 30000000'
    sysctl -w net.ipv4.tcp_rmem='30000000 30000000 30000000'
    sysctl -w net.ipv4.tcp_sack=1
    sysctl -w net.ipv4.tcp_syncookies=0
    sysctl -w net.ipv4.tcp_timestamps=1
    sysctl -w net.ipv4.tcp_wmem='30000000 30000000 30000000'
    sysctl -w net.ipv4.tcp_tw_recycle=1
    sysctl -w net.ipv4.tcp_tw_reuse=1


My nginx.conf is

    # Don't use gzip module for small content files.
    user www-data;
    pid /run/nginx-proxy.pid;
    worker_processes 8;
    worker_rlimit_nofile 100000;
    error_log /opt/nginx-1.11.3/logs/error-proxy.log crit;
    events {
        worker_connections 16384;
        use epoll;
        multi_accept on;
        accept_mutex off;
    }
    http {
        server {
            listen 9000 backlog=131072 deferred reuseport \

                        fastopen=4096;
            location / {
                proxy_pass http://127.0.0.1:8080/;
            }
        }
        access_log off;
        proxy_cache_path /opt/nginx/cache keys_zone=one:10m;
        proxy_cache one;
        proxy_cache_valid any 30d;
        proxy_cache_valid 200 302 10m;
        open_file_cache max=100000 inactive=30d;
        open_file_cache_valid 30d;
        open_file_cache_min_uses 2;
        open_file_cache_errors on;
        # `sendfile on` has no sense for small files
        etag off;
        tcp_nopush on;
        tcp_nodelay on;
        keepalive_timeout 65;
        keepalive_requests 100000000;
        reset_timedout_connection on;
        client_body_timeout 40;
        send_timeout 10;
    }


So now we dramatically increase our results to almost 95KRPS for HTTP Keep-Alive connections:

    # ./wrk -t 8 -c 64 -d 30s http://192.168.100.100:9000/
    Running 30s test @ http://192.168.100.100:9000/
      8 threads and 64 connections
      Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency   674.67us  632.24us  26.02ms   97.67%
        Req/Sec    11.83k     1.09k   20.05k    57.67%
      2830565 requests in 30.10s, 680.26MB read
    Requests/sec:  94793.24
    Transfer/sec:     22.60MB


And 17KRPS for "Connection: close" requests:

    # ./wrk -t 8 -c 64 -d 30s --header 'Connection: close' http://192.168.100.100:9000/
    Running 30s test @ http://192.168.100.100:9000/
      8 threads and 64 connections
      Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency     3.27ms    2.26ms  26.19ms   71.41%
        Req/Sec     2.16k   105.14     2.56k    69.83%
      516962 requests in 30.03s, 121.77MB read
    Requests/sec:  17215.52
    Transfer/sec:      4.06MB



Compare Performance Measurements

So now we have data about performance of reference system and now is time to measure performance of our program. Comprehensive performance benchmarks, covering all the mentioned test cases, are complex and too large. So I use only 3, probably the most notable, of them. All the tests are run for 1, 8, 64, 512, 2048, 8192 and 32768 connections and use the same 16 byte index.html.

I used following configuration file for Tempesta FW:

    listen 192.168.100.100:80;
    server 127.0.0.1:8080 conns_n=8;
    cache 1;
    cache_fulfill * *;

The first test uses keep-alive HTTP connections and sends smallest possible GET requests:

    ./wrk -t $T -c $C -d 30s http://192.168.100.100:$P/

, where $C is connections number from 1 to 32,768, $T is threads number, 1 for $C = 1 and 8 for larger values and $P is port, 9000 for Nginx and 80 for Tempesta FW.


The second test adds 'Connection: close' header to the requests:

    ./wrk -t $T -c $C -d 30s --header 'Connection: close' http://192.168.100.100:$P/



And the last test mimics real life requests with large Cookie, keep-alive connections are used:

    ./wrk -t $T -c $C -d 30s --header 'Connection: keep-alive' --header 'Upgrade-Insecure-Requests: 1' --header 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' --header 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' --header 'Accept-Encoding: gzip, deflate, sdch' --header 'Accept-Language: en-US,en;q=0.8,ru;q=0.6' --header 'Cookie: a=sdfasd; sdf=3242u389erfhhs; djcnjhe=sdfsdafsdjfb324te1267dd; sdaf=mo2u8943478t67437461746rfdgfcdc; ityu=9u489573484duifhd; GTYFT=nsdjhcbyq3te76ewgfcZ; uityut=23Y746756247856425784657; GA=URHUFVHHVSDNFDHGYSDGF; a=%45345%dfdfg%4656%4534sdfjhsdb.sdfsg.sdfgsf.; nsdjhfb=4358345y; jkbsdff=aaaa; aa=4583478; aaaaa=34435345; rrr=iy7t67t6tsdf; ggg=234i5y24785y78ry534785; mmm=23uy47fbhdsfbgh; bsdfhbhfgdqqwew=883476757%345345; jksdfb=2348y; ndfsgsfdg=235trHHVGHFGC; a=sdfasd; sdf=3242u389erfhhs; djcnjhe=sdfsdafsdjfb324te1267dd; sdaf=mo2u8943478t67437461746rfdgfcdc; ityu=9u489573484duifhd; GTYFT=nsdjhcbyq3te76ewgfcZ; uityut=23Y746756247856425784657; GA=URHUFVHHVSDNFDHGYSDGF; a=%45345%dfdfg%4656%4534sdfjhsdb.sdfsg.sdfgsf.; nsdjhfb=4358345y; jkbsdff=aaaa; aa=4583478; aaaaa=34435345; rrr=iy7t67t6tsdf; ggg=234i5y24785y78ry534785; mmm=23uy47fbhdsfbgh; bsdfhbhfgdqqwew=883476757%345345; jksdfb=2348y; ndfsgsfdg=235trHHVGHFGC; erertrt=3242342343423324234; ggggggggg=8888888888888888888888888888888888888888888888888888888888888788' http://192.168.100.100:$P/


The plots show that we beat a user-space Web accelerator in 2 times at most and like 1.5 times only in average. Does it mean that we did poor job and didn't reach our targets? Not at all. Let's understand the results and figure out the next steps.


Understand the Results

As the first step let's see top output for Nginx server. It shows

    %Cpu(s): 28.0 us, 44.7 sy,  0.0 ni,  0.2 id,  0.0 wa,  0.0 hi, 27.0 si,  0.0 st

for small requests and keep-alive connections or

    %Cpu(s):  35.7 us, 42.3 sy,  0.0 ni,  0.2 id,  0.0 wa,  0.0 hi, 21.8 si,  0.0 st

for real life requests with keep-alive connections. In all the cases traffic generating machine is idle for about 15-30%:

    %Cpu(s):  5.3 us, 52.6 sy,  0.0 ni, 28.0 id,  0.0 wa,  0.1 hi, 13.9 si,  0.0 st

Meantime Tempesta FW benchmarks behave very differently:

    %Cpu(s):  0.0 us,  1.2 sy,  0.0 ni, 58.4 id,  0.0 wa,  0.0 hi, 40.3 si,  0.0 st

for real life requests and

    %Cpu(s):  0.0 us,  0.2 sy,  0.0 ni, 94.3 id,  0.0 wa,  0.0 hi,  5.4 si,  0.0 st

for small requests. Note that in the last case 94% of CPU is idle, i.e. Tempesta FW uses only 5% of CPU to process almost 140KRPS! That's not bad at all. By the way, Tempesta FW works fully in softirq context, so this is why we see from the top output that CPU runs in softirq only.

And now let's have a look at traffic generating machine:

    %Cpu(s):  6.0 us, 62.2 sy,  0.0 ni,  0.1 id,  0.0 wa,  0.0 hi, 31.7 si,  0.0 st

So now the traffic generator is the bottle neck again. This is why we could not show results better that x2 times faster than Nginx. So this time to cause more load onto our Web accelerator we need much stronger hardware to generate more HTTP load. However, if we simply increase number of CPU cores that doesn't change results for small connections number, like 1 or 8, because only a few CPU cores are able to generate the load. To show the best results for small connections number we need to use much more productive machine in single CPU core loads.


What to Improve

Ok, having more power hardware to generate HTTP traffic we can show numbers in 5-10 times better for keep-alive connections and small requests. However, our top results for real life requests weren't so promising: we ate almost 60% of CPU. So this is the point for improvements.

perf top (again, I reference to perf tutorial and nice Brendan Gregg's article) shows us following profile:

    13.67%  [tempesta_fw]   [k] tfw_http_parse_req
      4.37%  [kernel]       [k] __memset
      3.80%  [tempesta_fw]  [k] __str_grow_tree
      3.52%  [tempesta_fw]  [k] tfw_http_msg_field_chunk_fixup
      2.98%  [kernel]       [k] debug_lockdep_rcu_enabled
      2.80%  [kernel]       [k] lock_acquire
      2.67%  [kernel]       [k] lock_release
      2.14%  [kernel]       [k] check_poison_obj
      2.03%  [tempesta_fw]  [k] tfw_pool_realloc



So the bottle neck is HTTP parser, actually known issue.


Rate Limiting

I didn't measure 4th case, rate limiting clients blocking. While the test is very notable, it's not easy to do it properly. The reason is that traditional Web server performance measurement tools like ab or wrk can't measure how many requests or connections were blocked, just how many of them were processed. To address the issue we're developing our own benchmark tool, tfw_bomber, which "bombs" Web server by normal as well as malicious request and is able to measure blocked connections and requests. I'll describe the tool in upcoming articles.

3 comments:

  1. Program delays are mainly attributable to the application code (80%) and to kernel code (20%)

    I used to look at what we thought was causing programs to run for a long time. Most of the time, it was reading or writing small amounts of data or rereading/rewriting same. A first optimisation we did was to increase buffer size, and write out many records at one time.

    The second method was to use a interrupt sampler like the one proposed.

    We would sample the code and record the address in the code when the sample was taken. Then we would remap the code to the compiled object/source(s). That would show us immediately, where in the code the samples were most located.

    We also spent time on code locality (demand paging systems can't do everything), to put highly used code close together. Code inspection helped. Adding counters inside the code also helped.

    What we knew is that the kernel developers would use the aforementioned tool, but the application programer was left to his experience. Experience counts.

    ReplyDelete
    Replies
    1. Hi Leslie,

      thank you for sharing your experience!

      I'd say it's kind of application and workload dependent how much time application spends in kernel or user space.

      Also probably you find useful splice(2) family system calls which give you zero-copy IO, I wrote about them some time ago: http://natsys-lab.blogspot.ru/2011/12/linux-zero-copy-output.html.

      Delete
    2. I do agree. When I was testing our system software with kcachegrind a profiling tool, we came to know that the small memory chunck copy, multiple times was taking long time. It was corrected by replacing it with long chunks of memory copying lesser times. This resulted huge difference nearly 50%.

      Delete

Note: Only a member of this blog may post a comment.