High Performance Linux: 2018

It may seem easy to proxy HTTP requests - after all we just receive an HTTP request, queue it for retransmission, send it to a backend server, and do the same with an HTTP response when we get it from the server. However, things aren't so simple in modern HTTP proxies. In this article I'm going to address several interesting problems in HTTP/1.1 proxying. I'll be mostly concentrating on HTTP reverse proxies also known as web accelerators.

HTTP reverse proxying

First of all let's see what HTTP reverse proxy is and how it works internally. Besides web acceleration - i.e. caching web content - reverse proxies do a lot of stuff:

Load balancing among several servers, sometimes with different performance characteristics. E.g. on the picture we have large 3rd server, which is capable of handling more requests per second than the 2 others, so the server should get more requests.
Automatic failovering of failed servers. The second server on the picture fails, so the proxy must load balance ingress requests among rest of the 2 servers. When the server backs to normal operations, the load must be balanced among all the 3 servers again.
Since TLS is resource hungry, it has sense to terminate TLS on a proxy, so backend servers consumes resources for more useful application logic.
There could be clients with different software, too outdated or too recent, and the proxy should convert different protocols to more suitable forms for the backend servers, e.g. it can downgrade HTTP/2 to HTTP/1.1 or upgrade HTTP/1.0 to HTTP/1.1.

Also HTTP reverse proxies can do many other things such as DDoS mitigation, web security, requests and content modification (SSI or ESI), but in this article I'm going to focus only on basic HTTP proxying issues. There are several interesting topics which immediately arise when we just want to pass a request to some server and forward corresponding response back to a client:

How many connections should a proxy establish with each backend server?
Sometimes backend servers reset connections (by default Nginx and Apache HTTPD reset connections each time when a connection serves 100 requests). When this happens how does a proxy manage the connection resets? Obviously, it should be something more optimal than in a case of a server failure.
Since we pass HTTP messages from a client socket to a backend server socket and vice versa, there should be message queues and these queues must be properly managed with connections failovering in mind.
Is it safe to resend an HTTP request to other backend if current backend can not properly answer the request?
When you pass data from one socket (e.g. client) to an other (e.g. server), then there are data copies and high lock contention. The problem especially crucial for TLS encrypted data and HTTP/2.

The issues listed above are the topic for the article.

Performance of HTTP proxying

Let's start from a small test of a web server. I take Nginx 1.10.3 running in Debian 9 VM with 2 virtual CPUs and 2GB of RAM on my laptop (Intel i7-6500U). For workload generation I'll use wrk. This is a toy, test, environment, so the numbers in the article only make sense in a relative way.

Let's start from getting numbers for raw performance of a Web server hosting only one small index.html file of 3 bytes in size (hereafter I make 3 test runs and get the best results):

    # ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:9090/
    Running 30s test @ http://192.168.100.4:9090/
      8 threads and 4096 connections
      Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency   129.32ms 208.24ms   1.99s    90.28%
        Req/Sec     7.86k     1.67k   13.88k    75.49%
      1877247 requests in 30.10s, 424.30MB read
    Socket errors: connect 0, read 0, write 0, timeout 1374
    Requests/sec: 62368.01
    Transfer/sec:     14.10MB

There are 62K HTTP RPS with no HTTP errors. The Nginx configuration is

    worker_processes auto;
    worker_cpu_affinity auto;
    events {
        worker_connections 65536;
        use                epoll;
        multi_accept       on;
        accept_mutex       off;
    }
    worker_rlimit_nofile   1000000;
    http {
        keepalive_timeout 600;
        keepalive_requests 10000000;
        sendfile         on;
        tcp_nopush       on;
        tcp_nodelay        on;
        open_file_cache    max=1000 inactive=3600s;
        open_file_cache_valid 3600s;
        open_file_cache_min_uses 2;
        open_file_cache_errors off;
error_log /dev/null emerg;
        access_log         off;
        server {
            listen 9090 backlog=131072 deferred reuseport fastopen=4096;
            location / { root /var/www/html; }
    }

I didn't do any special sysctl settings since all the tests were running with the same OS settings and I didn't care much about absolute numbers in the tests. I just made basic performance tuning settings and switched off logging to remove the slow filesystem writing from the discussion.

Next let's run a proxy in front of the web server. The same Nginx in the same VM is used, but with a different configuration. Note that I switched off the web cache to learn how much overhead HTTP proxying occurs.

    worker_processes auto;
    worker_cpu_affinity auto;
    events {
        worker_connections 65536;
        use    epoll;
        multi_accept on;
        accept_mutex       off;
    }
    worker_rlimit_nofile   1000000;
    http {
        sendfile off; # too small file
        tcp_nopush on;
        tcp_nodelay        on;
        keepalive_timeout 600;
        keepalive_requests 1000000;
        access_log         off;
        error_log /dev/null emerg;
        gzip               off;
        upstream u {
            server 127.0.0.1:9090;
            keepalive 4096;
        }
        server {
            listen 9000 backlog=131072 deferred reuseport fastopen=4096;
            location / {
                proxy_pass http://u;
                proxy_http_version 1.1;
                proxy_set_header Connection "";
            }
        }
        proxy_cache off;
    }

Update: Thanks to Maxim Dounin, a lead Nginx developer, for pointing me out the keepalive configuration option - without it Nginx shows twice worse performance results.

And see how many RPSes we get in this configuration:

    # ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:9000/
    Running 30s test @ http://192.168.100.4:9000/
    8 threads and 4096 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   174.78ms 141.82ms   1.99s    87.09%
      Req/Sec     3.01k     0.95k    8.28k    70.62%
      718994 requests in 30.09s, 162.51MB read
    Socket errors: connect 0, read 0, write 0, timeout 229
    Requests/sec: 23894.05
    Transfer/sec:      5.40MB

Let's also try HAProxy and Tempesta FW, which usually delivers more performance for HTTP proxying. I tried HAProxy of version 1.7.5 and Tempesta FW 0.5.0. The configuration for HAProxy is:

    global
        log /dev/log    local0
        log /dev/log    local1 notice
        chroot /var/lib/haproxy
        user haproxy
        group haproxy
        daemon
        maxconn 65536
        nbproc 2
        cpu-map 1 0
        cpu-map 2 1
    defaults
        log     global
mode    http
http-reuse always
no log
        timeout connect 5000
        timeout client 50000
        timeout server 50000
        errorfile 400 /etc/haproxy/errors/400.http
        errorfile 403 /etc/haproxy/errors/403.http
        errorfile 408 /etc/haproxy/errors/408.http
        errorfile 500 /etc/haproxy/errors/500.http
        errorfile 502 /etc/haproxy/errors/502.http
        errorfile 503 /etc/haproxy/errors/503.http
        errorfile 504 /etc/haproxy/errors/504.http
    frontend test
        bind    :7000
        mode    http
maxconn 65536
        default_backend nginx
    backend nginx
        mode    http
        balance static-rr
        server be1 127.0.0.1:9090

And its results are:

    # ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:7000/
    Running 30s test @ http://192.168.100.4:7000/
    8 threads and 4096 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency   171.75ms 126.96ms   1.93s    81.56%
      Req/Sec     3.01k     0.88k   10.30k    73.52%
    719739 requests in 30.08s, 146.20MB read
    Socket errors: connect 0, read 0, write 0, timeout 514
    Requests/sec: 23925.79
    Transfer/sec:      4.86MB

The Tempesta FW configuration is just (note conns_n parameter):

    listen 192.168.100.4:80;
    server 127.0.0.1:9090 conns_n=128;
    cache 0;

While Tempesta FW shows the best results, they are still 2 times worse than for no proxying at all:

    # ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:80/
    Running 30s test @ http://192.168.100.4:80/
    8 threads and 4096 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency   146.89ms 170.77ms   2.00s    88.21%
      Req/Sec     4.05k   565.96     7.32k    77.58%
    967299 requests in 30.08s, 252.76MB read
    Socket errors: connect 0, read 0, write 0, timeout 19
    Requests/sec: 32157.20
    Transfer/sec:      8.40MB

Thus, HTTP proxying without a cache is very expensive: almost twice worse performance in the best case!

Besides web acceleration, HTTP proxying is also required for load balancing on HTTP layer (e.g. using persistent HTTP sessions) and WAF (Web Application Firewalls), so the performance degradation is significant in many cases.

Having VM with only 2 CPUs and two Nginx instances with auto spawning worker processes makes 4 Nginx worker processes in total. I also ran tests with worker_processes 1 to have one to one CPU and worker process mapping, but the results were bit worse than these.

A very small static file is used in the tests, causing more overhead in network and HTTP processing. Of course, if you run the tests for large static files or a heavy dynamic logic, we won't see so dramatic differences in the numbers. For example, I ran the tests for 64KB index.html with switched on sendfile on the proxy and the proxy overhead was just about 3%.

Thus, if you have a significant work set of small files, which doesn't fit your web cache, then a web accelerator may hurt performance of your installation badly. Always analyze access patterns to your web content or, better, run performance tests.

Backend connections

The first issue is about backend server connections. In most cases modern HTTP proxies use following simple algorithm:

Establish a TCP connection with a backend server.
Send an HTTP request to the connection. Now the connection in busy state.
If a new request arrives, a new TCP connection is established with the server and we do step (2) for the new connection.
When an HTTP response arrives from the server, we forward it to a client and mark the TCP connection as free. Now we can send upcoming requests through the connection.

In busy loaded scenarios, there are thousands of client connections concurrently sending HTTP requests, so typically HTTP proxy establishes also thousands of connections to a backend server. For example, for the test above with 4096 client connections HAProxy establishes more than 3 thousands connections with the backend server (regardless, I used http-reuse always to reuse backend server connections as much as possible):

# ss -npto state established '( dport = :9090 )'|wc -l
3383

You may have noticed that this is very close to the number of connections from wrk. I'll explain this when I discuss HTTP pipelining, but now I want to emphasize that an HTTP proxy needs almost the same number of connections with a backend server as it has with all the clients. It's worth mentioning that this works only for very aggressive clients which send a lot of requests, e.g. DDoS bots. The consequence is that traditional HTTP accelerators aren't suitable for DDoS mitigation, since protected backend servers can get the same number of connections as the proxies.

Nginx since 1.11.5 supports max_conns option for server directive to limit number of backend connections (so my Nginx 1.10.3 from Debian 9 packages doesn't have the option). HAProxy also supports maxconn option for backend servers. The same way, Tempesta FW provides conns_n option.

Actually, depending on particular hardware and type of work load, web servers have optimal connection concurrency level. For example, Tempesta FW reaches a peak of 1.8M HTTP RPS on a 4 cores machine with 1024 concurrent connections. So starting from a single backend server connection on step (1) and establishing too many connections on step (3) introduces more latency for "unhappy" - requiring to establish a new TCP connection - requests and reduces overall requests processing performance.

Persistent connections

While establishing a new TCP connection on step (3) introduces unwished latency to the request processing time, it has sense to keep persistent TCP connections with a backend server. If an HTTP proxy keeps a pool of persistent connections with a backend server, then it's always ready for instant spike of ingress client requests, e.g. due to a flash crowd or DDoS attack. This is what Tempesta FW does with conns_n: if you specify for example conns_n=128, then Tempesta FW keeps exactly 128 established connections to the backend server. (By the way, you can specify different values for conns_n for each backend server).

HTTP basically manages the persistency of connections with two headers - keep-alive connections are defined in RFC 2068 - for example:

Connection: keep-alive
Keep-Alive: timeout=5, max=10000

, i.e. the TCP connection timeout is 5 seconds and the maximum allowed requests passed over the connection is 10 thousand. Note that a client can only send Connection: keep-alive requests, while Keep-Alive specification for the connection is determined by a server.

TCP connections failovering

There are 3 cases when persistent connections with backed servers may fail:

If there is no workload for some time (e.g. if you didn't enable backend servers' health monitoring), the TCP or HTTP keepalive timer elapses and the connection is closed.
Backend servers can close such connections intentionally due to processing errors or default configuration (e.g. Nginx and Apache HTTPD close TCP client connections after each 100th request). Connection closing also happens to indicate the end of HTTP response without Content-Length header as well as chunked transfer encoding.
Also a backend server may just go down due to a server maintenance or failure. This case is handled by a server health monitoring process which catches service failures on different layers, e.g. hard server reset as well as a web application failure when a web server still responds, but in a wrong way.

In all of these cases a proxy must reestablish TCP connection(s) with the backend server. But what if we just sent a request to a backend server and the server fails? What should we do with the request? To provide a better service for clients the proxy can just resend the request to some other backend server. However, not all types of HTTP requests are allowed to be retransmitted. The next section dives deeper into the subject, but now I'm going to stay on the retransmission issue a bit more.

If we resend an HTTP request, then we have to limit the number of retransmissions for it. Consider a "killing" request which just crashes your backend web application: a proxy sends a request to a backend server, the server fails, the proxy resends the request to another server and the server goes down the same way, so all the servers in the backend cluster are down. To prevent such situations all (I hope) HTTP proxies limit the number of a request retransmissions, for example, with Tempesta FW you can use server_forward_retries for the limit (default value is 5).

The next question is for how long should a proxy keep a request in internal queues in hope to get a successful response for it? After all a client waiting for too long time for a web page rendering considers the service down at some point. So again all (I hope) HTTP proxies provide the time limit for request retransmissions. In the case of Tempesta FW server_forward_timeout does this.

The requirement to be able to resend a request introduces sk_buff copying. struct sk_buff is a Linux kernel descriptor of data being sent through the TCP/IP stack, so the TCP acknowledgement and retransmission mechanisms extensively update the descriptor. Since we may need to resend a request through some other TCP connection, we have to copy sk_buff before it's transmission through TCP/IP stack. The problem is that the descriptor is relatively large, several hundred bytes in size. Originally Tempesta FW network I/O was designed to be zero-copy, but connections failovering doesn't allow to fully avoid copies. The design of Tempesta FW network I/O is described in my post What's Wrong With Sockets Performance And How to Fix It.

HTTP pipelining

HTTP requests can be pipelined, i.e. sent in a row without waiting responses for each of them separately. Tempesta FW is one of the few HTTP proxies (the two others are Squid and Polipo) which can pipeline HTTP requests in backend server connections. All other HTTP proxies, including Nginx and HAProxy, are unable to use HTTP pipelining and wait for a responses for each sent request. I.e. if you configure your HTTP proxy, e.g. Nginx or HAProxy, to establish say 100 connections with a backend server at the most, then only 100 requests can be sent concurrently to backend servers.

This is why we saw 3 thousand open connections between HAProxy and Nginx - the proxy needs so many connections to achieve the concurrency necessary to process ingress workload.

To analyze the performance impact of HTTP pipelining let's reconfigure Tempesta FW to use only one connection to the backend server:

    server 127.0.0.1:9090 conns_n=1;

And start the benchmark:

    # ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:80/
    Running 30s test @ http://192.168.100.4:80
    8 threads and 4096 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency   162.20ms 197.03ms   1.99s    87.05%
      Req/Sec     3.73k     0.91k   12.67k    77.56%
    888501 requests in 30.06s, 232.19MB read
    Socket errors: connect 0, read 0, write 0, timeout 59
    Requests/sec: 29556.83
    Transfer/sec:      7.72MB

We see bit lower performance results due to lower number of backend server connections, but the performance degradation is quite small. To view how pipelining actually works we can use Tempesta FW's application performance monitoring:

    # while :; do \
        grep '0 queue size' /proc/tempesta/servers/default/127.0.0.1\:9090; \
        sleep 1; \
    done
        Connection 000 queue size   : 97
        Connection 000 queue size   : 79
        Connection 000 queue size   : 163
        Connection 000 queue size   : 104
        Connection 000 queue size   : 326
        Connection 000 queue size   : 251
        Connection 000 queue size   : 286
        Connection 000 queue size   : 312
        Connection 000 queue size   : 923
        Connection 000 queue size   : 250
        Connection 000 queue size   : 485
    ^C

The script show the instant number of requests queued for transmission through the backend server connection. As you can see there are hundreds of HTTP requests on the fly at each moment of time.

Now let's make HAProxy to use only one connection with the backend server. To do so we need to change following settings:

    global
        ...
nbproc 1
...
    backend nginx
        mode    http
        balance static-rr
fullconn 1
        server be1 127.0.0.1:9090 minconn 1 maxconn 1

Since it can not use pipelining, the performance degradation is significant, almost 4 times:

    # ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:7000/
    Running 30s test @ http://192.168.100.4:7000/
    8 threads and 4096 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency   674.95ms   89.56ms 934.93ms   83.60%
      Req/Sec   759.42    547.99     2.96k    67.90%
    175041 requests in 30.10s, 35.56MB read
    Requests/sec:   5815.86
    Transfer/sec:      1.18MB

Idempotence of HTTP requests

While HTTP pipelining is good, not all HTTP requests can be pipelined. It's actually not so easy to implement HTTP pipelining correctly.

RFC 7231 4.2.2 defines idempotent methods as safe methods, not changing a server state. The safe methods are GET, HEAD, TRACE, OPTIONS. Only these methods can be pipelined. Consider that we send two requests, one of them is non-safe (stricktly speaking, both of the requests are non-idempotent), to a server in a pipeline:

    GET /forum?post="new%20post%20content" HTTP/1.1
    Host: foo.com
    \r\n
    \r\n
    POST /forum?post HTTP/1.1
Host: foo.com
    Content-Length: 16
    \r\n
    new post content
    \r\n
\r\n

If the server connection terminates just after the transmission, we're going to failover process and resend the requests to another connections or a server. But can we resend the non-safe POST request? The problem is that we don't know whether the server processed the request and created a new post on the forum or not. If it did and we resend it again, we create the same post twice which is unwished. If we don't resend the request and just return error code to a client, then our response is false. Thus RFC 7230 6.3.1 requires that a proxy must not automatically retry non-idempotent requests.

Moreover, note that the first GET request is actually requests some dynamic logic which also may change the server state, just like the second POST request. The GET request is essentially non-idempotent, but this is a web application developer's responsibility to use the right request methods in their applications. Actually, POST requests can be idempotent if, for example, an application developer uses them to send a web search query.

Since request idempotence depends on a particular web application, Tempesta FW provides a configuration option to define which request methods to which URIs are non-idemptont, e.g. to make the first GET request non-idempotent we can add following option to the configuration file:

    nonidempotent GET prefix “/forum?post”

Pipeline queues

Let's consider pipelining of requests from a 3 clients to a 2 server connections. The first client sends a non-idempotent request (the large square marked as "NI" on the picture). Firstly, we keep all client requests in a per-client queue to forward server responses to a client in exactly the same order in which the client sent corresponding requests. However, the queue is used only for ordering and when a request arrives it's immediately processed by the load balancing logic ("LB" on the picture) and is scheduled to some server queue. The non-idempotent request resides in a server queue just like idempotent request, but we don't send other requests to a server connection until we receive a response for the non-idempotent request.

If you don't want HTTP pipelining at all you can set the server queue size to 1, i.e. only one request at a time will be queued:

server_queue_size 1;

It's clear that in general the second server queue having a non-idempotent request is drained slower than the first one, so Tempesta FW's load balancing algorithm makes preference to server queues without non-idempotent requests and uses such queues only if all others are too busy.

Pipelined messages retransmission

There could be a request-killer, crashing a web application, among pipelined requests, so RFC 7230 6.3.2 requires that a client must not pipeline immediately after connection establishment since we don't know which request exactly is the killer. So does Tempesta FW: if a server queue contains requests for retransmission, it doesn't schedule new requests to the queue until the last resent request is responded.

Unless server_retry_nonidempotent configuration option is specified, non-idempotent requests aren't resent and just dropped. If we have idempotent requests before and after the non-idempotent one, then we still can resend them to a live server. The sequence of responses is kept thanks to an error response, which is generated for the dropped non-idempotent request.

Since requests can be scheduled to different servers, appropriate responses can arrive in different order. When a server response arrives, it's linked with an appropriate request and the request is checked against head of the client queue: if all the requests in the head of the queue have linked responses, then all the responses are sent at once (pipelined) to a client. For example if we receive a response for the 1st client request while the 2nd and 3rd requests are already responded, then the whole head of the client queue, all the 3 responses for the first 3 requests, can be sent to the client in a pipeline.

HTTP messages adjustment

HTTP proxies usually have to adjust HTTP headers of forwarded messages, e.g. add Via header or current IP to X-Forwarded-For header. To do so we usually have to "rebuild" the headers from scratch: copy original header to some new memory location and add new headers and/or header values. Having that some HTTP headers, such as Cookie or User-Agent as well as URI can easily reach several kilobytes in size for modern web applications, the data copies aren't wished.

Thus, if we consider a user-space HTTP proxy, then typically we have at least 2 data copies:

Receive a request on first CPU
Copy the request to user space
Update headers (2nd copy)
Copy the request to kernel space (can be eliminated if splice(2) is used - it seems HAProxy only is able to do this)
Send the request from the second CPU

Besides data copying, there is a problem with accessing sockets (TCP Control Blocks, TCBs) from different CPUs. As we saw above modern HTTP proxies work with thousands of TCP connections while modern hardware has only tens of CPU cores, so each core handles hundreds of TCBs. So if we want to forward an HTTP request from a client socket to a server socket, we have to do at least one copy of the request data among different CPUs and touch TCBs on different CPUs. This is not a big deal for single process package machines, but this is a problem for relatively large NUMA systems.

Linux kernel HTTP proxying

Tempesta FW is built-in to the Linux TCP/IP stack, so we can use full power of zero-copy sk_buff fragments and per-CPU TCBs. Details of the HTTP proxying in the Linux kernel can be found in my Netdev 2.1 talk Kernel HTTP/TCP/IP stack for HTTP DDoS mitigation.

The first problem of HTTP message transformation is solved by

HTTP message fragmentation: if we need to add, delete or update some data at the middle of an HTTP message, then we

create a new fragment pointer to a place where the new data must be inserted
create a new fragment with the new data and place its pointer just before the pointer from the previous step.

Data deletion is handled by just moving a pointer to the tail fragment further making a data gap between the first and the second fragments. Update is essentially a combination of deletion and addition.

To implement the zero-copy HTTP messages transformation we had to modify sk_buff allocator to always use paged data.

To reduce number and size of inter-CPU memory transfers we've introduced a per-CPU lock-free ring buffer for fast inter-CPU jobs transfer. Thanks to NIC RSS and the inter-CPU jobs transfer TCBs are mostly accessed by the same CPU. If we need to forward a request processed on the first CPU through a TCB residing on the second CPU we just put a job to the ring buffer of the second CPU and softirq working on the CPU takes care about the actual transmission.

HTTP/2

You might wonder why do I talk about HTTP/1.1 pipelining if there is HTTP/2 providing much better requests multiplexing, which is free from head of line blocking problem?

In the article I described HTTP/1.1 pipelining to backend servers only. It's harder to implement HTTP/2 in zero-copy fashion (I'll address the problems in a further article). Meantime, the biggest advantage of the protocol comes in global network with low-speed connections and high delays, which are not the case for local networks with 10G links connecting an HTTP reverse proxy with backend servers. So using HTTP/2 for backend connections is doubtful. By the way, neither HAProxy nor Nginx support HTTP/2 for backend connections.

High Performance Linux