HTTP reverse proxying
First of all let's see what HTTP reverse proxy is and how it works internally. Besides web acceleration - i.e. caching web content - reverse proxies do a lot of stuff:
- Load balancing among several servers, sometimes with different performance characteristics. E.g. on the picture we have large 3rd server, which is capable of handling more requests per second than the 2 others, so the server should get more requests.
- Automatic failovering of failed servers. The second server on the picture fails, so the proxy must load balance ingress requests among rest of the 2 servers. When the server backs to normal operations, the load must be balanced among all the 3 servers again.
- Since TLS is resource hungry, it has sense to terminate TLS on a proxy, so backend servers consumes resources for more useful application logic.
- There could be clients with different software, too outdated or too recent, and the proxy should convert different protocols to more suitable forms for the backend servers, e.g. it can downgrade HTTP/2 to HTTP/1.1 or upgrade HTTP/1.0 to HTTP/1.1.
- How many connections should a proxy establish with each backend server?
- Sometimes backend servers reset connections (by default Nginx and Apache HTTPD reset connections each time when a connection serves 100 requests). When this happens how does a proxy manage the connection resets? Obviously, it should be something more optimal than in a case of a server failure.
- Since we pass HTTP messages from a client socket to a backend server socket and vice versa, there should be message queues and these queues must be properly managed with connections failovering in mind.
- Is it safe to resend an HTTP request to other backend if current backend can not properly answer the request?
- When you pass data from one socket (e.g. client) to an other (e.g. server), then there are data copies and high lock contention. The problem especially crucial for TLS encrypted data and HTTP/2.
Performance of HTTP proxying
Let's start from a small test of a web server. I take Nginx 1.10.3 running in Debian 9 VM with 2 virtual CPUs and 2GB of RAM on my laptop (Intel i7-6500U). For workload generation I'll use wrk. This is a toy, test, environment, so the numbers in the article only make sense in a relative way.
Let's start from getting numbers for raw performance of a Web server hosting only one small index.html file of 3 bytes in size (hereafter I make 3 test runs and get the best results):
# ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:9090/
Running 30s test @ http://192.168.100.4:9090/
8 threads and 4096 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 129.32ms 208.24ms 1.99s 90.28%
Req/Sec 7.86k 1.67k 13.88k 75.49%
1877247 requests in 30.10s, 424.30MB read
Socket errors: connect 0, read 0, write 0, timeout 1374
Requests/sec: 62368.01
Transfer/sec: 14.10MB
There are 62K HTTP RPS with no HTTP errors. The Nginx configuration is
worker_processes auto;
worker_cpu_affinity auto;
events {
worker_connections 65536;
use epoll;
multi_accept on;
accept_mutex off;
}
worker_rlimit_nofile 1000000;
http {
keepalive_timeout 600;
keepalive_requests 10000000;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
open_file_cache max=1000 inactive=3600s;
open_file_cache_valid 3600s;
open_file_cache_min_uses 2;
open_file_cache_errors off;
error_log /dev/null emerg;
access_log off;
server {
listen 9090 backlog=131072 deferred reuseport fastopen=4096;
location / { root /var/www/html; }
}
I didn't do any special sysctl settings since all the tests were running with the same OS settings and I didn't care much about absolute numbers in the tests. I just made basic performance tuning settings and switched off logging to remove the slow filesystem writing from the discussion.
Next let's run a proxy in front of the web server. The same Nginx in the same VM is used, but with a different configuration. Note that I switched off the web cache to learn how much overhead HTTP proxying occurs.
worker_processes auto;
worker_cpu_affinity auto;
events {
worker_connections 65536;
use epoll;
multi_accept on;
accept_mutex off;
}
worker_rlimit_nofile 1000000;
http {
sendfile off; # too small file
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 600;
keepalive_requests 1000000;
access_log off;
error_log /dev/null emerg;
gzip off;
upstream u {
server 127.0.0.1:9090;
keepalive 4096;
}
server {
listen 9000 backlog=131072 deferred reuseport fastopen=4096;
location / {
proxy_pass http://u;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
}
proxy_cache off;
}
Update: Thanks to Maxim Dounin, a lead Nginx developer, for pointing me out the keepalive configuration option - without it Nginx shows twice worse performance results.
And see how many RPSes we get in this configuration:
# ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:9000/
Running 30s test @ http://192.168.100.4:9000/
8 threads and 4096 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 174.78ms 141.82ms 1.99s 87.09%
Req/Sec 3.01k 0.95k 8.28k 70.62%
718994 requests in 30.09s, 162.51MB read
Socket errors: connect 0, read 0, write 0, timeout 229
Requests/sec: 23894.05
Transfer/sec: 5.40MB
Let's also try HAProxy and Tempesta FW, which usually delivers more performance for HTTP proxying. I tried HAProxy of version 1.7.5 and Tempesta FW 0.5.0. The configuration for HAProxy is:
global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
user haproxy
group haproxy
daemon
maxconn 65536
nbproc 2
cpu-map 1 0
cpu-map 2 1
defaults
log global
mode http
http-reuse always
no log
timeout connect 5000
timeout client 50000
timeout server 50000
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http
frontend test
bind :7000
mode http
maxconn 65536
default_backend nginx
backend nginx
mode http
balance static-rr
server be1 127.0.0.1:9090
And its results are:
# ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:7000/
Running 30s test @ http://192.168.100.4:7000/
8 threads and 4096 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 171.75ms 126.96ms 1.93s 81.56%
Req/Sec 3.01k 0.88k 10.30k 73.52%
719739 requests in 30.08s, 146.20MB read
Socket errors: connect 0, read 0, write 0, timeout 514
Requests/sec: 23925.79
Transfer/sec: 4.86MB
The Tempesta FW configuration is just (note conns_n parameter):
listen 192.168.100.4:80;
server 127.0.0.1:9090 conns_n=128;
cache 0;
While Tempesta FW shows the best results, they are still 2 times worse than for no proxying at all:
# ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:80/
Running 30s test @ http://192.168.100.4:80/
8 threads and 4096 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 146.89ms 170.77ms 2.00s 88.21%
Req/Sec 4.05k 565.96 7.32k 77.58%
967299 requests in 30.08s, 252.76MB read
Socket errors: connect 0, read 0, write 0, timeout 19
Requests/sec: 32157.20
Transfer/sec: 8.40MB
Thus, HTTP proxying without a cache is very expensive: almost twice worse performance in the best case!
A very small static file is used in the tests, causing more overhead in network and HTTP processing. Of course, if you run the tests for large static files or a heavy dynamic logic, we won't see so dramatic differences in the numbers. For example, I ran the tests for 64KB index.html with switched on sendfile on the proxy and the proxy overhead was just about 3%.
Thus, if you have a significant work set of small files, which doesn't fit your web cache, then a web accelerator may hurt performance of your installation badly. Always analyze access patterns to your web content or, better, run performance tests.
Backend connections
The first issue is about backend server connections. In most cases modern HTTP proxies use following simple algorithm:
- Establish a TCP connection with a backend server.
- Send an HTTP request to the connection. Now the connection in busy state.
- If a new request arrives, a new TCP connection is established with the server and we do step (2) for the new connection.
- When an HTTP response arrives from the server, we forward it to a client and mark the TCP connection as free. Now we can send upcoming requests through the connection.
# ss -npto state established '( dport = :9090 )'|wc -l
3383
You may have noticed that this is very close to the number of connections from wrk. I'll explain this when I discuss HTTP pipelining, but now I want to emphasize that an HTTP proxy needs almost the same number of connections with a backend server as it has with all the clients. It's worth mentioning that this works only for very aggressive clients which send a lot of requests, e.g. DDoS bots. The consequence is that traditional HTTP accelerators aren't suitable for DDoS mitigation, since protected backend servers can get the same number of connections as the proxies.
Nginx since 1.11.5 supports max_conns option for server directive to limit number of backend connections (so my Nginx 1.10.3 from Debian 9 packages doesn't have the option). HAProxy also supports maxconn option for backend servers. The same way, Tempesta FW provides conns_n option.
Actually, depending on particular hardware and type of work load, web servers have optimal connection concurrency level. For example, Tempesta FW reaches a peak of 1.8M HTTP RPS on a 4 cores machine with 1024 concurrent connections. So starting from a single backend server connection on step (1) and establishing too many connections on step (3) introduces more latency for "unhappy" - requiring to establish a new TCP connection - requests and reduces overall requests processing performance.
Persistent connections
While establishing a new TCP connection on step (3) introduces unwished latency to the request processing time, it has sense to keep persistent TCP connections with a backend server. If an HTTP proxy keeps a pool of persistent connections with a backend server, then it's always ready for instant spike of ingress client requests, e.g. due to a flash crowd or DDoS attack. This is what Tempesta FW does with conns_n: if you specify for example conns_n=128, then Tempesta FW keeps exactly 128 established connections to the backend server. (By the way, you can specify different values for conns_n for each backend server).
HTTP basically manages the persistency of connections with two headers - keep-alive connections are defined in RFC 2068 - for example:
Connection: keep-alive
Keep-Alive: timeout=5, max=10000
, i.e. the TCP connection timeout is 5 seconds and the maximum allowed requests passed over the connection is 10 thousand. Note that a client can only send Connection: keep-alive requests, while Keep-Alive specification for the connection is determined by a server.
TCP connections failovering
There are 3 cases when persistent connections with backed servers may fail:
- If there is no workload for some time (e.g. if you didn't enable backend servers' health monitoring), the TCP or HTTP keepalive timer elapses and the connection is closed.
- Backend servers can close such connections intentionally due to processing errors or default configuration (e.g. Nginx and Apache HTTPD close TCP client connections after each 100th request). Connection closing also happens to indicate the end of HTTP response without Content-Length header as well as chunked transfer encoding.
- Also a backend server may just go down due to a server maintenance or failure. This case is handled by a server health monitoring process which catches service failures on different layers, e.g. hard server reset as well as a web application failure when a web server still responds, but in a wrong way.
If we resend an HTTP request, then we have to limit the number of retransmissions for it. Consider a "killing" request which just crashes your backend web application: a proxy sends a request to a backend server, the server fails, the proxy resends the request to another server and the server goes down the same way, so all the servers in the backend cluster are down. To prevent such situations all (I hope) HTTP proxies limit the number of a request retransmissions, for example, with Tempesta FW you can use
server_forward_retries
for the limit (default value is 5).The next question is for how long should a proxy keep a request in internal queues in hope to get a successful response for it? After all a client waiting for too long time for a web page rendering considers the service down at some point. So again all (I hope) HTTP proxies provide the time limit for request retransmissions. In the case of Tempesta FW server_forward_timeout does this.
The requirement to be able to resend a request introduces sk_buff copying. struct sk_buff is a Linux kernel descriptor of data being sent through the TCP/IP stack, so the TCP acknowledgement and retransmission mechanisms extensively update the descriptor. Since we may need to resend a request through some other TCP connection, we have to copy sk_buff before it's transmission through TCP/IP stack. The problem is that the descriptor is relatively large, several hundred bytes in size. Originally Tempesta FW network I/O was designed to be zero-copy, but connections failovering doesn't allow to fully avoid copies. The design of Tempesta FW network I/O is described in my post What's Wrong With Sockets Performance And How to Fix It.
HTTP pipelining
HTTP requests can be pipelined, i.e. sent in a row without waiting responses for each of them separately. Tempesta FW is one of the few HTTP proxies (the two others are Squid and Polipo) which can pipeline HTTP requests in backend server connections. All other HTTP proxies, including Nginx and HAProxy, are unable to use HTTP pipelining and wait for a responses for each sent request. I.e. if you configure your HTTP proxy, e.g. Nginx or HAProxy, to establish say 100 connections with a backend server at the most, then only 100 requests can be sent concurrently to backend servers.
This is why we saw 3 thousand open connections between HAProxy and Nginx - the proxy needs so many connections to achieve the concurrency necessary to process ingress workload.
To analyze the performance impact of HTTP pipelining let's reconfigure Tempesta FW to use only one connection to the backend server:
server 127.0.0.1:9090 conns_n=1;
And start the benchmark:
# ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:80/
Running 30s test @ http://192.168.100.4:80
8 threads and 4096 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 162.20ms 197.03ms 1.99s 87.05%
Req/Sec 3.73k 0.91k 12.67k 77.56%
888501 requests in 30.06s, 232.19MB read
Socket errors: connect 0, read 0, write 0, timeout 59
Requests/sec: 29556.83
Transfer/sec: 7.72MB
We see bit lower performance results due to lower number of backend server connections, but the performance degradation is quite small. To view how pipelining actually works we can use Tempesta FW's application performance monitoring:
# while :; do \
grep '0 queue size' /proc/tempesta/servers/default/127.0.0.1\:9090; \
sleep 1; \
done
Connection 000 queue size : 97
Connection 000 queue size : 79
Connection 000 queue size : 163
Connection 000 queue size : 104
Connection 000 queue size : 326
Connection 000 queue size : 251
Connection 000 queue size : 286
Connection 000 queue size : 312
Connection 000 queue size : 923
Connection 000 queue size : 250
Connection 000 queue size : 485
^C
The script show the instant number of requests queued for transmission through the backend server connection. As you can see there are hundreds of HTTP requests on the fly at each moment of time.
Now let's make HAProxy to use only one connection with the backend server. To do so we need to change following settings:
global
...
nbproc 1
...
backend nginx
mode http
balance static-rr
fullconn 1
server be1 127.0.0.1:9090 minconn 1 maxconn 1
Since it can not use pipelining, the performance degradation is significant, almost 4 times:
# ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:7000/
Running 30s test @ http://192.168.100.4:7000/
8 threads and 4096 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 674.95ms 89.56ms 934.93ms 83.60%
Req/Sec 759.42 547.99 2.96k 67.90%
175041 requests in 30.10s, 35.56MB read
Requests/sec: 5815.86
Transfer/sec: 1.18MB
Idempotence of HTTP requests
While HTTP pipelining is good, not all HTTP requests can be pipelined. It's actually not so easy to implement HTTP pipelining correctly.
RFC 7231 4.2.2 defines idempotent methods as safe methods, not changing a server state. The safe methods are GET, HEAD, TRACE, OPTIONS. Only these methods can be pipelined. Consider that we send two requests, one of them is non-safe (stricktly speaking, both of the requests are non-idempotent), to a server in a pipeline:
GET /forum?post="new%20post%20content" HTTP/1.1
Host: foo.com
\r\n
\r\n
POST /forum?post HTTP/1.1
Host: foo.com
Content-Length: 16
\r\n
new post content
\r\n
\r\n
If the server connection terminates just after the transmission, we're going to failover process and resend the requests to another connections or a server. But can we resend the non-safe POST request? The problem is that we don't know whether the server processed the request and created a new post on the forum or not. If it did and we resend it again, we create the same post twice which is unwished. If we don't resend the request and just return error code to a client, then our response is false. Thus RFC 7230 6.3.1 requires that a proxy must not automatically retry non-idempotent requests.
Moreover, note that the first GET request is actually requests some dynamic logic which also may change the server state, just like the second POST request. The GET request is essentially non-idempotent, but this is a web application developer's responsibility to use the right request methods in their applications. Actually, POST requests can be idempotent if, for example, an application developer uses them to send a web search query.
Since request idempotence depends on a particular web application, Tempesta FW provides a configuration option to define which request methods to which URIs are non-idemptont, e.g. to make the first GET request non-idempotent we can add following option to the configuration file:
nonidempotent GET prefix “/forum?post”
Pipeline queues
Let's consider pipelining of requests from a 3 clients to a 2 server connections. The first client sends a non-idempotent request (the large square marked as "NI" on the picture). Firstly, we keep all client requests in a per-client queue to forward server responses to a client in exactly the same order in which the client sent corresponding requests. However, the queue is used only for ordering and when a request arrives it's immediately processed by the load balancing logic ("LB" on the picture) and is scheduled to some server queue. The non-idempotent request resides in a server queue just like idempotent request, but we don't send other requests to a server connection until we receive a response for the non-idempotent request.
If you don't want HTTP pipelining at all you can set the server queue size to 1, i.e. only one request at a time will be queued:
server_queue_size 1;
It's clear that in general the second server queue having a non-idempotent request is drained slower than the first one, so Tempesta FW's load balancing algorithm makes preference to server queues without non-idempotent requests and uses such queues only if all others are too busy.
Pipelined messages retransmission
There could be a request-killer, crashing a web application, among pipelined requests, so RFC 7230 6.3.2 requires that a client must not pipeline immediately after connection establishment since we don't know which request exactly is the killer. So does Tempesta FW: if a server queue contains requests for retransmission, it doesn't schedule new requests to the queue until the last resent request is responded.
Unless server_retry_nonidempotent configuration option is specified, non-idempotent requests aren't resent and just dropped. If we have idempotent requests before and after the non-idempotent one, then we still can resend them to a live server. The sequence of responses is kept thanks to an error response, which is generated for the dropped non-idempotent request.
Since requests can be scheduled to different servers, appropriate responses can arrive in different order. When a server response arrives, it's linked with an appropriate request and the request is checked against head of the client queue: if all the requests in the head of the queue have linked responses, then all the responses are sent at once (pipelined) to a client. For example if we receive a response for the 1st client request while the 2nd and 3rd requests are already responded, then the whole head of the client queue, all the 3 responses for the first 3 requests, can be sent to the client in a pipeline.
HTTP messages adjustment
HTTP proxies usually have to adjust HTTP headers of forwarded messages, e.g. add Via header or current IP to X-Forwarded-For header. To do so we usually have to "rebuild" the headers from scratch: copy original header to some new memory location and add new headers and/or header values. Having that some HTTP headers, such as Cookie or User-Agent as well as URI can easily reach several kilobytes in size for modern web applications, the data copies aren't wished.
Thus, if we consider a user-space HTTP proxy, then typically we have at least 2 data copies:
- Receive a request on first CPU
- Copy the request to user space
- Update headers (2nd copy)
- Copy the request to kernel space (can be eliminated if splice(2) is used - it seems HAProxy only is able to do this)
- Send the request from the second CPU
Linux kernel HTTP proxying
Tempesta FW is built-in to the Linux TCP/IP stack, so we can use full power of zero-copy sk_buff fragments and per-CPU TCBs. Details of the HTTP proxying in the Linux kernel can be found in my Netdev 2.1 talk Kernel HTTP/TCP/IP stack for HTTP DDoS mitigation.
The first problem of HTTP message transformation is solved by
HTTP message fragmentation: if we need to add, delete or update some data at the middle of an HTTP message, then we
- create a new fragment pointer to a place where the new data must be inserted
- create a new fragment with the new data and place its pointer just before the pointer from the previous step.
To implement the zero-copy HTTP messages transformation we had to modify sk_buff allocator to always use paged data.
To reduce number and size of inter-CPU memory transfers we've introduced a per-CPU lock-free ring buffer for fast inter-CPU jobs transfer. Thanks to NIC RSS and the inter-CPU jobs transfer TCBs are mostly accessed by the same CPU. If we need to forward a request processed on the first CPU through a TCB residing on the second CPU we just put a job to the ring buffer of the second CPU and softirq working on the CPU takes care about the actual transmission.
HTTP/2
You might wonder why do I talk about HTTP/1.1 pipelining if there is HTTP/2 providing much better requests multiplexing, which is free from head of line blocking problem?
In the article I described HTTP/1.1 pipelining to backend servers only. It's harder to implement HTTP/2 in zero-copy fashion (I'll address the problems in a further article). Meantime, the biggest advantage of the protocol comes in global network with low-speed connections and high delays, which are not the case for local networks with 10G links connecting an HTTP reverse proxy with backend servers. So using HTTP/2 for backend connections is doubtful. By the way, neither HAProxy nor Nginx support HTTP/2 for backend connections.