NatSys Laboratory Ltd. was rebranded to Tempesta Technologies, so we closed the original site natsys-lab.com and now it's time to move to the Tempesta Technilogies blog.
All the further posts will be available at http://tempesta-tech.com/blog/ . The blog will remain the same - we're passionate about technologies, so we'll continue with the deep technical articles.
There is the summary of the most interesting posts from this blog.
I hope you'll enjoy our new blog!
High Performance Linux
High Performance Multi-core Networked and Storage Systems for Linux
High Performance Linux
> Try Tempesta FW, a high performance open source application delivery controller for the Linux/x86-64 platform.
> Or check custom high-performance solutions from Tempesta Technologies, INC.
> Careers: if you love low-level C/C++ hacking and Linux, we'll be happy to hear from you.
Wednesday, November 6, 2019
Saturday, May 18, 2019
Intelpocalypse: goodbye fast system calls
Intel announced MDS (aka ZombieLoad) vunerability. Earlier, in 2018, there was announced Metdown.
Modern Linux kernel is compiled with Kernel page table isolation (KPTI) to prevent Metldown. Essentially, KPTI is just a removal of old technique to optimize system calls, aka lazy TLB: kernel space is mapped to all page tables for user space processes, so there is no need to flush 1 layer caches on kernel/user-space context switches. Performance impact is serious: up to 20% for Nginx (MariaDB got even 40% for certain workloads).
MDS goes further in slowing down system calls, it introduces mds_clear_cpu_buffers() called on each context switch. Performance impact seems not so huge as for the Meltdown prevention, but it's clear that system calls become even more slow.
The good news is that Tempesta FW works in kernel space, so there is no context switches and KPTI and MDS do not affect our performance at all. Moreover, we accurately program our most performance crucial code (HTTP processing) in assembly and use retpoline Spectre prevention only where it's necessary. Retpoline may have up to 15% performance impact, but, fortunately, not each indirect jump must use retpoline to be safe against Spectre.
Modern Linux kernel is compiled with Kernel page table isolation (KPTI) to prevent Metldown. Essentially, KPTI is just a removal of old technique to optimize system calls, aka lazy TLB: kernel space is mapped to all page tables for user space processes, so there is no need to flush 1 layer caches on kernel/user-space context switches. Performance impact is serious: up to 20% for Nginx (MariaDB got even 40% for certain workloads).
MDS goes further in slowing down system calls, it introduces mds_clear_cpu_buffers() called on each context switch. Performance impact seems not so huge as for the Meltdown prevention, but it's clear that system calls become even more slow.
The good news is that Tempesta FW works in kernel space, so there is no context switches and KPTI and MDS do not affect our performance at all. Moreover, we accurately program our most performance crucial code (HTTP processing) in assembly and use retpoline Spectre prevention only where it's necessary. Retpoline may have up to 15% performance impact, but, fortunately, not each indirect jump must use retpoline to be safe against Spectre.
Wednesday, March 28, 2018
HTTP Requests Proxying
It may seem easy to proxy HTTP requests - after all we just receive an HTTP request, queue it for retransmission, send it to a backend server, and do the same with an HTTP response when we get it from the server. However, things aren't so simple in modern HTTP proxies. In this article I'm going to address several interesting problems in HTTP/1.1 proxying. I'll be mostly concentrating on HTTP reverse proxies also known as web accelerators.
First of all let's see what HTTP reverse proxy is and how it works internally. Besides web acceleration - i.e. caching web content - reverse proxies do a lot of stuff:
Let's start from a small test of a web server. I take Nginx 1.10.3 running in Debian 9 VM with 2 virtual CPUs and 2GB of RAM on my laptop (Intel i7-6500U). For workload generation I'll use wrk. This is a toy, test, environment, so the numbers in the article only make sense in a relative way.
Let's start from getting numbers for raw performance of a Web server hosting only one small index.html file of 3 bytes in size (hereafter I make 3 test runs and get the best results):
# ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:9090/
Running 30s test @ http://192.168.100.4:9090/
8 threads and 4096 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 129.32ms 208.24ms 1.99s 90.28%
Req/Sec 7.86k 1.67k 13.88k 75.49%
1877247 requests in 30.10s, 424.30MB read
Socket errors: connect 0, read 0, write 0, timeout 1374
Requests/sec: 62368.01
Transfer/sec: 14.10MB
There are 62K HTTP RPS with no HTTP errors. The Nginx configuration is
worker_processes auto;
worker_cpu_affinity auto;
events {
worker_connections 65536;
use epoll;
multi_accept on;
accept_mutex off;
}
worker_rlimit_nofile 1000000;
http {
keepalive_timeout 600;
keepalive_requests 10000000;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
open_file_cache max=1000 inactive=3600s;
open_file_cache_valid 3600s;
open_file_cache_min_uses 2;
open_file_cache_errors off;
error_log /dev/null emerg;
access_log off;
server {
listen 9090 backlog=131072 deferred reuseport fastopen=4096;
location / { root /var/www/html; }
}
I didn't do any special sysctl settings since all the tests were running with the same OS settings and I didn't care much about absolute numbers in the tests. I just made basic performance tuning settings and switched off logging to remove the slow filesystem writing from the discussion.
Next let's run a proxy in front of the web server. The same Nginx in the same VM is used, but with a different configuration. Note that I switched off the web cache to learn how much overhead HTTP proxying occurs.
worker_processes auto;
worker_cpu_affinity auto;
events {
worker_connections 65536;
use epoll;
multi_accept on;
accept_mutex off;
}
worker_rlimit_nofile 1000000;
http {
sendfile off; # too small file
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 600;
keepalive_requests 1000000;
access_log off;
error_log /dev/null emerg;
gzip off;
upstream u {
server 127.0.0.1:9090;
keepalive 4096;
}
server {
listen 9000 backlog=131072 deferred reuseport fastopen=4096;
location / {
proxy_pass http://u;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
}
proxy_cache off;
}
Update: Thanks to Maxim Dounin, a lead Nginx developer, for pointing me out the keepalive configuration option - without it Nginx shows twice worse performance results.
And see how many RPSes we get in this configuration:
# ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:9000/
Running 30s test @ http://192.168.100.4:9000/
8 threads and 4096 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 174.78ms 141.82ms 1.99s 87.09%
Req/Sec 3.01k 0.95k 8.28k 70.62%
718994 requests in 30.09s, 162.51MB read
Socket errors: connect 0, read 0, write 0, timeout 229
Requests/sec: 23894.05
Transfer/sec: 5.40MB
Let's also try HAProxy and Tempesta FW, which usually delivers more performance for HTTP proxying. I tried HAProxy of version 1.7.5 and Tempesta FW 0.5.0. The configuration for HAProxy is:
global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
user haproxy
group haproxy
daemon
maxconn 65536
nbproc 2
cpu-map 1 0
cpu-map 2 1
defaults
log global
mode http
http-reuse always
no log
timeout connect 5000
timeout client 50000
timeout server 50000
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http
frontend test
bind :7000
mode http
maxconn 65536
default_backend nginx
backend nginx
mode http
balance static-rr
server be1 127.0.0.1:9090
And its results are:
# ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:7000/
Running 30s test @ http://192.168.100.4:7000/
8 threads and 4096 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 171.75ms 126.96ms 1.93s 81.56%
Req/Sec 3.01k 0.88k 10.30k 73.52%
719739 requests in 30.08s, 146.20MB read
Socket errors: connect 0, read 0, write 0, timeout 514
Requests/sec: 23925.79
Transfer/sec: 4.86MB
The Tempesta FW configuration is just (note conns_n parameter):
listen 192.168.100.4:80;
server 127.0.0.1:9090 conns_n=128;
cache 0;
While Tempesta FW shows the best results, they are still 2 times worse than for no proxying at all:
# ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:80/
Running 30s test @ http://192.168.100.4:80/
8 threads and 4096 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 146.89ms 170.77ms 2.00s 88.21%
Req/Sec 4.05k 565.96 7.32k 77.58%
967299 requests in 30.08s, 252.76MB read
Socket errors: connect 0, read 0, write 0, timeout 19
Requests/sec: 32157.20
Transfer/sec: 8.40MB
Thus, HTTP proxying without a cache is very expensive: almost twice worse performance in the best case!
A very small static file is used in the tests, causing more overhead in network and HTTP processing. Of course, if you run the tests for large static files or a heavy dynamic logic, we won't see so dramatic differences in the numbers. For example, I ran the tests for 64KB index.html with switched on sendfile on the proxy and the proxy overhead was just about 3%.
Thus, if you have a significant work set of small files, which doesn't fit your web cache, then a web accelerator may hurt performance of your installation badly. Always analyze access patterns to your web content or, better, run performance tests.
The first issue is about backend server connections. In most cases modern HTTP proxies use following simple algorithm:
# ss -npto state established '( dport = :9090 )'|wc -l
3383
You may have noticed that this is very close to the number of connections from wrk. I'll explain this when I discuss HTTP pipelining, but now I want to emphasize that an HTTP proxy needs almost the same number of connections with a backend server as it has with all the clients. It's worth mentioning that this works only for very aggressive clients which send a lot of requests, e.g. DDoS bots. The consequence is that traditional HTTP accelerators aren't suitable for DDoS mitigation, since protected backend servers can get the same number of connections as the proxies.
Nginx since 1.11.5 supports max_conns option for server directive to limit number of backend connections (so my Nginx 1.10.3 from Debian 9 packages doesn't have the option). HAProxy also supports maxconn option for backend servers. The same way, Tempesta FW provides conns_n option.
Actually, depending on particular hardware and type of work load, web servers have optimal connection concurrency level. For example, Tempesta FW reaches a peak of 1.8M HTTP RPS on a 4 cores machine with 1024 concurrent connections. So starting from a single backend server connection on step (1) and establishing too many connections on step (3) introduces more latency for "unhappy" - requiring to establish a new TCP connection - requests and reduces overall requests processing performance.
While establishing a new TCP connection on step (3) introduces unwished latency to the request processing time, it has sense to keep persistent TCP connections with a backend server. If an HTTP proxy keeps a pool of persistent connections with a backend server, then it's always ready for instant spike of ingress client requests, e.g. due to a flash crowd or DDoS attack. This is what Tempesta FW does with conns_n: if you specify for example conns_n=128, then Tempesta FW keeps exactly 128 established connections to the backend server. (By the way, you can specify different values for conns_n for each backend server).
HTTP basically manages the persistency of connections with two headers - keep-alive connections are defined in RFC 2068 - for example:
Connection: keep-alive
Keep-Alive: timeout=5, max=10000
, i.e. the TCP connection timeout is 5 seconds and the maximum allowed requests passed over the connection is 10 thousand. Note that a client can only send Connection: keep-alive requests, while Keep-Alive specification for the connection is determined by a server.
There are 3 cases when persistent connections with backed servers may fail:
If we resend an HTTP request, then we have to limit the number of retransmissions for it. Consider a "killing" request which just crashes your backend web application: a proxy sends a request to a backend server, the server fails, the proxy resends the request to another server and the server goes down the same way, so all the servers in the backend cluster are down. To prevent such situations all (I hope) HTTP proxies limit the number of a request retransmissions, for example, with Tempesta FW you can use
The next question is for how long should a proxy keep a request in internal queues in hope to get a successful response for it? After all a client waiting for too long time for a web page rendering considers the service down at some point. So again all (I hope) HTTP proxies provide the time limit for request retransmissions. In the case of Tempesta FW server_forward_timeout does this.
The requirement to be able to resend a request introduces sk_buff copying. struct sk_buff is a Linux kernel descriptor of data being sent through the TCP/IP stack, so the TCP acknowledgement and retransmission mechanisms extensively update the descriptor. Since we may need to resend a request through some other TCP connection, we have to copy sk_buff before it's transmission through TCP/IP stack. The problem is that the descriptor is relatively large, several hundred bytes in size. Originally Tempesta FW network I/O was designed to be zero-copy, but connections failovering doesn't allow to fully avoid copies. The design of Tempesta FW network I/O is described in my post What's Wrong With Sockets Performance And How to Fix It.
HTTP requests can be pipelined, i.e. sent in a row without waiting responses for each of them separately. Tempesta FW is one of the few HTTP proxies (the two others are Squid and Polipo) which can pipeline HTTP requests in backend server connections. All other HTTP proxies, including Nginx and HAProxy, are unable to use HTTP pipelining and wait for a responses for each sent request. I.e. if you configure your HTTP proxy, e.g. Nginx or HAProxy, to establish say 100 connections with a backend server at the most, then only 100 requests can be sent concurrently to backend servers.
This is why we saw 3 thousand open connections between HAProxy and Nginx - the proxy needs so many connections to achieve the concurrency necessary to process ingress workload.
To analyze the performance impact of HTTP pipelining let's reconfigure Tempesta FW to use only one connection to the backend server:
server 127.0.0.1:9090 conns_n=1;
And start the benchmark:
# ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:80/
Running 30s test @ http://192.168.100.4:80
8 threads and 4096 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 162.20ms 197.03ms 1.99s 87.05%
Req/Sec 3.73k 0.91k 12.67k 77.56%
888501 requests in 30.06s, 232.19MB read
Socket errors: connect 0, read 0, write 0, timeout 59
Requests/sec: 29556.83
Transfer/sec: 7.72MB
We see bit lower performance results due to lower number of backend server connections, but the performance degradation is quite small. To view how pipelining actually works we can use Tempesta FW's application performance monitoring:
# while :; do \
grep '0 queue size' /proc/tempesta/servers/default/127.0.0.1\:9090; \
sleep 1; \
done
Connection 000 queue size : 97
Connection 000 queue size : 79
Connection 000 queue size : 163
Connection 000 queue size : 104
Connection 000 queue size : 326
Connection 000 queue size : 251
Connection 000 queue size : 286
Connection 000 queue size : 312
Connection 000 queue size : 923
Connection 000 queue size : 250
Connection 000 queue size : 485
^C
The script show the instant number of requests queued for transmission through the backend server connection. As you can see there are hundreds of HTTP requests on the fly at each moment of time.
Now let's make HAProxy to use only one connection with the backend server. To do so we need to change following settings:
global
...
nbproc 1
...
backend nginx
mode http
balance static-rr
fullconn 1
server be1 127.0.0.1:9090 minconn 1 maxconn 1
Since it can not use pipelining, the performance degradation is significant, almost 4 times:
# ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:7000/
Running 30s test @ http://192.168.100.4:7000/
8 threads and 4096 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 674.95ms 89.56ms 934.93ms 83.60%
Req/Sec 759.42 547.99 2.96k 67.90%
175041 requests in 30.10s, 35.56MB read
Requests/sec: 5815.86
Transfer/sec: 1.18MB
While HTTP pipelining is good, not all HTTP requests can be pipelined. It's actually not so easy to implement HTTP pipelining correctly.
RFC 7231 4.2.2 defines idempotent methods as safe methods, not changing a server state. The safe methods are GET, HEAD, TRACE, OPTIONS. Only these methods can be pipelined. Consider that we send two requests, one of them is non-safe (stricktly speaking, both of the requests are non-idempotent), to a server in a pipeline:
GET /forum?post="new%20post%20content" HTTP/1.1
Host: foo.com
\r\n
\r\n
POST /forum?post HTTP/1.1
Host: foo.com
Content-Length: 16
\r\n
new post content
\r\n
\r\n
If the server connection terminates just after the transmission, we're going to failover process and resend the requests to another connections or a server. But can we resend the non-safe POST request? The problem is that we don't know whether the server processed the request and created a new post on the forum or not. If it did and we resend it again, we create the same post twice which is unwished. If we don't resend the request and just return error code to a client, then our response is false. Thus RFC 7230 6.3.1 requires that a proxy must not automatically retry non-idempotent requests.
Moreover, note that the first GET request is actually requests some dynamic logic which also may change the server state, just like the second POST request. The GET request is essentially non-idempotent, but this is a web application developer's responsibility to use the right request methods in their applications. Actually, POST requests can be idempotent if, for example, an application developer uses them to send a web search query.
Since request idempotence depends on a particular web application, Tempesta FW provides a configuration option to define which request methods to which URIs are non-idemptont, e.g. to make the first GET request non-idempotent we can add following option to the configuration file:
nonidempotent GET prefix “/forum?post”
Let's consider pipelining of requests from a 3 clients to a 2 server connections. The first client sends a non-idempotent request (the large square marked as "NI" on the picture). Firstly, we keep all client requests in a per-client queue to forward server responses to a client in exactly the same order in which the client sent corresponding requests. However, the queue is used only for ordering and when a request arrives it's immediately processed by the load balancing logic ("LB" on the picture) and is scheduled to some server queue. The non-idempotent request resides in a server queue just like idempotent request, but we don't send other requests to a server connection until we receive a response for the non-idempotent request.
If you don't want HTTP pipelining at all you can set the server queue size to 1, i.e. only one request at a time will be queued:
server_queue_size 1;
It's clear that in general the second server queue having a non-idempotent request is drained slower than the first one, so Tempesta FW's load balancing algorithm makes preference to server queues without non-idempotent requests and uses such queues only if all others are too busy.
There could be a request-killer, crashing a web application, among pipelined requests, so RFC 7230 6.3.2 requires that a client must not pipeline immediately after connection establishment since we don't know which request exactly is the killer. So does Tempesta FW: if a server queue contains requests for retransmission, it doesn't schedule new requests to the queue until the last resent request is responded.
Unless server_retry_nonidempotent configuration option is specified, non-idempotent requests aren't resent and just dropped. If we have idempotent requests before and after the non-idempotent one, then we still can resend them to a live server. The sequence of responses is kept thanks to an error response, which is generated for the dropped non-idempotent request.
Since requests can be scheduled to different servers, appropriate responses can arrive in different order. When a server response arrives, it's linked with an appropriate request and the request is checked against head of the client queue: if all the requests in the head of the queue have linked responses, then all the responses are sent at once (pipelined) to a client. For example if we receive a response for the 1st client request while the 2nd and 3rd requests are already responded, then the whole head of the client queue, all the 3 responses for the first 3 requests, can be sent to the client in a pipeline.
HTTP proxies usually have to adjust HTTP headers of forwarded messages, e.g. add Via header or current IP to X-Forwarded-For header. To do so we usually have to "rebuild" the headers from scratch: copy original header to some new memory location and add new headers and/or header values. Having that some HTTP headers, such as Cookie or User-Agent as well as URI can easily reach several kilobytes in size for modern web applications, the data copies aren't wished.
Thus, if we consider a user-space HTTP proxy, then typically we have at least 2 data copies:
Tempesta FW is built-in to the Linux TCP/IP stack, so we can use full power of zero-copy sk_buff fragments and per-CPU TCBs. Details of the HTTP proxying in the Linux kernel can be found in my Netdev 2.1 talk Kernel HTTP/TCP/IP stack for HTTP DDoS mitigation.
The first problem of HTTP message transformation is solved by
HTTP message fragmentation: if we need to add, delete or update some data at the middle of an HTTP message, then we
To implement the zero-copy HTTP messages transformation we had to modify sk_buff allocator to always use paged data.
To reduce number and size of inter-CPU memory transfers we've introduced a per-CPU lock-free ring buffer for fast inter-CPU jobs transfer. Thanks to NIC RSS and the inter-CPU jobs transfer TCBs are mostly accessed by the same CPU. If we need to forward a request processed on the first CPU through a TCB residing on the second CPU we just put a job to the ring buffer of the second CPU and softirq working on the CPU takes care about the actual transmission.
You might wonder why do I talk about HTTP/1.1 pipelining if there is HTTP/2 providing much better requests multiplexing, which is free from head of line blocking problem?
In the article I described HTTP/1.1 pipelining to backend servers only. It's harder to implement HTTP/2 in zero-copy fashion (I'll address the problems in a further article). Meantime, the biggest advantage of the protocol comes in global network with low-speed connections and high delays, which are not the case for local networks with 10G links connecting an HTTP reverse proxy with backend servers. So using HTTP/2 for backend connections is doubtful. By the way, neither HAProxy nor Nginx support HTTP/2 for backend connections.
HTTP reverse proxying
First of all let's see what HTTP reverse proxy is and how it works internally. Besides web acceleration - i.e. caching web content - reverse proxies do a lot of stuff:
- Load balancing among several servers, sometimes with different performance characteristics. E.g. on the picture we have large 3rd server, which is capable of handling more requests per second than the 2 others, so the server should get more requests.
- Automatic failovering of failed servers. The second server on the picture fails, so the proxy must load balance ingress requests among rest of the 2 servers. When the server backs to normal operations, the load must be balanced among all the 3 servers again.
- Since TLS is resource hungry, it has sense to terminate TLS on a proxy, so backend servers consumes resources for more useful application logic.
- There could be clients with different software, too outdated or too recent, and the proxy should convert different protocols to more suitable forms for the backend servers, e.g. it can downgrade HTTP/2 to HTTP/1.1 or upgrade HTTP/1.0 to HTTP/1.1.
- How many connections should a proxy establish with each backend server?
- Sometimes backend servers reset connections (by default Nginx and Apache HTTPD reset connections each time when a connection serves 100 requests). When this happens how does a proxy manage the connection resets? Obviously, it should be something more optimal than in a case of a server failure.
- Since we pass HTTP messages from a client socket to a backend server socket and vice versa, there should be message queues and these queues must be properly managed with connections failovering in mind.
- Is it safe to resend an HTTP request to other backend if current backend can not properly answer the request?
- When you pass data from one socket (e.g. client) to an other (e.g. server), then there are data copies and high lock contention. The problem especially crucial for TLS encrypted data and HTTP/2.
Performance of HTTP proxying
Let's start from a small test of a web server. I take Nginx 1.10.3 running in Debian 9 VM with 2 virtual CPUs and 2GB of RAM on my laptop (Intel i7-6500U). For workload generation I'll use wrk. This is a toy, test, environment, so the numbers in the article only make sense in a relative way.
Let's start from getting numbers for raw performance of a Web server hosting only one small index.html file of 3 bytes in size (hereafter I make 3 test runs and get the best results):
# ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:9090/
Running 30s test @ http://192.168.100.4:9090/
8 threads and 4096 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 129.32ms 208.24ms 1.99s 90.28%
Req/Sec 7.86k 1.67k 13.88k 75.49%
1877247 requests in 30.10s, 424.30MB read
Socket errors: connect 0, read 0, write 0, timeout 1374
Requests/sec: 62368.01
Transfer/sec: 14.10MB
There are 62K HTTP RPS with no HTTP errors. The Nginx configuration is
worker_processes auto;
worker_cpu_affinity auto;
events {
worker_connections 65536;
use epoll;
multi_accept on;
accept_mutex off;
}
worker_rlimit_nofile 1000000;
http {
keepalive_timeout 600;
keepalive_requests 10000000;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
open_file_cache max=1000 inactive=3600s;
open_file_cache_valid 3600s;
open_file_cache_min_uses 2;
open_file_cache_errors off;
error_log /dev/null emerg;
access_log off;
server {
listen 9090 backlog=131072 deferred reuseport fastopen=4096;
location / { root /var/www/html; }
}
I didn't do any special sysctl settings since all the tests were running with the same OS settings and I didn't care much about absolute numbers in the tests. I just made basic performance tuning settings and switched off logging to remove the slow filesystem writing from the discussion.
Next let's run a proxy in front of the web server. The same Nginx in the same VM is used, but with a different configuration. Note that I switched off the web cache to learn how much overhead HTTP proxying occurs.
worker_processes auto;
worker_cpu_affinity auto;
events {
worker_connections 65536;
use epoll;
multi_accept on;
accept_mutex off;
}
worker_rlimit_nofile 1000000;
http {
sendfile off; # too small file
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 600;
keepalive_requests 1000000;
access_log off;
error_log /dev/null emerg;
gzip off;
upstream u {
server 127.0.0.1:9090;
keepalive 4096;
}
server {
listen 9000 backlog=131072 deferred reuseport fastopen=4096;
location / {
proxy_pass http://u;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
}
proxy_cache off;
}
Update: Thanks to Maxim Dounin, a lead Nginx developer, for pointing me out the keepalive configuration option - without it Nginx shows twice worse performance results.
And see how many RPSes we get in this configuration:
# ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:9000/
Running 30s test @ http://192.168.100.4:9000/
8 threads and 4096 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 174.78ms 141.82ms 1.99s 87.09%
Req/Sec 3.01k 0.95k 8.28k 70.62%
718994 requests in 30.09s, 162.51MB read
Socket errors: connect 0, read 0, write 0, timeout 229
Requests/sec: 23894.05
Transfer/sec: 5.40MB
Let's also try HAProxy and Tempesta FW, which usually delivers more performance for HTTP proxying. I tried HAProxy of version 1.7.5 and Tempesta FW 0.5.0. The configuration for HAProxy is:
global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
user haproxy
group haproxy
daemon
maxconn 65536
nbproc 2
cpu-map 1 0
cpu-map 2 1
defaults
log global
mode http
http-reuse always
no log
timeout connect 5000
timeout client 50000
timeout server 50000
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http
frontend test
bind :7000
mode http
maxconn 65536
default_backend nginx
backend nginx
mode http
balance static-rr
server be1 127.0.0.1:9090
And its results are:
# ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:7000/
Running 30s test @ http://192.168.100.4:7000/
8 threads and 4096 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 171.75ms 126.96ms 1.93s 81.56%
Req/Sec 3.01k 0.88k 10.30k 73.52%
719739 requests in 30.08s, 146.20MB read
Socket errors: connect 0, read 0, write 0, timeout 514
Requests/sec: 23925.79
Transfer/sec: 4.86MB
The Tempesta FW configuration is just (note conns_n parameter):
listen 192.168.100.4:80;
server 127.0.0.1:9090 conns_n=128;
cache 0;
While Tempesta FW shows the best results, they are still 2 times worse than for no proxying at all:
# ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:80/
Running 30s test @ http://192.168.100.4:80/
8 threads and 4096 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 146.89ms 170.77ms 2.00s 88.21%
Req/Sec 4.05k 565.96 7.32k 77.58%
967299 requests in 30.08s, 252.76MB read
Socket errors: connect 0, read 0, write 0, timeout 19
Requests/sec: 32157.20
Transfer/sec: 8.40MB
Thus, HTTP proxying without a cache is very expensive: almost twice worse performance in the best case!
A very small static file is used in the tests, causing more overhead in network and HTTP processing. Of course, if you run the tests for large static files or a heavy dynamic logic, we won't see so dramatic differences in the numbers. For example, I ran the tests for 64KB index.html with switched on sendfile on the proxy and the proxy overhead was just about 3%.
Thus, if you have a significant work set of small files, which doesn't fit your web cache, then a web accelerator may hurt performance of your installation badly. Always analyze access patterns to your web content or, better, run performance tests.
Backend connections
The first issue is about backend server connections. In most cases modern HTTP proxies use following simple algorithm:
- Establish a TCP connection with a backend server.
- Send an HTTP request to the connection. Now the connection in busy state.
- If a new request arrives, a new TCP connection is established with the server and we do step (2) for the new connection.
- When an HTTP response arrives from the server, we forward it to a client and mark the TCP connection as free. Now we can send upcoming requests through the connection.
# ss -npto state established '( dport = :9090 )'|wc -l
3383
You may have noticed that this is very close to the number of connections from wrk. I'll explain this when I discuss HTTP pipelining, but now I want to emphasize that an HTTP proxy needs almost the same number of connections with a backend server as it has with all the clients. It's worth mentioning that this works only for very aggressive clients which send a lot of requests, e.g. DDoS bots. The consequence is that traditional HTTP accelerators aren't suitable for DDoS mitigation, since protected backend servers can get the same number of connections as the proxies.
Nginx since 1.11.5 supports max_conns option for server directive to limit number of backend connections (so my Nginx 1.10.3 from Debian 9 packages doesn't have the option). HAProxy also supports maxconn option for backend servers. The same way, Tempesta FW provides conns_n option.
Actually, depending on particular hardware and type of work load, web servers have optimal connection concurrency level. For example, Tempesta FW reaches a peak of 1.8M HTTP RPS on a 4 cores machine with 1024 concurrent connections. So starting from a single backend server connection on step (1) and establishing too many connections on step (3) introduces more latency for "unhappy" - requiring to establish a new TCP connection - requests and reduces overall requests processing performance.
Persistent connections
While establishing a new TCP connection on step (3) introduces unwished latency to the request processing time, it has sense to keep persistent TCP connections with a backend server. If an HTTP proxy keeps a pool of persistent connections with a backend server, then it's always ready for instant spike of ingress client requests, e.g. due to a flash crowd or DDoS attack. This is what Tempesta FW does with conns_n: if you specify for example conns_n=128, then Tempesta FW keeps exactly 128 established connections to the backend server. (By the way, you can specify different values for conns_n for each backend server).
HTTP basically manages the persistency of connections with two headers - keep-alive connections are defined in RFC 2068 - for example:
Connection: keep-alive
Keep-Alive: timeout=5, max=10000
, i.e. the TCP connection timeout is 5 seconds and the maximum allowed requests passed over the connection is 10 thousand. Note that a client can only send Connection: keep-alive requests, while Keep-Alive specification for the connection is determined by a server.
TCP connections failovering
There are 3 cases when persistent connections with backed servers may fail:
- If there is no workload for some time (e.g. if you didn't enable backend servers' health monitoring), the TCP or HTTP keepalive timer elapses and the connection is closed.
- Backend servers can close such connections intentionally due to processing errors or default configuration (e.g. Nginx and Apache HTTPD close TCP client connections after each 100th request). Connection closing also happens to indicate the end of HTTP response without Content-Length header as well as chunked transfer encoding.
- Also a backend server may just go down due to a server maintenance or failure. This case is handled by a server health monitoring process which catches service failures on different layers, e.g. hard server reset as well as a web application failure when a web server still responds, but in a wrong way.
If we resend an HTTP request, then we have to limit the number of retransmissions for it. Consider a "killing" request which just crashes your backend web application: a proxy sends a request to a backend server, the server fails, the proxy resends the request to another server and the server goes down the same way, so all the servers in the backend cluster are down. To prevent such situations all (I hope) HTTP proxies limit the number of a request retransmissions, for example, with Tempesta FW you can use
server_forward_retries
for the limit (default value is 5).The next question is for how long should a proxy keep a request in internal queues in hope to get a successful response for it? After all a client waiting for too long time for a web page rendering considers the service down at some point. So again all (I hope) HTTP proxies provide the time limit for request retransmissions. In the case of Tempesta FW server_forward_timeout does this.
The requirement to be able to resend a request introduces sk_buff copying. struct sk_buff is a Linux kernel descriptor of data being sent through the TCP/IP stack, so the TCP acknowledgement and retransmission mechanisms extensively update the descriptor. Since we may need to resend a request through some other TCP connection, we have to copy sk_buff before it's transmission through TCP/IP stack. The problem is that the descriptor is relatively large, several hundred bytes in size. Originally Tempesta FW network I/O was designed to be zero-copy, but connections failovering doesn't allow to fully avoid copies. The design of Tempesta FW network I/O is described in my post What's Wrong With Sockets Performance And How to Fix It.
HTTP pipelining
HTTP requests can be pipelined, i.e. sent in a row without waiting responses for each of them separately. Tempesta FW is one of the few HTTP proxies (the two others are Squid and Polipo) which can pipeline HTTP requests in backend server connections. All other HTTP proxies, including Nginx and HAProxy, are unable to use HTTP pipelining and wait for a responses for each sent request. I.e. if you configure your HTTP proxy, e.g. Nginx or HAProxy, to establish say 100 connections with a backend server at the most, then only 100 requests can be sent concurrently to backend servers.
This is why we saw 3 thousand open connections between HAProxy and Nginx - the proxy needs so many connections to achieve the concurrency necessary to process ingress workload.
To analyze the performance impact of HTTP pipelining let's reconfigure Tempesta FW to use only one connection to the backend server:
server 127.0.0.1:9090 conns_n=1;
And start the benchmark:
# ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:80/
Running 30s test @ http://192.168.100.4:80
8 threads and 4096 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 162.20ms 197.03ms 1.99s 87.05%
Req/Sec 3.73k 0.91k 12.67k 77.56%
888501 requests in 30.06s, 232.19MB read
Socket errors: connect 0, read 0, write 0, timeout 59
Requests/sec: 29556.83
Transfer/sec: 7.72MB
We see bit lower performance results due to lower number of backend server connections, but the performance degradation is quite small. To view how pipelining actually works we can use Tempesta FW's application performance monitoring:
# while :; do \
grep '0 queue size' /proc/tempesta/servers/default/127.0.0.1\:9090; \
sleep 1; \
done
Connection 000 queue size : 97
Connection 000 queue size : 79
Connection 000 queue size : 163
Connection 000 queue size : 104
Connection 000 queue size : 326
Connection 000 queue size : 251
Connection 000 queue size : 286
Connection 000 queue size : 312
Connection 000 queue size : 923
Connection 000 queue size : 250
Connection 000 queue size : 485
^C
The script show the instant number of requests queued for transmission through the backend server connection. As you can see there are hundreds of HTTP requests on the fly at each moment of time.
Now let's make HAProxy to use only one connection with the backend server. To do so we need to change following settings:
global
...
nbproc 1
...
backend nginx
mode http
balance static-rr
fullconn 1
server be1 127.0.0.1:9090 minconn 1 maxconn 1
Since it can not use pipelining, the performance degradation is significant, almost 4 times:
# ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:7000/
Running 30s test @ http://192.168.100.4:7000/
8 threads and 4096 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 674.95ms 89.56ms 934.93ms 83.60%
Req/Sec 759.42 547.99 2.96k 67.90%
175041 requests in 30.10s, 35.56MB read
Requests/sec: 5815.86
Transfer/sec: 1.18MB
Idempotence of HTTP requests
While HTTP pipelining is good, not all HTTP requests can be pipelined. It's actually not so easy to implement HTTP pipelining correctly.
RFC 7231 4.2.2 defines idempotent methods as safe methods, not changing a server state. The safe methods are GET, HEAD, TRACE, OPTIONS. Only these methods can be pipelined. Consider that we send two requests, one of them is non-safe (stricktly speaking, both of the requests are non-idempotent), to a server in a pipeline:
GET /forum?post="new%20post%20content" HTTP/1.1
Host: foo.com
\r\n
\r\n
POST /forum?post HTTP/1.1
Host: foo.com
Content-Length: 16
\r\n
new post content
\r\n
\r\n
If the server connection terminates just after the transmission, we're going to failover process and resend the requests to another connections or a server. But can we resend the non-safe POST request? The problem is that we don't know whether the server processed the request and created a new post on the forum or not. If it did and we resend it again, we create the same post twice which is unwished. If we don't resend the request and just return error code to a client, then our response is false. Thus RFC 7230 6.3.1 requires that a proxy must not automatically retry non-idempotent requests.
Moreover, note that the first GET request is actually requests some dynamic logic which also may change the server state, just like the second POST request. The GET request is essentially non-idempotent, but this is a web application developer's responsibility to use the right request methods in their applications. Actually, POST requests can be idempotent if, for example, an application developer uses them to send a web search query.
Since request idempotence depends on a particular web application, Tempesta FW provides a configuration option to define which request methods to which URIs are non-idemptont, e.g. to make the first GET request non-idempotent we can add following option to the configuration file:
nonidempotent GET prefix “/forum?post”
Pipeline queues
Let's consider pipelining of requests from a 3 clients to a 2 server connections. The first client sends a non-idempotent request (the large square marked as "NI" on the picture). Firstly, we keep all client requests in a per-client queue to forward server responses to a client in exactly the same order in which the client sent corresponding requests. However, the queue is used only for ordering and when a request arrives it's immediately processed by the load balancing logic ("LB" on the picture) and is scheduled to some server queue. The non-idempotent request resides in a server queue just like idempotent request, but we don't send other requests to a server connection until we receive a response for the non-idempotent request.
If you don't want HTTP pipelining at all you can set the server queue size to 1, i.e. only one request at a time will be queued:
server_queue_size 1;
It's clear that in general the second server queue having a non-idempotent request is drained slower than the first one, so Tempesta FW's load balancing algorithm makes preference to server queues without non-idempotent requests and uses such queues only if all others are too busy.
Pipelined messages retransmission
There could be a request-killer, crashing a web application, among pipelined requests, so RFC 7230 6.3.2 requires that a client must not pipeline immediately after connection establishment since we don't know which request exactly is the killer. So does Tempesta FW: if a server queue contains requests for retransmission, it doesn't schedule new requests to the queue until the last resent request is responded.
Unless server_retry_nonidempotent configuration option is specified, non-idempotent requests aren't resent and just dropped. If we have idempotent requests before and after the non-idempotent one, then we still can resend them to a live server. The sequence of responses is kept thanks to an error response, which is generated for the dropped non-idempotent request.
Since requests can be scheduled to different servers, appropriate responses can arrive in different order. When a server response arrives, it's linked with an appropriate request and the request is checked against head of the client queue: if all the requests in the head of the queue have linked responses, then all the responses are sent at once (pipelined) to a client. For example if we receive a response for the 1st client request while the 2nd and 3rd requests are already responded, then the whole head of the client queue, all the 3 responses for the first 3 requests, can be sent to the client in a pipeline.
HTTP messages adjustment
HTTP proxies usually have to adjust HTTP headers of forwarded messages, e.g. add Via header or current IP to X-Forwarded-For header. To do so we usually have to "rebuild" the headers from scratch: copy original header to some new memory location and add new headers and/or header values. Having that some HTTP headers, such as Cookie or User-Agent as well as URI can easily reach several kilobytes in size for modern web applications, the data copies aren't wished.
Thus, if we consider a user-space HTTP proxy, then typically we have at least 2 data copies:
- Receive a request on first CPU
- Copy the request to user space
- Update headers (2nd copy)
- Copy the request to kernel space (can be eliminated if splice(2) is used - it seems HAProxy only is able to do this)
- Send the request from the second CPU
Linux kernel HTTP proxying
Tempesta FW is built-in to the Linux TCP/IP stack, so we can use full power of zero-copy sk_buff fragments and per-CPU TCBs. Details of the HTTP proxying in the Linux kernel can be found in my Netdev 2.1 talk Kernel HTTP/TCP/IP stack for HTTP DDoS mitigation.
The first problem of HTTP message transformation is solved by
HTTP message fragmentation: if we need to add, delete or update some data at the middle of an HTTP message, then we
- create a new fragment pointer to a place where the new data must be inserted
- create a new fragment with the new data and place its pointer just before the pointer from the previous step.
To implement the zero-copy HTTP messages transformation we had to modify sk_buff allocator to always use paged data.
To reduce number and size of inter-CPU memory transfers we've introduced a per-CPU lock-free ring buffer for fast inter-CPU jobs transfer. Thanks to NIC RSS and the inter-CPU jobs transfer TCBs are mostly accessed by the same CPU. If we need to forward a request processed on the first CPU through a TCB residing on the second CPU we just put a job to the ring buffer of the second CPU and softirq working on the CPU takes care about the actual transmission.
HTTP/2
You might wonder why do I talk about HTTP/1.1 pipelining if there is HTTP/2 providing much better requests multiplexing, which is free from head of line blocking problem?
In the article I described HTTP/1.1 pipelining to backend servers only. It's harder to implement HTTP/2 in zero-copy fashion (I'll address the problems in a further article). Meantime, the biggest advantage of the protocol comes in global network with low-speed connections and high delays, which are not the case for local networks with 10G links connecting an HTTP reverse proxy with backend servers. So using HTTP/2 for backend connections is doubtful. By the way, neither HAProxy nor Nginx support HTTP/2 for backend connections.
Sunday, October 30, 2016
HTTP Strings Processing Using C, SSE4.2 and AVX2
In this article I describe applications of standard C functions strcasecmp(3) and strspn(3)
to HTTP parser. Surprisingly the functions can be specialized to HTTP
parsing task which makes them much faster. Next I consider using SSE4.2
and AVX2 to implement the specialized versions of the functions and show
serious performance improvement. GLIBC and Linux kernel implementations
of strcasecmp(3) and strspn(3) are described as well as relevant routines from PicoHTTPParser, its modification by Cloud Flare and surely Tempesta FW.
I finished my previous post with bottleneck on long strings parsing in our HTTP parser, the issue details can be found at GitHub. In particular there are two bottlenecks: strspn()-like functions searching for delimiters and strcasecmp() used in many places.
There are few important properties of strings in HTTP messages that make their processing special:
All mature HTTP parsers use FSM (finite state machine) to process messages. The different approaches with benchmarks are covered in my earlier article. However, parsers are different in how deeply they analyze a message. For example we have following HTTP message:
GET / HTTP/1.0\r\n
Host: example.net\r\n
Cookie: session=42; theme="dark"\r\n
When a parser reads Cookie header it has following choices:
While Nginx accurately parses out all URI parts (see sw_check_uri and sw_uri states in ngx_http_parse_request_line(), src/http/ngx_http_parse.c), PicoHTTPParser just checks the URI alphabet as 0x20 (Space) < ch < 0x7f (DEL)., various delimeters and special characters (e.g. '"' or '\') are allowed by PicoHTTPParser, while they should have been filtered out.
Nginx uses old-school loop & switch based approach like:
for (p = b->pos; p < b->last; p++) {
ch = *p;
switch (state) {
case sw_start:
// ....
case sw_foo:
// .....
}
The FSM is obvious and easy to program, but it's quite slow. See my article about high-performance HTTP parsers and Kazuho Oku's presentation, slides 31-33, for explanation. So PicoHTTPParser uses SSE4.2 instruction PCMESTRI. The instruction can match a 16-byte string against set of characters, ranges or other string. Since this is hardware implemented string matcher, it works much faster than the dummy loop & switch based FSM. However, you're very limited in what you can match. You can match at most 8 ranges or 16 characters. Moreover, you can't mix range matches with characters matching (i.e. you can not match characters like 0x0 < ch < 0x20 && ch == '"'). The pity thing is that character sets for URI or most of HTTP headers exceed the limits (the sets can have about 10 ranges). So using the instruction as HTTP strings matcher involves weak content checking. If you're going to use the instruction, which is somewhat tricky, you probably find Andi Kleen's calculator very useful.
While PicoHTTPParser is very fast Vlad Krasnov from CloudFlare goes further replacing PCMPESTRI instruction by AVX2 code. The code basically checks range (ch >= 0x20 || ch == '\t') && (ch < 0x7f). While PicoHTTPParser uses only one instruction to do string matching, CloudFlare's version executes much more code: AVX2 doesn't have string processing instruction, so there are separate instruction for each comparison and logical operator. However, it executes much faster because it can eat 32 bytes per a step. Moreover, Vlad also did loop unrolling, such that 128 bytes are eaten at a time.
I wrote simple benchmark to learn both the approaches, you can find whole code at GitHub. Benchmark is focused on URI processing, so PicoHTTPParser approach looks as
static const size_t
findchar_fast(const char *str, size_t len, const char *ranges,
size_t ranges_sz, int *found)
{
__m128i ranges16 = _mm_loadu_si128((const __m128i *)ranges);
const char *s = str;
size_t left = len & ~0xf;
*found = 0;
do {
__m128i b16 = _mm_loadu_si128((void *)s);
int r = _mm_cmpestri(ranges16, ranges_sz, b16, 16,
_SIDD_LEAST_SIGNIFICANT
| _SIDD_CMP_RANGES
| _SIDD_UBYTE_OPS);
if (r != 16) {
*found = 1;
return s - str + r;
}
s += 16;
left -= 16;
} while (left);
return s - str;
}
size_t
picohttpparser_findchar_fast(const char *str, size_t len)
{
static const unsigned char ranges[] __attribute__((aligned(16))) =
"\x00 " /* control chars and up to SP */
"\"\"" /* 0x22 */
"<<" /* 0x3c,0x3c */
">>" /* 0x3e,0x3e */
"\\\\" /* 0x5c,0x5c */
"^^" /* 0x5e,0x5e */
"{}" /* 0x7b-0x7d */
"\x7f\xff"; /* 0x7f-0xff */
const char *s;
size_t n = 0;
if (len >= 16) {
int found;
n = findchar_fast(str, len, ranges, sizeof(ranges) - 1,
&found);
if (found)
return n;
}
s = str + n;
while (s - str < len && uri_a[*s])
++s;
return s - str;
}
Since ranges are used for PCMPESTRI we have to spend a range for a single character like '<' or '^'. Unfortunately, there are not enough available ranges for us and we have to pass '`' in URI while it is not included in URI specification by RFC.
Code for CloudFlare's approach looks as:
const __m256i lb = _mm256_set1_epi8(0x1f); /* low bound */
const __m256i hb = _mm256_set1_epi8(0x7f); /* high bound */
const __m256i tab = _mm256_set1_epi8(0x09); /* allow TAB */
/* SPACE <= v */
__m256i low = _mm256_cmpgt_epi8(v, lb);
/* SPACE <= v < 0x7f */
__m256i bounds = _mm256_and_si256(_mm256_cmpgt_epi8(hb, v), low);
/* SPACE <= v < 0x7f || v == TAB */
__m256i r = _mm256_or_si256(_mm256_cmpeq_epi8(tab, v), bounds);
/* Generate bit mask */
*range = ~_mm256_movemask_epi8(r);
I skip code for 64- and 128-byte processing as well as the functions results merging code. You can find the full code of the approach here. There are too many instructions to handle the simple characters set, so we do only basic verification. The numbers on my Intel Core i7-6500U are:
PCMPESTRI/PicoHTTPParser:
str_len 1: 128ms
str_len 3: 138ms
str_len 10: 161ms
str_len 19: 151ms
str_len 28: 183ms
str_len 107: 218ms
str_len 178: 230ms
str_len 1023: 784ms
str_len 1500: 1069ms
AVX2/CloudFlare:
str_len 1: 171ms
str_len 3: 175ms
str_len 10: 189ms
str_len 19: 174ms
str_len 28: 196ms
str_len 107: 198ms
str_len 178: 203ms
str_len 1023: 375ms
str_len 1500: 458ms
More code is executed in CloudFlare's version, so single character case is much slower than PicoHTTPParser. But AVX2 code show much more stable performance with increasing string length, there are 9 sizes of processed strings from 1 to 1500 bytes.
There is full results of the benchmark. Each string is porcessed 5,000,000 times. To get the results I executed the benchmark 5 times on my laptop with all heavy applications stopped (mail, browsers etc). For each benchmark I selected the best numbers to mitigate impact of some external activity by other processes still leaving in the system. I also used taskset to eliminate rescheduling overhead:
$ for i in `seq 0 4`; do taskset 0x2 ./str_benchmark > ./b.$i; done
Just to show how bad standard strspn(3) is for checking HTTP character sets I also wrote benchmarks for GLIBC assembly version and Linux kernel naive C implementation. Linux kernel doesn't use assembly for strcasecmp() and strspn() since there are no performance critical strings processing in kernel. But the implementation clearly shows that plain C for the task performs quite poorly. So the numbers are:
GLIBC strspn():
str_len 1: 350ms
str_len 3: 354ms
str_len 10: 380ms
str_len 19: 420ms
str_len 28: 398ms
str_len 107: 533ms
str_len 178: 650ms
str_len 1023: 2071ms
str_len 1500: 2856ms
Linux kernel strspn():
str_len 1: 324ms
str_len 3: 641ms
str_len 10: 1865ms
str_len 19: 3565ms
str_len 28: 4522ms
str_len 107: 18851ms
str_len 178: 28575ms
str_len 1023: 187992ms
str_len 1500: 273276ms
So we need a better alternative. It must quickly process short strings as well as long and it must accurately verify character sets defined by RFC.
Now Tempesta FW implements AVX2 routines for HTTP specific strings processing. It outperforms both the approaches and provides absolute accuracy in HTTP character sets verification. There are numbers from the same benchmark:
Tempesta AVX2 constant URI matching:
str_len 1: 123ms
str_len 3: 127ms
str_len 10: 150ms
str_len 19: 139ms
str_len 28: 156ms
str_len 107: 167ms
str_len 178: 180ms
str_len 1023: 350ms
str_len 1500: 433ms
Now lets have a look what's under the hood. The entry point of the algorithm is tfw_match_uri_const(). Firstly the function quickly process very short strings up to 4 bytes:
if (likely(len <= 4)) {
switch (len) {
case 0:
return 0;
case 4:
c3 = uri_a[s[3]];
case 3:
c2 = uri_a[s[2]];
case 2:
c1 = uri_a[s[1]];
case 1:
c0 = uri_a[s[0]];
}
return (c0 & c1) == 0 ? c0 : 2 + (c2 ? c2 + c3 : 0);
}
uri_a is defined as 256-byte constant array with bytes set for ASCII characters acccepted by URI. Previously we used 4 unsigned longs (256 bits) and defined whether a character allowed by set bit
uri_a[c >> 6] & (1UL << (c & 0x3f))
However, we found that the bit operation does to many operations and simple table lookup outperforms it. It worth to mention that while the 256-byte array wastes 4 cache lines, only 2 of them are frequently used.
Since the function must return exact number of matched symbols, we have to execute heavyweight condition at return statement. The condition doesn't allow us to efficiently check more bytes in plain C, for example 8 bytes.
Next we process large strings in following manner:
for ( ; unlikely(s + 128 <= end); s += 128) {
n = match_symbols_mask128_c(__C.URI_BM, s);
if (n < 128)
return s - (unsigned char *)str + n;
}
if (unlikely(s + 64 <= end)) {
n = match_symbols_mask64_c(__C.URI_BM, s);
if (n < 64)
return s - (unsigned char *)str + n;
s += 64;
}
if (unlikely(s + 32 <= end)) {
n = match_symbols_mask32_c(__C.URI_BM, s);
if (n < 32)
return s - (unsigned char *)str + n;
s += 32;
}
if (unlikely(s + 16 <= end)) {
n = match_symbols_mask16_c(__C.URI_BM128, s);
if (n < 16)
return s - (unsigned char *)str + n;
s += 16;
}
The code processes strings longer than 16 bytes. So we have the gap between 4 and 16 bytes which is processed by following code. The code processes string tail as well as short strings. This is why I use unlikely in the conditions above: branch misprediction is super important for short strings, while long strings aren't so sensitive to several penalties. Tail of the string is processed in the same way as GLIBC generic C implementation does it:
while (s + 4 <= end) {
c0 = uri_a[s[0]];
c1 = uri_a[s[1]];
c2 = uri_a[s[2]];
c3 = uri_a[s[3]];
if (!(c0 & c1 & c2 & c3)) {
n = s - (unsigned char *)str;
return !(c0 & c1) ? n + c0 : n + 2 + (c2 ? c2 + c3 : 0);
}
s += 4;
}
c0 = c1 = c2 = 0;
switch (end - s) {
case 3:
c2 = uri_a[s[2]];
case 2:
c1 = uri_a[s[1]];
case 1:
c0 = uri_a[s[0]];
}
n = s - (unsigned char *)str;
return !(c0 & c1) ? n + c0 : n + 2 + c2;
, just usual loop unrolling.
And now is time for the core of the algorithm. I describe match_symbols_mask32_c() only, all other functions are just straightforward modifications for larger data processing.
const __m256i ARF = _mm256_setr_epi8(
0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80,
0, 0, 0, 0, 0, 0, 0, 0,
0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80,
0, 0, 0, 0, 0, 0, 0, 0);
const __m256i LSH = _mm256_set1_epi8(0xf);
static size_t
match_symbols_mask32_c(__m256i sm, const char *str)
{
__m256i v = _mm256_lddqu_si256((void *)str);
1: __m256i acbm = _mm256_shuffle_epi8(sm, v);
2: __m256i acols = _mm256_and_si256(LSH, _mm256_srli_epi16(v, 4));
3: __m256i arbits = _mm256_shuffle_epi8(ARF, acols);
4: __m256i sbits = _mm256_and_si256(arbits, acbm);
5: v = _mm256_cmpeq_epi8(sbits, _mm256_setzero_si256());
6: unsigned long r = 0xffffffff00000000UL
| _mm256_movemask_epi8(v);
7: return __tzcnt(r);
}
(I numbered important lines of the code for later description). sm is specially crafted representation of allowed characters set, so it varies depending on what we're parsing, e.g. URI or particular header value. Meantime ARF and LSH are constants identical for all the matchers. For URI we set sm by
sm = _mm256_setr_epi8(
0xb8, 0xfc, 0xf8, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc,
0xfc, 0xfc, 0xfc, 0x7c, 0x54, 0x7c, 0xd4, 0x7c,
0xb8, 0xfc, 0xf8, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc,
0xfc, 0xfc, 0xfc, 0x7c, 0x54, 0x7c, 0xd4, 0x7c);
To understand the constants lets have a look at ASCII table.
There are 16 rows and sm contains two equal series by 16 bytes: each of them describes positions of valid URI characters in ASCII table rows. Note that 'column' and 'row' can be interchanged depending on the ASCI table representation. Hereafter I describe the logic using the table representation above. The first constant 0xb8 is defined by first row of the table. 'p', 'P' and '@' are valid URI charaters, while '`' isn't. So we encode this as 1011 in binary representation, or 0xb. Next 4 bits we define as 0x8 since '0' only form the next 4 ASCII symbols is accepted as URI character. The next constant 0xfc is defined by the second row and so on.
To generate the constants and uri_a, which I mentioned above, I use following simple program:
static const unsigned char uri[] =
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"abcdefghijklmnopqrstuvwxyz"
"0123456789"
"-_.~!*'();:@&=+$,/?%#[]";
unsigned char r[256] = { 0 };
for (int i = 0; i < sizeof(uri)/sizeof(uri[0]) - 1; ++i)
r[A[i]] = 1;
for (int i = 0; i < 256; ++i)
printf("%u,%c", r[i], (i & 0xF) == 0xF ? '\n' : ' ');
printf("\n");
for (int i = 0; i < 16; ++i) {
unsigned char c;
for (c = 0, j = 7; j >= 0; --j)
c = (c << 1) | r[(j << 4) + i];
printf("%#x, ", c);
}
printf("\n");
The algorithm does following steps (see line numbers in the code above, Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2 and Intel Intrinsics Guide are excellent references for the intrinsics):
The next important function, which is one of the hottest spots, is strncasecmp().
GLIBC's __strncasecmp_l_avx() (at least 2.23 version) basically implements following straightforward logic. Firstly, we define constants for 'A' and 'Z' - the characters range to be converted to lower case:
const __m256i A = _mm256_set1_epi8('A' - 1);
const __m256i Z = _mm256_set1_epi8('Z' + 1);
And constant for the case conversion:
const __m256i CASE = _mm256_set1_epi8(0x20);
Next, we load 32 bytes of each input string:
__m256i v0 = _mm256_lddqu_si256((void *)s0);
__m256i v1 = _mm256_lddqu_si256((void *)s1);
And determine which characters of each of them we have to convert to lower case:
__m256i a0 = _mm256_cmpgt_epi8(v0, A);
__m256i a1 = _mm256_cmpgt_epi8(v1, A);
__m256i z0 = _mm256_cmpgt_epi8(Z, v0);
__m256i z1 = _mm256_cmpgt_epi8(Z, v1);
__m256i cmp_r0 = _mm256_and_si256(a0, z0);
__m256i cmp_r1 = _mm256_and_si256(a1, z1);
cmp_r defines which characters (vector items) we have to convert to lower case and now we set 0x20 in the positions getting bit masks converting the input strings to lower case:
__m256i lc0 = _mm256_and_si256(cmp_r0, CASE);
__m256i lc1 = _mm256_and_si256(cmp_r1, CASE);
The bit masks are used for case conversion by OR operator and finally we can compare both the strings and return zero value if all the characters match or non-zero value otherwise. Note that HTTP parser requires only boolean return value whether the whole strings match or not.
__m256i vl0 = _mm256_or_si256(v0, lc0);
__m256i vl1 = _mm256_or_si256(v1, lc1);
__m256i eq = _mm256_cmpeq_epi8(vl0, vl1);
return ~_mm256_movemask_epi8(eq);
Actually GLIBC version does the stuff using 16-byte vectors and in fact it's slower than the AVX2 implementation above. Note that we don't care about zero byte in the strings, but rather require strings of equal size. Technically it can be done by defining 3rd argument of the function as min(s1.length, s2.length): in all the cases for HTTP parser we know string lengths since the parser doesn't work with zero-terminated C strings.
GLIBC strncasecmp():
str_len 1: 133ms
str_len 3: 144ms
str_len 10: 143ms
str_len 19: 163ms
str_len 28: 168ms
str_len 107: 213ms
str_len 178: 253ms
str_len 1023: 861ms
str_len 1500: 1167ms
AVX2 strncasecmp():
str_len 1: 127ms
str_len 3: 131ms
str_len 10: 178ms
str_len 19: 206ms
str_len 28: 235ms
str_len 107: 199ms
str_len 178: 254ms
str_len 1023: 558ms
str_len 1500: 673ms
I also used very similar optimizations for short stings in plain C as in strspn()-like case above.
Actually we don't need to convert both the strings to lower case. Instead we can do XOR on the strings (i.e. compute the strings "difference") and determine whether the difference is exactly in case:
__m256i xor = _mm256_xor_si256(v0, v1);
__m256i lc = _mm256_cmpeq_epi8(xor, CASE);
lc stores positions where the stings differ in case only, i.e. 0x20. However, for example '-' (0x2d) also differs from 'M' (0x4d) for exactly 0x20 and lc also stores the position. To know which positions are in the interest we determine which characters of first string is in ['a', 'z'] range, and we do this for one string only:
__m256i a = _mm256_set1_epi8('a' - 0x80);
__m256i D = _mm256_set1_epi8('z' - 'a' + 1 - 0x80);
__m256i vl0 = _mm256_or_si256(v0, CASE);
__m256i sub = _mm256_sub_epi8(vl0, a);
__m256i cmp_r = _mm256_cmpgt_epi8(D, sub);
Here I use 2 tricks from Hacker's Delight. Computing 'a' <= v <= 'z' requires 3 operations since it must be coded as v >= 'a' && v <= 'z'. So Hacker's Delight proposes to replace the expression by v - 'a' < 'z' - 'a' + 1, which is just 2 operations since 'z' - 'a' + 1 is computed at compile time. However, we must use unsigned version of < operator here to be able to employ integer overflow for correct comparison. Meantime x86-64 provides only signed versions of the instruction, so we must use the 2nd trick. The trick is that we can replace unsigned version of the operator by signed operator using 0x80 subtraction from both the arguments. So our expression becomes 'a' - 0x80 < 'z' - 'a' + 1 - 0x80.
Next, we intersect cmp_r with lc and intersect the result with CASE to determine virtual (good) result of XOR over the two strings if they are different in case only:
__m256i good = _mm256_and_si256(lc, cmp_r);
__m256i good_xor = _mm256_and_si256(good, CASE);
Since we have result of actual XOR over the strings we can compare it with the virtual (good) XOR result: if they match, then the strings are equal.
__m256i match = _mm256_xor_si256(good_xor, xor);
match = _mm256_cmpeq_epi8(match, _mm256_setzero_si256());
return ~_mm256_movemask_epi8(match);
This time I used vector instructions to process the string tails. To do so I need at least 8 bytes, so I used large switch at begin of the function to be sure that following vector code has at least 8 byte arguments:
switch (len) {
case 0:
return 0;
case 8:
c |= lct[s1[7]] ^ lct[s2[7]];
case 7:
c |= lct[s1[6]] ^ lct[s2[6]];
case 6:
c |= lct[s1[5]] ^ lct[s2[5]];
case 5:
c |= lct[s1[4]] ^ lct[s2[4]];
case 4:
c |= lct[s1[3]] ^ lct[s2[3]];
case 3:
c |= lct[s1[2]] ^ lct[s2[2]];
case 2:
c |= lct[s1[1]] ^ lct[s2[1]];
case 1:
c |= lct[s1[0]] ^ lct[s2[0]];
return c;
}
The switch is cheap since we don't need to calculate complex conditional expression as previously. lct is 256-byte static constant table used for case lowering. Note that GLIBC's tolower(3) is actually slow since it requires far call of __ctype_tolower_loc(). So I used lct table wherever possible.
String tails processing is performed by __stricmp_avx2_tail(), which basically employ the same logic. The function accepts stings shorter than 32 bytes and longer than 8 bytes. If the strings tail after vector processing is shorter than 8 bytes, then we simply move backward to get necessary 8 bytes:
if (len < 8) {
i -= 8 - len;
len = 8;
}
return __stricmp_avx2_tail(s1 + i, s2 + i, c);
If the strings have at least 16 bytes, then __stricmp_avx2_tail() executes the same code as above, but using 16 byte registers. The code always leaves 16 bytes of data:
if (len >= 16) {
// do the vector processing using 16 byte registers
if (len == 16 || r)
return r;
s1 += len - 16;
s2 += len - 16;
len = 16;
}
So now we have at least 8 bytes and not more than 16 bytes. But we still use 16 byte vector processing. To do so we need to load the data, with some overlapping and we do this in this way:
v0 = _mm_loadh_pd(v0, (double *)s1);
v1 = _mm_loadh_pd(v1, (double *)s2);
v0 = _mm_loadl_pd(v0, (double *)(s1 + len - 8));
v1 = _mm_loadl_pd(v1, (double *)(s2 + len - 8));
That's not a problem to process some piece of data twice. And now we can execute exactly the same code with 16 byte instructions as above.
Lets have a look how fast the code is:
AVX2/64bit strncasecmp():
str_len 1: 121ms
str_len 3: 132ms
str_len 10: 166ms
str_len 19: 194ms
str_len 28: 227ms
str_len 107: 189ms
str_len 178: 236ms
str_len 1023: 463ms
str_len 1500: 588ms
And we can run ever faster if we know that one particular argument of strcasecmp() is always in lower case. For example if you compare input data with static strings, then it's trivial to define the static strings in lower case and pass them always by the second argument. There is no sense to use XOR approach for the case and I use simple range calculation using both the Hacker's Delight tricks as above:
__m256i sub = _mm256_sub_epi8(v0, A);
__m256i cmp_r = _mm256_cmpgt_epi8(D, sub);
__m256i lc = _mm256_and_si256(cmp_r, CASE);
__m256i vl = _mm256_or_si256(v0, lc);
__m256i eq = _mm256_cmpeq_epi8(vl, v1);
return ~_mm256_movemask_epi8(eq);
Surely the code outperforms all of the approaches described before:
AVX2/64bit strncasecmp(), one string case conversion:
str_len 1: 126ms
str_len 3: 129ms
str_len 10: 129ms
str_len 19: 133ms
str_len 28: 136ms
str_len 107: 154ms
str_len 178: 179ms
str_len 1023: 310ms
str_len 1500: 376ms
Tempesta FW is Linux kernel project and using FPU in kernel is not trivial. To do so you must call kernel_fpu_begin() and kernel_fpu_end(), which save the contents of the registers if user mode processes use the FPU. So the using FPU in kernel mode isn't cheap. Tempesta FW processes HTTP in softirq context, just as soon as the packet arrives to NIC. Thus, to mitigate the overhead we made special FPU safe wrapper __tempesta_do_softirq_fpusafe():
void
__tempesta_do_softirq_fpusafe(void)
{
/*
* Switch FPU context once per budget packets to let Tempesta
* run many vector operations w/o costly FPU switches.
* Eager FPU must be enabled.
*/
kernel_fpu_begin();
__do_softirq();
kernel_fpu_end();
}
, which is called from do_softirq_own_stack() assembly:
#ifdef CONFIG_SECURITY_TEMPESTA
call __tempesta_do_softirq_fpusafe
#else
call __do_softirq
#endif
So now we do only one FPU context store per softirq shot which can process many packets at once.
I finished my previous post with bottleneck on long strings parsing in our HTTP parser, the issue details can be found at GitHub. In particular there are two bottlenecks: strspn()-like functions searching for delimiters and strcasecmp() used in many places.
What Makes HTTP Strings Processing Special
There are few important properties of strings in HTTP messages that make their processing special:
- HTTP message is sequence of strings separated by special delimiters (e.g. ';' or ','). The most important delimiter is CRLF ("\r\n"). While RFC 7230 defines the delimiter as exactly CRLF, it recommends to process single LF as the delimiter as well. So there can be many different delimiters and one of them has variable length. There are no '\0'-terminated strings, rather all the strings are just parts of contiguous network packets.
- RFC defines strict rules which character sets can be found in particular HTTP request line, URI, status line or header field name and value. The number of sets is relatively small.
- The character sets are know at compile time, so checking them using standard strspn(3) is a bad idea because it spends a lot of cycles for compiling accept range.
- Tempesta FW is true zero-copy server, so a string can not exceed 1500 bytes (1 Ethernet frame) for any string processing call. (Longer strings are processed by chunks.) URI, Cookie, User-Agent and non-standard headers can easily reach tens kilobytes in size.
- While long URIs are frequent, it seems the most frequent URI is still single-byte '/'. Many HTTP flood DDoS attacks still use this short URI.
- Some of strcasecmp(3) calls can have one of the arguments always in lower case. For example, if you compare ingress string with "Cookie:" pattern to find Cookie header, you can use "cookie:" instead, so strcasecmp() can avoid case conversion for one of the arguments.
- In many cases we need only boolean result from strncasecmp(): whether the strings match or not, we don't need to know which string is lexicographically greater.
How HTTP Servers Process Strings
All mature HTTP parsers use FSM (finite state machine) to process messages. The different approaches with benchmarks are covered in my earlier article. However, parsers are different in how deeply they analyze a message. For example we have following HTTP message:
GET / HTTP/1.0\r\n
Host: example.net\r\n
Cookie: session=42; theme="dark"\r\n
When a parser reads Cookie header it has following choices:
- Just read the header as opaque data, i.e. just run memchr(buf, '\n', size) over the data and put the header to some string array. This is the fastest way, but if the server works with cookies, then the cookie processing logic must scan the whole array for Cookie header and next parse the header. So you read the data at least twice and that's not fast at all. Thus the approach usually works for simple HTTP proxies which doesn't care which data they transfer;
- The better way is to analyze the header name, i.e. parse at least "Cookie:" string and store the rest of the string as opaque data. This time we know exactly whether we have Cookie header and where we can find it. Moreover, we can faster parse the header value since we know where it begins. The drawback of the approach is that we do not verify the header syntax, so we can pass incorrect or ever evil value of the header to vulnerable application;
- And the last opportunity is to fully execute FSM driven by the RFC grammar:
cookie-string = cookie-pair *( ";" SP cookie-pair )
cookie-pair = cookie-name "=" cookie-value
cookie-name = token
cookie-value = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
cookie-octet = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
; US-ASCII characters excluding CTLs,
; whitespace DQUOTE, comma, semicolon,
; and backslash
token = 1*<any CHAR except CTLs or separators>
The rules set is relatively trivial, but strict verification of the rules can be an issue if you must to do this really quickly.
While Nginx accurately parses out all URI parts (see sw_check_uri and sw_uri states in ngx_http_parse_request_line(), src/http/ngx_http_parse.c), PicoHTTPParser just checks the URI alphabet as 0x20 (Space) < ch < 0x7f (DEL)., various delimeters and special characters (e.g. '"' or '\') are allowed by PicoHTTPParser, while they should have been filtered out.
Nginx uses old-school loop & switch based approach like:
for (p = b->pos; p < b->last; p++) {
ch = *p;
switch (state) {
case sw_start:
// ....
case sw_foo:
// .....
}
The FSM is obvious and easy to program, but it's quite slow. See my article about high-performance HTTP parsers and Kazuho Oku's presentation, slides 31-33, for explanation. So PicoHTTPParser uses SSE4.2 instruction PCMESTRI. The instruction can match a 16-byte string against set of characters, ranges or other string. Since this is hardware implemented string matcher, it works much faster than the dummy loop & switch based FSM. However, you're very limited in what you can match. You can match at most 8 ranges or 16 characters. Moreover, you can't mix range matches with characters matching (i.e. you can not match characters like 0x0 < ch < 0x20 && ch == '"'). The pity thing is that character sets for URI or most of HTTP headers exceed the limits (the sets can have about 10 ranges). So using the instruction as HTTP strings matcher involves weak content checking. If you're going to use the instruction, which is somewhat tricky, you probably find Andi Kleen's calculator very useful.
While PicoHTTPParser is very fast Vlad Krasnov from CloudFlare goes further replacing PCMPESTRI instruction by AVX2 code. The code basically checks range (ch >= 0x20 || ch == '\t') && (ch < 0x7f). While PicoHTTPParser uses only one instruction to do string matching, CloudFlare's version executes much more code: AVX2 doesn't have string processing instruction, so there are separate instruction for each comparison and logical operator. However, it executes much faster because it can eat 32 bytes per a step. Moreover, Vlad also did loop unrolling, such that 128 bytes are eaten at a time.
I wrote simple benchmark to learn both the approaches, you can find whole code at GitHub. Benchmark is focused on URI processing, so PicoHTTPParser approach looks as
static const size_t
findchar_fast(const char *str, size_t len, const char *ranges,
size_t ranges_sz, int *found)
{
__m128i ranges16 = _mm_loadu_si128((const __m128i *)ranges);
const char *s = str;
size_t left = len & ~0xf;
*found = 0;
do {
__m128i b16 = _mm_loadu_si128((void *)s);
int r = _mm_cmpestri(ranges16, ranges_sz, b16, 16,
_SIDD_LEAST_SIGNIFICANT
| _SIDD_CMP_RANGES
| _SIDD_UBYTE_OPS);
if (r != 16) {
*found = 1;
return s - str + r;
}
s += 16;
left -= 16;
} while (left);
return s - str;
}
size_t
picohttpparser_findchar_fast(const char *str, size_t len)
{
static const unsigned char ranges[] __attribute__((aligned(16))) =
"\x00 " /* control chars and up to SP */
"\"\"" /* 0x22 */
"<<" /* 0x3c,0x3c */
">>" /* 0x3e,0x3e */
"\\\\" /* 0x5c,0x5c */
"^^" /* 0x5e,0x5e */
"{}" /* 0x7b-0x7d */
"\x7f\xff"; /* 0x7f-0xff */
const char *s;
size_t n = 0;
if (len >= 16) {
int found;
n = findchar_fast(str, len, ranges, sizeof(ranges) - 1,
&found);
if (found)
return n;
}
s = str + n;
while (s - str < len && uri_a[*s])
++s;
return s - str;
}
Since ranges are used for PCMPESTRI we have to spend a range for a single character like '<' or '^'. Unfortunately, there are not enough available ranges for us and we have to pass '`' in URI while it is not included in URI specification by RFC.
Code for CloudFlare's approach looks as:
const __m256i lb = _mm256_set1_epi8(0x1f); /* low bound */
const __m256i hb = _mm256_set1_epi8(0x7f); /* high bound */
const __m256i tab = _mm256_set1_epi8(0x09); /* allow TAB */
/* SPACE <= v */
__m256i low = _mm256_cmpgt_epi8(v, lb);
/* SPACE <= v < 0x7f */
__m256i bounds = _mm256_and_si256(_mm256_cmpgt_epi8(hb, v), low);
/* SPACE <= v < 0x7f || v == TAB */
__m256i r = _mm256_or_si256(_mm256_cmpeq_epi8(tab, v), bounds);
/* Generate bit mask */
*range = ~_mm256_movemask_epi8(r);
I skip code for 64- and 128-byte processing as well as the functions results merging code. You can find the full code of the approach here. There are too many instructions to handle the simple characters set, so we do only basic verification. The numbers on my Intel Core i7-6500U are:
PCMPESTRI/PicoHTTPParser:
str_len 1: 128ms
str_len 3: 138ms
str_len 10: 161ms
str_len 19: 151ms
str_len 28: 183ms
str_len 107: 218ms
str_len 178: 230ms
str_len 1023: 784ms
str_len 1500: 1069ms
AVX2/CloudFlare:
str_len 1: 171ms
str_len 3: 175ms
str_len 10: 189ms
str_len 19: 174ms
str_len 28: 196ms
str_len 107: 198ms
str_len 178: 203ms
str_len 1023: 375ms
str_len 1500: 458ms
More code is executed in CloudFlare's version, so single character case is much slower than PicoHTTPParser. But AVX2 code show much more stable performance with increasing string length, there are 9 sizes of processed strings from 1 to 1500 bytes.
There is full results of the benchmark. Each string is porcessed 5,000,000 times. To get the results I executed the benchmark 5 times on my laptop with all heavy applications stopped (mail, browsers etc). For each benchmark I selected the best numbers to mitigate impact of some external activity by other processes still leaving in the system. I also used taskset to eliminate rescheduling overhead:
$ for i in `seq 0 4`; do taskset 0x2 ./str_benchmark > ./b.$i; done
Just to show how bad standard strspn(3) is for checking HTTP character sets I also wrote benchmarks for GLIBC assembly version and Linux kernel naive C implementation. Linux kernel doesn't use assembly for strcasecmp() and strspn() since there are no performance critical strings processing in kernel. But the implementation clearly shows that plain C for the task performs quite poorly. So the numbers are:
GLIBC strspn():
str_len 1: 350ms
str_len 3: 354ms
str_len 10: 380ms
str_len 19: 420ms
str_len 28: 398ms
str_len 107: 533ms
str_len 178: 650ms
str_len 1023: 2071ms
str_len 1500: 2856ms
Linux kernel strspn():
str_len 1: 324ms
str_len 3: 641ms
str_len 10: 1865ms
str_len 19: 3565ms
str_len 28: 4522ms
str_len 107: 18851ms
str_len 178: 28575ms
str_len 1023: 187992ms
str_len 1500: 273276ms
Even More Fast and Accurate
So we need a better alternative. It must quickly process short strings as well as long and it must accurately verify character sets defined by RFC.
Now Tempesta FW implements AVX2 routines for HTTP specific strings processing. It outperforms both the approaches and provides absolute accuracy in HTTP character sets verification. There are numbers from the same benchmark:
Tempesta AVX2 constant URI matching:
str_len 1: 123ms
str_len 3: 127ms
str_len 10: 150ms
str_len 19: 139ms
str_len 28: 156ms
str_len 107: 167ms
str_len 178: 180ms
str_len 1023: 350ms
str_len 1500: 433ms
Now lets have a look what's under the hood. The entry point of the algorithm is tfw_match_uri_const(). Firstly the function quickly process very short strings up to 4 bytes:
if (likely(len <= 4)) {
switch (len) {
case 0:
return 0;
case 4:
c3 = uri_a[s[3]];
case 3:
c2 = uri_a[s[2]];
case 2:
c1 = uri_a[s[1]];
case 1:
c0 = uri_a[s[0]];
}
return (c0 & c1) == 0 ? c0 : 2 + (c2 ? c2 + c3 : 0);
}
uri_a is defined as 256-byte constant array with bytes set for ASCII characters acccepted by URI. Previously we used 4 unsigned longs (256 bits) and defined whether a character allowed by set bit
uri_a[c >> 6] & (1UL << (c & 0x3f))
However, we found that the bit operation does to many operations and simple table lookup outperforms it. It worth to mention that while the 256-byte array wastes 4 cache lines, only 2 of them are frequently used.
Since the function must return exact number of matched symbols, we have to execute heavyweight condition at return statement. The condition doesn't allow us to efficiently check more bytes in plain C, for example 8 bytes.
Next we process large strings in following manner:
for ( ; unlikely(s + 128 <= end); s += 128) {
n = match_symbols_mask128_c(__C.URI_BM, s);
if (n < 128)
return s - (unsigned char *)str + n;
}
if (unlikely(s + 64 <= end)) {
n = match_symbols_mask64_c(__C.URI_BM, s);
if (n < 64)
return s - (unsigned char *)str + n;
s += 64;
}
if (unlikely(s + 32 <= end)) {
n = match_symbols_mask32_c(__C.URI_BM, s);
if (n < 32)
return s - (unsigned char *)str + n;
s += 32;
}
if (unlikely(s + 16 <= end)) {
n = match_symbols_mask16_c(__C.URI_BM128, s);
if (n < 16)
return s - (unsigned char *)str + n;
s += 16;
}
The code processes strings longer than 16 bytes. So we have the gap between 4 and 16 bytes which is processed by following code. The code processes string tail as well as short strings. This is why I use unlikely in the conditions above: branch misprediction is super important for short strings, while long strings aren't so sensitive to several penalties. Tail of the string is processed in the same way as GLIBC generic C implementation does it:
while (s + 4 <= end) {
c0 = uri_a[s[0]];
c1 = uri_a[s[1]];
c2 = uri_a[s[2]];
c3 = uri_a[s[3]];
if (!(c0 & c1 & c2 & c3)) {
n = s - (unsigned char *)str;
return !(c0 & c1) ? n + c0 : n + 2 + (c2 ? c2 + c3 : 0);
}
s += 4;
}
c0 = c1 = c2 = 0;
switch (end - s) {
case 3:
c2 = uri_a[s[2]];
case 2:
c1 = uri_a[s[1]];
case 1:
c0 = uri_a[s[0]];
}
n = s - (unsigned char *)str;
return !(c0 & c1) ? n + c0 : n + 2 + c2;
, just usual loop unrolling.
And now is time for the core of the algorithm. I describe match_symbols_mask32_c() only, all other functions are just straightforward modifications for larger data processing.
const __m256i ARF = _mm256_setr_epi8(
0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80,
0, 0, 0, 0, 0, 0, 0, 0,
0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80,
0, 0, 0, 0, 0, 0, 0, 0);
const __m256i LSH = _mm256_set1_epi8(0xf);
static size_t
match_symbols_mask32_c(__m256i sm, const char *str)
{
__m256i v = _mm256_lddqu_si256((void *)str);
1: __m256i acbm = _mm256_shuffle_epi8(sm, v);
2: __m256i acols = _mm256_and_si256(LSH, _mm256_srli_epi16(v, 4));
3: __m256i arbits = _mm256_shuffle_epi8(ARF, acols);
4: __m256i sbits = _mm256_and_si256(arbits, acbm);
5: v = _mm256_cmpeq_epi8(sbits, _mm256_setzero_si256());
6: unsigned long r = 0xffffffff00000000UL
| _mm256_movemask_epi8(v);
7: return __tzcnt(r);
}
(I numbered important lines of the code for later description). sm is specially crafted representation of allowed characters set, so it varies depending on what we're parsing, e.g. URI or particular header value. Meantime ARF and LSH are constants identical for all the matchers. For URI we set sm by
sm = _mm256_setr_epi8(
0xb8, 0xfc, 0xf8, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc,
0xfc, 0xfc, 0xfc, 0x7c, 0x54, 0x7c, 0xd4, 0x7c,
0xb8, 0xfc, 0xf8, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc,
0xfc, 0xfc, 0xfc, 0x7c, 0x54, 0x7c, 0xd4, 0x7c);
To understand the constants lets have a look at ASCII table.
There are 16 rows and sm contains two equal series by 16 bytes: each of them describes positions of valid URI characters in ASCII table rows. Note that 'column' and 'row' can be interchanged depending on the ASCI table representation. Hereafter I describe the logic using the table representation above. The first constant 0xb8 is defined by first row of the table. 'p', 'P' and '@' are valid URI charaters, while '`' isn't. So we encode this as 1011 in binary representation, or 0xb. Next 4 bits we define as 0x8 since '0' only form the next 4 ASCII symbols is accepted as URI character. The next constant 0xfc is defined by the second row and so on.
To generate the constants and uri_a, which I mentioned above, I use following simple program:
static const unsigned char uri[] =
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"abcdefghijklmnopqrstuvwxyz"
"0123456789"
"-_.~!*'();:@&=+$,/?%#[]";
unsigned char r[256] = { 0 };
for (int i = 0; i < sizeof(uri)/sizeof(uri[0]) - 1; ++i)
r[A[i]] = 1;
for (int i = 0; i < 256; ++i)
printf("%u,%c", r[i], (i & 0xF) == 0xF ? '\n' : ' ');
printf("\n");
for (int i = 0; i < 16; ++i) {
unsigned char c;
for (c = 0, j = 7; j >= 0; --j)
c = (c << 1) | r[(j << 4) + i];
printf("%#x, ", c);
}
printf("\n");
The algorithm does following steps (see line numbers in the code above, Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2 and Intel Intrinsics Guide are excellent references for the intrinsics):
- shuffle sm bytes according to input data, using as shuffle control mask. For example if 'p', which is essentially 0x70, is the first character of input data, then 0 (4 least significant bits defining the character row in ASCII table) defines the new position of the first constant in sm, 0xb8. In other words this step says that first character of input data belongs to ASCII table first row and places corresponding bitmap of allowed characters in the row by 0th index.
- Next, we build an array of ASCII table columns corresponding to input data. To determine column of our character we just do 'p' >> 4. However, the minimal unit for shift is word, 2 bytes, so we have to use LSH (0xf) mask to clear moved least significant bits of most significant byte of the word.
- ARF is a bit mask defining at which column in a ASCII row a character is placed , i.e. 'p' placed at the most right column corresponds to least significant bit 0x1. So at this step we arrange the column bits according to input data.
- So now we have two arrays of bitmaps for allowed characters in ASCII row and a column bit for particular characters. And at this line of code we intersect both the arrays by AND determining whether we have allowed character at particular place.
- The previous step sets bits somewhere in each byte of the vector. And now we propagates the bit to most significant bit.
- Next, we aggregate most significant bits of all bytes of the vector to 32-bit integer. We store the result in 64-bit integer with set most significant bits, such that the next step can correctly count number of set bits in the value.
- Finally, we count non-zero bits corresponding to matching URI bits.
strcasecmp()
The next important function, which is one of the hottest spots, is strncasecmp().
GLIBC's __strncasecmp_l_avx() (at least 2.23 version) basically implements following straightforward logic. Firstly, we define constants for 'A' and 'Z' - the characters range to be converted to lower case:
const __m256i A = _mm256_set1_epi8('A' - 1);
const __m256i Z = _mm256_set1_epi8('Z' + 1);
And constant for the case conversion:
const __m256i CASE = _mm256_set1_epi8(0x20);
Next, we load 32 bytes of each input string:
__m256i v0 = _mm256_lddqu_si256((void *)s0);
__m256i v1 = _mm256_lddqu_si256((void *)s1);
And determine which characters of each of them we have to convert to lower case:
__m256i a0 = _mm256_cmpgt_epi8(v0, A);
__m256i a1 = _mm256_cmpgt_epi8(v1, A);
__m256i z0 = _mm256_cmpgt_epi8(Z, v0);
__m256i z1 = _mm256_cmpgt_epi8(Z, v1);
__m256i cmp_r0 = _mm256_and_si256(a0, z0);
__m256i cmp_r1 = _mm256_and_si256(a1, z1);
cmp_r defines which characters (vector items) we have to convert to lower case and now we set 0x20 in the positions getting bit masks converting the input strings to lower case:
__m256i lc0 = _mm256_and_si256(cmp_r0, CASE);
__m256i lc1 = _mm256_and_si256(cmp_r1, CASE);
The bit masks are used for case conversion by OR operator and finally we can compare both the strings and return zero value if all the characters match or non-zero value otherwise. Note that HTTP parser requires only boolean return value whether the whole strings match or not.
__m256i vl0 = _mm256_or_si256(v0, lc0);
__m256i vl1 = _mm256_or_si256(v1, lc1);
__m256i eq = _mm256_cmpeq_epi8(vl0, vl1);
return ~_mm256_movemask_epi8(eq);
Actually GLIBC version does the stuff using 16-byte vectors and in fact it's slower than the AVX2 implementation above. Note that we don't care about zero byte in the strings, but rather require strings of equal size. Technically it can be done by defining 3rd argument of the function as min(s1.length, s2.length): in all the cases for HTTP parser we know string lengths since the parser doesn't work with zero-terminated C strings.
GLIBC strncasecmp():
str_len 1: 133ms
str_len 3: 144ms
str_len 10: 143ms
str_len 19: 163ms
str_len 28: 168ms
str_len 107: 213ms
str_len 178: 253ms
str_len 1023: 861ms
str_len 1500: 1167ms
AVX2 strncasecmp():
str_len 1: 127ms
str_len 3: 131ms
str_len 10: 178ms
str_len 19: 206ms
str_len 28: 235ms
str_len 107: 199ms
str_len 178: 254ms
str_len 1023: 558ms
str_len 1500: 673ms
I also used very similar optimizations for short stings in plain C as in strspn()-like case above.
Actually we don't need to convert both the strings to lower case. Instead we can do XOR on the strings (i.e. compute the strings "difference") and determine whether the difference is exactly in case:
__m256i xor = _mm256_xor_si256(v0, v1);
__m256i lc = _mm256_cmpeq_epi8(xor, CASE);
lc stores positions where the stings differ in case only, i.e. 0x20. However, for example '-' (0x2d) also differs from 'M' (0x4d) for exactly 0x20 and lc also stores the position. To know which positions are in the interest we determine which characters of first string is in ['a', 'z'] range, and we do this for one string only:
__m256i a = _mm256_set1_epi8('a' - 0x80);
__m256i D = _mm256_set1_epi8('z' - 'a' + 1 - 0x80);
__m256i vl0 = _mm256_or_si256(v0, CASE);
__m256i sub = _mm256_sub_epi8(vl0, a);
__m256i cmp_r = _mm256_cmpgt_epi8(D, sub);
Here I use 2 tricks from Hacker's Delight. Computing 'a' <= v <= 'z' requires 3 operations since it must be coded as v >= 'a' && v <= 'z'. So Hacker's Delight proposes to replace the expression by v - 'a' < 'z' - 'a' + 1, which is just 2 operations since 'z' - 'a' + 1 is computed at compile time. However, we must use unsigned version of < operator here to be able to employ integer overflow for correct comparison. Meantime x86-64 provides only signed versions of the instruction, so we must use the 2nd trick. The trick is that we can replace unsigned version of the operator by signed operator using 0x80 subtraction from both the arguments. So our expression becomes 'a' - 0x80 < 'z' - 'a' + 1 - 0x80.
Next, we intersect cmp_r with lc and intersect the result with CASE to determine virtual (good) result of XOR over the two strings if they are different in case only:
__m256i good = _mm256_and_si256(lc, cmp_r);
__m256i good_xor = _mm256_and_si256(good, CASE);
Since we have result of actual XOR over the strings we can compare it with the virtual (good) XOR result: if they match, then the strings are equal.
__m256i match = _mm256_xor_si256(good_xor, xor);
match = _mm256_cmpeq_epi8(match, _mm256_setzero_si256());
return ~_mm256_movemask_epi8(match);
This time I used vector instructions to process the string tails. To do so I need at least 8 bytes, so I used large switch at begin of the function to be sure that following vector code has at least 8 byte arguments:
switch (len) {
case 0:
return 0;
case 8:
c |= lct[s1[7]] ^ lct[s2[7]];
case 7:
c |= lct[s1[6]] ^ lct[s2[6]];
case 6:
c |= lct[s1[5]] ^ lct[s2[5]];
case 5:
c |= lct[s1[4]] ^ lct[s2[4]];
case 4:
c |= lct[s1[3]] ^ lct[s2[3]];
case 3:
c |= lct[s1[2]] ^ lct[s2[2]];
case 2:
c |= lct[s1[1]] ^ lct[s2[1]];
case 1:
c |= lct[s1[0]] ^ lct[s2[0]];
return c;
}
The switch is cheap since we don't need to calculate complex conditional expression as previously. lct is 256-byte static constant table used for case lowering. Note that GLIBC's tolower(3) is actually slow since it requires far call of __ctype_tolower_loc(). So I used lct table wherever possible.
String tails processing is performed by __stricmp_avx2_tail(), which basically employ the same logic. The function accepts stings shorter than 32 bytes and longer than 8 bytes. If the strings tail after vector processing is shorter than 8 bytes, then we simply move backward to get necessary 8 bytes:
if (len < 8) {
i -= 8 - len;
len = 8;
}
return __stricmp_avx2_tail(s1 + i, s2 + i, c);
If the strings have at least 16 bytes, then __stricmp_avx2_tail() executes the same code as above, but using 16 byte registers. The code always leaves 16 bytes of data:
if (len >= 16) {
// do the vector processing using 16 byte registers
if (len == 16 || r)
return r;
s1 += len - 16;
s2 += len - 16;
len = 16;
}
So now we have at least 8 bytes and not more than 16 bytes. But we still use 16 byte vector processing. To do so we need to load the data, with some overlapping and we do this in this way:
v0 = _mm_loadh_pd(v0, (double *)s1);
v1 = _mm_loadh_pd(v1, (double *)s2);
v0 = _mm_loadl_pd(v0, (double *)(s1 + len - 8));
v1 = _mm_loadl_pd(v1, (double *)(s2 + len - 8));
That's not a problem to process some piece of data twice. And now we can execute exactly the same code with 16 byte instructions as above.
Lets have a look how fast the code is:
AVX2/64bit strncasecmp():
str_len 1: 121ms
str_len 3: 132ms
str_len 10: 166ms
str_len 19: 194ms
str_len 28: 227ms
str_len 107: 189ms
str_len 178: 236ms
str_len 1023: 463ms
str_len 1500: 588ms
And we can run ever faster if we know that one particular argument of strcasecmp() is always in lower case. For example if you compare input data with static strings, then it's trivial to define the static strings in lower case and pass them always by the second argument. There is no sense to use XOR approach for the case and I use simple range calculation using both the Hacker's Delight tricks as above:
__m256i sub = _mm256_sub_epi8(v0, A);
__m256i cmp_r = _mm256_cmpgt_epi8(D, sub);
__m256i lc = _mm256_and_si256(cmp_r, CASE);
__m256i vl = _mm256_or_si256(v0, lc);
__m256i eq = _mm256_cmpeq_epi8(vl, v1);
return ~_mm256_movemask_epi8(eq);
Surely the code outperforms all of the approaches described before:
AVX2/64bit strncasecmp(), one string case conversion:
str_len 1: 126ms
str_len 3: 129ms
str_len 10: 129ms
str_len 19: 133ms
str_len 28: 136ms
str_len 107: 154ms
str_len 178: 179ms
str_len 1023: 310ms
str_len 1500: 376ms
FPU in Linux Kernel
Tempesta FW is Linux kernel project and using FPU in kernel is not trivial. To do so you must call kernel_fpu_begin() and kernel_fpu_end(), which save the contents of the registers if user mode processes use the FPU. So the using FPU in kernel mode isn't cheap. Tempesta FW processes HTTP in softirq context, just as soon as the packet arrives to NIC. Thus, to mitigate the overhead we made special FPU safe wrapper __tempesta_do_softirq_fpusafe():
void
__tempesta_do_softirq_fpusafe(void)
{
/*
* Switch FPU context once per budget packets to let Tempesta
* run many vector operations w/o costly FPU switches.
* Eager FPU must be enabled.
*/
kernel_fpu_begin();
__do_softirq();
kernel_fpu_end();
}
, which is called from do_softirq_own_stack() assembly:
#ifdef CONFIG_SECURITY_TEMPESTA
call __tempesta_do_softirq_fpusafe
#else
call __do_softirq
#endif
So now we do only one FPU context store per softirq shot which can process many packets at once.
Subscribe to:
Posts (Atom)