High Performance Linux

> Try Tempesta FW, a high performance open source application delivery controller for the Linux/x86-64 platform.

> Or check custom high-performance solutions from Tempesta Technologies, INC.

> Careers: if you love low-level C/C++ hacking and Linux, we'll be happy to hear from you.

Wednesday, November 6, 2019

Tempesta Technologies blog

NatSys Laboratory Ltd. was rebranded to Tempesta Technologies, so we closed the original site natsys-lab.com and now it's time to move to the Tempesta Technilogies blog.

All the further posts will be available at http://tempesta-tech.com/blog/ . The blog will remain the same - we're passionate about technologies, so we'll continue with the deep technical articles.

There is the summary of the most interesting posts from this blog.

I hope you'll enjoy our new blog!

Saturday, May 18, 2019

Intelpocalypse: goodbye fast system calls

Intel announced MDS (aka ZombieLoad) vunerability. Earlier, in 2018, there was announced Metdown.

Modern Linux kernel is compiled with Kernel page table isolation (KPTI) to prevent Metldown. Essentially, KPTI is just a removal of old technique to optimize system calls, aka lazy TLB: kernel space is mapped to all page tables for user space processes, so there is no need to flush 1 layer caches on kernel/user-space context switches. Performance impact is serious: up to 20% for Nginx (MariaDB got even 40% for certain workloads).

MDS goes further in slowing down system calls, it introduces mds_clear_cpu_buffers() called on each context switch. Performance impact seems not so huge as for the Meltdown prevention, but it's clear that system calls become even more slow.

The good news is that Tempesta FW works in kernel space, so there is no context switches and KPTI and MDS do not affect our performance at all. Moreover, we accurately program our most performance crucial code (HTTP processing) in assembly and use retpoline Spectre prevention only where it's necessary. Retpoline may have up to 15% performance impact, but, fortunately, not each indirect jump must use retpoline to be safe against Spectre.

Wednesday, March 28, 2018

HTTP Requests Proxying

It may seem easy to proxy HTTP requests - after all we just receive an HTTP request, queue it for retransmission, send it to a backend server, and do the same with an HTTP response when we get it from the server. However, things aren't so simple in modern HTTP proxies. In this article I'm going to address several interesting problems in HTTP/1.1 proxying. I'll be mostly concentrating on HTTP reverse proxies also known as web accelerators.

HTTP reverse proxying

First of all let's see what HTTP reverse proxy is and how it works internally. Besides web acceleration - i.e. caching web content - reverse proxies do a lot of stuff:
  1. Load balancing among several servers, sometimes with different performance characteristics. E.g. on the picture we have large 3rd server, which is capable of handling more requests per second than the 2 others, so the server should get more requests.
  2. Automatic failovering of failed servers. The second server on the picture fails, so the proxy must load balance ingress requests among rest of the 2 servers. When the server backs to normal operations, the load must be balanced among all the 3 servers again.
  3. Since TLS is resource hungry, it has sense to terminate TLS on a proxy, so backend servers consumes resources for more useful application logic.
  4. There could be clients with different software, too outdated or too recent, and the proxy should convert different protocols to more suitable forms for the backend servers, e.g. it can downgrade HTTP/2 to HTTP/1.1 or upgrade HTTP/1.0 to HTTP/1.1.
Also HTTP reverse proxies can do many other things such as DDoS mitigation, web security, requests and content modification (SSI or ESI), but in this article I'm going to focus only on basic HTTP proxying issues. There are several interesting topics which immediately arise when we just want to pass a request to some server and forward corresponding response back to a client:
  1. How many connections should a proxy establish with each backend server?
  2. Sometimes backend servers reset connections (by default Nginx and Apache HTTPD reset connections each time when a connection serves 100 requests). When this happens how does a proxy manage the connection resets? Obviously, it should be something more optimal than in a case of a server failure.
  3. Since we pass HTTP messages from a client socket to a backend server socket and vice versa, there should be message queues and these queues must be properly managed with connections failovering in mind.
  4. Is it safe to resend an HTTP request to other backend if current backend can not properly answer the request?
  5. When you pass data from one socket (e.g. client) to an other (e.g. server), then there are data copies and high lock contention. The problem especially crucial for TLS encrypted data and HTTP/2.
The issues listed above are the topic for the article.

Performance of HTTP proxying

Let's start from a small test of a web server. I take Nginx 1.10.3 running in Debian 9 VM with 2 virtual CPUs and 2GB of RAM on my laptop (Intel i7-6500U). For workload generation I'll use wrk. This is a toy, test, environment, so the numbers in the article only make sense in a relative way.

Let's start from getting numbers for raw performance of a Web server hosting only one small index.html file of 3 bytes in size (hereafter I make 3 test runs and get the best results):

    # ./wrk -c 4096 -t 8 -d 30
    Running 30s test @
      8 threads and 4096 connections
      Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency   129.32ms  208.24ms   1.99s    90.28%
        Req/Sec     7.86k     1.67k   13.88k    75.49%
      1877247 requests in 30.10s, 424.30MB read
    Socket errors: connect 0, read 0, write 0, timeout 1374
    Requests/sec:  62368.01
    Transfer/sec:     14.10MB

There are 62K HTTP RPS with no HTTP errors. The Nginx configuration is

    worker_processes auto;

    worker_cpu_affinity auto;
    events {
        worker_connections 65536;
        use                epoll;
        multi_accept       on;
        accept_mutex       off;
    worker_rlimit_nofile   1000000;
    http {
        keepalive_timeout  600;
        keepalive_requests 10000000;
        sendfile           on;
        tcp_nopush         on;
        tcp_nodelay        on;
        open_file_cache    max=1000 inactive=3600s;
        open_file_cache_valid 3600s;
        open_file_cache_min_uses 2;
        open_file_cache_errors off;
        error_log /dev/null emerg;
        access_log         off;
        server {
            listen 9090 backlog=131072 deferred reuseport fastopen=4096;
            location / { root /var/www/html; }

I didn't do any special sysctl settings since all the tests were running with the same OS settings and I didn't care much about absolute numbers in the tests. I just made basic performance tuning settings and switched off logging to remove the slow filesystem writing from the discussion.

Next let's run a proxy in front of the web server. The same Nginx in the same VM is used, but with a different configuration. Note that I switched off the web cache to learn how much overhead HTTP proxying occurs.

    worker_processes auto;
    worker_cpu_affinity auto;
    events {
        worker_connections 65536;
        use                epoll;
        multi_accept       on;
        accept_mutex       off;
    worker_rlimit_nofile   1000000;
    http {
        sendfile           off; # too small file
        tcp_nopush         on;
        tcp_nodelay        on;
        keepalive_timeout  600;
        keepalive_requests 1000000;
        access_log         off;
        error_log /dev/null emerg;
        gzip               off;
        upstream u {
            keepalive 4096;
        server {
            listen 9000 backlog=131072 deferred reuseport fastopen=4096;
            location / {
                proxy_pass http://u;
                proxy_http_version 1.1;
                proxy_set_header Connection "";
        proxy_cache off;

Update: Thanks to Maxim Dounin, a lead Nginx developer, for pointing me out the keepalive configuration option - without it Nginx shows twice worse performance results.

And see how many RPSes we get in this configuration:

    # ./wrk -c 4096 -t 8 -d 30
    Running 30s test @
    8 threads and 4096 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   174.78ms  141.82ms   1.99s    87.09%
      Req/Sec     3.01k     0.95k    8.28k    70.62%
      718994 requests in 30.09s, 162.51MB read
    Socket errors: connect 0, read 0, write 0, timeout 229
    Requests/sec:  23894.05
    Transfer/sec:      5.40MB

Let's also try HAProxy and Tempesta FW, which usually delivers more performance for HTTP proxying. I tried HAProxy of version 1.7.5 and Tempesta FW 0.5.0. The configuration for HAProxy is:

        log /dev/log    local0
        log /dev/log    local1 notice
        chroot /var/lib/haproxy
        user haproxy
        group haproxy
        maxconn 65536
        nbproc 2
        cpu-map 1 0
        cpu-map 2 1
        log     global
        mode    http
        http-reuse always
        no log
        timeout connect 5000
        timeout client  50000
        timeout server  50000
        errorfile 400 /etc/haproxy/errors/400.http
        errorfile 403 /etc/haproxy/errors/403.http
        errorfile 408 /etc/haproxy/errors/408.http
        errorfile 500 /etc/haproxy/errors/500.http
        errorfile 502 /etc/haproxy/errors/502.http
        errorfile 503 /etc/haproxy/errors/503.http
        errorfile 504 /etc/haproxy/errors/504.http
    frontend test
        bind    :7000
        mode    http
        maxconn 65536
        default_backend nginx
    backend nginx
        mode    http
        balance static-rr
        server be1

And its results are:

    # ./wrk -c 4096 -t 8 -d 30
    Running 30s test @
    8 threads and 4096 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency   171.75ms  126.96ms   1.93s    81.56%
      Req/Sec     3.01k     0.88k   10.30k    73.52%
    719739 requests in 30.08s, 146.20MB read
    Socket errors: connect 0, read 0, write 0, timeout 514
    Requests/sec:  23925.79
    Transfer/sec:      4.86MB

The Tempesta FW configuration is just (note conns_n parameter):

    server conns_n=128;
    cache 0;

While Tempesta FW shows the best results, they are still 2 times worse than for no proxying at all:

    # ./wrk -c 4096 -t 8 -d 30
    Running 30s test @
    8 threads and 4096 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency   146.89ms  170.77ms   2.00s    88.21%
      Req/Sec     4.05k   565.96     7.32k    77.58%
    967299 requests in 30.08s, 252.76MB read
    Socket errors: connect 0, read 0, write 0, timeout 19
    Requests/sec:  32157.20
    Transfer/sec:      8.40MB

Thus, HTTP proxying without a cache is very expensive: almost twice worse performance in the best case!
Besides web acceleration, HTTP proxying is also required for load balancing on HTTP layer (e.g. using persistent HTTP sessions) and WAF (Web Application Firewalls), so the performance degradation is significant in many cases.
Having VM with only 2 CPUs and two Nginx instances with auto spawning worker processes makes 4 Nginx worker processes in total. I also ran tests with worker_processes 1 to have one to one CPU and worker process mapping, but the results were bit worse than these.

A very small static file is used in the tests, causing more overhead in network and HTTP processing. Of course, if you run the tests for large static files or a heavy dynamic logic, we won't see so dramatic differences in the numbers. For example, I ran the tests for 64KB index.html with switched on sendfile on the proxy and the proxy overhead was just about 3%.

Thus, if you have a significant work set of small files, which doesn't fit your web cache, then a web accelerator may hurt performance of your installation badly. Always analyze access patterns to your web content or, better, run performance tests.

Backend connections

The first issue is about backend server connections. In most cases modern HTTP proxies use following simple algorithm:
  1. Establish a TCP connection with a backend server.
  2. Send an HTTP request to the connection. Now the connection in busy state.
  3. If a new request arrives, a new TCP connection is established with the server and we do step (2) for the new connection.
  4. When an HTTP response arrives from the server, we forward it to a client and mark the TCP connection as free. Now we can send upcoming requests through the connection.
In busy loaded scenarios, there are thousands of client connections concurrently sending HTTP requests, so typically HTTP proxy establishes also thousands of connections to a backend server. For example, for the test above with 4096 client connections HAProxy establishes more than 3 thousands connections with the backend server (regardless, I used http-reuse always to reuse backend server connections as much as possible):

    # ss -npto state established '( dport = :9090 )'|wc -l

You may have noticed that this is very close to the number of connections from wrk. I'll explain this when I discuss HTTP pipelining, but now I want to emphasize that an HTTP proxy needs almost the same number of connections with a backend server as it has with all the clients. It's worth mentioning that this works only for very aggressive clients which send a lot of requests, e.g. DDoS bots. The consequence is that traditional HTTP accelerators aren't suitable for DDoS mitigation, since protected backend servers can get the same number of connections as the proxies.

Nginx since 1.11.5 supports max_conns option for server directive to limit number of backend connections (so my Nginx 1.10.3 from Debian 9 packages doesn't have the option). HAProxy also supports maxconn option for backend servers. The same way, Tempesta FW provides conns_n option.

Actually, depending on particular hardware and type of work load, web servers have optimal connection concurrency level. For example, Tempesta FW reaches a peak of 1.8M HTTP RPS on a 4 cores machine with 1024 concurrent connections. So starting from a single backend server connection on step (1) and establishing too many connections on step (3) introduces more latency for "unhappy" - requiring to establish a new TCP connection - requests  and reduces overall requests processing performance.

Persistent connections

While establishing a new TCP connection on step (3) introduces unwished latency to the request processing time, it has sense to keep persistent TCP connections with a backend server. If an HTTP proxy keeps a pool of persistent connections with a backend server, then it's always ready for instant spike of ingress client requests, e.g. due to a flash crowd or DDoS attack. This is what Tempesta FW does with conns_n: if you specify for example conns_n=128, then Tempesta FW keeps exactly 128 established connections to the backend server. (By the way, you can specify different values for conns_n for each backend server).

HTTP basically manages the persistency of connections with two headers - keep-alive connections are defined in RFC 2068 - for example:

    Connection: keep-alive
    Keep-Alive: timeout=5, max=10000

, i.e. the TCP connection timeout is 5 seconds and the maximum allowed requests passed over the connection is 10 thousand. Note that a client can only send Connection: keep-alive requests, while Keep-Alive specification for the connection is determined by a server.

TCP connections failovering

There are 3 cases when persistent connections with backed servers may fail:
  1. If there is no workload for some time (e.g. if you didn't enable backend servers' health monitoring), the TCP or HTTP keepalive timer elapses and the connection is closed.
  2. Backend servers can close such connections intentionally due to processing errors or default configuration (e.g. Nginx and Apache HTTPD close TCP client connections after each 100th request). Connection closing also happens to indicate the end of HTTP response without Content-Length header as well as chunked transfer encoding.
  3. Also a backend server may just go down due to a server maintenance or failure. This case is handled by a server health monitoring process which catches service failures on different layers, e.g. hard server reset as well as a web application failure when a web server still responds, but in a wrong way.
In all of these cases a proxy must reestablish TCP connection(s) with the backend server. But what if we just sent a request to a backend server and the server fails? What should we do with the request? To provide a better service for clients the proxy can just resend the request to some other backend server. However, not all types of HTTP requests are allowed to be retransmitted. The next section dives deeper into the subject, but now I'm going to stay on the retransmission issue a bit more.

If we resend an HTTP request, then we have to limit the number of retransmissions for it. Consider a "killing" request which just crashes your backend web application: a proxy sends a request to a backend server, the server fails, the proxy resends the request to another server and the server goes down the same way, so all the servers in the backend cluster are down. To prevent such situations all (I hope) HTTP proxies limit the number of a request retransmissions, for example, with Tempesta FW you can use server_forward_retries for the limit (default value is 5).

The next question is for how long should a proxy keep a request in internal queues in hope to get a successful response for it? After all a client waiting for too long time for a web page rendering considers the service down at some point. So again all (I hope) HTTP proxies provide the time limit for request retransmissions. In the case of Tempesta FW server_forward_timeout does this.

The requirement to be able to resend a request introduces sk_buff copying. struct sk_buff is a Linux kernel descriptor of data being sent through the TCP/IP stack, so the TCP acknowledgement and retransmission mechanisms extensively update the descriptor. Since we may need to resend a request through some other TCP connection, we have to copy sk_buff before it's transmission through TCP/IP stack. The problem is that the descriptor is relatively large, several hundred bytes in size. Originally Tempesta FW network I/O was designed to be zero-copy, but connections failovering doesn't allow to fully avoid copies. The design of Tempesta FW network I/O is described in my post What's Wrong With Sockets Performance And How to Fix It.

HTTP pipelining

HTTP requests can be pipelined, i.e. sent in a row without waiting responses for each of them separately. Tempesta FW is one of the few HTTP proxies (the two others are Squid and Polipo) which can pipeline HTTP requests in backend server connections. All other HTTP proxies, including Nginx and HAProxy, are unable to use HTTP pipelining and wait for a responses for each sent request. I.e. if you configure your HTTP proxy, e.g. Nginx or HAProxy, to establish say 100 connections with a backend server at the most, then only 100 requests can be sent concurrently to backend servers.

This is why we saw 3 thousand open connections between HAProxy and Nginx - the proxy needs so many connections to achieve the concurrency necessary to process ingress workload.

To analyze the performance impact of HTTP pipelining let's reconfigure Tempesta FW to use only one connection to the backend server:

    server conns_n=1;

And start the benchmark:

    # ./wrk -c 4096 -t 8 -d 30
    Running 30s test @
    8 threads and 4096 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency   162.20ms  197.03ms   1.99s    87.05%
      Req/Sec     3.73k     0.91k   12.67k    77.56%
    888501 requests in 30.06s, 232.19MB read
    Socket errors: connect 0, read 0, write 0, timeout 59
    Requests/sec:  29556.83
    Transfer/sec:      7.72MB

We see bit lower performance results due to lower number of backend server connections, but the performance degradation is quite small. To view how pipelining actually works we can use Tempesta FW's application performance monitoring:

    # while :; do \
        grep '0 queue size' /proc/tempesta/servers/default/\:9090; \
        sleep 1; \

        Connection 000 queue size    : 97
        Connection 000 queue size    : 79
        Connection 000 queue size    : 163
        Connection 000 queue size    : 104
        Connection 000 queue size    : 326
        Connection 000 queue size    : 251
        Connection 000 queue size    : 286
        Connection 000 queue size    : 312
        Connection 000 queue size    : 923
        Connection 000 queue size    : 250
        Connection 000 queue size    : 485

The script show the instant number of requests queued for transmission through the backend server connection. As you can see there are hundreds of HTTP requests on the fly at each moment of time.

Now let's make HAProxy to use only one connection with the backend server. To do so we need to change following settings:

        nbproc 1
    backend nginx
        mode    http
        balance static-rr
        fullconn 1
        server be1 minconn 1 maxconn 1

Since it can not use pipelining, the performance degradation is significant, almost 4 times:

    # ./wrk -c 4096 -t 8 -d 30
    Running 30s test @
    8 threads and 4096 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency   674.95ms   89.56ms 934.93ms   83.60%
      Req/Sec   759.42    547.99     2.96k    67.90%
    175041 requests in 30.10s, 35.56MB read
    Requests/sec:   5815.86
    Transfer/sec:      1.18MB

Idempotence of HTTP requests

While HTTP pipelining is good, not all HTTP requests can be pipelined. It's actually not so easy to implement HTTP pipelining correctly.

RFC 7231 4.2.2 defines idempotent methods as safe methods, not changing a server state. The safe methods are GET, HEAD, TRACE, OPTIONS. Only these methods can be pipelined. Consider that we send two requests, one of them is non-safe (stricktly speaking, both of the requests are non-idempotent), to a server in a pipeline:

    GET /forum?post="new%20post%20content" HTTP/1.1
    Host: foo.com
    POST /forum?post HTTP/1.1
    Host: foo.com
    Content-Length: 16
    new post content

If the server connection terminates just after the transmission, we're going to failover process and resend the requests to another connections or a server. But can we resend the non-safe POST request? The problem is that we don't know whether the server processed the request and created a new post on the forum or not. If it did and we resend it again, we create the same post twice which is unwished. If we don't resend the request and just return error code to a client, then our response is false. Thus RFC 7230 6.3.1 requires that a proxy must not automatically retry non-idempotent requests.

Moreover, note that the first GET request is actually requests some dynamic logic which also may change the server state, just like the second POST request. The GET request is essentially non-idempotent, but this is a web application developer's responsibility to use the right request methods in their applications. Actually, POST requests can be idempotent if, for example, an application developer uses them to send a web search query.

Since request idempotence depends on a particular web application, Tempesta FW provides a configuration option to define which request methods to which URIs are non-idemptont, e.g. to make the first GET request non-idempotent we can add following option to the configuration file:

    nonidempotent GET prefix “/forum?post”

Pipeline queues

Let's consider pipelining of requests from a 3 clients to a 2 server connections. The first client sends a non-idempotent request (the large square marked as "NI" on the picture). Firstly, we keep all client requests in a per-client queue to forward server responses to a client in exactly the same order in which the client sent corresponding requests. However, the queue is used only for ordering and when a request arrives it's immediately processed by the load balancing logic ("LB" on the picture) and is scheduled to some server queue. The non-idempotent request resides in a server queue just like idempotent request, but we don't send other requests to a server connection until we receive a response for the non-idempotent request.

If you don't want HTTP pipelining at all you can set the server queue size to 1, i.e. only one request at a time will be queued:

    server_queue_size 1;

It's clear that in general the second server queue having a non-idempotent request is drained slower than the first one, so Tempesta FW's load balancing algorithm makes preference to server queues without non-idempotent requests and uses such queues only if all others are too busy.

Pipelined messages retransmission

There could be a request-killer, crashing a web application, among pipelined requests, so RFC 7230 6.3.2 requires that a client must not pipeline immediately after connection establishment since we don't know which request exactly is the killer. So does Tempesta FW: if a server queue contains requests for retransmission, it doesn't schedule new requests to the queue until the last resent request is responded.

Unless server_retry_nonidempotent configuration option is specified, non-idempotent requests aren't resent and just dropped. If we have idempotent requests before and after the non-idempotent one, then we still can resend them to a live server. The sequence of responses is kept thanks to an error response, which is generated for the dropped non-idempotent request.

Since requests can be scheduled to different servers, appropriate responses can arrive in different order. When a server response arrives, it's linked with an appropriate request and the request is checked against head of the client queue: if all the requests in the head of the queue have linked responses, then all the responses are sent at once (pipelined) to a client. For example if we receive a response for the 1st client request while the 2nd and 3rd requests are already responded, then the whole head of the client queue, all the 3 responses for the first 3 requests, can be sent to the client in a pipeline.

HTTP messages adjustment

HTTP proxies usually have to adjust HTTP headers of forwarded messages, e.g. add Via header or current IP to X-Forwarded-For header. To do so we usually have to "rebuild" the headers from scratch: copy original header to some new memory location and add new headers and/or header values. Having that some HTTP headers, such as Cookie or User-Agent as well as URI can easily reach several kilobytes in size for modern web applications, the data copies aren't wished.

Thus, if we consider a user-space HTTP proxy, then typically we have at least 2 data copies:
  1. Receive a request on first CPU
  2. Copy the request to user space
  3. Update headers (2nd copy)
  4. Copy the request to kernel space (can be eliminated if splice(2) is used - it seems HAProxy only is able to do this)
  5. Send the request from the second CPU
Besides data copying, there is a problem with accessing sockets (TCP Control Blocks, TCBs) from different CPUs. As we saw above modern HTTP proxies work with thousands of TCP connections while modern hardware has only tens of CPU cores, so each core handles hundreds of TCBs. So if we want to forward an HTTP request from a client socket to a server socket, we have to do at least one copy of the request data among different CPUs and touch TCBs on different CPUs. This is not a big deal for single process package machines, but this is a problem for relatively large NUMA systems.

Linux kernel HTTP proxying

Tempesta FW is built-in to the Linux TCP/IP stack, so we can use full power of zero-copy sk_buff fragments and per-CPU TCBs. Details of the HTTP proxying in the Linux kernel can be found in my Netdev 2.1 talk Kernel HTTP/TCP/IP stack for HTTP DDoS mitigation.

The first problem of HTTP message transformation is solved by
HTTP message fragmentation: if we need to add, delete or update some data at the middle of an HTTP message, then we
  1. create a new fragment pointer to a place where the new data must be inserted
  2. create a new fragment with the new data and place its pointer just before the pointer from the previous step.
Data deletion is handled by just moving a pointer to the tail fragment further making a data gap between the first and the second fragments. Update is essentially a combination of deletion and addition.

To implement the zero-copy HTTP messages transformation we had to modify sk_buff allocator to always use paged data.

To reduce number and size of inter-CPU memory transfers we've introduced a per-CPU lock-free ring buffer for fast inter-CPU jobs transfer. Thanks to NIC RSS and the inter-CPU jobs transfer TCBs are mostly accessed by the same CPU. If we need to forward a request processed on the first CPU through a TCB residing on the second CPU we just put a job to the ring buffer of the second CPU and softirq working on the CPU takes care about the actual transmission.


You might wonder why do I talk about HTTP/1.1 pipelining if there is HTTP/2 providing much better requests multiplexing, which is free from head of line blocking problem?

In the article I described HTTP/1.1 pipelining to backend servers only. It's harder to implement HTTP/2 in zero-copy fashion (I'll address the problems in a further article). Meantime, the biggest advantage of the protocol comes in global network with low-speed connections and high delays, which are not the case for local networks with 10G links connecting an HTTP reverse proxy with backend servers. So using HTTP/2 for backend connections is doubtful. By the way, neither HAProxy nor Nginx support HTTP/2 for backend connections.

Sunday, October 30, 2016

HTTP Strings Processing Using C, SSE4.2 and AVX2

In this article I describe applications of standard C functions strcasecmp(3) and strspn(3) to HTTP parser. Surprisingly the functions can be specialized to HTTP parsing task which makes them much faster. Next I consider using SSE4.2 and AVX2 to implement the specialized versions of the functions and show serious performance improvement. GLIBC and Linux kernel implementations of strcasecmp(3) and strspn(3) are described as well as relevant routines from PicoHTTPParser, its modification by Cloud Flare and surely Tempesta FW.

I finished my previous post with bottleneck on long strings parsing in our HTTP parser, the issue details can be found at GitHub. In particular there are two bottlenecks: strspn()-like functions searching for delimiters and strcasecmp() used in many places.

What Makes HTTP Strings Processing Special

There are few important properties of strings in HTTP messages that make their processing special:

  1. HTTP message is sequence of strings separated by special delimiters (e.g. ';' or ','). The most important delimiter is CRLF ("\r\n"). While RFC 7230 defines the delimiter as exactly CRLF, it recommends to process single LF as the delimiter as well. So there can be many different delimiters and one of them has variable length. There are no '\0'-terminated strings, rather all the strings are just parts of contiguous network packets.
  2. RFC defines strict rules which character sets can be found in particular HTTP request line, URI, status line or header field name and value. The number of sets is relatively small.
  3. The character sets are know at compile time, so checking them using standard strspn(3) is a bad idea because it spends a lot of cycles for compiling accept range.
  4. Tempesta FW is true zero-copy server, so a string can not exceed 1500 bytes (1 Ethernet frame) for any string processing call. (Longer strings are processed by chunks.) URI, Cookie, User-Agent and non-standard headers can easily reach tens kilobytes in size.
  5. While long URIs are frequent, it seems the most frequent URI is still single-byte '/'. Many HTTP flood DDoS attacks still use this short URI.
  6. Some of strcasecmp(3) calls can have one of the arguments always in lower case. For example, if you compare ingress string with "Cookie:" pattern to find Cookie header, you can use "cookie:" instead, so strcasecmp() can avoid case conversion for one of the arguments.
  7. In many cases we need only boolean result from strncasecmp(): whether the strings match or not, we don't need to know which string is lexicographically greater.

How HTTP Servers Process Strings

All mature HTTP parsers use FSM (finite state machine) to process messages. The different approaches with benchmarks are covered in my earlier article. However, parsers are different in how deeply they analyze a message. For example we have following HTTP message:

        GET / HTTP/1.0\r\n
        Host: example.net\r\n
        Cookie: session=42; theme="dark"\r\n

When a parser reads Cookie header it has following choices:
  1. Just read the header as opaque data, i.e. just run memchr(buf, '\n', size) over the data and put the header to some string array. This is the fastest way, but if the server works with cookies, then the cookie processing logic must scan the whole array for Cookie header and next parse the header. So you read the data at least twice and that's not fast at all. Thus the approach usually works for simple HTTP proxies which doesn't care which data they transfer;
  2. The better way is to analyze the header name, i.e. parse at least "Cookie:" string and store the rest of the string as opaque data. This time we know exactly whether we have Cookie header and where we can find it. Moreover, we can faster parse the header value since we know where it begins. The drawback of the approach is that we do not verify the header syntax, so we can pass incorrect or ever evil value of the header to vulnerable application;
  3. And the last opportunity is to fully execute FSM driven by the RFC grammar:

    cookie-string = cookie-pair *( ";" SP cookie-pair )
    cookie-pair   = cookie-name "=" cookie-value
    cookie-name   = token
    cookie-value  = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
    cookie-octet  = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
                    ; US-ASCII characters excluding CTLs,
                    ; whitespace DQUOTE, comma, semicolon,
                    ; and backslash
    token = 1*<any CHAR except CTLs or separators>

    The rules set is relatively trivial, but strict verification of the rules can be an issue if you must to do this really quickly.
The same points work for URI processing, but URI processing seems more crucial for security reasons since there are many Web attacks involving specially crafted URIs, e.g. SQL injection. Thus strict HTTP fields content verification is important for Web application protection.

While Nginx accurately parses out all URI parts (see sw_check_uri and sw_uri states in ngx_http_parse_request_line(), src/http/ngx_http_parse.c), PicoHTTPParser just checks the URI alphabet as 0x20 (Space) < ch < 0x7f (DEL)., various delimeters and special characters (e.g. '"' or '\') are allowed by PicoHTTPParser, while they should have been filtered out.

Nginx uses old-school loop & switch based approach like:

    for (p = b->pos; p < b->last; p++) {
         ch = *p;
         switch (state) {
             case sw_start:
                 // ....
             case sw_foo:
                 // .....


The FSM is obvious and easy to program, but it's quite slow. See my article about high-performance HTTP parsers and Kazuho Oku's presentation, slides 31-33, for explanation. So PicoHTTPParser uses SSE4.2 instruction PCMESTRI. The instruction can match a 16-byte string against set of characters, ranges or other string. Since this is hardware implemented string matcher, it works much faster than the dummy loop & switch based FSM. However, you're very limited in what you can match. You can match at most 8 ranges or 16 characters. Moreover, you can't mix range matches with characters matching (i.e. you can not match characters like 0x0 < ch < 0x20 && ch == '"'). The pity thing is that character sets for URI or most of HTTP headers exceed the limits (the sets can have about 10 ranges). So using the instruction as HTTP strings matcher involves weak content checking. If you're going to use the instruction, which is somewhat tricky, you probably find Andi Kleen's calculator very useful.

While PicoHTTPParser is very fast Vlad Krasnov from CloudFlare goes further replacing PCMPESTRI instruction by AVX2 code. The code basically checks range (ch >= 0x20 || ch == '\t') && (ch < 0x7f). While PicoHTTPParser uses only one instruction to do string matching, CloudFlare's version executes much more code: AVX2 doesn't have string processing instruction, so there are separate instruction for each comparison and logical operator. However, it executes much faster because it can eat 32 bytes per a step. Moreover, Vlad also did loop unrolling, such that 128 bytes are eaten at a time.

I wrote simple benchmark to learn both the approaches, you can find whole code at GitHub. Benchmark is focused on URI processing, so PicoHTTPParser approach looks as

    static const size_t
    findchar_fast(const char *str, size_t len, const char *ranges,
                  size_t ranges_sz, int *found)
        __m128i ranges16 = _mm_loadu_si128((const __m128i *)ranges); 
        const char *s = str;
        size_t left = len & ~0xf;

        *found = 0;
        do {
            __m128i b16 = _mm_loadu_si128((void *)s);
            int r = _mm_cmpestri(ranges16, ranges_sz, b16, 16,
                                 | _SIDD_CMP_RANGES
                                 | _SIDD_UBYTE_OPS);
                if (r != 16) {
                        *found = 1; 
                        return s - str + r;
                s += 16;
                left -= 16;
        } while (left);

        return s - str;

picohttpparser_findchar_fast(const char *str, size_t len)
    static const unsigned char ranges[] __attribute__((aligned(16))) = 
        "\x00 "         /* control chars and up to SP */ 
        "\"\""          /* 0x22 */ 
        "<<"            /* 0x3c,0x3c */ 
        ">>"            /* 0x3e,0x3e */ 
        "\\\\"          /* 0x5c,0x5c */ 
        "^^"            /* 0x5e,0x5e */ 
        "{}"            /* 0x7b-0x7d */ 
        "\x7f\xff";     /* 0x7f-0xff */ 
        const char *s; 
        size_t n = 0; 

        if (len >= 16) { 
            int found;
            n = findchar_fast(str, len, ranges, sizeof(ranges) - 1,
            if (found)
                return n;
        s = str + n;
        while (s - str < len && uri_a[*s])
        return s - str;

Since ranges are used for PCMPESTRI we have to spend a range for a single character like '<' or '^'. Unfortunately, there are not enough available ranges for us and we have to pass '`' in URI while it is not included in URI specification by RFC.

Code for CloudFlare's approach looks as:

    const __m256i lb = _mm256_set1_epi8(0x1f); /* low bound */
    const __m256i hb = _mm256_set1_epi8(0x7f); /* high bound */
    const __m256i tab = _mm256_set1_epi8(0x09); /* allow TAB */

    /* SPACE <= v */
    __m256i low = _mm256_cmpgt_epi8(v, lb);
    /* SPACE <= v < 0x7f */
    __m256i bounds = _mm256_and_si256(_mm256_cmpgt_epi8(hb, v), low);
    /* SPACE <= v < 0x7f || v == TAB */
    __m256i r = _mm256_or_si256(_mm256_cmpeq_epi8(tab, v), bounds);

    /* Generate bit mask */
    *range = ~_mm256_movemask_epi8(r);

I skip code for 64- and 128-byte processing as well as the functions results merging code. You can find the full code of the approach here. There are too many instructions to handle the simple characters set, so we do only basic verification. The numbers on my Intel Core i7-6500U are:

        str_len     1:     128ms
        str_len     3:     138ms
        str_len    10:     161ms
        str_len    19:     151ms
        str_len    28:     183ms
        str_len   107:     218ms
        str_len   178:     230ms
        str_len  1023:     784ms
        str_len  1500:    1069ms

        str_len     1:     171ms
        str_len     3:     175ms
        str_len    10:     189ms
        str_len    19:     174ms
        str_len    28:     196ms
        str_len   107:     198ms
        str_len   178:     203ms
        str_len  1023:     375ms
        str_len  1500:     458ms

More code is executed in CloudFlare's version, so single character case is much slower than PicoHTTPParser. But AVX2 code show much more stable performance with increasing string length, there are 9 sizes of processed strings from 1 to 1500 bytes.

There is full results of the benchmark. Each string is porcessed 5,000,000 times. To get the results I executed the benchmark 5 times on my laptop with all heavy applications stopped (mail, browsers etc). For each benchmark I selected the best numbers to mitigate impact of some external activity by other processes still leaving in the system. I also used taskset to eliminate rescheduling overhead:

    $ for i in `seq 0 4`; do taskset 0x2 ./str_benchmark > ./b.$i; done

Just to show how bad standard strspn(3) is for checking HTTP character sets I also wrote benchmarks for GLIBC assembly version and Linux kernel naive C implementation. Linux kernel doesn't use assembly for strcasecmp() and strspn() since there are no performance critical strings processing in kernel. But the implementation clearly shows that plain C for the task performs quite poorly. So the numbers are:

    GLIBC strspn():
        str_len     1:     350ms
        str_len     3:     354ms

        str_len    10:     380ms
        str_len    19:     420ms
        str_len    28:     398ms
        str_len   107:     533ms
        str_len   178:     650ms
        str_len  1023:    2071ms
        str_len  1500:    2856ms

    Linux kernel strspn():
        str_len     1:     324ms
        str_len     3:     641ms
        str_len    10:    1865ms
        str_len    19:    3565ms
        str_len    28:    4522ms
        str_len   107:   18851ms
        str_len   178:   28575ms
        str_len  1023:  187992ms
        str_len  1500:  273276ms

Even More Fast and Accurate

So we need a better alternative. It must quickly process short strings as well as long and it must accurately verify character sets defined by RFC.

Now Tempesta FW implements AVX2 routines for HTTP specific strings processing. It outperforms both the approaches and provides absolute accuracy in HTTP character sets verification. There are numbers from the same benchmark:

    Tempesta AVX2 constant URI matching:
        str_len     1:     123ms

        str_len     3:     127ms
        str_len    10:     150ms
        str_len    19:     139ms
        str_len    28:     156ms
        str_len   107:     167ms

        str_len   178:     180ms
        str_len  1023:     350ms
        str_len  1500:     433ms

Now lets have a look what's under the hood. The entry point of the algorithm is tfw_match_uri_const(). Firstly the function quickly process very short strings up to 4 bytes:

    if (likely(len <= 4)) {
        switch (len) {
        case 0:
            return 0;
        case 4:
            c3 = uri_a[s[3]];
        case 3:
            c2 = uri_a[s[2]];
        case 2:
            c1 = uri_a[s[1]];
        case 1:
           c0 = uri_a[s[0]];
       return (c0 & c1) == 0 ? c0 : 2 + (c2 ? c2 + c3 : 0);

uri_a is defined as 256-byte constant array with bytes set for ASCII characters acccepted by URI. Previously we used 4 unsigned longs (256 bits) and defined whether a character allowed by set bit

    uri_a[c >> 6] & (1UL << (c & 0x3f))

However, we found that the bit operation does to many operations and simple table lookup outperforms it. It worth to mention that while the 256-byte array wastes 4 cache lines, only 2 of them are frequently used.

Since the function must return exact number of matched symbols, we have to execute heavyweight condition at return statement. The condition doesn't allow us to efficiently check more bytes in plain C, for example 8 bytes.

Next we process large strings in following manner:

    for ( ; unlikely(s + 128 <= end); s += 128) {
        n = match_symbols_mask128_c(__C.URI_BM, s);
        if (n < 128)
            return s - (unsigned char *)str + n;
    if (unlikely(s + 64 <= end)) {
        n = match_symbols_mask64_c(__C.URI_BM, s);
        if (n < 64)
            return s - (unsigned char *)str + n;
        s += 64;
    if (unlikely(s + 32 <= end)) {
        n = match_symbols_mask32_c(__C.URI_BM, s);
        if (n < 32)
            return s - (unsigned char *)str + n;
        s += 32;
    if (unlikely(s + 16 <= end)) {
        n = match_symbols_mask16_c(__C.URI_BM128, s);
        if (n < 16)
            return s - (unsigned char *)str + n;
        s += 16;

The code processes strings longer than 16 bytes. So we have the gap between 4 and 16 bytes which is processed by following code. The code processes string tail as well as short strings. This is why I use unlikely in the conditions above: branch misprediction is super important for short strings, while long strings aren't so sensitive to several penalties. Tail of the string is processed in the same way as GLIBC generic C implementation does it:

    while (s + 4 <= end) {
        c0 = uri_a[s[0]];
        c1 = uri_a[s[1]];
        c2 = uri_a[s[2]];
        c3 = uri_a[s[3]];
        if (!(c0 & c1 & c2 & c3)) {
            n = s - (unsigned char *)str;
            return !(c0 & c1) ? n + c0 : n + 2 + (c2 ? c2 + c3 : 0);
        s += 4;
    c0 = c1 = c2 = 0;
    switch (end - s) {
    case 3:
        c2 = uri_a[s[2]];
    case 2:
        c1 = uri_a[s[1]];
    case 1:
        c0 = uri_a[s[0]];
    n = s - (unsigned char *)str;
    return !(c0 & c1) ? n + c0 : n + 2 + c2;

, just usual loop unrolling.

And now is time for the core of the algorithm. I describe match_symbols_mask32_c() only, all other functions are just straightforward modifications for larger data processing.

    const __m256i ARF = _mm256_setr_epi8(
        0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80,

        0, 0, 0, 0, 0, 0, 0, 0, 
        0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80,
        0, 0, 0, 0, 0, 0, 0, 0);
    const __m256i LSH = _mm256_set1_epi8(0xf);      

    static size_t 
    match_symbols_mask32_c(__m256i sm, const char *str)
        __m256i v = _mm256_lddqu_si256((void *)str);

1:      __m256i acbm = _mm256_shuffle_epi8(sm, v);

2:      __m256i acols = _mm256_and_si256(LSH, _mm256_srli_epi16(v, 4));
3:      __m256i arbits = _mm256_shuffle_epi8(ARF, acols);
4:      __m256i sbits = _mm256_and_si256(arbits, acbm);
5:      v = _mm256_cmpeq_epi8(sbits, _mm256_setzero_si256());
6:      unsigned long r = 0xffffffff00000000UL
                          | _mm256_movemask_epi8(v); 

7:      return __tzcnt(r);

(I numbered important lines of the code for later description). sm is specially crafted representation of allowed characters set, so it varies depending on what we're parsing, e.g. URI or particular header value. Meantime ARF and LSH are constants identical for all the matchers. For URI we set sm by

    sm = _mm256_setr_epi8(
        0xb8, 0xfc, 0xf8, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc,
        0xfc, 0xfc, 0xfc, 0x7c, 0x54, 0x7c, 0xd4, 0x7c,
        0xb8, 0xfc, 0xf8, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc,
        0xfc, 0xfc, 0xfc, 0x7c, 0x54, 0x7c, 0xd4, 0x7c);

To understand the constants lets have a look at ASCII table.

There are 16 rows and sm contains two equal series by 16 bytes: each of them describes positions of valid URI characters in ASCII table rows. Note that 'column' and 'row' can be interchanged depending on the ASCI table representation. Hereafter I describe the logic using the table representation above. The first constant 0xb8 is defined by first row of the table. 'p', 'P' and '@' are valid URI charaters, while '`' isn't. So we encode this as 1011 in binary representation, or 0xb. Next 4 bits we define as 0x8 since '0' only form the next 4 ASCII symbols is accepted as URI character. The next constant 0xfc is defined by the second row and so on.

To generate the constants and uri_a, which I mentioned above, I use following simple program:

    static const unsigned char uri[] =

    unsigned char r[256] = { 0 };
    for (int i = 0; i < sizeof(uri)/sizeof(uri[0]) - 1; ++i)
        r[A[i]] = 1;
    for (
int i = 0; i < 256; ++i)
        printf("%u,%c", r[i], (i & 0xF) == 0xF ? '\n' : ' ');
    for (
int i = 0; i < 16; ++i) {
        unsigned char c;
        for (
c = 0, j = 7; j >= 0; --j)
            c = (c << 1) | r[(j << 4) + i];
        printf("%#x, ", c);

The algorithm does following steps (see line numbers in the code above, Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2 and Intel Intrinsics Guide are excellent references for the intrinsics):
  1. shuffle sm bytes according to input data, using as shuffle control mask. For example if 'p', which is essentially 0x70, is the first character of input data, then 0 (4 least significant bits defining the character row in ASCII table) defines the new position of the first constant in sm, 0xb8. In other words this step says that first character of input data belongs to ASCII table first row and places corresponding bitmap of allowed characters in the row by 0th index.
  2. Next, we build an array of ASCII table columns corresponding to input data. To determine column of our character we just do 'p' >> 4. However, the minimal unit for shift is word, 2 bytes, so we have to use LSH (0xf) mask to clear moved least significant bits of most significant byte of the word.
  3. ARF is a bit mask defining at which column in a ASCII row a character is placed , i.e. 'p' placed at the most right column corresponds to least significant bit 0x1. So at this step we arrange the column bits according to input data.
  4. So now we have two arrays of bitmaps for allowed characters in ASCII row and a column bit for particular characters. And at this line of code we intersect both the arrays by AND determining whether we have allowed character at particular place.
  5. The previous step sets bits somewhere in each byte of the vector. And now we propagates the bit to most significant bit.
  6. Next, we aggregate most significant bits of all bytes of the vector to 32-bit integer. We store the result in 64-bit integer with set most significant bits, such that the next step can correctly count number of set bits in the value.
  7. Finally, we count non-zero bits corresponding to matching URI bits.
For now Tempesta FW determines 7 alphabets accepted by HTTP strings parsing FSM. The alphabets matching is localized by simple to use wrappers.


The next important function, which is one of the hottest spots, is strncasecmp().

GLIBC's __strncasecmp_l_avx() (at least 2.23 version) basically implements following straightforward logic. Firstly, we define constants for 'A' and 'Z' - the characters range to be converted to lower case:

    const __m256i A = _mm256_set1_epi8('A' - 1);
    const __m256i Z = _mm256_set1_epi8('Z' + 1);

And constant for the case conversion:

    const __m256i CASE = _mm256_set1_epi8(0x20);

Next, we load 32 bytes of each input string:

    __m256i v0 = _mm256_lddqu_si256((void *)s0);
    __m256i v1 = _mm256_lddqu_si256((void *)s1);

And determine which characters of each of them we have to convert to lower case:

    __m256i a0 = _mm256_cmpgt_epi8(v0, A);
    __m256i a1 = _mm256_cmpgt_epi8(v1, A);
    __m256i z0 = _mm256_cmpgt_epi8(Z, v0);
    __m256i z1 = _mm256_cmpgt_epi8(Z, v1);
    __m256i cmp_r0 = _mm256_and_si256(a0, z0);
    __m256i cmp_r1 = _mm256_and_si256(a1, z1);

cmp_r defines which characters (vector items) we have to convert to lower case and now we set 0x20 in the positions getting bit masks converting the input strings to lower case:

    __m256i lc0 = _mm256_and_si256(cmp_r0, CASE);
    __m256i lc1 = _mm256_and_si256(cmp_r1, CASE);

The bit masks are used for case conversion by OR operator and finally we can compare both the strings and return zero value if all the characters match or non-zero value otherwise. Note that HTTP parser requires only boolean return value whether the whole strings match or not.

    __m256i vl0 = _mm256_or_si256(v0, lc0);
    __m256i vl1 = _mm256_or_si256(v1, lc1);

    __m256i eq = _mm256_cmpeq_epi8(vl0, vl1);

    return ~_mm256_movemask_epi8(eq);

Actually GLIBC version does the stuff using 16-byte vectors and in fact it's slower than the AVX2 implementation above. Note that we don't care about zero byte in the strings, but rather require strings of equal size. Technically it can be done by defining 3rd argument of the function as min(s1.length, s2.length): in all the cases for HTTP parser we know string lengths since the parser doesn't work with zero-terminated C strings.

    GLIBC strncasecmp():
        str_len     1:     133ms
        str_len     3:     144ms
        str_len    10:     143ms
        str_len    19:     163ms
        str_len    28:     168ms
        str_len   107:     213ms
        str_len   178:     253ms
        str_len  1023:     861ms
        str_len  1500:    1167ms

    AVX2 strncasecmp():
        str_len     1:     127ms
        str_len     3:     131ms
        str_len    10:     178ms
        str_len    19:     206ms
        str_len    28:     235ms
        str_len   107:     199ms
        str_len   178:     254ms
        str_len  1023:     558ms
        str_len  1500:     673ms

I also used very similar optimizations for short stings in plain C as in strspn()-like case above.

Actually we don't need to convert both the strings to lower case. Instead we can do XOR on the strings (i.e. compute the strings "difference") and determine whether the difference is exactly in case:

    __m256i xor = _mm256_xor_si256(v0, v1);
    __m256i lc = _mm256_cmpeq_epi8(xor, CASE);

lc stores positions where the stings differ in case only, i.e. 0x20. However, for example '-' (0x2d) also differs from 'M' (0x4d) for exactly 0x20 and lc also stores the position. To know which positions are in the interest we determine which characters of first string is in ['a', 'z'] range, and we do this for one string only:

    __m256i a = _mm256_set1_epi8('a' - 0x80);
    __m256i D = _mm256_set1_epi8('z' - 'a' + 1 - 0x80);

    __m256i vl0 = _mm256_or_si256(v0, CASE);
    __m256i sub = _mm256_sub_epi8(vl0, a);
    __m256i cmp_r = _mm256_cmpgt_epi8(D, sub);

Here I use 2 tricks from Hacker's Delight. Computing 'a' <= v <= 'z' requires 3 operations since it must be coded as v >= 'a' && v <= 'z'. So Hacker's Delight proposes to replace the expression by v - 'a' < 'z' - 'a' + 1, which is just 2 operations since 'z' - 'a' + 1 is computed at compile time. However, we must use unsigned version of < operator here to be able to employ integer overflow for correct comparison. Meantime x86-64 provides only signed versions of the instruction, so we must use the 2nd trick. The trick is that we can replace unsigned version of the operator by signed operator using 0x80 subtraction from both the arguments. So our expression becomes 'a' - 0x80 < 'z' - 'a' + 1 - 0x80.

Next, we intersect cmp_r with lc and intersect the result with CASE to determine virtual (good) result of XOR over the two strings if they are different in case only:

    __m256i good = _mm256_and_si256(lc, cmp_r);
    __m256i good_xor = _mm256_and_si256(good, CASE);

Since we have result of actual XOR over the strings we can compare it with the virtual (good) XOR result: if they match, then the strings are equal.

    __m256i match = _mm256_xor_si256(good_xor, xor);
    match = _mm256_cmpeq_epi8(match, _mm256_setzero_si256());

    return ~_mm256_movemask_epi8(match);

This time I used vector instructions to process the string tails. To do so I need at least 8 bytes, so I used large switch at begin of the function to be sure that following vector code has at least 8 byte arguments:

    switch (len) {
    case 0:
        return 0;
    case 8:
        c |= lct[s1[7]] ^ lct[s2[7]];
    case 7:
        c |= lct[s1[6]] ^ lct[s2[6]];
    case 6:
        c |= lct[s1[5]] ^ lct[s2[5]];
    case 5:
        c |= lct[s1[4]] ^ lct[s2[4]];
    case 4:
        c |= lct[s1[3]] ^ lct[s2[3]];
    case 3:
        c |= lct[s1[2]] ^ lct[s2[2]];
    case 2:
        c |= lct[s1[1]] ^ lct[s2[1]];
    case 1:
        c |= lct[s1[0]] ^ lct[s2[0]];
        return c;

The switch is cheap since we don't need to calculate complex conditional expression as previously. lct is 256-byte static constant table used for case lowering. Note that GLIBC's tolower(3) is actually slow since it requires far call of __ctype_tolower_loc(). So I used lct table wherever possible.

String tails processing is performed by __stricmp_avx2_tail(), which basically employ the same logic. The function accepts stings shorter than 32 bytes and longer than 8 bytes. If the strings tail after vector processing is shorter than 8 bytes, then we simply move backward to get necessary 8 bytes:

    if (len < 8) {
        i -= 8 - len;
        len = 8;

    return __stricmp_avx2_tail(s1 + i, s2 + i, c);

If the strings have at least 16 bytes, then __stricmp_avx2_tail() executes the same code as above, but using 16 byte registers. The code always leaves 16 bytes of data:

    if (len >= 16) {
        // do the vector processing using 16 byte registers
        if (len == 16 || r)
            return r;
        s1 += len - 16;
        s2 += len - 16;
        len = 16;

So now we have at least 8 bytes and not more than 16 bytes. But we still use 16 byte vector processing. To do so we need to load the data, with some overlapping and we do this in this way:

    v0 = _mm_loadh_pd(v0, (double *)s1);
    v1 = _mm_loadh_pd(v1, (double *)s2);
    v0 = _mm_loadl_pd(v0, (double *)(s1 + len - 8));
    v1 = _mm_loadl_pd(v1, (double *)(s2 + len - 8));

That's not a problem to process some piece of data twice. And now we can execute exactly the same code with 16 byte instructions as above.

Lets have a look how fast the code is:

    AVX2/64bit strncasecmp():
        str_len     1:     121ms
        str_len     3:     132ms
        str_len    10:     166ms
        str_len    19:     194ms
        str_len    28:     227ms
        str_len   107:     189ms
        str_len   178:     236ms
        str_len  1023:     463ms
        str_len  1500:     588ms

And we can run ever faster if we know that one particular argument of strcasecmp() is always in lower case. For example if you compare input data with static strings, then it's trivial to define the static strings in lower case and pass them always by the second argument. There is no sense to use XOR approach for the case and I use simple range calculation using both the Hacker's Delight tricks as above:

    __m256i sub = _mm256_sub_epi8(v0, A);
    __m256i cmp_r = _mm256_cmpgt_epi8(D, sub);
    __m256i lc = _mm256_and_si256(cmp_r, CASE);
    __m256i vl = _mm256_or_si256(v0, lc);
    __m256i eq = _mm256_cmpeq_epi8(vl, v1);

    return ~_mm256_movemask_epi8(eq);

Surely the code outperforms all of the approaches described before:

    AVX2/64bit strncasecmp(), one string case conversion:
        str_len     1:     126ms
        str_len     3:     129ms
        str_len    10:     129ms
        str_len    19:     133ms
        str_len    28:     136ms
        str_len   107:     154ms
        str_len   178:     179ms
        str_len  1023:     310ms
        str_len  1500:     376ms

FPU in Linux Kernel

Tempesta FW is Linux kernel project and using FPU in kernel is not trivial. To do so you must call kernel_fpu_begin() and kernel_fpu_end(), which save the contents of the registers if user mode processes use the FPU. So the using FPU in kernel mode isn't cheap. Tempesta FW processes HTTP in softirq context, just as soon as the packet arrives to NIC. Thus, to mitigate the overhead we made special FPU safe wrapper __tempesta_do_softirq_fpusafe():


         * Switch FPU context once per budget packets to let Tempesta
         * run many vector operations w/o costly FPU switches.
         * Eager FPU must be enabled.



, which is called from do_softirq_own_stack() assembly:

        call __tempesta_do_softirq_fpusafe
        call __do_softirq

So now we do only one FPU context store per softirq shot which can process many packets at once.