High Performance Linux

Sunday, November 23, 2014

Fast Finite State Machine for HTTP Parsing

There is special type of DDoS attacks, application level DDoS, which is quite hard to combat against. Analyzing logic which filters this type of DDoS attack must operate on HTTP message level. So in most cases the logic is implemented as custom modules for application layer (usually nowadays user space) HTTP accelerators. And surely Nginx is the most widespread platform for such solutions. However, common HTTP servers and reverse proxies were not designed for DDoS mitigation- they are simply wrong tools for this issue. One of the reason is that they are too slow to combat with massive traffic (see my recent paper and presentation for other reasons).

If logging is switched off and all content is in cache, then HTTP parser becomes the hottest spot. Simplified output of perf for Nginx under simple DoS is shown below (Nginx’s calls begin with ’ngx’ prefix, memcpy and recv are standard GLIBC calls):


      %            symbol name
    1.5719    ngx_http_parse_header_line
    1.0303    ngx_vslprintf
    0.6401    memcpy
    0.5807    recv
    0.5156    ngx_linux_sendfile_chain
    0.4990    ngx_http_limit_req_handler


The next hot spots are linked to complicated application logic (ngx vslprintf ) and I/O. 

During Tempesta FW development We have studied several HTTP servers and proxies (Nginx, Apache Traffic Server, Cherokee, node.js, Varnish and userver) and learned that all of them use switch and/or if-else driven state machines.

The problem with the approach is that HTTP parsing code is comparable in size with L1i cache and processes one character at a time with significant number of branches. Modern compilers optimize large switch statements to lookup tables that minimizes number of conditional jumps, but branch misprediction and instruction cache misses still hurt performance of the state machine. So the method probably has poor performance.

The other well-known approach is table-driven automaton. However, simple HTTP parser can have more than 200 states and 72 alphabet cardinality. That gives 200 x 72 = 14400 bytes for the table, which is about half of L1d of modern microprocessors. So the approach is also could be considered as inefficient due to high memory consumption.

The first obvious alternative for the state machine is to use Hybrid State Machine (HSM) described in our paper, which combines very small table with also small switch statement. In our case we tried to encode outgoing transitions from a state with at most 4 ranges. If the state has more outgoing transitions, then all transitions over that 4 must be encoded in switch. All actions (like storing HTTP header names and values) must be performed in switch. Using this technique we can encode each state with only 16 bytes, i.e. one cache line can contain 4 states. Giving this the approach should have significantly improve data cache hit.

We also know that Ragel generates perfect automatons and combines case labels in switch statement with direct goto labels (it seems switch is used to be able to enter FSM from any state, i.e. to be able to process chunked data). Such automatons has lower number of loop cycle and bit faster than traditional a-loop-cycle-for-each-transition approach. There was successful attempt to generate simple HTTP parsers using Ragel, but the parsers are limited in functionality.

However there are also several research papers which says that an automaton states is just auxiliary information and an automaton can be significantly accelerated if state information is declined.
So the second interesting opportunity to generate the fastest HTTP parser is just to encode the automaton directly using simple goto statements, ever w/o any explicit loop.

Basically HTTP parsers just matches a string against set of characters (e.g. [A-Za-z_-] for header names), what strspn(3) does. SSE 4.2 provides PCMPSTR instructions family for this purpose (GLIBC since 2.16 uses SSE 4.2 implemenetation for strspn()). However, this is vector instruction which doesn't support accept or reject sets more than 16 characters, so it's not too usable for HTTP parsers.

Results
I made a simple benchmark for four approaches described above (http_ngx.c - Nginx HTTP parsing routines, http_table.c - table-driven FSM, http_hsm.c - hybrid state machine and http_goto.c - simple goto-driven FSM). And here are the results (routines with 'opt' or 'lw' - are optimized or lightweight versions of functions):

Haswell (i7-4650U)

    Nginx HTTP parser:
        ngx_request_line:       730ms
        ngx_header_line:        422ms
        ngx_lw_header_line:     428ms
        ngx_big_header_line:    1725ms

    HTTP Hybrid State Machine:
        hsm_header_line:        553ms

    Table-driven Automaton (DPI)
        tbl_header_line:        473ms
        tbl_big_header_line:    840ms

    Goto-driven Automaton:
        goto_request_line:      470ms
        goto_opt_request_line:  458ms
        goto_header_line:       237ms
        goto_big_header_line:   589ms

Core (Xeon E5335)

    Nginx HTTP parser:
        ngx_request_line:       909ms
        ngx_header_line:        583ms
        ngx_lw_header_line:     661ms
        ngx_big_header_line:    1938ms

    HTTP Hybrid State Machine:
        hsm_header_line:        433ms

    Table-driven Automaton (DPI)
        tbl_header_line:        562ms
        tbl_big_header_line:    1570ms

    Goto-driven Automaton:
        goto_request_line:      747ms
        goto_opt_request_line:  736ms
        goto_header_line:       375ms
        goto_big_header_line:   975ms

Goto-driven automaton shows the better performance in all the tests on both the architectures. Also it's much easier to implement in comparison with HSM. So in Tempesta FW we migrated from HSM to goto-driven atomaton, but with some additional optimizations.

Lessons Learned
Haswell has very good BPU
Core micro-architecture has show that HSM behaves much better than switch-driven and table-driven automatons. While this is not the case for Haswell - the approach loses to both the approaches. I've tried many optimizations techniques to improve HSM performance, but the results above are the best and they still worse than the simple FSM approaches.
Profiler shows that the problem (hot spot) in HSM on Haswell is in the following code
 
    if (likely((unsigned char)(c - RNG_CB(s, 0)) <= RNG_SUB(s, 0))) {
        st = RNG_ST(s, 0);
        continue;
    }

Here we extract transition information and compare current character with the range. In most cases only this one branch is observer in the test. 3rd and 4th branches are never observed. The whole automaton was encoded with only 2 cache lines.

In first test case, when XTrans.x structure is dereferenced to get access to the ranges, the compiler generates 3 pointer dereferences. In fact these instructions (part of the disassembled branch)
 
    sub    0x4010c4(%rax),%bl
    cmp    0x4010c5(%rax),%bl
    movzbl 0x4010cc(%rax),%eax

produce 3 accesses to L1d and the cache has very limited bandwidth (64 bytes for reading and 32 bytes for writing) on each cycle with minimal latency as 4 cycles for Haswell. While the only one cache line is accessed by all the instructions. So the test case bottle neck is L1d bandwidth.

If we use XTrans.l longs (we need only l[0], which can be loaded with only one L1d access, in all the cases) and use bitwise operations to extract the data, then we get lower number of L1d accesses (4G vs 6.7G for previous cases), but branch mispredictions are increased.

The problem is that more complex statement in the conditions makes harder to Branch Prediction Unit to predict branches.

However, we can see that simple branches (for switch-driven and goto-driven automatons) show perfect performance on Haswell. So advanced Haswell BPU perfectly processes simple automatons making complex HSM inadequate.

In fact HSM is only test which is slower on Haswell in comparison with Core Xeon. Probably, this is the difference between server and mobile chips that ever old server processor beats modern mobile CPU on complex loads...

-O3 is ambiguous
Sometimes -O3 (GCC 4.8.2) generates slower code than -O2. Also benchmarks for -O3 show very strange and unexpected results. For example the below are results for -O2:

    goto_request_line: 470ms

However, -O3 shows worse results:

    goto_request_line: 852ms



Automata must be encoded statically whenever possible
Table-driven and HSM automaton are encoded using static constant tables (in difference with run-time generated tables for current DPI parser). This was done during HSM optimizations. Sometimes compiler can't optimize code using run-time generated tables. And this is crucial for real hot spots (for HSM the table is used in the if-statement described above which gets about 50-70% of whole the function execution time) - after the moving to the static data the code can get up to 50% performance improvement (the case for HSM).

No comments:

Post a Comment