High Performance Linux

> Try Tempesta FW, a high performance open source application delivery controller for the Linux/x86-64 platform.

> Or check custom high-performance solutions from Tempesta Technologies, INC.

> Careers: if you love low-level C/C++ hacking and Linux, we'll be happy to hear from you.

Sunday, October 30, 2016

HTTP Strings Processing Using C, SSE4.2 and AVX2

In this article I describe applications of standard C functions strcasecmp(3) and strspn(3) to HTTP parser. Surprisingly the functions can be specialized to HTTP parsing task which makes them much faster. Next I consider using SSE4.2 and AVX2 to implement the specialized versions of the functions and show serious performance improvement. GLIBC and Linux kernel implementations of strcasecmp(3) and strspn(3) are described as well as relevant routines from PicoHTTPParser, its modification by Cloud Flare and surely Tempesta FW.

I finished my previous post with bottleneck on long strings parsing in our HTTP parser, the issue details can be found at GitHub. In particular there are two bottlenecks: strspn()-like functions searching for delimiters and strcasecmp() used in many places.

What Makes HTTP Strings Processing Special

There are few important properties of strings in HTTP messages that make their processing special:

  1. HTTP message is sequence of strings separated by special delimiters (e.g. ';' or ','). The most important delimiter is CRLF ("\r\n"). While RFC 7230 defines the delimiter as exactly CRLF, it recommends to process single LF as the delimiter as well. So there can be many different delimiters and one of them has variable length. There are no '\0'-terminated strings, rather all the strings are just parts of contiguous network packets.
  2. RFC defines strict rules which character sets can be found in particular HTTP request line, URI, status line or header field name and value. The number of sets is relatively small.
  3. The character sets are know at compile time, so checking them using standard strspn(3) is a bad idea because it spends a lot of cycles for compiling accept range.
  4. Tempesta FW is true zero-copy server, so a string can not exceed 1500 bytes (1 Ethernet frame) for any string processing call. (Longer strings are processed by chunks.) URI, Cookie, User-Agent and non-standard headers can easily reach tens kilobytes in size.
  5. While long URIs are frequent, it seems the most frequent URI is still single-byte '/'. Many HTTP flood DDoS attacks still use this short URI.
  6. Some of strcasecmp(3) calls can have one of the arguments always in lower case. For example, if you compare ingress string with "Cookie:" pattern to find Cookie header, you can use "cookie:" instead, so strcasecmp() can avoid case conversion for one of the arguments.
  7. In many cases we need only boolean result from strncasecmp(): whether the strings match or not, we don't need to know which string is lexicographically greater.

How HTTP Servers Process Strings

All mature HTTP parsers use FSM (finite state machine) to process messages. The different approaches with benchmarks are covered in my earlier article. However, parsers are different in how deeply they analyze a message. For example we have following HTTP message:

        GET / HTTP/1.0\r\n
        Host: example.net\r\n
        Cookie: session=42; theme="dark"\r\n

When a parser reads Cookie header it has following choices:
  1. Just read the header as opaque data, i.e. just run memchr(buf, '\n', size) over the data and put the header to some string array. This is the fastest way, but if the server works with cookies, then the cookie processing logic must scan the whole array for Cookie header and next parse the header. So you read the data at least twice and that's not fast at all. Thus the approach usually works for simple HTTP proxies which doesn't care which data they transfer;
  2. The better way is to analyze the header name, i.e. parse at least "Cookie:" string and store the rest of the string as opaque data. This time we know exactly whether we have Cookie header and where we can find it. Moreover, we can faster parse the header value since we know where it begins. The drawback of the approach is that we do not verify the header syntax, so we can pass incorrect or ever evil value of the header to vulnerable application;
  3. And the last opportunity is to fully execute FSM driven by the RFC grammar:

    cookie-string = cookie-pair *( ";" SP cookie-pair )
    cookie-pair   = cookie-name "=" cookie-value
    cookie-name   = token
    cookie-value  = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
    cookie-octet  = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
                    ; US-ASCII characters excluding CTLs,
                    ; whitespace DQUOTE, comma, semicolon,
                    ; and backslash
    token = 1*<any CHAR except CTLs or separators>

    The rules set is relatively trivial, but strict verification of the rules can be an issue if you must to do this really quickly.
The same points work for URI processing, but URI processing seems more crucial for security reasons since there are many Web attacks involving specially crafted URIs, e.g. SQL injection. Thus strict HTTP fields content verification is important for Web application protection.

While Nginx accurately parses out all URI parts (see sw_check_uri and sw_uri states in ngx_http_parse_request_line(), src/http/ngx_http_parse.c), PicoHTTPParser just checks the URI alphabet as 0x20 (Space) < ch < 0x7f (DEL)., various delimeters and special characters (e.g. '"' or '\') are allowed by PicoHTTPParser, while they should have been filtered out.

Nginx uses old-school loop & switch based approach like:

    for (p = b->pos; p < b->last; p++) {
         ch = *p;
         switch (state) {
             case sw_start:
                 // ....
             case sw_foo:
                 // .....


The FSM is obvious and easy to program, but it's quite slow. See my article about high-performance HTTP parsers and Kazuho Oku's presentation, slides 31-33, for explanation. So PicoHTTPParser uses SSE4.2 instruction PCMESTRI. The instruction can match a 16-byte string against set of characters, ranges or other string. Since this is hardware implemented string matcher, it works much faster than the dummy loop & switch based FSM. However, you're very limited in what you can match. You can match at most 8 ranges or 16 characters. Moreover, you can't mix range matches with characters matching (i.e. you can not match characters like 0x0 < ch < 0x20 && ch == '"'). The pity thing is that character sets for URI or most of HTTP headers exceed the limits (the sets can have about 10 ranges). So using the instruction as HTTP strings matcher involves weak content checking. If you're going to use the instruction, which is somewhat tricky, you probably find Andi Kleen's calculator very useful.

While PicoHTTPParser is very fast Vlad Krasnov from CloudFlare goes further replacing PCMPESTRI instruction by AVX2 code. The code basically checks range (ch >= 0x20 || ch == '\t') && (ch < 0x7f). While PicoHTTPParser uses only one instruction to do string matching, CloudFlare's version executes much more code: AVX2 doesn't have string processing instruction, so there are separate instruction for each comparison and logical operator. However, it executes much faster because it can eat 32 bytes per a step. Moreover, Vlad also did loop unrolling, such that 128 bytes are eaten at a time.

I wrote simple benchmark to learn both the approaches, you can find whole code at GitHub. Benchmark is focused on URI processing, so PicoHTTPParser approach looks as

    static const size_t
    findchar_fast(const char *str, size_t len, const char *ranges,
                  size_t ranges_sz, int *found)
        __m128i ranges16 = _mm_loadu_si128((const __m128i *)ranges); 
        const char *s = str;
        size_t left = len & ~0xf;

        *found = 0;
        do {
            __m128i b16 = _mm_loadu_si128((void *)s);
            int r = _mm_cmpestri(ranges16, ranges_sz, b16, 16,
                                 | _SIDD_CMP_RANGES
                                 | _SIDD_UBYTE_OPS);
                if (r != 16) {
                        *found = 1; 
                        return s - str + r;
                s += 16;
                left -= 16;
        } while (left);

        return s - str;

picohttpparser_findchar_fast(const char *str, size_t len)
    static const unsigned char ranges[] __attribute__((aligned(16))) = 
        "\x00 "         /* control chars and up to SP */ 
        "\"\""          /* 0x22 */ 
        "<<"            /* 0x3c,0x3c */ 
        ">>"            /* 0x3e,0x3e */ 
        "\\\\"          /* 0x5c,0x5c */ 
        "^^"            /* 0x5e,0x5e */ 
        "{}"            /* 0x7b-0x7d */ 
        "\x7f\xff";     /* 0x7f-0xff */ 
        const char *s; 
        size_t n = 0; 

        if (len >= 16) { 
            int found;
            n = findchar_fast(str, len, ranges, sizeof(ranges) - 1,
            if (found)
                return n;
        s = str + n;
        while (s - str < len && uri_a[*s])
        return s - str;

Since ranges are used for PCMPESTRI we have to spend a range for a single character like '<' or '^'. Unfortunately, there are not enough available ranges for us and we have to pass '`' in URI while it is not included in URI specification by RFC.

Code for CloudFlare's approach looks as:

    const __m256i lb = _mm256_set1_epi8(0x1f); /* low bound */
    const __m256i hb = _mm256_set1_epi8(0x7f); /* high bound */
    const __m256i tab = _mm256_set1_epi8(0x09); /* allow TAB */

    /* SPACE <= v */
    __m256i low = _mm256_cmpgt_epi8(v, lb);
    /* SPACE <= v < 0x7f */
    __m256i bounds = _mm256_and_si256(_mm256_cmpgt_epi8(hb, v), low);
    /* SPACE <= v < 0x7f || v == TAB */
    __m256i r = _mm256_or_si256(_mm256_cmpeq_epi8(tab, v), bounds);

    /* Generate bit mask */
    *range = ~_mm256_movemask_epi8(r);

I skip code for 64- and 128-byte processing as well as the functions results merging code. You can find the full code of the approach here. There are too many instructions to handle the simple characters set, so we do only basic verification. The numbers on my Intel Core i7-6500U are:

        str_len     1:     128ms
        str_len     3:     138ms
        str_len    10:     161ms
        str_len    19:     151ms
        str_len    28:     183ms
        str_len   107:     218ms
        str_len   178:     230ms
        str_len  1023:     784ms
        str_len  1500:    1069ms

        str_len     1:     171ms
        str_len     3:     175ms
        str_len    10:     189ms
        str_len    19:     174ms
        str_len    28:     196ms
        str_len   107:     198ms
        str_len   178:     203ms
        str_len  1023:     375ms
        str_len  1500:     458ms

More code is executed in CloudFlare's version, so single character case is much slower than PicoHTTPParser. But AVX2 code show much more stable performance with increasing string length, there are 9 sizes of processed strings from 1 to 1500 bytes.

There is full results of the benchmark. Each string is porcessed 5,000,000 times. To get the results I executed the benchmark 5 times on my laptop with all heavy applications stopped (mail, browsers etc). For each benchmark I selected the best numbers to mitigate impact of some external activity by other processes still leaving in the system. I also used taskset to eliminate rescheduling overhead:

    $ for i in `seq 0 4`; do taskset 0x2 ./str_benchmark > ./b.$i; done

Just to show how bad standard strspn(3) is for checking HTTP character sets I also wrote benchmarks for GLIBC assembly version and Linux kernel naive C implementation. Linux kernel doesn't use assembly for strcasecmp() and strspn() since there are no performance critical strings processing in kernel. But the implementation clearly shows that plain C for the task performs quite poorly. So the numbers are:

    GLIBC strspn():
        str_len     1:     350ms
        str_len     3:     354ms

        str_len    10:     380ms
        str_len    19:     420ms
        str_len    28:     398ms
        str_len   107:     533ms
        str_len   178:     650ms
        str_len  1023:    2071ms
        str_len  1500:    2856ms

    Linux kernel strspn():
        str_len     1:     324ms
        str_len     3:     641ms
        str_len    10:    1865ms
        str_len    19:    3565ms
        str_len    28:    4522ms
        str_len   107:   18851ms
        str_len   178:   28575ms
        str_len  1023:  187992ms
        str_len  1500:  273276ms

Even More Fast and Accurate

So we need a better alternative. It must quickly process short strings as well as long and it must accurately verify character sets defined by RFC.

Now Tempesta FW implements AVX2 routines for HTTP specific strings processing. It outperforms both the approaches and provides absolute accuracy in HTTP character sets verification. There are numbers from the same benchmark:

    Tempesta AVX2 constant URI matching:
        str_len     1:     123ms

        str_len     3:     127ms
        str_len    10:     150ms
        str_len    19:     139ms
        str_len    28:     156ms
        str_len   107:     167ms

        str_len   178:     180ms
        str_len  1023:     350ms
        str_len  1500:     433ms

Now lets have a look what's under the hood. The entry point of the algorithm is tfw_match_uri_const(). Firstly the function quickly process very short strings up to 4 bytes:

    if (likely(len <= 4)) {
        switch (len) {
        case 0:
            return 0;
        case 4:
            c3 = uri_a[s[3]];
        case 3:
            c2 = uri_a[s[2]];
        case 2:
            c1 = uri_a[s[1]];
        case 1:
           c0 = uri_a[s[0]];
       return (c0 & c1) == 0 ? c0 : 2 + (c2 ? c2 + c3 : 0);

uri_a is defined as 256-byte constant array with bytes set for ASCII characters acccepted by URI. Previously we used 4 unsigned longs (256 bits) and defined whether a character allowed by set bit

    uri_a[c >> 6] & (1UL << (c & 0x3f))

However, we found that the bit operation does to many operations and simple table lookup outperforms it. It worth to mention that while the 256-byte array wastes 4 cache lines, only 2 of them are frequently used.

Since the function must return exact number of matched symbols, we have to execute heavyweight condition at return statement. The condition doesn't allow us to efficiently check more bytes in plain C, for example 8 bytes.

Next we process large strings in following manner:

    for ( ; unlikely(s + 128 <= end); s += 128) {
        n = match_symbols_mask128_c(__C.URI_BM, s);
        if (n < 128)
            return s - (unsigned char *)str + n;
    if (unlikely(s + 64 <= end)) {
        n = match_symbols_mask64_c(__C.URI_BM, s);
        if (n < 64)
            return s - (unsigned char *)str + n;
        s += 64;
    if (unlikely(s + 32 <= end)) {
        n = match_symbols_mask32_c(__C.URI_BM, s);
        if (n < 32)
            return s - (unsigned char *)str + n;
        s += 32;
    if (unlikely(s + 16 <= end)) {
        n = match_symbols_mask16_c(__C.URI_BM128, s);
        if (n < 16)
            return s - (unsigned char *)str + n;
        s += 16;

The code processes strings longer than 16 bytes. So we have the gap between 4 and 16 bytes which is processed by following code. The code processes string tail as well as short strings. This is why I use unlikely in the conditions above: branch misprediction is super important for short strings, while long strings aren't so sensitive to several penalties. Tail of the string is processed in the same way as GLIBC generic C implementation does it:

    while (s + 4 <= end) {
        c0 = uri_a[s[0]];
        c1 = uri_a[s[1]];
        c2 = uri_a[s[2]];
        c3 = uri_a[s[3]];
        if (!(c0 & c1 & c2 & c3)) {
            n = s - (unsigned char *)str;
            return !(c0 & c1) ? n + c0 : n + 2 + (c2 ? c2 + c3 : 0);
        s += 4;
    c0 = c1 = c2 = 0;
    switch (end - s) {
    case 3:
        c2 = uri_a[s[2]];
    case 2:
        c1 = uri_a[s[1]];
    case 1:
        c0 = uri_a[s[0]];
    n = s - (unsigned char *)str;
    return !(c0 & c1) ? n + c0 : n + 2 + c2;

, just usual loop unrolling.

And now is time for the core of the algorithm. I describe match_symbols_mask32_c() only, all other functions are just straightforward modifications for larger data processing.

    const __m256i ARF = _mm256_setr_epi8(
        0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80,

        0, 0, 0, 0, 0, 0, 0, 0, 
        0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80,
        0, 0, 0, 0, 0, 0, 0, 0);
    const __m256i LSH = _mm256_set1_epi8(0xf);      

    static size_t 
    match_symbols_mask32_c(__m256i sm, const char *str)
        __m256i v = _mm256_lddqu_si256((void *)str);

1:      __m256i acbm = _mm256_shuffle_epi8(sm, v);

2:      __m256i acols = _mm256_and_si256(LSH, _mm256_srli_epi16(v, 4));
3:      __m256i arbits = _mm256_shuffle_epi8(ARF, acols);
4:      __m256i sbits = _mm256_and_si256(arbits, acbm);
5:      v = _mm256_cmpeq_epi8(sbits, _mm256_setzero_si256());
6:      unsigned long r = 0xffffffff00000000UL
                          | _mm256_movemask_epi8(v); 

7:      return __tzcnt(r);

(I numbered important lines of the code for later description). sm is specially crafted representation of allowed characters set, so it varies depending on what we're parsing, e.g. URI or particular header value. Meantime ARF and LSH are constants identical for all the matchers. For URI we set sm by

    sm = _mm256_setr_epi8(
        0xb8, 0xfc, 0xf8, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc,
        0xfc, 0xfc, 0xfc, 0x7c, 0x54, 0x7c, 0xd4, 0x7c,
        0xb8, 0xfc, 0xf8, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc,
        0xfc, 0xfc, 0xfc, 0x7c, 0x54, 0x7c, 0xd4, 0x7c);

To understand the constants lets have a look at ASCII table.

There are 16 rows and sm contains two equal series by 16 bytes: each of them describes positions of valid URI characters in ASCII table rows. Note that 'column' and 'row' can be interchanged depending on the ASCI table representation. Hereafter I describe the logic using the table representation above. The first constant 0xb8 is defined by first row of the table. 'p', 'P' and '@' are valid URI charaters, while '`' isn't. So we encode this as 1011 in binary representation, or 0xb. Next 4 bits we define as 0x8 since '0' only form the next 4 ASCII symbols is accepted as URI character. The next constant 0xfc is defined by the second row and so on.

To generate the constants and uri_a, which I mentioned above, I use following simple program:

    static const unsigned char uri[] =

    unsigned char r[256] = { 0 };
    for (int i = 0; i < sizeof(uri)/sizeof(uri[0]) - 1; ++i)
        r[A[i]] = 1;
    for (
int i = 0; i < 256; ++i)
        printf("%u,%c", r[i], (i & 0xF) == 0xF ? '\n' : ' ');
    for (
int i = 0; i < 16; ++i) {
        unsigned char c;
        for (
c = 0, j = 7; j >= 0; --j)
            c = (c << 1) | r[(j << 4) + i];
        printf("%#x, ", c);

The algorithm does following steps (see line numbers in the code above, Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2 and Intel Intrinsics Guide are excellent references for the intrinsics):
  1. shuffle sm bytes according to input data, using as shuffle control mask. For example if 'p', which is essentially 0x70, is the first character of input data, then 0 (4 least significant bits defining the character row in ASCII table) defines the new position of the first constant in sm, 0xb8. In other words this step says that first character of input data belongs to ASCII table first row and places corresponding bitmap of allowed characters in the row by 0th index.
  2. Next, we build an array of ASCII table columns corresponding to input data. To determine column of our character we just do 'p' >> 4. However, the minimal unit for shift is word, 2 bytes, so we have to use LSH (0xf) mask to clear moved least significant bits of most significant byte of the word.
  3. ARF is a bit mask defining at which column in a ASCII row a character is placed , i.e. 'p' placed at the most right column corresponds to least significant bit 0x1. So at this step we arrange the column bits according to input data.
  4. So now we have two arrays of bitmaps for allowed characters in ASCII row and a column bit for particular characters. And at this line of code we intersect both the arrays by AND determining whether we have allowed character at particular place.
  5. The previous step sets bits somewhere in each byte of the vector. And now we propagates the bit to most significant bit.
  6. Next, we aggregate most significant bits of all bytes of the vector to 32-bit integer. We store the result in 64-bit integer with set most significant bits, such that the next step can correctly count number of set bits in the value.
  7. Finally, we count non-zero bits corresponding to matching URI bits.
For now Tempesta FW determines 7 alphabets accepted by HTTP strings parsing FSM. The alphabets matching is localized by simple to use wrappers.


The next important function, which is one of the hottest spots, is strncasecmp().

GLIBC's __strncasecmp_l_avx() (at least 2.23 version) basically implements following straightforward logic. Firstly, we define constants for 'A' and 'Z' - the characters range to be converted to lower case:

    const __m256i A = _mm256_set1_epi8('A' - 1);
    const __m256i Z = _mm256_set1_epi8('Z' + 1);

And constant for the case conversion:

    const __m256i CASE = _mm256_set1_epi8(0x20);

Next, we load 32 bytes of each input string:

    __m256i v0 = _mm256_lddqu_si256((void *)s0);
    __m256i v1 = _mm256_lddqu_si256((void *)s1);

And determine which characters of each of them we have to convert to lower case:

    __m256i a0 = _mm256_cmpgt_epi8(v0, A);
    __m256i a1 = _mm256_cmpgt_epi8(v1, A);
    __m256i z0 = _mm256_cmpgt_epi8(Z, v0);
    __m256i z1 = _mm256_cmpgt_epi8(Z, v1);
    __m256i cmp_r0 = _mm256_and_si256(a0, z0);
    __m256i cmp_r1 = _mm256_and_si256(a1, z1);

cmp_r defines which characters (vector items) we have to convert to lower case and now we set 0x20 in the positions getting bit masks converting the input strings to lower case:

    __m256i lc0 = _mm256_and_si256(cmp_r0, CASE);
    __m256i lc1 = _mm256_and_si256(cmp_r1, CASE);

The bit masks are used for case conversion by OR operator and finally we can compare both the strings and return zero value if all the characters match or non-zero value otherwise. Note that HTTP parser requires only boolean return value whether the whole strings match or not.

    __m256i vl0 = _mm256_or_si256(v0, lc0);
    __m256i vl1 = _mm256_or_si256(v1, lc1);

    __m256i eq = _mm256_cmpeq_epi8(vl0, vl1);

    return ~_mm256_movemask_epi8(eq);

Actually GLIBC version does the stuff using 16-byte vectors and in fact it's slower than the AVX2 implementation above. Note that we don't care about zero byte in the strings, but rather require strings of equal size. Technically it can be done by defining 3rd argument of the function as min(s1.length, s2.length): in all the cases for HTTP parser we know string lengths since the parser doesn't work with zero-terminated C strings.

    GLIBC strncasecmp():
        str_len     1:     133ms
        str_len     3:     144ms
        str_len    10:     143ms
        str_len    19:     163ms
        str_len    28:     168ms
        str_len   107:     213ms
        str_len   178:     253ms
        str_len  1023:     861ms
        str_len  1500:    1167ms

    AVX2 strncasecmp():
        str_len     1:     127ms
        str_len     3:     131ms
        str_len    10:     178ms
        str_len    19:     206ms
        str_len    28:     235ms
        str_len   107:     199ms
        str_len   178:     254ms
        str_len  1023:     558ms
        str_len  1500:     673ms

I also used very similar optimizations for short stings in plain C as in strspn()-like case above.

Actually we don't need to convert both the strings to lower case. Instead we can do XOR on the strings (i.e. compute the strings "difference") and determine whether the difference is exactly in case:

    __m256i xor = _mm256_xor_si256(v0, v1);
    __m256i lc = _mm256_cmpeq_epi8(xor, CASE);

lc stores positions where the stings differ in case only, i.e. 0x20. However, for example '-' (0x2d) also differs from 'M' (0x4d) for exactly 0x20 and lc also stores the position. To know which positions are in the interest we determine which characters of first string is in ['a', 'z'] range, and we do this for one string only:

    __m256i a = _mm256_set1_epi8('a' - 0x80);
    __m256i D = _mm256_set1_epi8('z' - 'a' + 1 - 0x80);

    __m256i vl0 = _mm256_or_si256(v0, CASE);
    __m256i sub = _mm256_sub_epi8(vl0, a);
    __m256i cmp_r = _mm256_cmpgt_epi8(D, sub);

Here I use 2 tricks from Hacker's Delight. Computing 'a' <= v <= 'z' requires 3 operations since it must be coded as v >= 'a' && v <= 'z'. So Hacker's Delight proposes to replace the expression by v - 'a' < 'z' - 'a' + 1, which is just 2 operations since 'z' - 'a' + 1 is computed at compile time. However, we must use unsigned version of < operator here to be able to employ integer overflow for correct comparison. Meantime x86-64 provides only signed versions of the instruction, so we must use the 2nd trick. The trick is that we can replace unsigned version of the operator by signed operator using 0x80 subtraction from both the arguments. So our expression becomes 'a' - 0x80 < 'z' - 'a' + 1 - 0x80.

Next, we intersect cmp_r with lc and intersect the result with CASE to determine virtual (good) result of XOR over the two strings if they are different in case only:

    __m256i good = _mm256_and_si256(lc, cmp_r);
    __m256i good_xor = _mm256_and_si256(good, CASE);

Since we have result of actual XOR over the strings we can compare it with the virtual (good) XOR result: if they match, then the strings are equal.

    __m256i match = _mm256_xor_si256(good_xor, xor);
    match = _mm256_cmpeq_epi8(match, _mm256_setzero_si256());

    return ~_mm256_movemask_epi8(match);

This time I used vector instructions to process the string tails. To do so I need at least 8 bytes, so I used large switch at begin of the function to be sure that following vector code has at least 8 byte arguments:

    switch (len) {
    case 0:
        return 0;
    case 8:
        c |= lct[s1[7]] ^ lct[s2[7]];
    case 7:
        c |= lct[s1[6]] ^ lct[s2[6]];
    case 6:
        c |= lct[s1[5]] ^ lct[s2[5]];
    case 5:
        c |= lct[s1[4]] ^ lct[s2[4]];
    case 4:
        c |= lct[s1[3]] ^ lct[s2[3]];
    case 3:
        c |= lct[s1[2]] ^ lct[s2[2]];
    case 2:
        c |= lct[s1[1]] ^ lct[s2[1]];
    case 1:
        c |= lct[s1[0]] ^ lct[s2[0]];
        return c;

The switch is cheap since we don't need to calculate complex conditional expression as previously. lct is 256-byte static constant table used for case lowering. Note that GLIBC's tolower(3) is actually slow since it requires far call of __ctype_tolower_loc(). So I used lct table wherever possible.

String tails processing is performed by __stricmp_avx2_tail(), which basically employ the same logic. The function accepts stings shorter than 32 bytes and longer than 8 bytes. If the strings tail after vector processing is shorter than 8 bytes, then we simply move backward to get necessary 8 bytes:

    if (len < 8) {
        i -= 8 - len;
        len = 8;

    return __stricmp_avx2_tail(s1 + i, s2 + i, c);

If the strings have at least 16 bytes, then __stricmp_avx2_tail() executes the same code as above, but using 16 byte registers. The code always leaves 16 bytes of data:

    if (len >= 16) {
        // do the vector processing using 16 byte registers
        if (len == 16 || r)
            return r;
        s1 += len - 16;
        s2 += len - 16;
        len = 16;

So now we have at least 8 bytes and not more than 16 bytes. But we still use 16 byte vector processing. To do so we need to load the data, with some overlapping and we do this in this way:

    v0 = _mm_loadh_pd(v0, (double *)s1);
    v1 = _mm_loadh_pd(v1, (double *)s2);
    v0 = _mm_loadl_pd(v0, (double *)(s1 + len - 8));
    v1 = _mm_loadl_pd(v1, (double *)(s2 + len - 8));

That's not a problem to process some piece of data twice. And now we can execute exactly the same code with 16 byte instructions as above.

Lets have a look how fast the code is:

    AVX2/64bit strncasecmp():
        str_len     1:     121ms
        str_len     3:     132ms
        str_len    10:     166ms
        str_len    19:     194ms
        str_len    28:     227ms
        str_len   107:     189ms
        str_len   178:     236ms
        str_len  1023:     463ms
        str_len  1500:     588ms

And we can run ever faster if we know that one particular argument of strcasecmp() is always in lower case. For example if you compare input data with static strings, then it's trivial to define the static strings in lower case and pass them always by the second argument. There is no sense to use XOR approach for the case and I use simple range calculation using both the Hacker's Delight tricks as above:

    __m256i sub = _mm256_sub_epi8(v0, A);
    __m256i cmp_r = _mm256_cmpgt_epi8(D, sub);
    __m256i lc = _mm256_and_si256(cmp_r, CASE);
    __m256i vl = _mm256_or_si256(v0, lc);
    __m256i eq = _mm256_cmpeq_epi8(vl, v1);

    return ~_mm256_movemask_epi8(eq);

Surely the code outperforms all of the approaches described before:

    AVX2/64bit strncasecmp(), one string case conversion:
        str_len     1:     126ms
        str_len     3:     129ms
        str_len    10:     129ms
        str_len    19:     133ms
        str_len    28:     136ms
        str_len   107:     154ms
        str_len   178:     179ms
        str_len  1023:     310ms
        str_len  1500:     376ms

FPU in Linux Kernel

Tempesta FW is Linux kernel project and using FPU in kernel is not trivial. To do so you must call kernel_fpu_begin() and kernel_fpu_end(), which save the contents of the registers if user mode processes use the FPU. So the using FPU in kernel mode isn't cheap. Tempesta FW processes HTTP in softirq context, just as soon as the packet arrives to NIC. Thus, to mitigate the overhead we made special FPU safe wrapper __tempesta_do_softirq_fpusafe():


         * Switch FPU context once per budget packets to let Tempesta
         * run many vector operations w/o costly FPU switches.
         * Eager FPU must be enabled.



, which is called from do_softirq_own_stack() assembly:

        call __tempesta_do_softirq_fpusafe
        call __do_softirq

So now we do only one FPU context store per softirq shot which can process many packets at once.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.