High Performance Linux

> Try Tempesta FW, a high performance open source application delivery controller for the Linux/x86-64 platform.

> Or check custom high-performance solutions from Tempesta Technologies, INC.

> Careers: if you love low-level C/C++ hacking and Linux, we'll be happy to hear from you.

Sunday, April 17, 2016

x86-64 Wastes TLB Entries

Basically TLB caches page table translations. We need page table since contiguous virtual memory area can in fact consist of many physical fragments. So page table is used to map the physical memory fragments, pages, to some virtual address space. x86-64 has 4-level page table, so if you access a page, then in fact you for 5 memory accesses (4 for page table to resolve the virtual address and the last one is you actual access). To mitigate the performance overhead caused by page table translations TLB is used to cache virtual to physical address translations.

Surprisingly, when OS is loaded it maps whole physical memory to some virtual address space. It's important that the physical and virtual address spaces are both contiguous now. And OS kernel works with exactly this virtual address space. For example, kmalloc() in Linux kernel returns pointer from this virtual address space. However, vmalloc() maps physical pages to some other virtual address space, named vmalloc area.

Thus kernel addresses can be resolved by simple offsets (see linux/arch/x86/include/asm/page_64.h):

    static inline unsigned long
    __phys_addr_nodebug(unsigned long x)
        unsigned long y = x - __START_KERNEL_map;

        /* use the carry flag to determine if
           x was < __START_KERNEL_map */
        x = y + ((x > y)
                 ? phys_base
                 : (__START_KERNEL_map - PAGE_OFFSET));    

        return x;

Since virtual to physical address translation is trivial, why in earth do we need to use page table and waste invaluable TLB entries for the translations? However, there is nothing like MIPS's direct mapped kernel space segments for x86-64. The sad story about x86-64 is that trivial mappings wastes TLD entries and require extra memory transfers. The only one thing which x86-64 does to optimize TLD usage is global address spaces, e.g. kernel space, which are never invalidated in TLB on context switches. But still if you switch to kernel, kernel mappings evict your user-space mappings from TLB. Meantime, if your application is memory greedy, then syscalls can take long time due to TLB cache misses.