High Performance Linux

> Try Tempesta FW, a high performance open source application delivery controller for the Linux/x86-64 platform.

> Or check custom high-performance solutions from Tempesta Technologies, INC.

> Careers: if you love low-level C/C++ hacking and Linux, we'll be happy to hear from you.

Sunday, April 17, 2016

x86-64 Wastes TLB Entries

Basically TLB caches page table translations. We need page table since contiguous virtual memory area can in fact consist of many physical fragments. So page table is used to map the physical memory fragments, pages, to some virtual address space. x86-64 has 4-level page table, so if you access a page, then in fact you for 5 memory accesses (4 for page table to resolve the virtual address and the last one is you actual access). To mitigate the performance overhead caused by page table translations TLB is used to cache virtual to physical address translations.

Surprisingly, when OS is loaded it maps whole physical memory to some virtual address space. It's important that the physical and virtual address spaces are both contiguous now. And OS kernel works with exactly this virtual address space. For example, kmalloc() in Linux kernel returns pointer from this virtual address space. However, vmalloc() maps physical pages to some other virtual address space, named vmalloc area.

Thus kernel addresses can be resolved by simple offsets (see linux/arch/x86/include/asm/page_64.h):

    static inline unsigned long
    __phys_addr_nodebug(unsigned long x)
        unsigned long y = x - __START_KERNEL_map;

        /* use the carry flag to determine if
           x was < __START_KERNEL_map */
        x = y + ((x > y)
                 ? phys_base
                 : (__START_KERNEL_map - PAGE_OFFSET));    

        return x;

Since virtual to physical address translation is trivial, why in earth do we need to use page table and waste invaluable TLB entries for the translations? However, there is nothing like MIPS's direct mapped kernel space segments for x86-64. The sad story about x86-64 is that trivial mappings wastes TLD entries and require extra memory transfers. The only one thing which x86-64 does to optimize TLD usage is global address spaces, e.g. kernel space, which are never invalidated in TLB on context switches. But still if you switch to kernel, kernel mappings evict your user-space mappings from TLB. Meantime, if your application is memory greedy, then syscalls can take long time due to TLB cache misses.


  1. Because of *process* virtual memory mappings and access rights. You need some form of lookup tables to manage these. For kernel space you are probably right because of the bijective mapping, but consider this: the big user of TLB entries is normally the userspace (ignoring large, "abnormal" iptables instances with ~4GB kernel memory footprints).

    1. Yeah, kernel also have to track page access bits and so on. However, x86-64 should have been optimize access to directly mapped pages, but it doesn't do thant.

      Meantime, page cache, all network traffic (plenty of skbs) - these are just first heavy-weight things which come to mind and which efficiently blow caches up.

  2. x86-64 has the "large" (1 GiB) memory pages and dedicated TLB for them (although on many models there are only 4 entries). Therefore, the OS kernel may use these 1 GiB pages for direct access to physical memory, to reduce TLB flushing, especially in combination with the "global" bit.

    1. Hi Julius,

      nice to see you here :)

      Yes, sure kernel can use huge and gigantic pages, but it doesn't so. Moreover, some data, e.g. network packets, isn't suitable for huge pages, but the total size of the data at particular point in time can be enormous. Just consider a Web server processing 10Gbps traffic - at any given time it has a lot of packets in-fly, and they all are mapped in kernel.

    2. It is a pity that the kernel itself does not use large (1 GiB) pages. Even taking into account the fact that the additional TLB is limited to four elements, its use would allow us to completely eliminate thrashing of the regular TLB entries (that related to user space) by the OS kernel. In addition, we could do a smart loop to optimize memory access in case if we need to perform such operations like a copying of large scatter-gather lists of memory blocks, which could completely eliminate the multiple reloading of four special TLB entries. But in the case of using a regular TLB this is not possible, because thousands of new entries still will arrive into the regular TLB and they will trash TLB entries that loaded during user code execution.

    3. Maybe I am misunderstanding what you're saying but the kernel is using a linear 1GB mappings (unless you're using KMEMCHECK or DEBUG_PAGEALLOC)

    4. I meant that regular kernel allocation, like slab_alloc or kmalloc, uses buddy allocator which essentially operates with 4KB pages. If we'd use gigantic pages, then we'd have to rewrite at least 2nd level allocators like kmalloc. But, in general, you're right, kernel is able to work wit 1GB and 2MB pages.

    5. I didn't get that note about buddy allocator, could you elaborate? Buddy allocator AFAIU has nothing to do with page mapping, so whether buddy allocator allocates physical space by 4K pages or not is not important if we access physical address through mapping that uses 1GB pages, do i miss something?

    6. I only meant that it's not easy to move kernel allocations to use gigantic or huge pages and referred buddy allocator as an example of a subsystem which must be heavily reworked in this case.

    7. Ok, but what i really didn't get is why do we need to move buddy allocator to use 1GB pages if it doesn't affect TLB usage? Isn't that enough to linearly map all physical memory using 1GB pages somewhere in upper ranges of virtual memory and use that virtual addresses to access memory in the kernel? And AFAIU buddy allocator can allocate contiguous 2MB ranges, and to allow allocation of 1GB ranges we would only need to fix MAX_ORDER definition.