High Performance Linux: x86-64 Wastes TLB Entries

Sunday, April 17, 2016

x86-64 Wastes TLB Entries

Basically TLB caches page table translations. We need page table since contiguous virtual memory area can in fact consist of many physical fragments. So page table is used to map the physical memory fragments, pages, to some virtual address space. x86-64 has 4-level page table, so if you access a page, then in fact you for 5 memory accesses (4 for page table to resolve the virtual address and the last one is you actual access). To mitigate the performance overhead caused by page table translations TLB is used to cache virtual to physical address translations.

Surprisingly, when OS is loaded it maps whole physical memory to some virtual address space. It's important that the physical and virtual address spaces are both contiguous now. And OS kernel works with exactly this virtual address space. For example, kmalloc() in Linux kernel returns pointer from this virtual address space. However, vmalloc() maps physical pages to some other virtual address space, named vmalloc area.

Thus kernel addresses can be resolved by simple offsets (see linux/arch/x86/include/asm/page_64.h):

static inline unsigned long
__phys_addr_nodebug(unsigned long x)
{
unsigned long y = x - __START_KERNEL_map;

/* use the carry flag to determine if
x was < __START_KERNEL_map */
x = y + ((x > y)
? phys_base
: (__START_KERNEL_map - PAGE_OFFSET));

return x;
}

Since virtual to physical address translation is trivial, why in earth do we need to use page table and waste invaluable TLB entries for the translations? However, there is nothing like MIPS's direct mapped kernel space segments for x86-64. The sad story about x86-64 is that trivial mappings wastes TLD entries and require extra memory transfers. The only one thing which x86-64 does to optimize TLD usage is global address spaces, e.g. kernel space, which are never invalidated in TLB on context switches. But still if you switch to kernel, kernel mappings evict your user-space mappings from TLB. Meantime, if your application is memory greedy, then syscalls can take long time due to TLB cache misses.

10 comments:

UnknownApril 17, 2016 at 11:07 AM
Because of *process* virtual memory mappings and access rights. You need some form of lookup tables to manage these. For kernel space you are probably right because of the bijective mapping, but consider this: the big user of TLB entries is normally the userspace (ignoring large, "abnormal" iptables instances with ~4GB kernel memory footprints).
ReplyDelete
Replies
UnknownApril 17, 2016 at 5:49 PM
x86-64 has the "large" (1 GiB) memory pages and dedicated TLB for them (although on many models there are only 4 entries). Therefore, the OS kernel may use these 1 GiB pages for direct access to physical memory, to reduce TLB flushing, especially in combination with the "global" bit.
ReplyDelete
Replies

Note: Only a member of this blog may post a comment.