Work queue has always been one of the most hot points in server software. Here is how to scale it effectively to multi-core environment. I. INTRODUCTION Nowadays high performance server software (e.g. HTTP accelerator) in most cases runs on multi-core machines. Modern hardware could provide 32, 64 and more CPU cores. In such highly-concurrent environment lock contention sometimes hurts overall system performance more than data copying, context switches etc. Thus, moving most hot data structures from locked to lock-free design can significantly improve performance of software working in multi-core environment. One of the most hot data structure in traditional server software is work queue, which could have hundreds of thousands push and pop operations per second from tens of producers and/or consumers. Work queue is a FIFO data structure which has only two operations: push() and pop(). It usually limits its size such that pop() waits if there is no elements in the queue and push() waits if the queue contains maximum allowed number of elements. It is important that many threads can execute pop() and push() operations simultaneously on different CPU cores. One of the possible work queue implementations is ring buffer storing pointers to the queued elements. It has good performance especially in comparison with common non-intrusive linked list (which stores copies of values passed by the user, e.g. std::list). Significant thing about ring buffer implementation is that it natively limits its size - you only need to move current position in round-robin fashion. From other side linked lists require maintaining additional field for total queue length. With linked list push and pop operations have to modify the queue length in addition to element links updating, so you need to take more care of consistency in the queue for lock-free implementation. Basically different CPU families provides different guarantees for memory operations ordering and this is critical for lock-free algorithms. In this article we'll concentrate on x86 as most widespread architecture rather than write generic (but slower) code. II. NAIVE SYNCHRONIZED QUEUE First of all lets define the interface for our queue (I'll use C++11 in the article): template<class T, long Q_SIZE> class NaiveQueue { public: NaiveQueue(); void push(T *x); T *pop(); }; The queue will store T* pointers and has maximum size of Q_SIZE. Lets see how the queue would looks in naive locked implementation. To develop the queue we need an array in which we place our ring buffer. We can define this as T *ptr_array_[Q_SIZE]; Two members of the class, head_ and tail_, will point to head (next position to push an element) and tail (next item to pop) of the queue and should be initialized to zero in the class construction. We can simplify our operations on ring buffer by defining the counters as unsigned long. Unsigned long (which is 64 bit in length) is large enough to handle ever millions operations per second for thousands of years. So tail_ and head_ will be defined as: unsigned long head_; unsigned long tail_; This way we can access the elements (the same for head_ and tail_) just by ptr_array_[tail_++ & Q_MASK] Where Q_MASK is defined as static const unsigned long Q_MASK = Q_SIZE - 1; To get current position in the array we can calculate a remainder of integer division of tail_ by Q_SIZE, but rather we define Q_SIZE as a power of 2 (32768 in our case) so we can use bitwise AND between Q_MASK and tail_ which is bit faster. Since the operations on the queue must wait if there is no elements or the queue is full, we need two condition variables: std::condition_variable cond_empty_; std::condition_variable cond_overflow_; to wait on some new elements in the queue or some free space respectively. Surely, we need a mutex to serialize our queue: std::mutex mtx_; This way we can write push() and pop() in the following way: void push(T *x) { std::unique_lock<std::mutex> lock(mtx_); cond_overflow_.wait(lock, [&head_, &tail_]() { return tail_ + Q_SIZE > head_; }); ptr_array_[head_++ & Q_MASK] = x; cond_empty_.notify_one(); } T *pop() { std::unique_lock<std::mutex> lock(mtx_); cond_empty_.wait(lock, [&head_, &tail_]() { return tail_ < head_; }); T *x = ptr_array_[tail_++ & Q_MASK]; cond_overflow_.notify_one(); return x; } We perform both the operations under acquired exclusive lock using mtx_. When the lock acquired we can check current queue state: whether it is empty (and we can not pop any new element) or full (can not push a new element). std::condition_variable::wait() moves the current thread to sleep state until the specified predicate is true. Next we push or pop an element and notify other thread (by notify_one() call) that we have changed the queue state. Since we add or delete only one element at a time, then only one thread waiting for a available elements or free slots in the queue can make progress, so we notify and wake up only one thread. The problem with the implementation is that only one thread at single point of time can modify the queue. Moreover mutexes and condition variables are expensive - in Linux they are implemented by futex(2) system call. So each time when a thread needs to wait on a mutex or condition variable, that leads to call futex(2) which reschedule the thread and moves it to wait queue. Now lets run plain test which just push and pop addresses to and from the queue in 16 producers and 16 consumers (please refer end of the article for link to full source code). On a box with 16 Xeon cores the test took about 7 minutes: # time ./a.out real 6m59.219s user 6m21.515s sys 72m34.177s And strace with -c and -f options shows that 99.98% of time the program spends in futex system call. III. LOCK-FREE MULTI-PRODUCER MULTI-CONSUMER QUEUE Hopefully you do not have to ask kernel for help with user space threads synchronization. CPU (at least the most known architectures) provide atomic memory operations and barriers. With the operations you can atomically * read memory operand, modify it and write back * read memory operand, compare it with a value and swap with other value Memory barriers are special assembly instructions also known as fences. Fences guarantee instructions execution order on local CPU and visibility order on other CPUs. Lets consider two independent by data instructions, A and B, separated by fence (let it be mfence which provides guarantee for ordering of read and write operations): A mfence B The fence guaranties that: 1. compiler optimizations won't move A after the fence or B before the fence; 2. CPU will execute A and B instructions in-order (event it normally executes instructions out-of-order); 3. other CPU cores and processor packages, which work on the same bus, will see result of instruction A before result of instruction B. For our queue we need to synchronize multiple threads access to head_ and tail_ fields. Actually, when you run head_++ (this is an example of RMW, Read-Modify-Write, operation since processor must read current head_ value, increment it locally and write back to memory) on two cores, then both the cores could simultaneously read current head_ value, increment it and simultaneously write the new value back, so one increment is lost. For atomic operations C++11 provides std::atomic template which should replace current GCC sync_ intrinsics in future. Unfortunately, for my compiler (GCC 4.6.3 for x86-64) std::atomic<> methods still generate extra fences independently on specified memory order. So I'll use old GCC's intrinsics for atomic operations. We can atomically read and increment a variable (e.g. our head_) by __sync_fetch_and_add(&head_, 1); This makes CPU to lock the shared memory location on which it's going to do an operation (increment in our case). In multiprocessor environment processors communicate to each other to ensure that they all see relevant data. This is known as cache coherency protocol. By this protocol processor can take exclusive ownership on a memory location. However these communications are not for free and we should use such atomic operations carefully and only when we really need them. Otherwise we can hurt performance significantly. Meanwhile plain read and write operations on memory locations execute atomically and do not require any additional actions (like specifying 'lock' prefix to make the instruction run atomically on x86 architecture). In our lock-free implementation we're going to abandon mutex mtx_ and consequently both the condition variable. However we still need to wait if the queue is full on push and if the queue is empty on pop operations. For push we would do this by simple loop like we did it for locked queue: while (tail_ + Q_SIZE < head_) sched_yield(); sched_yield() just lets other thread to run on current processor. This is native and fastest way to reschedule current thread. However if there is no other thread which is waiting in scheduler run queue for available CPU, then current thread will be immediately scheduled back. Thus we'll see always 100% CPU usage, ever if we have no data to process. To cope with this we can use usleep(3) with some small value. Lets see more carefully what's going on in the loop. Firstly we read tail_ value, next we read value of head_ and after that we make a decision whether to wait or push an element and move head_ forward. Current thread can schedule at any place of the check and ever after the check. Lets consider 2 threads scenario: Thread 1 Thread 2 read tail_ read tail_ read head_ read head_ (scheduled) push an element push an element If we had only one free place in the ring buffer, then we override pointer to oldest queued element. We can solve the problem by incrementing the shared head_ before the loop and use temporal local variable (i.e. we reserve a place to which we're going to insert an element and wait when it is free): unsigned long tmp_head = __sync_fetch_and_add(&head_, 1); while (tail_ + Q_SIZE < tmp_head) sched_yield(); ptr_array_[tmp_head & Q_MASK] = x; We can write similar code for pop() (just swap head and tail). However the problem still exists. Two producers can increment head_, check that they have enough space and reschedule at the same time just before inserting x. A consumer can wake up instantly (it sees that head_ moved forward to two positions) and read a value from the queue which was not inserted yet. Before solving the issue lets see which picture we have in 2 producers (P1 and P2) and 2 consumers (C1 and C2) case: LT LH | _ | _ | _ | x | x | x | x | x | x | x | _ | _ | _ | ^ ^ ^ ^ | | | | C1 C2 P1 P2 On the picture '_' denotes free slots and 'x' denotes inserted elements. At the picture C1 and C2 are going to read values and P1 and P2 are going to write an elements to currently free slots. Let LT be a latest (lowest) tail value among all the consumers, which is stored in tmp_tail of latest consumer, C1 on the picture. Consumer C1 currently can work on the queue at LT position (i.e. it is at the middle of fetching the element). And let LH correspondingly be lowest value of tmp_head among all the producers. At each given time you can not push an element to position equal or greater than LT and should not try to pop an element at position equal or greater than LH. It means that all the producers should care about current LT value and all consumers about current LH value. So lets introduce the two helping class members for LH and LT: volatile unsigned long last_head_; volatile unsigned long last_tail_; Thus we should check for last_tail_ value instead of tail_ in the loop above. We need to update the values from multiple threads, but we're going to do this by plain write operations, without RMW. So the members do not have to be of atomic type. I just specified the variables as volatile to prevent their values caching in local processor registers. Now the question is who and when should update last_head_ and last_tail_ values. We do expect that in most cases we are able to perform push and/or pop operation on the queue without a wait. Thus we can update the two helping variables only when we really need them, i.e. inside the waiting loop. So when a producer realizes that it can not insert a new element because of too small last_tail_ value it falls into the wait loop and try to update last_tail_ value. To update the value the thread must inspect current tmp_tail of each consumer. So we need to make the temporal value visible to other threads. One of the possible solutions is to maintain an array of tmp_tail and tmp_head values with size equal to number of running threads. We can do this with following code: struct ThrPos { volatile unsigned long head, tail; }; ThrPos thr_p_[std::max(n_consumers_, n_producers_)]; where n_consumers_ is the number of consumers and n_producers_ is the number of producers. We can allocate the array dynamically, but leave it statically sized for simplicity for now. Many threads read the elements of the array, but only one thread with plain move instruction (no RMW operation) can update them, so you also can use regular reads on the variables. Since thr_p_ values are used to only limit moving of current queue pointers, then we initialize them to maximum allowed values, i.e. do not limit head_ and tail_ movings until somebody push or pop into the queue. We can find the lowest tail values for all the consumers by following loop: auto min = tail_; for (size_t i = 0; i < n_consumers_; ++i) { auto tmp_t = thr_p_[i].tail; asm volatile("" ::: "memory"); // compiler barrier if (tmp_t < min) min = tmp_t; l} The temporal variable tmp_t is required here since you can not atomically compare whether thr_p_[i].tail is less than min and update min if it is. When you remember current consumer's tail and compare it with min, the consumer can move the tail. It can move it only forward, so the check in the while condition is still correct and you won't overwrite some live queue elements. But if you wouldn't use tmp_t and write the code like if (thr_p_[i].tail < min) min = thr_p_[i].tail; Then the consumer can has lower tail value while you're comparing it with min, but move it far forward after the comparison is done and just before the assignment. So you probably find incorrect minimal value. I added compiler barrier, asm volatile("" ::: "memory") (this is GCC specific compiler barrier), to be sure that compiler won't move thr_p_[i].tail access and will access the memory location only once - to load its value to tmp_t. One important thing about the array is that it must be indexed by current thread identifier. Since POSIX threads (and consequently C++ threads which uses them) do not use small monotonically increasing values for threads identifying, then we need to use our own thread wrapping. I'll use inline thr_pos() method of the queue to access the array elements: ThrPos& thr_pos() const { return thr_p_[ThrId()]; } (you can find example of ThrId() implementation in the source referenced at the end of the article). Before writing the final implementation of push() and pop() lets back to initial application of our queue, work queue. Usually, producers and consumers do a lot of work between operations with the queue. For instance, it could be very slow IO operation. So what happens if one consumer fetch an element from the queue and go to sleep in long IO operation? Its tail value will be stay the same for long time and all the producers will wait on it ever all the other consumers fully cleared the queue. This is not desired behavior. Lets fix this by two steps. First, lets assign to per-thread tail pointer maximum allowed value just after the fetching the element. So we should write following at the end of pop() method: T *ret = ptr_array_[thr_pos().tail & Q_MASK]; thr_pos().tail = ULONG_MAX; return ret; Since a producer in push() starts to find minimal allowed value for last_tail_ from current value of global tail_, then it can assign current tail_ value to last_tail_ only if there is no any active consumers. This is what we wish. Generally speaking, other processors can see thr_pos().tail update before local processor reads from ptr_array_, so they can move and overwrite the position in the array before local processor reads it. This is possible on processors with relaxed memory operation ordering. However x86 provides relatively strict memory ordering rules, particularly it guarantees that 1. stores are not reordered with earlier loads 2. and stores are seen in consistent order by other processors. Thus, loading from ptr_array_ and storing to thr_pos().tail in the code above will be done on x86 and seen by all processors in exactly this order. So we don't need any explicit memory barriers here. The second step which we need to do is correctly set thr_pos().tail at the beginning of pop(). We assign current thr_pos().tail by thr_pos().tail = __sync_fetch_and_add(&tail_, 1); The problem is that the operation is atomic only for tail_ shift, but not for thr_pos().tail assignment. So there is a time window in which thr_pos().tail = ULONG_MAX, and tail_ could be shifted significantly by other consumers, so push() will set last_tail_ to current, just incremented, tail_. So when we're are going to pop an element we have to reserve a tail position less or equal to tail_ value which we'll pop an element from: thr_pos().tail = tail_; thr_pos().tail = __sync_fetch_and_add(&tail_, 1); In this code we actually perform following three operations: write tail_ to thr_pos().tail increment tail_ write previous value of tail_ to thr_pos().tail Again, in general case we have no guarantee that other processors will "see" results of the write operations in the same order. Potentially some other processor can firstly read incremented tail_ value, try to find new last_tail_ and only after that read new current thread tail value. However, __sync_fetch_and_add() executes locked instruction which implies implicit full memory barrier on most architectures (including x86), so neither first nor third operations can not be moved over the second one. Therefore we also can skip explicit memory barriers here. Thus if the queue is almost full then all producers will stop at or before the position of element which we're popping. Now we're are ready to write our final implementation of push() and pop() methods. Here they are: void push(T *ptr) { thr_pos().head = head_; thr_pos().head = __sync_fetch_and_add(&head_, 1); while (__builtin_expect(thr_pos().head >= last_tail_ + Q_SIZE, 0)) { ::sched_yield(); auto min = tail_; for (size_t i = 0; i < n_consumers_; ++i) { auto tmp_t = thr_p_[i].tail; asm volatile("" ::: "memory"); // compiler barrier if (tmp_t < min) min = tmp_t; } last_tail_ = min; } ptr_array_[thr_pos().head & Q_MASK] = ptr; thr_pos().head = ULONG_MAX; } T *pop() { thr_pos().tail = tail_; thr_pos().tail = __sync_fetch_and_add(&tail_, 1); while (__builtin_expect(thr_pos().tail >= last_head_, 0)) { ::sched_yield(); auto min = head_; for (size_t i = 0; i < n_producers_; ++i) { auto tmp_h = thr_p_[i].head; asm volatile("" ::: "memory"); // compiler barrier if (tmp_h < min) min = tmp_h; } last_head_ = min; } T *ret = ptr_array_[thr_pos().tail & Q_MASK]; thr_pos().tail = ULONG_MAX; return ret; } Careful reader can notice that multiple threads can scan current head or tail values over all the producing or consuming threads. So number of threads can find different min values and try to write them to last_head_ or last_tail_ simultaneously, so probably you would use CAS operation here. However atomic CAS is expensive and worst that can happen there is that you assign too small value to last_head_ or last_tail_. Or ever overwrite new higher value with a smaller old value, so you'll fall into sched_yield() again. Maybe we fall to sched_yield() more frequently than if we use synchronized CAS operation, but in practice the cost of extra atomic operation reduces performance. Also I used __builtin_expect with zero expect argument to say that we do not expect that the condition in while statement becomes true too frequently and compiler should move the inner loop code after the code executed if the condition is false. This way you can improve instruction cache usage. Finally lets run the same test as for naive queue: # time ./a.out real 1m53.566s user 27m55.784s sys 2m4.461s This is 3.7 times faster than our naive queue implementation! IV. CONCLUSION Nowadays, high performance computing is typically achieved by two ways: horizontal scaling (scale-out) by adding new computational nodes and vertical scaling (scale-up) by adding extra computational resources (like CPUs or memory) to a single node. Unfortunatelly, linear scaling is possible only in theory. In practice if you double your computational resources, then it is likely that you get only 30-60% performance gain. Lock contention is one of the problems which prevents efficient scale-up by adding extra CPUs. Lock-free algorigthms makes scale-up more productive and allows to get more performance in multi-core environments. The code for naive and lock-free queue implementations with the tests for correctness is available at: https://github.com/krizhanovsky/NatSys-Lab/blob/master/lockfree_rb_q.cc Alexander Krizhanovsky is the software architect and founder of NatSys-Lab. Before NatSys-Lab he was working as Senior Software Developer at IBM, Yandex and Parallels. He specializes in high performance solutions for UNIX environment. Special thanks to Johann George from SanDisk for final review of the paper.
High Performance Linux
> Try Tempesta FW, a high performance open source application delivery controller for the Linux/x86-64 platform.
> Or check custom high-performance solutions from Tempesta Technologies, INC.
> Careers: if you love low-level C/C++ hacking and Linux, we'll be happy to hear from you.
Thursday, May 23, 2013
Lock-free Multi-producer Multi-consumer Queue on Ring Buffer
My article "Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer" was published by Linux Journal more than 30 days ago, so now I can post it here.
Subscribe to:
Posts (Atom)