High Performance Linux



> Try Tempesta FW, a high performance open source application delivery controller for the Linux/x86-64 platform.

> Or check custom high-performance solutions from Tempesta Technologies, INC.

> Careers: if you love low-level C/C++ hacking and Linux, we'll be happy to hear from you.


Monday, November 11, 2013

Studying Intel TSX Performance

Lock-free algorithms on atomic operations perfectly work with updating of small data (typically 8 or 16 bytes on modern x86-64 hardware). If you need to update more data, then you have to spin in checking loop to verify whether a particular update is consistent with other concurrent updates.

Suppose you have N source bank accounts and N destination bank accounts. And you need to transfer money from the source accounts to the destination at once. This is classic example for database transaction (usually database books use N = 1). For simplicity we can describe each account by one integer number, so if N = 1, then we can handle the transaction using double CAS (Compare And Swap, cmpxchg16b instruction on x86-64) operation. However, if N is much larger, then it's time to think about Transactional Memory. One year ago I've written about software transactional memory in GCC, but it's quite slow. So now it's time to see at Intel TSX.

The Test Case

Our target is to atomically execute following function:

    void
    trx_func(unsigned long trx_sz)

    {
        for (unsigned i = 0; i < trx_sz; ++i) {
            debit[i] += 1;
            credit[i] += -1;
        }
    }


(we move only one dollar in our example). Intel TSX operates by CPU cache lines (64 bytes for Haswell), so we need to ensure that each transaction reads and modifies only its own cache lines and doesn't affect cache lines of other transactions. So debit and credit could be defined as:

    struct CacheLine {
        long c[L1DSZ / sizeof(long)];
        CacheLine() : c{0} {}

        void
        operator+=(int x)
        {
            c[0] += x;
        }
    } __attribute__((aligned(L1DSZ)));

    CacheLine debit[TRX_BUF_SZ_MAX]

        __attribute__((aligned(L1DSZ)));
    CacheLine credit[TRX_BUF_SZ_MAX]

        __attribute__((aligned(L1DSZ)));

L1DSZ is size of cache line (getconf LEVEL1_DCACHE_LINESIZE). TRX_BUF_SZ_MAX is just some relatively big value, in my case it's 8192, we won't refer to it any more.

To understand TSX performance we need some code which can be compared with TSX transactions. So let's write simple spin lock synchronization:

    void
    execute_spinlock_trx(unsigned long trx_sz)
    {
        pthread_spin_lock(&spin_l);

        trx_func(trx_sz);

        pthread_spin_unlock(&spin_l);
    }


Certainly, the code must be run on many threads on multi core system. I won't show the threading code, you can find it at GitHub (compilation notes are in the header comment of the source code file).

Now let's have a look how we can use Intel TSX to execute trx_func() atomically:

    void
    execute_short_trx(unsigned long trx_sz)
    {
        while (1) {
            unsigned status = _xbegin();

            if (status == _XBEGIN_STARTED) {
                // we're in transactional context

                // Hacky check whether spinlock is locked.
                // See glibc/nptl/sysdeps/x86_64/pthread_spin_unlock.S

                if ((int)spin_l != 1)
                    _xabort(_ABORT_LOCK_BUSY);

                trx_func(trx_sz);

                _xend();

                return;
            }
            

            if (!(status & _XABORT_RETRY)
                && !(status & _XABORT_CONFLICT)
                && !((status & _XABORT_EXPLICIT)
                     && _XABORT_CODE(status) != _ABORT_LOCK_BUSY))
                break;

            _mm_pause();
        }

        // fallback to spinlock.
        execute_spinlock_trx(trx_sz);
    }


_xbegin(), _xend() and _xabort() functions as well as  _ABORT_LOCK_BUSY and _XABORT_* defines are stolen from glibc-2.18 code (nptl/sysdeps/unix/sysv/linux/x86/elision-lock.c, see also Lock Elision in the GNU C Library).

The function was also mostly written using __lll_lock_elision() from glibc-2.18 as an example. The function does following. Firstly, it starts TSX RTM (Restricted Transactional Memory) transaction using _xbegin(). If the transaction is normally started, then status has value _XBEGIN_STARTED and we're going into appropriate if branch. Code in the branch ends with return statement, so we exit function if the transaction is normally commited (using _xend() call). If the transaction aborts due to any reason, then all the changes in the branch are rolled back. Moreover, on rollback status takes different value and we jump to just after _xbegin() and test status again. Thus, the code after if corresponds to aborted transaction.

The function has a fallback path to spin lock. This is a common practise for TSX programming. Andi Kleen wrote nice article about this. Firstly, we check that spin lock is unlocked. This is done in transactional context, so TSX adds lock_l to its read set, so if some other CPU tries to acquire the lock, then it updates lock_l and current transaction aborts. If the lock is acquired, then somebody modifies protected data using the spin lock, so we need to abort the transaction. Next, there is two possibilities: try to execute the transaction again or also, like other CPU, fallback to spin lock.

Just falling back to spin lock it it's already acquired by other CPU gave very poor performance. Imagine that there is 2 CPUs. The first one tries to run transaction, but it aborts due to some reason (aborts are very common for TSX as we'll see bit later) and falls back to spin lock, acquires it and starts to update data. The second CPU also tries to execute transaction and sees that the lock is held by the first CPU, so it also fails back to spin lock. Spin lock is busy, so the second CPU goes to busy loop on it. When the first CPU finishes with its updates, then it releases the lock and the lock immediately acquired by waiting CPU. Now first CPUs tries to run transaction again and finds that the lock is acquired by other CPU, so it also fails back to spin lock... This scenario shows that naive fallback can lead to situation when only spin lock is usedto synchronize data and transactional memory doesn't work at all.

Glibc's __lll_lock_elision() uses adaptive locking algorithm which tries to balance between transaction restartings and fallbacks. We're interested in TSX properties, so our algorithms tries hardly to execute transaction.

On transaction abort processor sets flags which indicate the reason for abort. If _XABORT_RETRY is set, then processor suggests that there is sense to restart transaction. If we abort the transaction explicitly, then _XABORT_EXPLICIT is set. And _XABORT_CONFLICT indicates that there is data conflict with other transaction. In these three cases we restart current transaction. However, transaction can be aborted due to limited system resources (_XABORT_CAPACITY) or other, not for busy lock, explicit abort. So we check the abort code and fallback to spin lock in all other cases.

Test Results

For performance measurements I used Intel Core i7-4650U (dual core 1.70GHz with hyperthreading). The processor has 32KB 8-way Data L1 cache. The system was running Linux 3.12.0-rc6 with patches by Andi Kleen (git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git hsw/pmuX). X server and neworking were down during the tests and no any activity was performed on the machine.

It seems (see the abort tests below and Intel documentation: "Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 1: Basic Architecture" and "Intel 64 and IA-32 Architectures Optimization Reference Manual") that TSX transactions abort if data doesn't fit L1 data cache, so all the tests uses very small data set which fits into L1 cache. Sine there is no memory operations or other CPU waiting points, then this is the case to switch HyperThreading off for better performance. My computer doesn't have such BIOS option, so I just use 2 threads binded to physical cores (CPUs 0 and 1):

    $ grep 'processor\|core id' /proc/cpuinfo
    processor       : 0
    core id         : 0
    processor       : 1
    core id         : 1
    processor       : 2
    core id         : 0
    processor       : 3
    core id         : 1


All the tests below were ran for 10M iterations (i.e. iter variable is equal to 10000000).

Aborts on Single-threaded workload

Single-threaded workload shows how TSX transactions work without contention on shared data between CPUs. This testing workload is produced by following lines in main():

    for (int trx_sz = 32; trx_sz <= 1024; trx_sz += 4)
        run_test(1, trx_sz, 1, 0, iter, Sync::TSX);

Figure 1: Dependency of aborts on transaction size (1 CPU)
Dependency of aborts number on transaction size (in cache lines) is depicted on Figure 1 (both the axes are logarithm scaled). Number of aborts (precisely, transaction aborts with clean _XABORT_RETRY bit in status) reaches 100% (10M) at around 256 cache lines. I count aborts number by local integer counter inside transaction abort handling code (please, see execute_short_trx() the source code for details). TSX provides abort code for aborted transaction, so we easily can gather statistics which type of aborts dominate in this workload. Just compile the program with -DABORT_COUNT and run the test case for trx_sz = 256:

    explicit abrt: 0
    retry abrt: 18
    conflict abrt: 18
    capacity abrt: 9969559


Let's check the results with Intel PCM tool (output is reduced for brevity):

    # ./pcm-tsx.x a.out -e RTM_RETIRED.ABORTED_MISC1 -e RTM_RETIRED.ABORTED_MISC2 -e RTM_RETIRED.ABORTED_MISC3 -e RTM_RETIRED.ABORTED_MISC4

    Time elapsed: 10453 ms 

    Event0: RTM_RETIRED.ABORTED_MISC1 Number of times an RTM execution aborted due to various memory events (raw 0x8c9)
    Event1: RTM_RETIRED.ABORTED_MISC2 Number of times an RTM execution aborted due to uncommon conditions (raw 0x10c9) 
    Event2: RTM_RETIRED.ABORTED_MISC3 Number of times an RTM execution aborted due to HLE-unfriendly instructions (raw 0x20c9) 
    Event3: RTM_RETIRED.ABORTED_MISC4 Number of times an RTM execution aborted due to incompatible memory type (raw 0x40c9)

    Core | Event0  | Event1  | Event2  | Event3
       0   9966 K       0         0         0    
 
      1      0         0         0         0    
  
     2      0         0         0         0    
  
    3      0         0         0         0
--------------------------------------------------
  
    *   9966 K       0         0         0      


Figure 2: Dependency of retries on transaction size
 So most of the aborts are caused by capacity problem. 256 * 64 = 16384 and this is a half of L1 data cache. The cache has 8-way associativity, however, it's still unlikely that the transaction work set produces so many address collisions that we can't utilize the cache fully. It is also unlikely that other program data utilizes rest 1 / 2 of the cache. So it seems that transaction size has lower limit even than L1 data cache.

Let's also plot graphs for number of retries and whole test execution time depending on transaction buffer size. Results are shown of Figure 2 and Figure 3 correspondingly.

Figure 3: Dependency of execution time on transaction size
 The time plot also shows significant fluctuation around transaction size 256 cache lines. At transaction size 244 it jumps from 10180ms to 12292ms after which execution time smoothly decreases to 9094ms for transaction size 264 and grows again.

UPD 1: as slotty noticed in the comment below each transaction in trx_func() modifies actually 2 cache lines, for debit and credit updates. The figure was drawn for transactions rather than acual number of modified cache lines by each transaction. So TSX transactions actually are limited by full L1d cache size.

TSX vs Spin Lock: Transaction Time

To run this test case we need to modify our trx_func() in following way:

    void
    trx_func(int thr_id, unsigned long trx_sz, int trx_count)
    {
        for (int c = 0; c < trx_count; c++)

            for (unsigned i = 0; i < trx_sz; ++i) {
                unsigned shift = thr_id * trx_sz + i;
                debit[shift] += 1;
                credit[shift] += -1;

            }
    }

Thus, we just execute the same data updates multiple time, so transaction work set stays the same while transaction time increases. Plus to adding surrounding loop I also added thread ID (0 or 1) to calculation of offset of updated item. This change allows both the CPUs perform always on different cache lines, so there is no data contention. And following lines of source code are responsible for the test:

    for (int trx_count = 1; trx_count <= 201; trx_count += 10)
        run_test(2, 2, trx_count, 0, iter, Sync::TSX);
    for (int trx_count = 1; trx_count <= 201; trx_count += 10)

        run_test(2, 2, trx_count, 0, iter, Sync::SpinLock);

Results for the tests are depicted on Figure 4. So for the short transactions (trx_count < 50) TSX shows better execution time, but on trx_count = 51 spin lock overtakes it.

Figure 4: TSX vs Spin Lock: Transaction Time
This results shows that TSX performs 3 times better (401ms vs 1329ms for trx_count = 1) on small transaction. This is interesting, but how to use this results in practise? I.e. when we should use TSX and when spin lock? In this thread Andi Kleen suggests "For contended conventional locks we usually recommend at least 200ns". This is also "just a number" and real benchmarks for particular workload, which is observed for TSX applicability, must be done.

However, in our case we don't have have data contention, i.e. both the CPUs can perform in parallel. Obviously, spin lock which must be acquired to change any data makes the code singe threaded (only one CPU can update the data at any given time). I expected that TSX should show much better results for the test due to more parallelism, but it isn't so...

To understand the issue let's compare aborts statistics for trx_count = 1 and trx_count = 60. For trx_count = 1 our statistics shows:

    explicit abrt: 28
    retry abrt: 567
    conflict abrt: 589

    capacity abrt: 8

for CPU 0 and

    explicit abrt: 67
    retry abrt: 441
    conflict abrt: 506
    capacity abrt: 3


for CPU 1. Meantime, pcm-tsx reports:

    Core | Event0  | Event1  | Event2  | Event3
       0    596         0        28         0     
       1    508         0        67         0     


Thus we can see that Event 2 with cryptic description "Number of times an RTM execution aborted due to HLE-unfriendly instructions" exactly matches our explicit aborts. Intel TSX has set of instructions which leads to transaction aborts. It seems that the aborts are handled as explicit (this is why we need to check abort code in execute_short_trx()). However, it's unclear why we didn't see the aborts in single threaded workload and Intel documentation with list of the instructions doesn't answer the question. Values for Event 0, "Number of times an RTM execution aborted due to various memory events", are very close to conflict aborts... The corresponding values for trx_count = 60 are:

     explicit abrt: 8524329
     retry abrt: 8538461
     conflict abrt: 8538484
     capacity abrt: 61


for CPU 0 and

    explicit abrt: 8524788
    retry abrt: 8554159
    conflict abrt: 8554179
    capacity abrt: 187


for CPU 1. pcm-tsx says:

    Core | Event0  | Event1  | Event2  | Event3
       0   8538 K       0      8524 K       0     
       1   8554 K       0      8524 K       0     


So the reason for low performance on many iterations inside the transaction is too huge aborts rate. Why do we see so many conflict aborts on uncontended data updates? Actually we have contended data - our spin lock for fallback. If we comment the fallback code (spin lock checking in transaction and acquiring the lock at the end of the function), then we'll see much better picture for trx_count = 60:

    explicit abrt: 0
    retry abrt: 425
    conflict abrt: 425
    capacity abrt: 204


    explicit abrt: 0
    retry abrt: 1886
    conflict abrt: 1886
    capacity abrt: 139


    Core | Event0  | Event1  | Event2  | Event3
       0    629         0         0         0     
       1   2025         0         0         0     


So it seems that spin lock fallback produces two types of aborts at the same time. If we comment out only _xabort(_ABORT_LOCK_BUSY), then we'll see very similar picture - zero Event 2. So Event 2 is exactly our explicit aborts. Intel documentation notes that transactions can abort due to various reasons - it looks like we have these various reasons as Event 0 and conflict & retry aborts.

TSX vs Spin Lock: Transaction Size

Now do the same as for previous test, but vary transaction work set instead of running time. The source code lines for the test are:

    for (int trx_sz = 1; trx_sz <= 256; trx_sz <<= 1)
        run_test(2, trx_sz, 1, 0, iter, Sync::TSX);
    for (int trx_sz = 1; trx_sz <= 256; trx_sz <<= 1)
        run_test(2, trx_sz, 1, 0, iter, Sync::SpinLock);


The test results are depicted on Figure 5 (note that both the axes are logarithm scaled). As for previous test we see very similar picture - TSX outperforms spin locks only for small data sets and loses at already at 32 cache lines.
Figure 5: TSX vs Spin Lock: Transaction Size

64 cache lines is a point at which TSX gets too much aborts (6,4M in camparison with only 7K for 32 cache lines). In discussion on Intel forum Andi Kleen suggested to things to optimize TSX performance:
  • "wait for the lock to become free again before retrying";
  • and "additional randomized backoff can also improve it in some cases".
So I made couple adjustments in execute_short_trx() function. Firstly, if the lock is acquired the function initially retries a transaction if it fins the lock acquired, but does it in transactional context. So I added following loop in abort handling part of the function:

    if ((status & _XABORT_EXPLICIT)
         && _XABORT_CODE(status) != _ABORT_LOCK_BUSY)
    {
        while ((int)spin_l != 1)
            _mm_pause();
        continue;
    }


Figure 6: TSX aborts on dual core workload
(the full adjusted code is available on GitHub). So we're spinning in the busy loop in waiting for the spin lock releasing before we restart the transaction. Results are shown of Figure 5 by blue curve - it shows much better time for the point of 64 cache lines (2314ms vs. 3412ms). Some of the other points somewhat better and some of them are somewhat worse.

To implement random fallbacks I used local abort counter abrt for the function (how many aborts happen during this run) and small array abrt_fallback of 64 constant items for the counter values. In the test each thread does 20M iterations and I've seen maximum aborts values also very close to 20M, so transactions have 1 abort in average. Thus I used very small values in abrt_fallback array from 0 to 0x39. To get randomness I intermixed the values. Following code does the "random" fallbacks:

    if (++abrt == abrt_fallback[af]) {
        af = (af + 1) % (sizeof(abrt_fallback)
                         / sizeof(*abrt_fallback));
        break;
    }


where af is global thread local index in the array.

Figure 6 shows how TSX aborts (for basic and the optimized versions) number raises in dual CPU environment (the figure is logarithm scaled on both the axes). Random fallbacks provides the lower abort rate in most cases, however as Figure 5 show it doesn't have the best execution time. So sometimes it's better to have higher abort rates by cost to avoid spin lock fallbacks (note that acquiring spin lock means that transaction on other CPU aborts and likely to try acquire the lock).

TSX vs Spin Lock: Data Overlapping

If we would have the workload as we was observing so far, then we simply could use fine grained spin locks to protect the data for each thread. Moreover, we even could update the data concurrently on different CPUs without any locks at all because using thread identifier thr_id we update different memory locations on different CPUs.

So now it's time to see more realistic example with arbitrary data overlapping. This is where transactional memory can't be easily replaced by fine grained locks.

Again, we need to modify our trx_func() that now it accepts additional parameter overlap and computes shift in following way:

    shift = thr_id * trx_sz + i - overlap * thr_id;

So now we can specify by overlap parameter how many data cells will be overlapped between CPUs. And the testing code is

    for (int overlap = 0; overlap <= 32; overlap++)
        run_test(2, 32, 1, overlap, iter, Sync::TSX);
    for (int overlap = 0; overlap <= 32; overlap++)
        run_test(2, 32, 1, overlap, iter, Sync::SpinLock);


Figure 7: TSX vs Spin Lock:data overlapping
The test was performed for transaction size of 32 cache lines with overlaping from 0 to all 32 cache lines.

Results are depicted on Figure 7. Average value for execution time for TSX is 2811ms and for spin lock is 2631ms.

It's expectable for spin lock that running time won't vary significantly with changing data overlapping - we have only one lock, so there is no difference to modify the same data cells on both the CPUs or completely different sets of cells. However I expected that transactional memory is sensitive to data overlapping, but it isn't so. We've already seen above that even nonoverlapping transactions still produces a lot of conflict aborts. And the same for this test - number of aborts for zero and all overlapping cells are the same, 14%.

UPD 2: Since we use spin lock as a fallback for TSX, then the spinlock can be that conflicting cache line which doesn't allow TSX scale on non-overlapping tests (i.e. the spinlock is the conflicting cache line). So I've ran the same test for TSX overlapping transactions with commented out spin lock fallback code. Unfortunately, it didn't change the curve for TSX on Figure 7.

21 comments:

  1. Not surprised by the performance.. transactional memory isn't really so much about performance, but more about making programming easier..

    One thing to think about, is the case when transactional memory typically does show better performance than locks.. that happens when there are many different shared locations, but two CPUs rarely try to update the same location at the same time.
    One way to test this case would be to create a very large array, then have each CPU randomly choose a cache-line segment from it, which it modifies. For spin lock, it will have to lock the entire array, or else divide the array into chunks and create a separate lock for each chunk. If it locks the entire array, then I would expect transactional memory to always win. But if there is a separate lock for each cache line sized chunk, then I would expect spin locks to win. It might be interesting to see at which size of block between those extremes the two cross over.
    The other thing is that the more CPUs there are, the better I would expect transactional memory to be. That's because locks serialize, and the more CPUs, the longer each has to wait.. it's multiplicative, so 4 CPUs means 3 wait to 1 compute, while 8 CPUs means 7 to 1.. but in contrast, transactional memory only depends on conflicts in accesses (in theory, at least.. I have no idea why all the conflicts were happening in the case of independent locations!).. so, it will only degrade if the greater amount of computation causes a proportionally larger number of conflicts (this depends on the nature of the application, and the nature of those unknown aborts)..

    Thanks for running these experiments and sharing!

    ReplyDelete
    Replies
    1. Hi Sean,

      nice to see you :)

      Figures 4 and 6 depict 2 thread (1 thread per CPU core) for spinlock and TSX. There is no data overlapping. I did exactly what you described - just updated two different (always different) memory locations (parts of huge array) in the threads. And TSX outperforms spinlock only on small transactions.

      There is fundamental problem - we need spinlock fallback. TSX can abort transaction due to many reasons, so we need to fallback to spinlock if a transaction aborted. If we just rerun transaction for all kinds of aborts, then we get very very smal performance. The system can progress so slow than you can think that it hangs... So falling back spinlock is that shared memory area on which we have contention event if we modify always different pieces of large array...

      Delete
  2. I didn't conclude the benchmarks since there are 3 important questions:

    1. We see huge jump of transactional aborts when transaction work set reaches 256 cache lines (Figure 1). This is 16KB which is just only a quarter of L1d cache. The workload is single threaded, running strictly on one CPU core. Also I didn't use HyperThreading. I know that the cache has 8-way associativity, however I used static memory allocations (which are continous in virtual memory and likely continous physically) and it's unlikely that I have too many collisions in cache lines to get only 16KB of available cache memory for the transaction. So how the spike can be explained? Also Figure 3 (dependency of execution time on transaction size) shows strange fluctuation around 256 cache lines. Why execution time firstly quickly raises, then jumps down and after that smoothly increases again? There is no other such fluctuations on the graph.

    2. Test with overlapping data has shown very confusing results: TSX performance doesn't show peformance increasing for transactions with small data contention in comparison with transactions which modify the same data set on different CPUs. Probably I'm wrong with my fallback code and TSX transaction always finds the lock acquired. But I did special steps (retry transaction on _XABORT_RETRY, _XABORT_CONFLICT and _XABORT_EXPLICIT with my abort code events) to retry transaction more times instead of fallback to spin lock. In this scenario I'd expect smaller number of aborts for transactions with no data ovelapping in comparison with transactions with tight contention, but abort rates are the same. Why?

    3. And finally, I was experimenting with fallback condition a lot and results in the post are the best. How can I reduce number of transaction aborts (especially in conflict aborts for non-overlapping or with very small data overlapping transactions) and increase overall transactional concurrency?


    I asked the question on Intel forum (http://software.intel.com/en-us/forums/topic/488911), but there is no answers. So research to be continued... :)

    ReplyDelete
    Replies
    1. The article is really helpful. Regarding to question 1, I think as each transaction needs 2 cacheline (debit, credit), 32K cache can only support 256 transaction size at most in best situation.

      Delete
    2. Slotty, thank you very much for the comment!

      I've conclusions in "Aborts on Single-threaded workload" according to your notice. So actually, TSX transactions are limited by L1d cache size.

      Delete
  3. Regarding Figure 3.
    If I understand your code correctly there is no real work for transactional memory beyond 256 CL( really 512 CL) so fluctuation around this point might be related to branch misspredictions and other burden of complex logic that works when abort probability is high but not 100%

    ReplyDelete
    Replies
    1. Hi Sasha,

      results for figure 3 are produced by first loop:

      for (int trx_sz = 32; trx_sz <= 1024; trx_sz += 4)
      run_test(1, trx_sz, 1, 0, iter, Sync::TSX);

      so the upper limit for trx_sz is 1024, which gives us in trx_func()'s inner loop maximum transaction size as 2048 cache lines.

      Delete
  4. In execute_short_trx function, a "hacky" check is used to decide if the lock is used. Is it necessary to make this check and abort the RTM if the lock is used? And if yes why? Is there a similar way to check if a mutex (rather than a spinlcok) is locked
    I am making the above question because I am testing a program of mine and i have the following problem. The program executes RTM and uses a mutex-lock fallback path if the RTM aborts more than 10 times. But the results i take show violation of coherence
    I have the same program with no fallback path and only RTM and runs fine

    ReplyDelete
  5. Hi George,

    let's consider two threads: one of them is currently running TSX transaction while the other one falls back to spin lock due to some reason (it happens fairly frequently with TSX). Thus we have to synchronize both the threads, i.e. spin lock in my case and mutex in your case must be synchronized with RTM which doesn't know anything about the lock and which operates only with memory locations. To do so we add the spin_lock to read set of the transaction - if it is locked or changed during transaction (after our check but before transaction commit) then transaction aborts.

    I started my TSX learning also from transactions without fallback at all. To make the program safe we need to rerun transactions until there is successful iteration. I found that this way has worse performance than fallback to spin-lock. Moreover, I've seen that the program goes into infinite loop (at least it was running too long), i.e. hardware transactions were aborting and aborting and aborting....

    ReplyDelete
    Replies
    1. Thank you for your answer Alexander,

      I totally agree with you that a fallback path must exist. What I was questioning, was the need for synchronization between a thread using locks and a thread using RTM.

      My first though was that, if the thread using locks reads or writes the critical section then the thread that is using RTM will abort, because it will detect the conflict. However it seems to be that one situation where:
      1)the thread using locks, will acquire the lock and read the read set.
      2)then the thread using RTM will start and successfully commit(as no other thread is reading or writing the critical section at that time), changing the values in the write set
      3)the thread with the locks will continue its calculations and overwrite the write set
      (I hope I was clear with my example)

      So yes, it seems that there must be a synchronization between the two threads.

      As I said before, I am using mutexes but I couldn't find a way to add the mutex in the read set, without forcing the mutex to lock in the case that it was free in the first place... Any idea how I can do that?

      Delete
    2. Yes, your scenario is clear and I agree with it.

      Basically, a lock (spin lock in my case and mutex in yours) is just a memory area, which is updated when the lock is acquired, so we add it to transaction read set to be sure that nobody acquires the lock simultaneously with the transaction. Also we check that the lock isn't acquired because current thread can preempt a thread working under the lock or, as you pointed out, just run concurrently on other CPU.

      Could you rephrase "without forcing the mutex to lock in the case that it was free in the first place..."?

      I just had a quick look at glibc mutex implementation, so probably I'm wrong, but hope the following will be helpful. __pthread_mutex_lock() from glibc2.18/nptl/pthread_mutex_lock.c calls LLL_MUTEX_LOCK(mutex) or LLL_MUTEX_LOCK_ELISION(mutex) and the both operates with pthread_mutex_t.__data.__lock. Also the last one leads to __lll_lock_elision() and the __lock member (passed as futex) is checked in TSX transaction to know whether the lock is busy or not. So probably you should use pthread_mutex_t.__data.__lock in your transactional context to synchronize hardware transaction with mutex.

      Delete
    3. Alexander you where actually right. One can add the instruction bellow if he is using mutex and not spin locks:

      if (pthread_mutex_t.__data.__lock != 0) _xabort ();

      Thank you for all your help and info!

      Delete
    4. You're welcome. I'm glad that the info was useful for you.

      Delete
  6. Hi
    If I run the same code on Intel TSX i4770 for threads ranging from 2 to 8, will it give correct results ? Or is the code written above is specific to dual core processors only.
    i4770 has 4 cores and can run 8 concurrent threads at a time.

    ReplyDelete
  7. Hi Sahila,

    sorry for late reply.

    I'd expect different results running on 4-core CPU - there is higher contention between the cores and/or hardware threads and we saw that TSX is very sensitive to number of concurrent threads...

    However, i7-4650U is the only Haswell CPU with which I made the experiments, so I'd be happy to see the benchmarks numbers for i7-4770 or other Haswell CPUs.

    ReplyDelete
  8. Hey Alexander!

    I try to use RTM instructions, using pthreads in a c file, but it always aborts
    before entering the transaction.
    Have you any idea why this is happening?
    Thank you a lot in advance!


    void *addone(void *arg)
    {
    int i;
    unsigned status_tsx ;

    while(1) {

    status_tsx = _xbegin() ;


    if (status_tsx == _XBEGIN_STARTED ){

    if(pthread_mutex_trylock (&mutexsum) !=0) {

    _xabort(_ABORT_LOCK_BUSY);

    }

    else {

    pthread_mutex_unlock (&mutexsum);

    for (i=0;i<100;i++) sum = sum + 1.0;


    pthread_exit(NULL);

    _xend();

    return;
    }

    }

    if (!(status_tsx & _XABORT_RETRY)
    && !(status_tsx & _XABORT_CONFLICT)
    && !((status_tsx & _XABORT_EXPLICIT)
    && _XABORT_CODE(status_tsx) != _ABORT_LOCK_BUSY))
    break;

    ++retries;

    }

    pthread_mutex_lock (&mutexsum);
    for (i=0;i<100;i++) sum = sum + 1.0;
    pthread_mutex_unlock (&mutexsum);

    pthread_exit(NULL);

    }

    ReplyDelete
    Replies
    1. Hi Giorgio,

      sorry for the delay - there were too many business trips for the last time.

      I have few point about the program and what's possible to try:

      1. it's not clear at which point exactly transactions abort. Does it mean that you just can't start a transaction? How many CPU cores are executing the code concurrently? Is there CPU binding enabled?

      2. you use pthread_mutex_trylock() inside the transaction and the call in fact uses TSX since glib 2.18. TSX should be ok with nested transactions, but I'd try to debug the TSX code from as simple transaction code as possible and increase the code complexity until you get the code point which leads to enormous aborts number.

      3. It seems the code also suffers from 'lemming effect' and probably pause instruction before transaction restart can make the program behave better. Probably you find the recent article from Andi Kleen, https://software.intel.com/en-us/articles/tsx-anti-patterns-in-lock-elision-code, useful.

      Lastly TSX is very self-willed technology. It absolutely doesn't like context switches and large/long transaction code. Also errata was found recently in the technology (http://www.anandtech.com/show/8376/intel-disables-tsx-instructions-erratum-found-in-haswell-haswelleep-broadwelly).

      Delete
    2. pthread_mutex_trylock is not a right way to read the status of the lock. It always writes to the lock creating a conflict that abort transactions.PThreads library does not provide any method to read-(only) the status of the lock. As Alexander stated, newer glibc can support TSX lock elision out of the box. Intel TBB supports speculative locks based on TSX.

      Delete
  9. >This results shows that TSX performs 3 times better (401ms vs 1329ms for trx_count = 1) on small transaction.

    > I expected that TSX should show much better results for the test due to more parallelism, but it isn't so...

    what parallel speedup would you expect on a two core machine?

    ReplyDelete
    Replies
    1. In that test there was no data dependency while spinlock still synchronizes transactions on both the cores, so it has sense to expect nearly double performance gain on updating independent data on two cores in parallel.

      Delete
  10. Is figure 2 correct with that many cache lines ... shouldn't it be the same scale as figure 1?

    ReplyDelete

Note: Only a member of this blog may post a comment.