Comments on High Performance Linux: Studying Intel TSX Performance

Is figure 2 correct with that many cache lines ......

2019-10-15T00:09:32.603-06:00

Is figure 2 correct with that many cache lines ... shouldn't it be the same scale as figure 1?

In that test there was no data dependency while sp...

2015-11-12T13:49:20.958-07:00

In that test there was no data dependency while spinlock still synchronizes transactions on both the cores, so it has sense to expect nearly double performance gain on updating independent data on two cores in parallel.

pthread_mutex_trylock is not a right way to read t...

2015-11-12T07:09:12.898-07:00

pthread_mutex_trylock is not a right way to read the status of the lock. It always writes to the lock creating a conflict that abort transactions.PThreads library does not provide any method to read-(only) the status of the lock. As Alexander stated, newer glibc can support TSX lock elision out of the box. Intel TBB supports speculative locks based on TSX.

>This results shows that TSX performs 3 times b...

2015-11-12T06:58:25.197-07:00

>This results shows that TSX performs 3 times better (401ms vs 1329ms for trx_count = 1) on small transaction.

> I expected that TSX should show much better results for the test due to more parallelism, but it isn't so...

what parallel speedup would you expect on a two core machine?

Hi Giorgio, sorry for the delay - there were too ...

2014-11-15T15:35:30.489-07:00

Hi Giorgio,

sorry for the delay - there were too many business trips for the last time.

I have few point about the program and what's possible to try:

1. it's not clear at which point exactly transactions abort. Does it mean that you just can't start a transaction? How many CPU cores are executing the code concurrently? Is there CPU binding enabled?

2. you use pthread_mutex_trylock() inside the transaction and the call in fact uses TSX since glib 2.18. TSX should be ok with nested transactions, but I'd try to debug the TSX code from as simple transaction code as possible and increase the code complexity until you get the code point which leads to enormous aborts number.

3. It seems the code also suffers from 'lemming effect' and probably pause instruction before transaction restart can make the program behave better. Probably you find the recent article from Andi Kleen, https://software.intel.com/en-us/articles/tsx-anti-patterns-in-lock-elision-code, useful.

Lastly TSX is very self-willed technology. It absolutely doesn't like context switches and large/long transaction code. Also errata was found recently in the technology (http://www.anandtech.com/show/8376/intel-disables-tsx-instructions-erratum-found-in-haswell-haswelleep-broadwelly).

Hey Alexander! I try to use RTM instructions, usi...

2014-10-13T08:48:01.507-06:00

Hey Alexander!

I try to use RTM instructions, using pthreads in a c file, but it always aborts
before entering the transaction.
Have you any idea why this is happening?
Thank you a lot in advance!

void *addone(void *arg)
{
int i;
unsigned status_tsx ;

while(1) {

status_tsx = _xbegin() ;

if (status_tsx == _XBEGIN_STARTED ){

if(pthread_mutex_trylock (&mutexsum) !=0) {

_xabort(_ABORT_LOCK_BUSY);

}

else {

pthread_mutex_unlock (&mutexsum);

for (i=0;i<100;i++) sum = sum + 1.0;

pthread_exit(NULL);

_xend();

return;
}

}

if (!(status_tsx & _XABORT_RETRY)
&& !(status_tsx & _XABORT_CONFLICT)
&& !((status_tsx & _XABORT_EXPLICIT)
&& _XABORT_CODE(status_tsx) != _ABORT_LOCK_BUSY))
break;

++retries;

}

pthread_mutex_lock (&mutexsum);
for (i=0;i<100;i++) sum = sum + 1.0;
pthread_mutex_unlock (&mutexsum);

pthread_exit(NULL);

}

Hi Sahila, sorry for late reply. I'd expect ...

2014-05-29T10:24:29.752-06:00

Hi Sahila,

sorry for late reply.

I'd expect different results running on 4-core CPU - there is higher contention between the cores and/or hardware threads and we saw that TSX is very sensitive to number of concurrent threads...

However, i7-4650U is the only Haswell CPU with which I made the experiments, so I'd be happy to see the benchmarks numbers for i7-4770 or other Haswell CPUs.

Hi If I run the same code on Intel TSX i4770 for t...

2014-05-19T01:15:47.552-06:00

Hi
If I run the same code on Intel TSX i4770 for threads ranging from 2 to 8, will it give correct results ? Or is the code written above is specific to dual core processors only.
i4770 has 4 cores and can run 8 concurrent threads at a time.

You're welcome. I'm glad that the info was...

2014-02-14T01:09:57.143-07:00

You're welcome. I'm glad that the info was useful for you.

Alexander you where actually right. One can add th...

2014-02-13T09:38:09.742-07:00

Alexander you where actually right. One can add the instruction bellow if he is using mutex and not spin locks:

if (pthread_mutex_t.__data.__lock != 0) _xabort ();

Thank you for all your help and info!

Yes, your scenario is clear and I agree with it. ...

2014-01-17T06:11:40.571-07:00

Yes, your scenario is clear and I agree with it.

Basically, a lock (spin lock in my case and mutex in yours) is just a memory area, which is updated when the lock is acquired, so we add it to transaction read set to be sure that nobody acquires the lock simultaneously with the transaction. Also we check that the lock isn't acquired because current thread can preempt a thread working under the lock or, as you pointed out, just run concurrently on other CPU.

Could you rephrase "without forcing the mutex to lock in the case that it was free in the first place..."?

I just had a quick look at glibc mutex implementation, so probably I'm wrong, but hope the following will be helpful. __pthread_mutex_lock() from glibc2.18/nptl/pthread_mutex_lock.c calls LLL_MUTEX_LOCK(mutex) or LLL_MUTEX_LOCK_ELISION(mutex) and the both operates with pthread_mutex_t.__data.__lock. Also the last one leads to __lll_lock_elision() and the __lock member (passed as futex) is checked in TSX transaction to know whether the lock is busy or not. So probably you should use pthread_mutex_t.__data.__lock in your transactional context to synchronize hardware transaction with mutex.

Thank you for your answer Alexander, I totally ag...

2014-01-17T05:09:23.353-07:00

Thank you for your answer Alexander,

I totally agree with you that a fallback path must exist. What I was questioning, was the need for synchronization between a thread using locks and a thread using RTM.

My first though was that, if the thread using locks reads or writes the critical section then the thread that is using RTM will abort, because it will detect the conflict. However it seems to be that one situation where:
1)the thread using locks, will acquire the lock and read the read set.
2)then the thread using RTM will start and successfully commit(as no other thread is reading or writing the critical section at that time), changing the values in the write set
3)the thread with the locks will continue its calculations and overwrite the write set
(I hope I was clear with my example)

So yes, it seems that there must be a synchronization between the two threads.

As I said before, I am using mutexes but I couldn't find a way to add the mutex in the read set, without forcing the mutex to lock in the case that it was free in the first place... Any idea how I can do that?

Hi George, let's consider two threads: one of...

2014-01-15T12:34:25.873-07:00

Hi George,

let's consider two threads: one of them is currently running TSX transaction while the other one falls back to spin lock due to some reason (it happens fairly frequently with TSX). Thus we have to synchronize both the threads, i.e. spin lock in my case and mutex in your case must be synchronized with RTM which doesn't know anything about the lock and which operates only with memory locations. To do so we add the spin_lock to read set of the transaction - if it is locked or changed during transaction (after our check but before transaction commit) then transaction aborts.

I started my TSX learning also from transactions without fallback at all. To make the program safe we need to rerun transactions until there is successful iteration. I found that this way has worse performance than fallback to spin-lock. Moreover, I've seen that the program goes into infinite loop (at least it was running too long), i.e. hardware transactions were aborting and aborting and aborting....

In execute_short_trx function, a "hacky"...

2014-01-15T10:08:44.048-07:00

In execute_short_trx function, a "hacky" check is used to decide if the lock is used. Is it necessary to make this check and abort the RTM if the lock is used? And if yes why? Is there a similar way to check if a mutex (rather than a spinlcok) is locked
I am making the above question because I am testing a program of mine and i have the following problem. The program executes RTM and uses a mutex-lock fallback path if the RTM aborts more than 10 times. But the results i take show violation of coherence
I have the same program with no fallback path and only RTM and runs fine

Hi Sasha, results for figure 3 are produced by fi...

2014-01-10T07:01:03.583-07:00

Hi Sasha,

results for figure 3 are produced by first loop:

for (int trx_sz = 32; trx_sz <= 1024; trx_sz += 4)
run_test(1, trx_sz, 1, 0, iter, Sync::TSX);

so the upper limit for trx_sz is 1024, which gives us in trx_func()'s inner loop maximum transaction size as 2048 cache lines.

Regarding Figure 3. If I understand your code cor...

2014-01-09T04:36:43.282-07:00

Regarding Figure 3.
If I understand your code correctly there is no real work for transactional memory beyond 256 CL( really 512 CL) so fluctuation around this point might be related to branch misspredictions and other burden of complex logic that works when abort probability is high but not 100%

Slotty, thank you very much for the comment! I...

2013-12-12T09:28:45.538-07:00

Slotty, thank you very much for the comment!

I've conclusions in "Aborts on Single-threaded workload" according to your notice. So actually, TSX transactions are limited by L1d cache size.

The article is really helpful. Regarding to questi...

2013-12-09T19:54:34.344-07:00

The article is really helpful. Regarding to question 1, I think as each transaction needs 2 cacheline (debit, credit), 32K cache can only support 256 transaction size at most in best situation.

I didn't conclude the benchmarks since there a...

2013-11-15T03:14:42.567-07:00

I didn't conclude the benchmarks since there are 3 important questions:

1. We see huge jump of transactional aborts when transaction work set reaches 256 cache lines (Figure 1). This is 16KB which is just only a quarter of L1d cache. The workload is single threaded, running strictly on one CPU core. Also I didn't use HyperThreading. I know that the cache has 8-way associativity, however I used static memory allocations (which are continous in virtual memory and likely continous physically) and it's unlikely that I have too many collisions in cache lines to get only 16KB of available cache memory for the transaction. So how the spike can be explained? Also Figure 3 (dependency of execution time on transaction size) shows strange fluctuation around 256 cache lines. Why execution time firstly quickly raises, then jumps down and after that smoothly increases again? There is no other such fluctuations on the graph.

2. Test with overlapping data has shown very confusing results: TSX performance doesn't show peformance increasing for transactions with small data contention in comparison with transactions which modify the same data set on different CPUs. Probably I'm wrong with my fallback code and TSX transaction always finds the lock acquired. But I did special steps (retry transaction on _XABORT_RETRY, _XABORT_CONFLICT and _XABORT_EXPLICIT with my abort code events) to retry transaction more times instead of fallback to spin lock. In this scenario I'd expect smaller number of aborts for transactions with no data ovelapping in comparison with transactions with tight contention, but abort rates are the same. Why?

3. And finally, I was experimenting with fallback condition a lot and results in the post are the best. How can I reduce number of transaction aborts (especially in conflict aborts for non-overlapping or with very small data overlapping transactions) and increase overall transactional concurrency?

I asked the question on Intel forum (http://software.intel.com/en-us/forums/topic/488911), but there is no answers. So research to be continued... :)

Hi Sean, nice to see you :) Figures 4 and 6 depi...

2013-11-15T03:02:02.854-07:00

Hi Sean,

nice to see you :)

Figures 4 and 6 depict 2 thread (1 thread per CPU core) for spinlock and TSX. There is no data overlapping. I did exactly what you described - just updated two different (always different) memory locations (parts of huge array) in the threads. And TSX outperforms spinlock only on small transactions.

There is fundamental problem - we need spinlock fallback. TSX can abort transaction due to many reasons, so we need to fallback to spinlock if a transaction aborted. If we just rerun transaction for all kinds of aborts, then we get very very smal performance. The system can progress so slow than you can think that it hangs... So falling back spinlock is that shared memory area on which we have contention event if we modify always different pieces of large array...

Not surprised by the performance.. transactional ...

2013-11-12T12:11:20.270-07:00

Not surprised by the performance.. transactional memory isn't really so much about performance, but more about making programming easier..

One thing to think about, is the case when transactional memory typically does show better performance than locks.. that happens when there are many different shared locations, but two CPUs rarely try to update the same location at the same time.
One way to test this case would be to create a very large array, then have each CPU randomly choose a cache-line segment from it, which it modifies. For spin lock, it will have to lock the entire array, or else divide the array into chunks and create a separate lock for each chunk. If it locks the entire array, then I would expect transactional memory to always win. But if there is a separate lock for each cache line sized chunk, then I would expect spin locks to win. It might be interesting to see at which size of block between those extremes the two cross over.
The other thing is that the more CPUs there are, the better I would expect transactional memory to be. That's because locks serialize, and the more CPUs, the longer each has to wait.. it's multiplicative, so 4 CPUs means 3 wait to 1 compute, while 8 CPUs means 7 to 1.. but in contrast, transactional memory only depends on conflicts in accesses (in theory, at least.. I have no idea why all the conflicts were happening in the case of independent locations!).. so, it will only degrade if the greater amount of computation causes a proportionally larger number of conflicts (this depends on the nature of the application, and the nature of those unknown aborts)..

Thanks for running these experiments and sharing!