
James Sutherland wrote:
I have been testing thread performance on Linux and Mac. My Linux system has two dual-core processors and my Mac has one dual-core processor. Both are intel chips.
For the code snippet given below, the execution time should ideally decrease as the number of threads increases. However, the opposite trend is observed. For example, using -O3 flags on my Linux desktop produces the following timings: 1 Thread: 0.66 sec 2 Threads: 0.9 sec 3 Threads: 1.2 sec 4 Threads: 1.4 sec
I do not have a lot of experience with threads, and was wondering if this result surprises anyone?
Hi James, Quoting your code out of order:
for( int itask=0; itask<nTasks; ++itask ){ boost::thread_group threads; for( int i=0; i<nThreads; ++i ){ threads.create_thread( MyStruct(itask++ + 100) ); } threads.join_all(); }
Did you really want the ++itask in the first for() ? Isn't it being incremented enough in the create_thread line?
struct MyStruct { explicit MyStruct(const int i) : tag(i) {} void operator()() const { const int n = 100; std::vector<int> nums(n,0); for( int j=0; j<1000000; ++j ) for( int i=0; i<n; ++i ) nums[i] = i+tag; } private: int tag; };
So sizeof(MyStruct)==sizeof(int) [for the tag]. Now, if you were creating the MyStruct objects like this: MyStruct my_structs[n]; then I would say that they are all sharing a cache line, and that cache line is being fought over by the different processors when they read tag, and that you should add some padding. But you're not; you're passing a temporary MyStruct to create_thread which presumably stores a copy of it. How does boost::thread_group store the functors that are passed to it? If it is storing them in some sort of array or vector then that could still be the problem - and it could be fixed by adding padding inside boost.thread, or by copying the functor onto the new thread's stack. Also, I would imagine that the compiler would keep tag in a register. What happens if you declare it as const? I suggest that you try adding some padding and see what happens. Phil.