very poor thread performance

James Sutherland

14 May 2008 14 May '08

7:31 p.m.

I have been testing thread performance on Linux and Mac. My Linux system has two dual-core processors and my Mac has one dual-core processor. Both are intel chips. For the code snippet given below, the execution time should ideally decrease as the number of threads increases. However, the opposite trend is observed. For example, using -O3 flags on my Linux desktop produces the following timings: 1 Thread: 0.66 sec 2 Threads: 0.9 sec 3 Threads: 1.2 sec 4 Threads: 1.4 sec I do not have a lot of experience with threads, and was wondering if this result surprises anyone? James ----- #include <boost/thread.hpp> #include <iostream> #include <vector> #include <time.h> struct MyStruct { explicit MyStruct(const int i) : tag(i) {} void operator()() const { const int n = 100; std::vector<int> nums(n,0); for( int j=0; j<1000000; ++j ) for( int i=0; i<n; ++i ) nums[i] = i+tag; } private: int tag; }; int main() { using namespace std; const int nTasks = 12; const int nThreads = 4; assert( nTasks%nThreads == 0 ); assert( nThreads<=nTasks ); cout << "Executing " << nTasks << " using " << nThreads << " threads." << endl; time_t t1 = clock(); for( int itask=0; itask<nTasks; ++itask ){ boost::thread_group threads; for( int i=0; i<nThreads; ++i ){ threads.create_thread( MyStruct(itask++ + 100) ); } threads.join_all(); } t2 = clock(); cout << "time: " << difftime( t2,t1)/CLOCKS_PER_SEC << endl; } ----

Show replies by date

Frank Mori Hess

14 May 14 May

7:52 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wednesday 14 May 2008 15:31 pm, James Sutherland wrote:

...

I have been testing thread performance on Linux and Mac. My Linux system has two dual-core processors and my Mac has one dual-core processor. Both are intel chips.

For the code snippet given below, the execution time should ideally decrease as the number of threads increases. However, the opposite trend is observed. For example, using -O3 flags on my Linux desktop produces the following timings: 1 Thread: 0.66 sec 2 Threads: 0.9 sec 3 Threads: 1.2 sec 4 Threads: 1.4 sec

I do not have a lot of experience with threads, and was wondering if this result surprises anyone?

If you print a debug message next to your threads.create_thread(); call, you'll see that you are creating more threads than you think you are. - -- Frank -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFIK0L35vihyNWuA4URAvmEAKCR/MEdyxSpTEKCYBLXvnycHesNLACfVRyN YI7DqOs00Q3v7bx04hvxpGU= =bhMj -----END PGP SIGNATURE-----

Lance.Diduck＠ubs.com

7:57 p.m.

The result has been noticed before. In fact I wrote about it in several posts on c.l.c++.m, and in N2486 in the first few pages it uses this effect to illustrate a point about allocators. Nevertheless, the result is still surprising!! What is worse is that different compilers and libraries will give you way different results for this test, so it is not just a function of Intel. ( I have not tried to reproduce the results on anything but intel). It is quite common that completely independent algorithms on a multicore machine will slow each other down --worse than just running the same algorithms sequentially. Of course we can make programs go faster by parallelism, but the theoretical approach of making many isolated simultaneous tasks clearly does not meet practice as a way to improve performance. So forget about "superlinear" programs. I would just settle for C++ containers being linear. -----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of James Sutherland Sent: Wednesday, May 14, 2008 3:32 PM To: boost@lists.boost.org Subject: [boost] very poor thread performance I have been testing thread performance on Linux and Mac. My Linux system has two dual-core processors and my Mac has one dual-core processor. Both are intel chips. For the code snippet given below, the execution time should ideally decrease as the number of threads increases. However, the opposite trend is observed. For example, using -O3 flags on my Linux desktop produces the following timings: 1 Thread: 0.66 sec 2 Threads: 0.9 sec 3 Threads: 1.2 sec 4 Threads: 1.4 sec I do not have a lot of experience with threads, and was wondering if this result surprises anyone? James ----- #include <boost/thread.hpp> #include <iostream> #include <vector> #include <time.h> struct MyStruct { explicit MyStruct(const int i) : tag(i) {} void operator()() const { const int n = 100; std::vector<int> nums(n,0); for( int j=0; j<1000000; ++j ) for( int i=0; i<n; ++i ) nums[i] = i+tag; } private: int tag; }; int main() { using namespace std; const int nTasks = 12; const int nThreads = 4; assert( nTasks%nThreads == 0 ); assert( nThreads<=nTasks ); cout << "Executing " << nTasks << " using " << nThreads << " threads." << endl; time_t t1 = clock(); for( int itask=0; itask<nTasks; ++itask ){ boost::thread_group threads; for( int i=0; i<nThreads; ++i ){ threads.create_thread( MyStruct(itask++ + 100) ); } threads.join_all(); } t2 = clock(); cout << "time: " << difftime( t2,t1)/CLOCKS_PER_SEC << endl; } ---- _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost Visit our website at http://www.ubs.com This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mails are not encrypted and cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message which arise as a result of e-mail transmission. If verification is required please request a hard-copy version. This message is provided for informational purposes and should not be construed as a solicitation or offer to buy or sell any securities or related financial instruments.

James Sutherland

9:23 p.m.

...

The result has been noticed before. In fact I wrote about it in several posts on c.l.c++.m, and in N2486 in the first few pages it uses this effect to illustrate a point about allocators. Nevertheless, the result is still surprising!! What is worse is that different compilers and libraries will give you way different results for this test, so it is not just a function of Intel. ( I have not tried to reproduce the results on anything but intel).

Changing from using a std::vector<int> with a single space allocation (not push_back) to a int[] improved the speed but not scalability... James

Lance.Diduck＠ubs.com

10:13 p.m.

...

Changing from using a std::vector<int> with a single space allocation (not push_back) to a int[] improved the speed but not scalability... Now that you've fixed the measurement problem, give it a try with other containers, and doing other operations. You will find some combination that demonstrates that this does not always scale like you expect. -----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of James Sutherland Sent: Wednesday, May 14, 2008 5:23 PM To: boost@lists.boost.org Subject: Re: [boost] very poor thread performance

...

The result has been noticed before. In fact I wrote about it in several posts on c.l.c++.m, and in N2486 in the first few pages it uses this effect to illustrate a point about allocators. Nevertheless, the result is still surprising!! What is worse is that different compilers and libraries will give you way different results for this test, so it is not just a function of Intel. ( I have not tried to reproduce the results on anything but intel).

Changing from using a std::vector<int> with a single space allocation (not push_back) to a int[] improved the speed but not scalability... James _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost Visit our website at http://www.ubs.com This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mails are not encrypted and cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message which arise as a result of e-mail transmission. If verification is required please request a hard-copy version. This message is provided for informational purposes and should not be construed as a solicitation or offer to buy or sell any securities or related financial instruments.

Phil Endecott

8:07 p.m.

James Sutherland wrote:

...

I have been testing thread performance on Linux and Mac. My Linux system has two dual-core processors and my Mac has one dual-core processor. Both are intel chips.

For the code snippet given below, the execution time should ideally decrease as the number of threads increases. However, the opposite trend is observed. For example, using -O3 flags on my Linux desktop produces the following timings: 1 Thread: 0.66 sec 2 Threads: 0.9 sec 3 Threads: 1.2 sec 4 Threads: 1.4 sec

I do not have a lot of experience with threads, and was wondering if this result surprises anyone?

Hi James, Quoting your code out of order:

...

for( int itask=0; itask<nTasks; ++itask ){ boost::thread_group threads; for( int i=0; i<nThreads; ++i ){ threads.create_thread( MyStruct(itask++ + 100) ); } threads.join_all(); }

Did you really want the ++itask in the first for() ? Isn't it being incremented enough in the create_thread line?

...

struct MyStruct { explicit MyStruct(const int i) : tag(i) {} void operator()() const { const int n = 100; std::vector<int> nums(n,0); for( int j=0; j<1000000; ++j ) for( int i=0; i<n; ++i ) nums[i] = i+tag; } private: int tag; };

So sizeof(MyStruct)==sizeof(int) [for the tag]. Now, if you were creating the MyStruct objects like this: MyStruct my_structs[n]; then I would say that they are all sharing a cache line, and that cache line is being fought over by the different processors when they read tag, and that you should add some padding. But you're not; you're passing a temporary MyStruct to create_thread which presumably stores a copy of it. How does boost::thread_group store the functors that are passed to it? If it is storing them in some sort of array or vector then that could still be the problem - and it could be fixed by adding padding inside boost.thread, or by copying the functor onto the new thread's stack. Also, I would imagine that the compiler would keep tag in a register. What happens if you declare it as const? I suggest that you try adding some padding and see what happens. Phil.

James Sutherland

8:24 p.m.

Phil, Thank you for your suggestions - I am hunting in the dark here...

...

...
for( int itask=0; itask<nTasks; ++itask ){ boost::thread_group threads; for( int i=0; i<nThreads; ++i ){ threads.create_thread( MyStruct(itask++ + 100) ); } threads.join_all(); }

Did you really want the ++itask in the first for() ? Isn't it being incremented enough in the create_thread line?

This was intentional. This highly contrived example creates a fixed amount of work (nTasks) and divides them up among nThreads threads. As a thread takes a task, I increment itask. Strange, I know...

...

...
struct MyStruct { explicit MyStruct(const int i) : tag(i) {} void operator()() const { const int n = 100; std::vector<int> nums(n,0); for( int j=0; j<1000000; ++j ) for( int i=0; i<n; ++i ) nums[i] = i+tag; } private: int tag; };

So sizeof(MyStruct)==sizeof(int) [for the tag]. Now, if you were creating the MyStruct objects like this:

MyStruct my_structs[n];

then I would say that they are all sharing a cache line, and that cache line is being fought over by the different processors when they read tag, and that you should add some padding. But you're not; you're passing a temporary MyStruct to create_thread which presumably stores a copy of it. How does boost::thread_group store the functors that are passed to it? If it is storing them in some sort of array or vector then that could still be the problem - and it could be fixed by adding padding inside boost.thread, or by copying the functor onto the new thread's stack.

I tried adding a member double pad[9999] (uninitialized) to MyStruct to increase its size. This had no effect on performance. I think that boost::thread stores the functors by copy on each individual thread, not on the boost::thread_group object.

...

Also, I would imagine that the compiler would keep tag in a register. What happens if you declare it as const?

I changed it to const and there was no effect on the performance... James

Frank Mori Hess

8:36 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wednesday 14 May 2008 16:24 pm, James Sutherland wrote:

...

Phil, Thank you for your suggestions - I am hunting in the dark here...

...
...
for( int itask=0; itask<nTasks; ++itask ){ boost::thread_group threads; for( int i=0; i<nThreads; ++i ){ threads.create_thread( MyStruct(itask++ + 100) ); } threads.join_all(); }

Did you really want the ++itask in the first for() ? Isn't it being incremented enough in the create_thread line?

This was intentional. This highly contrived example creates a fixed amount of work (nTasks) and divides them up among nThreads threads. As a thread takes a task, I increment itask. Strange, I know...

No, if you also increment itask in the outer loop, you are doing fewer tasks when nThreads is smaller. Although, what I said before about creating more threads than you think was wrong (I misunderstood your program). - -- Frank -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFIK0065vihyNWuA4URArbpAJ9OS3wPEJ1/FQfDyrt4TEbJ60QWyACeM4zZ X870SRnIAMmWejzw/iTtxYs= =Je0u -----END PGP SIGNATURE-----

Frank Mori Hess

8:40 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wednesday 14 May 2008 16:36 pm, Frank Mori Hess wrote:

...

On Wednesday 14 May 2008 16:24 pm, James Sutherland wrote:

...
Phil, Thank you for your suggestions - I am hunting in the dark here...

...
...
for( int itask=0; itask<nTasks; ++itask ){ boost::thread_group threads; for( int i=0; i<nThreads; ++i ){ threads.create_thread( MyStruct(itask++ + 100) ); } threads.join_all(); }

Did you really want the ++itask in the first for() ? Isn't it being incremented enough in the create_thread line?

This was intentional. This highly contrived example creates a fixed amount of work (nTasks) and divides them up among nThreads threads. As a thread takes a task, I increment itask. Strange, I know...

No, if you also increment itask in the outer loop, you are doing fewer tasks when nThreads is smaller. Although, what I said before about creating more threads than you think was wrong (I misunderstood your program).

Also, your program appears to output the cpu time spent on the program, not the real time it took. What I see (after fixing the itask increment): $time ./a.out Executing 12 tasks using 1 threads. time: 9.64 real 0m9.668s user 0m9.641s sys 0m0.004s $ time ./a.out Executing 12 tasks using 2 threads. time: 9.91 real 0m4.991s user 0m9.909s sys 0m0.012s - -- Frank -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFIK05B5vihyNWuA4URAkAbAKCWSBb0kLmh61BLesNiBRCDm8rdhACgp0Wi SR1/9LEMLdcr8K4eSOeOA+o= =mGQ4 -----END PGP SIGNATURE-----

James Sutherland

8:47 p.m.

...

...
No, if you also increment itask in the outer loop, you are doing fewer tasks when nThreads is smaller. Although, what I said before about creating more threads than you think was wrong (I misunderstood your program).

Also, your program appears to output the cpu time spent on the program, not the real time it took. What I see (after fixing the itask increment):

$time ./a.out Executing 12 tasks using 1 threads. time: 9.64

real 0m9.668s user 0m9.641s sys 0m0.004s

$ time ./a.out Executing 12 tasks using 2 threads. time: 9.91

real 0m4.991s user 0m9.909s sys 0m0.012s

You are right - here is the modified loop structure: int itask=0; while( itask<nTasks ){ boost::thread_group threads; for( int i=0; i<nThreads; ++i ){ threads.create_thread( MyStruct(itask++ + 100) ); } threads.join_all(); } Any thoughts on what is chewing up the extra time then? Isn't the total time the correct measure here? Is the difference between the "real" time and the "system" time due to the join_all? James

Frank Mori Hess

9:07 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wednesday 14 May 2008 16:47 pm, James Sutherland wrote:

...

Any thoughts on what is chewing up the extra time then? Isn't the total time the correct measure here? Is the difference between the "real" time and the "system" time due to the join_all?

There is no extra time, or not much. You seem to be misinterpreting the output of the "time" command. The real time is the elapsed time in the real world. The sys plus user time is the time the all cpus spent executing the program in total, which you would expect to remain roughly constant. If 2 cpus each spent 1 second running the program in parallel, the real time would be 1 second and the system plus user time would be 2 seconds. If one cpu each spent 2 seconds running the program, the real time would be 2 seconds and the system plus user time would still be 2 seconds. Oh, I also made my version loop 10 times as long, which is why it takes almost 10 seconds. - -- Frank -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFIK1Si5vihyNWuA4URAmKoAJ4i5fysGAa1zRhRXvnAJ/nAURq7XwCfUAiX Q81qI8OKScguhGe9y92BY5U= =kgXn -----END PGP SIGNATURE-----

James Sutherland

9:14 p.m.

...

...
Any thoughts on what is chewing up the extra time then? Isn't the total time the correct measure here? Is the difference between the "real" time and the "system" time due to the join_all?

There is no extra time, or not much. You seem to be misinterpreting the output of the "time" command. The real time is the elapsed time in the real world. The sys plus user time is the time the all cpus spent executing the program in total, which you would expect to remain roughly constant.

If 2 cpus each spent 1 second running the program in parallel, the real time would be 1 second and the system plus user time would be 2 seconds.

If one cpu each spent 2 seconds running the program, the real time would be 2 seconds and the system plus user time would still be 2 seconds.

Oh, I also made my version loop 10 times as long, which is why it takes almost 10 seconds.

Okay, then I am interpreting things correctly, and my internal timer seems to be consistent with the output of the "time" command that you refer to. To restate, the sum of "real" and "system" time is the pertinent measure. The results you posted then indicate that there is no speedup associated with increasing the number of threads. I am now seeing the same thing... So any thoughts as to why there is no speedup? James

Frank Mori Hess

9:23 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wednesday 14 May 2008 17:14 pm, James Sutherland wrote:

...

Okay, then I am interpreting things correctly, and my internal timer seems to be consistent with the output of the "time" command that you refer to. To restate, the sum of "real" and "system" time is the pertinent measure. The results you posted then indicate that there is no speedup associated with increasing the number of threads. I am now seeing the same thing...

No, the "real" time is the pertinent measure. I'm saying the program took 4.991s to complete with 2 threads and 9.668s to complete with 1 thread. You might be less confused if you increase the number of loops in operator() so the program is slow enough you can measure the execution time by watching a clock. - -- Frank -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFIK1hd5vihyNWuA4URAiKqAJ4zLf0bhnC1cbjsCcxwCskHhU9ZqACeIE6S 79NY0j69nLY1gSTqN68doQg= =DAhJ -----END PGP SIGNATURE-----

James Sutherland

9:27 p.m.

...

...
Okay, then I am interpreting things correctly, and my internal timer seems to be consistent with the output of the "time" command that you refer to. To restate, the sum of "real" and "system" time is the pertinent measure. The results you posted then indicate that there is no speedup associated with increasing the number of threads. I am now seeing the same thing...

No, the "real" time is the pertinent measure. I'm saying the program took 4.991s to complete with 2 threads and 9.668s to complete with 1 thread. You might be less confused if you increase the number of loops in operator() so the program is slow enough you can measure the execution time by watching a clock.

Frank, Thank you for your patience with me. I am now understanding this correctly. I did not realize that the clock() function would double- count in multithreaded applications. I took your suggestion of increasing the loop count and that clarified it for me. It appears that I do have scalability after all... James

Matt Doyle

9:26 p.m.

...

[mailto:boost-bounces@lists.boost.org] On Behalf Of James Sutherland Sent: Wednesday, May 14, 2008 2:15 PM To: boost@lists.boost.org Subject: Re: [boost] very poor thread performance

...
...
Any thoughts on what is chewing up the extra time then? Isn't the total time the correct measure here? Is the difference between the "real" time and the "system" time due to the join_all?

There is no extra time, or not much. You seem to be misinterpreting the output of the "time" command. The real time is the elapsed time in the real world. The sys plus user time is the time the all cpus spent executing the program in total, which you would expect to remain roughly constant.

If 2 cpus each spent 1 second running the program in parallel, the real time would be 1 second and the system plus user time would be 2 seconds.

If one cpu each spent 2 seconds running the program, the real time would be 2 seconds and the system plus user time would still be 2 seconds.

Oh, I also made my version loop 10 times as long, which is why it takes almost 10 seconds.

Okay, then I am interpreting things correctly, and my internal timer seems to be consistent with the output of the "time" command that you refer to. To restate, the sum of "real" and "system" time is the pertinent measure. The results you posted then indicate that there is no speedup associated with increasing the number of threads. I am now seeing the same thing...

So any thoughts as to why there is no speedup?

I'm sorry to jump in the middle, but now you've lost me. The way I see the measurements going from 1 thread to 2 cut the over-all execution time roughly in half. User time is measuring the amount of work performed and that didn't change. Sorry if I missed something obvious here. Matt

...

James _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Maik Beckmann

9:31 p.m.

Am Mittwoch 14 Mai 2008 23:14:47 schrieb James Sutherland:

...

Okay, then I am interpreting things correctly, and my internal timer seems to be consistent with the output of the "time" command that you refer to. To restate, the sum of "real" and "system" time is the pertinent measure. The results you posted then indicate that there is no speedup associated with increasing the number of threads. I am now seeing the same thing...

So any thoughts as to why there is no speedup?

real: the real time it took user: CPU time owned by the user it took sys: CPU time owned by the system it took Thus this timings: $time ./a.out Executing 12 tasks using 1 threads. time: 9.64 real 0m9.668s user 0m9.641s sys 0m0.004s $ time ./a.out Executing 12 tasks using 2 threads. time: 9.91 real 0m4.991s user 0m9.909s sys 0m0.012s === 0m9.668s vs. 0m4.991s say that the 2-thread version is about two times faster than the single threaded version. Best, -- Maik

Chris Fairles

8:29 p.m.

On Wed, May 14, 2008 at 4:07 PM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:

...

James Sutherland wrote:

...
I have been testing thread performance on Linux and Mac. My Linux system has two dual-core processors and my Mac has one dual-core processor. Both are intel chips.

For the code snippet given below, the execution time should ideally decrease as the number of threads increases. However, the opposite trend is observed. For example, using -O3 flags on my Linux desktop produces the following timings: 1 Thread: 0.66 sec 2 Threads: 0.9 sec 3 Threads: 1.2 sec 4 Threads: 1.4 sec

I do not have a lot of experience with threads, and was wondering if this result surprises anyone?

Hi James,

Quoting your code out of order:

...
for( int itask=0; itask<nTasks; ++itask ){ boost::thread_group threads; for( int i=0; i<nThreads; ++i ){ threads.create_thread( MyStruct(itask++ + 100) ); } threads.join_all(); }

Did you really want the ++itask in the first for() ? Isn't it being incremented enough in the create_thread line?

...
struct MyStruct { explicit MyStruct(const int i) : tag(i) {} void operator()() const { const int n = 100; std::vector<int> nums(n,0); for( int j=0; j<1000000; ++j ) for( int i=0; i<n; ++i ) nums[i] = i+tag; } private: int tag; };

So sizeof(MyStruct)==sizeof(int) [for the tag]. Now, if you were creating the MyStruct objects like this:

MyStruct my_structs[n];

then I would say that they are all sharing a cache line, and that cache line is being fought over by the different processors when they read tag, and that you should add some padding. But you're not; you're passing a temporary MyStruct to create_thread which presumably stores a copy of it. How does boost::thread_group store the functors that are passed to it? If it is storing them in some sort of array or vector then that could still be the problem - and it could be fixed by adding padding inside boost.thread, or by copying the functor onto the new thread's stack.

Also, I would imagine that the compiler would keep tag in a register. What happens if you declare it as const?

I suggest that you try adding some padding and see what happens.

Phil.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Ran your test on windows for kicks, msvc9 with full optimizations. Dual-core AMD Athalon X2 4800 (2 x 2.5ghz running in 32bit compat mode). 1 thread = 1.50 s 2 threads = 1.05 s 3 threads = 1.30 s 4 threads = 1.55 s So, somewhat expected results here. Chris

Matus Chochlik

15 May 15 May

8:11 a.m.

Hi, IMHO the problem is that the tasks are too simple and are executed too quickly when compared to thread creation time. A task, or several tasks, that spend less that one second of CPU time are not worth creating a new dedicated thread. This is actually why thread pools are created. In my experience it is much more efficient to put such short tasks into a queue and execute them sequentially. If you have many of such tasks (~hundreds and more) and You want to use multiple CPU's then create several worker threads at application startup, that pick the tasks from the queue and execute them. I didn't try to run Your example myself but I've seen things like this happening before. Maybe try to change the const int n = 100; to something like const int n = 10000; in the MyStruct::operator() and see what happens. However, it's also possible that I'm missing something :-) and there is indeed some issue with the implementation of Boost.Threads On Wed, May 14, 2008 at 9:31 PM, James Sutherland <James.Sutherland@utah.edu> wrote:

...

I have been testing thread performance on Linux and Mac. My Linux system has two dual-core processors and my Mac has one dual-core processor. Both are intel chips.

For the code snippet given below, the execution time should ideally decrease as the number of threads increases. However, the opposite trend is observed. For example, using -O3 flags on my Linux desktop produces the following timings: 1 Thread: 0.66 sec 2 Threads: 0.9 sec 3 Threads: 1.2 sec 4 Threads: 1.4 sec

I do not have a lot of experience with threads, and was wondering if this result surprises anyone?

James

----- #include <boost/thread.hpp> #include <iostream> #include <vector> #include <time.h>

struct MyStruct { explicit MyStruct(const int i) : tag(i) {} void operator()() const { const int n = 100; std::vector<int> nums(n,0); for( int j=0; j<1000000; ++j ) for( int i=0; i<n; ++i ) nums[i] = i+tag; } private: int tag; };

int main() { using namespace std; const int nTasks = 12; const int nThreads = 4; assert( nTasks%nThreads == 0 ); assert( nThreads<=nTasks ); cout << "Executing " << nTasks << " using " << nThreads << " threads." << endl; time_t t1 = clock(); for( int itask=0; itask<nTasks; ++itask ){ boost::thread_group threads; for( int i=0; i<nThreads; ++i ){ threads.create_thread( MyStruct(itask++ + 100) ); } threads.join_all(); } t2 = clock(); cout << "time: " << difftime( t2,t1)/CLOCKS_PER_SEC << endl; } ---- _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

-- ________________ ::matus_chochlik

6237

Age (days ago)

6238

Last active (days ago)

List overview

Download

17 comments

8 participants

participants (8)

Chris Fairles
Frank Mori Hess
James Sutherland
Lance.Diduck＠ubs.com
Maik Beckmann
Matt Doyle
Matus Chochlik
Phil Endecott