Hello Boost-Users, I have a strange Problem and also a theorie to solve it. I do not realy understand the reason and hope for somebody how may explain the problem. I am using boost.threads to execute some algorithms in parallel. I will post source code in detail if this is required. I have some work to do which is similar for different Data. Concrete: I compute some Forces on a Deformationmodel of geometrical Edges. I create for n available Processors n-1 Threads so lets say we have 10.000 Edges 3 times 2500 are processed from boost threads and 2500 are processed by the main thread. Because this is a worker crew i join the threads and will be finished. All this is done in Win32 using Visual Studio Express and Boost. I create some wrapper Structure at the moment struct Wrappy { void operator()( EdgeProcessor * array, int count) { for (int i = 0; i < count; i++) array[i]->compute(); } } My main application is creating a EdgeProcessor Array of 10000 Elements. In boost::thread constructor I give a ptr to the elements the thread should compute and count is for every thread 2500. You'll see - everything is straight forward and worked fine in a lot of situations. Now EdgeProcessor is a class compiled in a seperate dll (Multithreaded DLL) doing a lot of calls - some recursive, Just basic C++ calls, no std contaiiners in use. If I process all Elements with the main application using no threads it takes 0,07 sec and i have 100% usage of one CPU (have an I5, so 25% in total) If I use threads, i got 0,19 sec - more than twice the time. If I use just one boost::thread - not processing in main app - again 0,07 sec. If I use all threads all four cpus are in 100% usage - they are all together working but require more than twice the time - Yes i am sure that every thread is just working on 2500 Elements. If more than one thread is calling the EdgeProcessor, the task is done very slow. The data for the EdgeProcessor is by the way parallel, so every EdgeProcessor has its own data and there are no intersections or synchronisations at all. I assume that there is a realy big overhead because of the dll. Maybe the access of the class can not be done in parallel or not as fast as in a static case. Maybe there is a hidden synchronisation. Do you have some explanation for this? I use the same approach on other parts of the application without problems. The only difference is a) using of a class Instance within a dll and b) the compute will result i a couple of recursive calls (traversing tree) I have experience in this, but this behavior is strange. I am sorry if this is a little win32 / dll / visual studio like, but i am not sure where to ask. Thanks for your ideas! Simon