Thread lib (good reason for static)

[This is from an email about the impact of not having a static Threads lib available anymore. I wanted to post this here to see if anyone else has similar experiences since I have only seen posts about the problem of distributing RTL dlls so far - not computation time impact.] Hello, I was reading some some of the posts about your thread implementation and the removal of the static RTL option. I understand the technical reason for this, but I wanted to submit another compelling reason (for me) for the need for static RTL linkage even if this means loosing some of the features under win32. Some of the software I develop deals with different types of analysis (such as scientific and financial). The runs can take as much as 20 hours to complete. If I do nothing but change to link with dynamic RTL, the run time is increased on average of 35%. This means and additional 7 hours of run time! So dynamic RTL is really undesirable. Do you see any way of making a more limited static Boost.threads available in the next Boost release? [Above I say "limited" because some of the features of the Thread lib must be implemented via dll to work under win32.] Peter --------------------------------- Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger

Some of the software I develop deals with different types of analysis (such as scientific and financial). The runs can take as much as 20 hours to complete. If I do nothing but change to link with dynamic RTL, the run time is increased on average of 35%. This means and additional 7 hours of run time! So dynamic RTL is really undesirable.
You didn't mention what OS/compiler you are using. We've found that (on Linux with gcc) there is a difference in run-time between shared library with -fpic turned on, and static libraries. However, when we do not compile with -fpic, the difference goes away. According to some posts on gcc, -fpic is not needed on Linux, as the loader takes care of the translation. It can increase memory usage, and will increase application load time, but it will work: http://gcc.gnu.org/ml/gcc/2000-06/msg00814.html Our speed ups are not always on the order of 35%, but we have reached that at times. Perhaps trying a build without -fpic, but still shared will show you the same benefit. If that's the case, then we might not need a different release of the Threads library. TJ -- Trey Jackson tjackson@ichips.intel.com "Ripley, she doesn't have bad dreams because she's just a piece of plastic." -- Newt

I am using WinXP (actually, a cluster of up to 25 machines) with .NET 7.1. Occasionally on Linux but the XP cluster is the primary target. Also, I remember reading somewhere that optimizers cannot do some optimizations on dlls. I would be quite happy with very basic threading (ie only that offered in the thread class) as a static lib. The rest is certainly not essential. Some of the other posts mentioned not liking RTL dlls because of deployment headaches (in high volume). Boost seems to be such a nice effort, it would be a real shame if they decide to not support high end users. As is stands now, the Threads not offering a static subset means we cannot use it because the computational hit is enormous on XP platforms. I will try the below suggestion to see the difference on Linux, but if it is true some optimizations are not available on shared libs, it may still be problematic. Peter Trey Jackson <tjackson@ichips.intel.com> wrote:
Some of the software I develop deals with different types of analysis (such as scientific and financial). The runs can take as much as 20 hours to complete. If I do nothing but change to link with dynamic RTL, the run time is increased on average of 35%. This means and additional 7 hours of run time! So dynamic RTL is really undesirable.
You didn't mention what OS/compiler you are using. --------------------------------- Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger

I am using WinXP (actually, a cluster of up to 25 machines) with .NET 7.1. Occasionally on Linux but the XP cluster is the primary target.
Also, I remember reading somewhere that optimizers cannot do some optimizations on dlls.
I would be quite happy with very basic threading (ie only that offered in the thread class) as a static lib. The rest is certainly not essential.
Some of the other posts mentioned not liking RTL dlls because of deployment headaches (in high volume).
Boost seems to be such a nice effort, it would be a real shame if
I'm working on supporting static linking in Boost.Threads. Have you determined why the statically linked version runs so much faster? Mike "Peter Danford" <pdanford_qed@yahoo.com> wrote in message news:20040601230327.7391.qmail@web60603.mail.yahoo.com... they decide to not support high end users. As is stands now, the Threads not offering a static subset means we cannot use it because the computational hit is enormous on XP platforms.
I will try the below suggestion to see the difference on Linux, but
if it is true some optimizations are not available on shared libs, it may still be problematic.
Peter
Trey Jackson <tjackson@ichips.intel.com> wrote:
Some of the software I develop deals with different types of
analysis
(such as scientific and financial). The runs can take as much as 20 hours to complete. If I do nothing but change to link with dynamic RTL, the run time is increased on average of 35%. This means and additional 7 hours of run time! So dynamic RTL is really undesirable.
You didn't mention what OS/compiler you are using.
--------------------------------- Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

I'm not sure other than the reasons regarding optimizations possible with statically linked libs. I know that there is also some overhead involved in calling dll functions as opposed to the static counterpart, but this surely is a small cost. Other than that, I am not sure. I am very glad to hear that you are working on the static problem though. The latest checkout from thread_dev branch revealed the use of thread specific storage in thread.cpp, so it looks like the latest work there may be making it more difficult to make a static subset... Peter Michael Glassford <glassfordm@hotmail.com> wrote: I'm working on supporting static linking in Boost.Threads. Have you determined why the statically linked version runs so much faster? Mike --------------------------------- Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger

Hi, "Peter Danford" <pdanford_qed@yahoo.com> wrote in message news:20040602024428.45768.qmail@web60606.mail.yahoo.com...
I'm not sure other than the reasons regarding optimizations possible with statically linked libs. I know that there is also some overhead involved in calling dll functions as opposed to the static counterpart, but this surely is a small cost. Other than that, I am not sure.
As you are running VS.NET 2003, you could try the (free) community edition of Compuware's profiler, downloadable from: http://www.compuware.com/products/devpartner/profiler/default.asp Instrumenting your application (and boost.thread) should help you to find out the reason for the performance degradation when using the runtime linked version. And, yes, I second the opinion to make a statically linked version available - even without or with limited support for TSS/TLS. [following is windows-specific stuff] I seem to recall that the reason for requiring a dynamically linked version was solely for automatically cleaning up thread_specific_ptr's at thread exit - is that right? I've in the past made a solution based on lazy cleanup of previously allocated TLS data - i.e. when someone attempts to alloc a TLS (aka TSS) slot/data, the currently invalid data (belonging to an "ex-thread") is deleted. This requires some metadata to be globally available to all threads, and that this data is tagged with the owning threads id, but is certainly doable. It could also be possible to expose a manual TLS-data "garbage-collection" routine, e.g. "collect_tss_data", for the user's really nedding it. Not beautiful, but considering the options ... There are a few potential problems though (off the top of my head): 1. Performance - each creation of thread-specific data required exclusive access to the metadata. If this is a real issue or not depends on the real-world scenario. Developers sensitive to this should anyway try to create only one thread_specific_ptr to host all their TSS data. Also, TLS slots are a limited resource (especially under earlier NT versions). Or is this automatically managed by the current TSS implementation? 2. Operating system's reuse of thread id's. This could (in theory) cause data to be hanging around longer than necessary, but should otherwise not be causing problems as each thread should only be able to access data created by themselves. 3. Destruction-time for TSS data would be indeterminate (unless explicitly clearing things up). HTH // Johan

[following is windows-specific stuff]
I seem to recall that the reason for requiring a dynamically linked version was solely for automatically cleaning up thread_specific_ptr's at
"Johan Nilsson" <johan.nilsson@esrange.ssc.se> wrote in message news:c9k069$9jf$1@sea.gmane.org... [snip performance profiling discussion] thread
exit - is that right?
Yes.
I've in the past made a solution based on lazy cleanup of previously allocated TLS data - i.e. when someone attempts to alloc a TLS (aka TSS) slot/data, the currently invalid data (belonging to an "ex-thread") is deleted. This requires some metadata to be globally available to all threads, and that this data is tagged with the owning threads id, but is certainly doable.
I'll have to think about this.
It could also be possible to expose a manual TLS-data "garbage-collection" routine, e.g. "collect_tss_data", for the user's really nedding it. Not beautiful, but considering the options ...
There are a few potential problems though (off the top of my head):
1. Performance - each creation of thread-specific data required exclusive access to the metadata. If this is a real issue or not depends on
I had thought of something like this, too. Or of exposing cleanup functions that could be called from the user's dllmain (if they have one) rather than requiring Boost.Threads to have its own dllmain function to detect when threads go away. And there's the idea that Roland proposed some time ago, which could be built on top of this, of having Boost.Threads create a "pseudo-dll" on the fly that can detect when threads go away and tell Boost.Threads about it. Unfortunately, I still haven't been able to spend much time actually trying these ideas out. Most of my recent Boost time has been spent working on finishing the conversion of Boost.Threads docs to BoostBook format and updating them. the
real-world scenario.
Yes.
Developers sensitive to this should anyway try to create only one thread_specific_ptr to host all their TSS data. Also, TLS slots are a limited resource (especially under earlier NT versions). Or is this automatically managed by the current TSS implementation?
Boost.Threads now uses only one real TLS slot no matter how many thread_specific_ptrs you create, so this shouldn't be a problem.
2. Operating system's reuse of thread id's. This could (in theory) cause data to be hanging around longer than necessary, but should otherwise not be causing problems as each thread should only be able to access data created by themselves.
I can see reuse of thread ids being a problem, but I'm not sure I understand the rest of this.
3. Destruction-time for TSS data would be indeterminate (unless explicitly clearing things up).
Unless using one of the alternative dllmain schemes I mentioned above. Mike

"Michael Glassford" <glassfordm@hotmail.com> wrote in message news:c9kodd$krt$1@sea.gmane.org...
"Johan Nilsson" <johan.nilsson@esrange.ssc.se> wrote in message news:c9k069$9jf$1@sea.gmane.org...
[snip performance profiling discussion]
[...]
I've in the past made a solution based on lazy cleanup of previously allocated TLS data - i.e. when someone attempts to alloc a TLS (aka TSS) slot/data, the currently invalid data (belonging to an "ex-thread") is deleted. This requires some metadata to be globally available to all threads, and that this data is tagged with the owning threads id, but is certainly doable.
I'll have to think about this.
Ok, please also see my comment below on period tss data cleanup.
It could also be possible to expose a manual TLS-data "garbage-collection" routine, e.g. "collect_tss_data", for the user's really nedding it. Not beautiful, but considering the options ...
I had thought of something like this, too. Or of exposing cleanup functions that could be called from the user's dllmain (if they have one) rather than requiring Boost.Threads to have its own dllmain function to detect when threads go away. And there's the idea that Roland proposed some time ago, which could be built on top of this, of having Boost.Threads create a "pseudo-dll" on the fly that can detect when threads go away and tell Boost.Threads about it.
Could this cause problems in the future, when Windows will make use of the protect-memory-from-execution functionality in recent Intel processors (sorry, I just don't remember the proper name)? An alternative could be to have Boost.Threads automatically create a background thread to periodically collect unowned thread-specific data. Now if boost threads could have priority assigned to them to make this a (corresponding to) THREAD_PRIORITY_IDLE thread - is that in the works? That might anyway not be an issue as this is a platform specific issue only as this special support should only get compiled if running under Windows anyway. [...]
Boost.Threads now uses only one real TLS slot no matter how many thread_specific_ptrs you create, so this shouldn't be a problem.
Great.
2. Operating system's reuse of thread id's. This could (in theory) cause data to be hanging around longer than necessary, but should otherwise not be causing problems as each thread should only be able to access data created by themselves.
I can see reuse of thread ids being a problem, but I'm not sure I understand the rest of this.
Highly theoretical: 1. Thread A is created with id:1 2. Thread A" creates thread_specificic_ptr (first time), implicitly allocating TLS slot. 3. Thread A exits 4. Thread B is created, gettting the recycled id of the first thread (id:1) 5. Thread B creates thread_specific_ptr; this is the first time so the implementation now also tries to perform a 'lazy' cleanup of any unowned data. There are still data allocated by Thread A, but this is mapped through the thread's id and so can't be detected as ready for collection. Thread B can still create it's own data, but Thread A's won't be collected until after Thread B has exited (and another thread attempts to create thread-specific data - causing lazy cleanup).
3. Destruction-time for TSS data would be indeterminate (unless explicitly clearing things up).
Unless using one of the alternative dllmain schemes I mentioned above.
Yes, but that's still forcing the user to use a "special" dll just for that purpose. // Johan

"Johan Nilsson" <johan.nilsson@esrange.ssc.se> wrote in message news:c9mk92$815$1@sea.gmane.org...
"Michael Glassford" <glassfordm@hotmail.com> wrote in message news:c9kodd$krt$1@sea.gmane.org...
"Johan Nilsson" <johan.nilsson@esrange.ssc.se> wrote in message news:c9k069$9jf$1@sea.gmane.org...
It could also be possible to expose a manual TLS-data "garbage-collection" routine, e.g. "collect_tss_data", for the user's really nedding
it.
Not
beautiful, but considering the options ...
I had thought of something like this, too. Or of exposing cleanup functions that could be called from the user's dllmain (if they have one) rather than requiring Boost.Threads to have its own dllmain function to detect when threads go away. And there's the idea that Roland proposed some time ago, which could be built on top of
having Boost.Threads create a "pseudo-dll" on the fly that can detect when threads go away and tell Boost.Threads about it.
Could this cause problems in the future, when Windows will make use of the protect-memory-from-execution functionality in recent Intel
[snip] this, of processors
(sorry, I just don't remember the proper name)?
I don't know. I was worried about that myself, but haven't looked into it yet.
An alternative could be to have Boost.Threads automatically create a background thread to periodically collect unowned thread-specific data.
Now if boost threads could have priority assigned to them to make this a (corresponding to) THREAD_PRIORITY_IDLE thread - is that in
This could also be used in addition to lazy cleanup instead of as an alternative, and seems to have the same problems. the
works?
2. Operating system's reuse of thread id's. This could (in
cause
data to be hanging around longer than necessary, but should otherwise not be causing problems as each thread should only be able to access data created by themselves.
I can see reuse of thread ids being a problem, but I'm not sure I understand the rest of this.
Highly theoretical:
1. Thread A is created with id:1 2. Thread A" creates thread_specificic_ptr (first time), implicitly allocating TLS slot. 3. Thread A exits 4. Thread B is created, gettting the recycled id of the first thread (id:1) 5. Thread B creates thread_specific_ptr; this is the first time so
implementation now also tries to perform a 'lazy' cleanup of any unowned data. There are still data allocated by Thread A, but this is mapped
The (unfinished) changes on the thread_dev branch do implement thread priorties. They won't make it into the next Boost release, but I hope they will be in the one after that. [snip] theory) the through
the thread's id and so can't be detected as ready for collection.
I follow as far as this.
Thread B can still create it's own data,
How does thread B know that it needs to create its own data--i.e., what prevents it from thinking that thread A's data is its own and using it? Here's a specific case I have in mind: the implementation of the thread class on the thread_dev branch. In this implmemtation, the thread class has become a handle class that holds a reference-counted pointer to a thread_data class. When a thread class is created, it gets access to the thread_data class for the thread using a global static thread_specific_ptr. In the scenario you outline above, when a thread object created on Thread B tries to access its thread data through this thread_specific_ptr, it will get the thread data for thread A, which is a Bad Thing.
but Thread A's won't be collected until after Thread B has exited (and another thread attempts to create thread-specific data - causing lazy cleanup).
3. Destruction-time for TSS data would be indeterminate (unless explicitly clearing things up).
Unless using one of the alternative dllmain schemes I mentioned
above.
Yes, but that's still forcing the user to use a "special" dll just for that purpose.
Not necessarily. If the user's code is in a dll, its own dllmain could be used; or the dllmain of the "pseudo dll" that is created on the fly could be used. Mike

"Michael Glassford" <glassfordm@hotmail.com> wrote in message news:c9nemm$gt8$1@sea.gmane.org...
"Johan Nilsson" <johan.nilsson@esrange.ssc.se> wrote in message news:c9mk92$815$1@sea.gmane.org...
"Michael Glassford" <glassfordm@hotmail.com> wrote in message news:c9kodd$krt$1@sea.gmane.org...
"Johan Nilsson" <johan.nilsson@esrange.ssc.se> wrote in message news:c9k069$9jf$1@sea.gmane.org...
[snip]
An alternative could be to have Boost.Threads automatically create a background thread to periodically collect unowned thread-specific data.
This could also be used in addition to lazy cleanup instead of as an alternative, and seems to have the same problems.
Implementing the solution another poster mentioned (have a background thread wait either on thread termination or an event signalling a new thread request TSS data) would be better. Or why not implement a catch (...) around the call to the threads user's provided thread entry and free data when it returns - but that would preclude users to create threads directly and still use thread_specific_ptr's. Hmmm ... or is that impossible in the current implementation as well?
Now if boost threads could have priority assigned to them to make this a (corresponding to) THREAD_PRIORITY_IDLE thread - is that in the works?
The (unfinished) changes on the thread_dev branch do implement thread priorties. They won't make it into the next Boost release, but I hope they will be in the one after that.
Not a problem in this case; you could just use (the boost equivalent of): // pseudo pseudo-code #if defined (WIN32) ::SetThreadPriority(...); #else #error <appropriate error message> #endif [snip]
Highly theoretical:
1. Thread A is created with id:1 2. Thread A" creates thread_specificic_ptr (first time), implicitly allocating TLS slot. 3. Thread A exits 4. Thread B is created, gettting the recycled id of the first thread (id:1) 5. Thread B creates thread_specific_ptr; this is the first time so the implementation now also tries to perform a 'lazy' cleanup of any unowned data. There are still data allocated by Thread A, but this is mapped through the thread's id and so can't be detected as ready for collection.
I follow as far as this.
Thread B can still create it's own data,
How does thread B know that it needs to create its own data--i.e., what prevents it from thinking that thread A's data is its own and using it?
By pure magic I suppose ... ;-) Seriously, I haven't checked the thread_specific_ptr implementation. Previously I've always been under the impression that if calling TlsGetValue(<tls index>) return NULL and GetLastError() == NO_ERROR, the calling threads slot is uninitialized => create whatever and store pointer. The thread id as mapped by the suggestion above is only used for deleting "unowned data". References to the data is stored globally, protected by a synchronization object and mapped per thread-id. This "meta"-data is not used for _accessing_ the data.
Here's a specific case I have in mind: the implementation of the thread class on the thread_dev branch. In this implmemtation, the thread class has become a handle class that holds a reference-counted pointer to a thread_data class. When a thread class is created, it gets access to the thread_data class for the thread using a global static thread_specific_ptr. In the scenario you outline above, when a thread object created on Thread B tries to access its thread data through this thread_specific_ptr, it will get the thread data for thread A, which is a Bad Thing.
If it accesses the data through it's own id <-> data map, yes. Not if it leaves that to the operating systems's service internally (TlsGetValue). I might be missing something though. [snip]
Yes, but that's still forcing the user to use a "special" dll just
for that
purpose.
Not necessarily. If the user's code is in a dll, its own dllmain could be used; or the dllmain of the "pseudo dll" that is created on the fly could be used.
Sorry, I meant you're forcing the user to _use a dll_ for that purpose (i.e. not necessarily a "special" one). // Johan

"Johan Nilsson" <johan.nilsson@esrange.ssc.se> wrote in message news:c9pn3h$q4a$1@sea.gmane.org...
"Michael Glassford" <glassfordm@hotmail.com> wrote in message news:c9nemm$gt8$1@sea.gmane.org...
"Johan Nilsson" <johan.nilsson@esrange.ssc.se> wrote in message news:c9mk92$815$1@sea.gmane.org...
"Michael Glassford" <glassfordm@hotmail.com> wrote in message news:c9kodd$krt$1@sea.gmane.org...
"Johan Nilsson" <johan.nilsson@esrange.ssc.se> wrote in
news:c9k069$9jf$1@sea.gmane.org...
[snip]
An alternative could be to have Boost.Threads automatically create a background thread to periodically collect unowned
data.
This could also be used in addition to lazy cleanup instead of as an alternative, and seems to have the same problems.
Implementing the solution another poster mentioned (have a background thread wait either on thread termination or an event signalling a new
message thread-specific thread
request TSS data) would be better.
Except that Malcolm, who mentioned that approach, abandoned it as unworkable--or was that only the optimization? But without the optimization you'd have to create a "watchdog" thread for each thread being watched, which doesn't sound like a good idea, either. I agree that it's worth looking into, though, to see if the problems he had can be dealt with.
Or why not implement a catch (...) around the call to the threads user's provided thread entry and free data when it returns - but that would preclude users to create threads directly and still use thread_specific_ptr's.
Yes, and that's pretty important.
Hmmm ... or is that impossible in the current implementation as well?
Now if boost threads could have priority assigned to them to make this a (corresponding to) THREAD_PRIORITY_IDLE thread - is that in the works?
The (unfinished) changes on the thread_dev branch do implement
No, it's quite possible to use a thread_specific_ptr in a thread not created by Boost.Threads. thread
priorities. They won't make it into the next Boost release, but I hope they will be in the one after that.
Not a problem in this case; you could just use (the boost equivalent of):
// pseudo pseudo-code #if defined (WIN32) ::SetThreadPriority(...); #else #error <appropriate error message> #endif
[snip]
Highly theoretical:
1. Thread A is created with id:1 2. Thread A" creates thread_specific_ptr (first time), implicitly allocating TLS slot. 3. Thread A exits 4. Thread B is created, getting the recycled id of the first
True. I was reading "is that in the works" as asking about availability of setting thread priorities in general, and answering it as such. thread
5. Thread B creates thread_specific_ptr; this is the first time so
implementation now also tries to perform a 'lazy' cleanup of any unowned data. There are still data allocated by Thread A, but this is mapped
(id:1) the through
the thread's id and so can't be detected as ready for
collection. > > > > I follow as far as this. > > > > > Thread B can still create it's own data, > > > > How does thread B know that it needs to create its own data--i.e., > > what prevents it from thinking that thread A's data is its own and > > using it? > > By pure magic I suppose ... ;-) > > Seriously, I haven't checked the thread_specific_ptr implementation. > Previously I've always been under the impression that if calling > TlsGetValue(<tls index>) return NULL and GetLastError() == NO_ERROR, the > calling threads slot is uninitialized => create whatever and store pointer.
I need to look at the code again and think about this. I think I may have missed something before.
The thread id as mapped by the suggestion above is only used for deleting "unowned data". References to the data is stored globally, protected by a synchronization object and mapped per thread-id. This "meta"-data is not used for _accessing_ the data.
Here's a specific case I have in mind: the implementation of the thread class on the thread_dev branch. In this implementation, the thread class has become a handle class that holds a
reference-counted
pointer to a thread_data class. When a thread class is created, it gets access to the thread_data class for the thread using a global static thread_specific_ptr. In the scenario you outline above, when a thread object created on Thread B tries to access its thread data through this thread_specific_ptr, it will get the thread data for thread A, which is a Bad Thing.
If it accesses the data through it's own id <-> data map, yes. Not if it leaves that to the operating systems's service internally (TlsGetValue). I might be missing something though.
[snip]
Yes, but that's still forcing the user to use a "special" dll
just for that
purpose.
Not necessarily. If the user's code is in a dll, its own dllmain could be used; or the dllmain of the "pseudo dll" that is created on the fly could be used.
Sorry, I meant you're forcing the user to _use a dll_ for that
As I said above, I need to look at the code again. purpose (i.e.
not necessarily a "special" one).
OK. Mike

"Michael Glassford" <glassfordm@hotmail.com> wrote in message news:c9qn4v$s45$1@sea.gmane.org...
"Johan Nilsson" <johan.nilsson@esrange.ssc.se> wrote in message news:c9pn3h$q4a$1@sea.gmane.org...
"Michael Glassford" <glassfordm@hotmail.com> wrote in message news:c9nemm$gt8$1@sea.gmane.org...
[snip]
Implementing the solution another poster mentioned (have a
wait either on thread termination or an event signalling a new
background thread thread
request TSS data) would be better.
Except that Malcolm, who mentioned that approach, abandoned it as unworkable--or was that only the optimization? But without the optimization you'd have to create a "watchdog" thread for each thread being watched, which doesn't sound like a good idea, either.
I agree that it's worth looking into, though, to see if the problems he had can be dealt with.
I meant including the optimization; i.e. (definitely) _not_ creating one "watchdog" thread for each "real" thread. Problem being, as he mentioned, that WFMO only takes so many events; making it necessary to create one watchdog thread for every 63 user-threads. I don't see any implementation problems with this approach as long as it would only be used inside executable modules (i.e. static linkage), but as I haven't actually tried to implement it ... The 63-threads limit shouldn't be a problem in well-designed applications, it actually should be more than sufficient. If going for a watchdog-thread based solution, I'm not so sure it's worth the effort to support more than so due to the added complexity (at least that's my initial impression). Regards, Johan

I don't see any implementation problems with this approach as long as it would only be used inside executable modules (i.e. static linkage), but as I haven't actually tried to implement it ...
IIRC, the complexity arises when you start to consider global scope thread specific variables in different translation units, together with the possibility that threads can be started before 'main' and waited on, i.e. joined, after 'main has completed. Ah, I see the remains of the bad dream now . . . imagine you need to wait for completion of some thread that is joined in the destructor of a global in a different translation unit. You need to make sure nothing will wait for your thread for any reason (otherwise you deadlock), and you need to somehow make sure that any object cleanup is handled by the thread local object and doesn't rely on any statics since they may have been cleaned up before the global in the other translation unit. I have a vague recollection that I was trying to ensure that the thread that I started actually completed by joining it, but it makes no sense to do that so I think now that this approach might work, although it's a little complex. Note that the overhead of one additional 'thread per thread' might not be too great since the secondary (cleanup) thread gets started only once (this could be either on the first access, or when a boost thread starts) and then goes to sleep until the primary thread ends. The simplicity of this design might be preferable to the limitation/complexity imposed by WFMO. Also you should be aware that the contract that states that the cleanup handler *will* be called when the primary terminates is technically invalid, since there is no way to ensure that the cleanup thread will ever be scheduled when the primary thread ends. In most cases it will be scheduled, but there are no actual guarantees. Looking through the archives, I believe Bill Kempf was very keen to preserve this contract (I can understand why and I think I agree). One other advantage to 'one thread per thread' is that the cleaup thread only has to manage one set of cleanup routines (for the thread that it's waiting for) and it is *guaranteed* to stay asleep until the primary has ended. For this reason you wouldn't need any locks guarding the thread local object, as seem to be needed with the dll design (at least my recollection of the code in thread_dev is that there is a lock on the tss cleanup collection, although it may be possible to design that away). It might be possible to do this with WFMO too, but it may get a bit complex. The other alternative approach (requiring an instance of an object allocated in the thread function and/or functor) also requires no locks, and guarantees that the cleanup must be called as long as the destructor gets called, which will happen when the thread/functor exits normally or via an exception (i.e. it won't happen if you call TerminateThread(), but all bets are off then anyway). Oh, and it feels more in keeping with C++ since the language enforces cleanup via a destructor and it doesn't require any additional threads either ;-) Malcolm Noyes

"Malcolm Noyes" <boost@alchemise.com> wrote in message news:cte9c0h0c4ojj2uqb7ao2ilatc1hkg9shq@4ax.com...
I don't see any implementation problems with this approach as long as it would only be used inside executable modules (i.e. static linkage), but as I haven't actually tried to implement it ...
IIRC, the complexity arises when you start to consider global scope thread specific variables in different translation units, together with the possibility that threads can be started before 'main' and waited on, i.e. joined, after 'main has completed. Ah, I see the remains of the bad dream now . . . imagine you need to wait for completion of some thread that is joined in the destructor of a global in a different translation unit. You need to make sure nothing will wait for your thread for any reason (otherwise you deadlock), and you need to somehow make sure that any object cleanup is handled by the thread local object and doesn't rely on any statics since they may have been cleaned up before the global in the other translation unit.
Yes, global (implicit or explicit) tss data is likely cause non-deterministic behaviour in such a solution. I'm not so sure you should wait for other threads to complete in the context of exiting the process/terminating the rtl (actually I'm convinced it's not, even if this is not inside a dll).
I have a vague recollection that I was trying to ensure that the thread that I started actually completed by joining it, but it makes no sense to do that so I think now that this approach might work, although it's a little complex. Note that the overhead of one additional 'thread per thread' might not be too great since the secondary (cleanup) thread gets started only once (this could be either on the first access, or when a boost thread starts) and then goes to sleep until the primary thread ends. The simplicity of this design might be preferable to the limitation/complexity imposed by WFMO.
I don't agree, but having different opinions is definitely allowed :-)
Also you should be aware that the contract that states that the cleanup handler *will* be called when the primary terminates is technically invalid, since there is no way to ensure that the cleanup thread will ever be scheduled when the primary thread ends. In most cases it will be scheduled, but there are no actual guarantees. Looking through the archives, I believe Bill Kempf was very keen to preserve this contract (I can understand why and I think I agree).
I'm more and more leaning towards a non-worker-thread, lazy tss cleanup implementation (if any at all for static linking). Look further down as well.
One other advantage to 'one thread per thread' is that the cleaup thread only has to manage one set of cleanup routines (for the thread that it's waiting for) and it is *guaranteed* to stay asleep until the primary has ended. For this reason you wouldn't need any locks guarding the thread local object, as seem to be needed with the dll design (at least my recollection of the code in thread_dev is that there is a lock on the tss cleanup collection, although it may be possible to design that away). It might be possible to do this with WFMO too, but it may get a bit complex.
See comments below.
The other alternative approach (requiring an instance of an object allocated in the thread function and/or functor) also requires no locks, and guarantees that the cleanup must be called as long as the destructor gets called, which will happen when the thread/functor exits normally or via an exception (i.e. it won't happen if you call TerminateThread(), but all bets are off then anyway). Oh, and it feels more in keeping with C++ since the language enforces cleanup via a destructor and it doesn't require any additional threads either ;-)
I believe all of the above suggestions (including the WFMO + event variation) would render it impossible to use boost's tss functionality outside threads created using Boost.Thread. As I mentioned above, perhaps a pure lazy-tss-cleanup would be the way to go (possibly in combination with a timer-activated cleanup thread). To guarantee cleanup of thread data at primary thread exit; the thread-id <-> data map could clean up all data on destruction (assuming it is a global/singleton object). The details would need to be fleshed out and a test implementation heavily excercised, I guess. The only real problem I can foresee is for threads requiring deterministic cleanup of their resources. A possibility to explicitly clean up their tss data could also be provided for this purpose. // Johan

From: "Johan Nilsson" <johan.nilsson@esrange.ssc.se>
"Malcolm Noyes" <boost@alchemise.com> wrote in message
IIRC, the complexity arises when you start to consider global scope thread specific variables in different translation units, together with the possibility that threads can be started before 'main' and waited on, i.e. joined, after 'main has completed. Ah, I see the
Do you really want to account for the creation and joining of threads outside of main()? I'm no MT expert, but I recall learning that one should avoid creating threads prior to main. It makes statics and even main() difficult to write. Isn't that correct? If so, should Boost.Thread really account for bad designs? -- Rob Stewart stewart@sig.com Software Engineer http://www.sig.com Susquehanna International Group, LLP using std::disclaimer;

Rob Stewart wrote:
From: "Johan Nilsson" <johan.nilsson@esrange.ssc.se>
"Malcolm Noyes" <boost@alchemise.com> wrote in message
IIRC, the complexity arises when you start to consider global scope thread specific variables in different translation units, together with the possibility that threads can be started before 'main' and waited on, i.e. joined, after 'main has completed. Ah, I see the
Do you really want to account for the creation and joining of threads outside of main()? I'm no MT expert, but I recall learning that one should avoid creating threads prior to main. It makes statics and even main() difficult to write. Isn't that correct? If so, should Boost.Thread really account for bad designs?
The biggest problem with this is that global static objects are initialized before main(). Is it necessarily a bad thing for the constructor of such an object to create a thread? Mike

From: Michael Glassford <glassfordm@hotmail.com>
Rob Stewart wrote:
Do you really want to account for the creation and joining of threads outside of main()? I'm no MT expert, but I recall learning that one should avoid creating threads prior to main. It makes statics and even main() difficult to write. Isn't that correct? If so, should Boost.Thread really account for bad designs?
The biggest problem with this is that global static objects are initialized before main(). Is it necessarily a bad thing for the constructor of such an object to create a thread?
My understanding is that it is a problem, but I may be wrong. The problems you run into with permitting threads before main() is that you have to make access to shared resources thread safe when accessed by static objects that can be created in any order and main() can't do any required initialization without worrying about thread safety. The better approach is for statics to be initialized in a single thread and for main() to do its initialization and, only then, does main() cause the various threads to come into existence. That is, main() doesn't need to create the threads, but it does need to make the (member) function calls necessary to do so. -- Rob Stewart stewart@sig.com Software Engineer http://www.sig.com Susquehanna International Group, LLP using std::disclaimer;

"Rob Stewart" <stewart@sig.com> wrote in message news:200406081538.i58FceS29292@lawrencewelk.systems.susq.com...
From: Michael Glassford <glassfordm@hotmail.com>
Rob Stewart wrote:
Do you really want to account for the creation and joining of threads outside of main()? I'm no MT expert, but I recall learning that one should avoid creating threads prior to main. It makes statics and even main() difficult to write. Isn't that correct? If so, should Boost.Thread really account for bad designs?
The biggest problem with this is that global static objects are initialized before main(). Is it necessarily a bad thing for the constructor of such an object to create a thread?
My understanding is that it is a problem, but I may be wrong. The problems you run into with permitting threads before main() is that you have to make access to shared resources thread safe when accessed by static objects that can be created in any order and main() can't do any required initialization without worrying about thread safety. The better approach is for statics to be initialized in a single thread and for main() to do its initialization and, only then, does main() cause the various threads to come into existence.
You are assuming that the author of main() knows all about the inner works of all objects it makes use of (directly as well as indirectly). I believe that he/she should not necessarily have to. // Johan

-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org]On Behalf Of Michael Glassford Sent: Wednesday, June 09, 2004 2:32 AM To: boost@lists.boost.org Subject: [boost] Re: Thread lib (good reason for static)
[snip]
Do you really want to account for the creation and joining of threads outside of main()? I'm no MT expert, but I recall learning that one should avoid creating threads prior to main. It makes statics and even main() difficult to write. Isn't that correct? If so, should Boost.Thread really account for bad designs?
The biggest problem with this is that global static objects are initialized before main(). Is it necessarily a bad thing for the constructor of such an object to create a thread?
Assume use of a technique like the following (think this has been presented as "object shim"?); T & anastasia() { static T *p = new T; return *p; } as a means of dealing with global construction issue, i.e. instead of expressions like "anastasia.data_member" the usage is "anastasia().data_member". Also assume that some of these global objects (i.e. shims) deploy mutexes for MT reasons. What happens when there is more that one thread running around making calls to shims? Instantiation of the global itself becomes an MT issue. It is a specific example, but it is unfortunate to lose the utility of a technique that is current preference for another curly issue. Cheers, Scott

"Rob Stewart" <stewart@sig.com> wrote in message news:200406081404.i58E49L27521@lawrencewelk.systems.susq.com...
From: "Johan Nilsson" <johan.nilsson@esrange.ssc.se>
"Malcolm Noyes" <boost@alchemise.com> wrote in message
IIRC, the complexity arises when you start to consider global scope thread specific variables in different translation units, together with the possibility that threads can be started before 'main' and waited on, i.e. joined, after 'main has completed. Ah, I see the
Do you really want to account for the creation and joining of threads outside of main()? I'm no MT expert, but I recall learning that one should avoid creating threads prior to main. It makes statics and even main() difficult to write. Isn't that correct? If so, should Boost.Thread really account for bad designs?
Well, you replied to my posting but to _Malcolms_ text ... whatever. I think you actually need to account for this and that such a design is not neccessarily bad. As an example, consider a singleton utilizing a thread pool created at object creation time (which might occur anytime in the program). When the singleton is destroyed, it wants its workers to exit. To make sure the threads have a possibility to do a _controlled_ shutdown, the singleton signals the threads that they should terminate, and then waits for them to exit (by using e.g. join) before returning from the destructor. To be on the safe side though, it would be preferable to use a timed join ... there's always the possibility that one of the worker threads might hang causing the entire program to stop at exit time. [Before embarking on the singletons-are-also-bad-design trail, remember that there are exceptions to the rule.] // Johan

The simplicity of this [one thread per thread] design might be preferable to the limitation/complexity imposed by WFMO.
I don't agree, but having different opinions is definitely allowed :-)
I seem to recall being told once to avoid premature optimisation ;-) Unless a developer is in the habit of creating threads frequently then 1 additional thread per thread probably wouldn't make much difference. A naive implementation of a socket server that allocated 1 thread per connection probably wouldn't perform well for example. Needs to be tested to see what the limits are, I think.
I believe all of the above suggestions (including the WFMO + event variation) would render it impossible to use boost's tss functionality outside threads created using Boost.Thread.
Not sure I agree. I have test cases running now that work with non-boost threads and seem pretty simple to use. I have more tests that I want to implement, including MFC UI threads (and I see no reason why it shouldn't work). Also, I'm still not convinced that tss cleanup on *thread* exit is the best option; for a thread pool I still think that I'd want automatic cleanup on *functor* exit (for the same reason that exception propogation should probably happen on functor exit). Perhaps an implementation that works with QueueUserWorkItem wouldn't go amiss . . . Malcolm Malcolm Noyes

The simplicity of this [one thread per thread] design might be preferable to the limitation/complexity imposed by WFMO.
I don't agree, but having different opinions is definitely allowed :-)
I seem to recall being told once to avoid premature optimisation ;-) Unless a developer is in the habit of creating threads frequently
1 additional thread per thread probably wouldn't make much difference. A naive implementation of a socket server that allocated 1 thread
"Malcolm Noyes" <boost@alchemise.com> wrote in message news:ntqjc012df7c2tcrt1offg51b2gp9sj1t0@4ax.com... then per
connection probably wouldn't perform well for example. Needs to be tested to see what the limits are, I think.
I believe all of the above suggestions (including the WFMO + event variation) would render it impossible to use boost's tss functionality outside threads created using Boost.Thread.
Not sure I agree. I have test cases running now that work with non-boost threads and seem pretty simple to use. I have more tests that I want to implement, including MFC UI threads (and I see no reason why it shouldn't work).
Also, I'm still not convinced that tss cleanup on *thread* exit is
Thanks for doing so much of my work for me--it will get done a lot faster that way. the
best option; for a thread pool I still think that I'd want automatic cleanup on *functor* exit (for the same reason that exception propogation should probably happen on functor exit).
It would probably be easy enough to make the functor trigger the tss cleanup, or perhaps to make the thread pool wrap the functor with a functor wrapper that triggers tss cleanup when it exits.
Perhaps an implementation that works with QueueUserWorkItem wouldn't go amiss . .
Mike

Johan Nilsson wrote:
"Michael Glassford" <glassfordm@hotmail.com> wrote in message news:c9qn4v$s45$1@sea.gmane.org...
"Johan Nilsson" <johan.nilsson@esrange.ssc.se> wrote in message news:c9pn3h$q4a$1@sea.gmane.org...
"Michael Glassford" <glassfordm@hotmail.com> wrote in message news:c9nemm$gt8$1@sea.gmane.org...
[snip]
Implementing the solution another poster mentioned (have a
wait either on thread termination or an event signalling a new
background thread thread
request TSS data) would be better.
Except that Malcolm, who mentioned that approach, abandoned it as unworkable--or was that only the optimization? But without the optimization you'd have to create a "watchdog" thread for each thread being watched, which doesn't sound like a good idea, either.
I agree that it's worth looking into, though, to see if the problems he had can be dealt with.
I meant including the optimization; i.e. (definitely) _not_ creating one "watchdog" thread for each "real" thread. Problem being, as he mentioned, that WFMO only takes so many events; making it necessary to create one watchdog thread for every 63 user-threads. I don't see any implementation problems with this approach as long as it would only be used inside executable modules (i.e. static linkage), but as I haven't actually tried to implement it ...
POSIX-conformant thread-specific storage destructors run in the context of the thread that is about to exit, AFAIK. They are even allowed to use thread-specific storage themselves.

"Peter Dimov" <pdimov@mmltd.net> wrote in message news:01fc01c44d59$76b94d20$0600a8c0@pdimov...
Johan Nilsson wrote:
"Michael Glassford" <glassfordm@hotmail.com> wrote in message news:c9qn4v$s45$1@sea.gmane.org...
"Johan Nilsson" <johan.nilsson@esrange.ssc.se> wrote in message news:c9pn3h$q4a$1@sea.gmane.org...
"Michael Glassford" <glassfordm@hotmail.com> wrote in message news:c9nemm$gt8$1@sea.gmane.org...
[...]
Except that Malcolm, who mentioned that approach, abandoned it as unworkable--or was that only the optimization? But without the optimization you'd have to create a "watchdog" thread for each thread being watched, which doesn't sound like a good idea, either.
I agree that it's worth looking into, though, to see if the problems he had can be dealt with.
I meant including the optimization; i.e. (definitely) _not_ creating one "watchdog" thread for each "real" thread. Problem being, as he mentioned, that WFMO only takes so many events; making it necessary to create one watchdog thread for every 63 user-threads. I don't see any implementation problems with this approach as long as it would only be used inside executable modules (i.e. static linkage), but as I haven't actually tried to implement it ...
POSIX-conformant thread-specific storage destructors run in the context of the thread that is about to exit, AFAIK.
Ok, I have zip experience in using pthreads, only Win32 threads and the boost.thread library. If that's a definite requirement for the implementation it'll be virtually impossible to implement under Win32 (assuming static linkage to boost.thread) - without limiting users to use tss from "boost threads" only, IMHO.
They are even allowed to use thread-specific storage themselves.
Do you mean that the "destructors" can allocate/use tss under execution, or? What is the definition of "destructors" under pthreads, anyway (I assume it's not the same as a C++ destructor)? // Johan

Johan Nilsson wrote: [...]
Do you mean that the "destructors" can allocate/use tss under execution, or?
<quote> After all cancellation cleanup handlers have been executed, if the thread has any thread-specific data, appropriate destructor functions shall be called in an unspecified order. [...] Both pthread_getspecific() and pthread_setspecific() may be called from a thread-specific data destructor function. A call to pthread_getspecific() for the thread-specific data key being destroyed shall return the value NULL, unless the value is changed (after the destructor starts) by a call to pthread_setspecific(). Calling pthread_setspecific() from a thread-specific data destructor routine may result either in lost storage (after at least PTHREAD_DESTRUCTOR_ITERATIONS attempts at destruction) or in an infinite loop. </quote>
What is the definition of "destructors" under pthreads, anyway (I assume it's not the same as a C++ destructor)?
It's extern "C" void (*destructor)(void*)); <quote> An optional destructor function may be associated with each key value. At thread exit, if a key value has a non-NULL destructor pointer, and the thread has a non-NULL value associated with that key, the value of the key is set to NULL, and then the function pointed to is called with the previously associated value as its sole argument. The order of destructor calls is unspecified if more than one destructor exists for a thread when it exits. If, after all the destructors have been called for all non-NULL values with associated destructors, there are still some non-NULL values with associated destructors, then the process is repeated. If, after at least {PTHREAD_DESTRUCTOR_ITERATIONS} iterations of destructor calls for outstanding non-NULL values, there are still some non-NULL values with associated destructors, implementations may stop calling destructors, or they may continue calling destructors until no non-NULL values with associated destructors exist, even though this might result in an infinite loop. </quote> regards, alexander.

Michael Glassford wrote:
"Johan Nilsson" <johan.nilsson@esrange.ssc.se> wrote in message news:c9k069$9jf$1@sea.gmane.org...
[following is windows-specific stuff]
I seem to recall that the reason for requiring a dynamically linked
version
was solely for automatically cleaning up thread_specific_ptr's at
thread
exit - is that right?
Yes.
There is a little-known mechanism present in the Windows executable support, called TLS callbacks, that can call a list of arbitrary functions when a process starts and exits, and when threads attach or detach (very similar to DllMain). The calls are made in the context of the relevent thread. No linked DLL or overriden or special thread functions are necessary, as with ordinary approaches. Its likely that few people know about this because there are apparently no compilers that use it--that I am familiar with. It is presently only documented in the PE specification. However, it appears to work without problems on all versions of IA32 Windows, and quite likely on anything else that uses PE, also. Support for it requires minimal compiler support (the ability to emit data in arbitrarily-named sections, or even better, support for constructors and destructors in __declspec(thread), which no compiler I am familiar with has, including MSVC), a linker that minimally supports TLS, and possibly some "glue code" to support the linker. Presently, MSVC has the necessary support, and GNU targets MinGW and Cygwin have all of the critical support as well, but lack the "glue." The way to do it is to emit a pointer to the function to be called (which has the same signature as DllMain) in any of the 25 sections .CRT$XLA through .CRT$XLY. The linker merges all of these sections into a single null-termniated table of function pointers and creates a pointer to it from the executable image's TLS directory. I am presently using this feature on the GCC target MinGW for this purpose. I have not distributed my code, though, so there is no widespread testing. I am also not sure what the limitations are with regards to what functions may safely be called from the context of one of these callbacks. I assume it is similar to DllMain(), which it makes it dangerous to use this as a general-purpose mechanism to execute arbitrary code. In particular, constructors often load DLLs as a side effect, particularly if lazy loading is being used, and this may cause difficult to diagnose or reproduce problems if one such constructor is being used in a TLS construction or destruction context. More research is required in this area. In any case, I think this mechanism is definitely something that should be examined if static thread support is being considered. Aaron W. LaFramboise

"Aaron W. LaFramboise" <aaronrabiddog51@aaronwl.com> wrote in message news:40C6168B.20400@aaronwl.com... [snip]
There is a little-known mechanism present in the Windows executable support, called TLS callbacks, that can call a list of arbitrary functions when a process starts and exits, and when threads attach or detach (very similar to DllMain). The calls are made in the context of the relevent thread. No linked DLL or overriden or special thread functions are necessary, as with ordinary approaches.
That's cool - never heard of it before.
Its likely that few people know about this because there are apparently no compilers that use it--that I am familiar with. It is presently only documented in the PE specification. However, it appears to work without problems on all versions of IA32 Windows, and quite likely on anything else that uses PE, also.
You've actually tried it under Win9x, NT4, w2k and XP/2003? Longhorn? I'm not skeptical, just interested. In any case, if it's part of the official PE documentation it should be possible. I just can't imagine why I've never heard of this before ...
Support for it requires minimal compiler support (the ability to emit data in arbitrarily-named sections, or even better, support for constructors and destructors in __declspec(thread), which no compiler I am familiar with has, including MSVC),
Ok. a linker that minimally supports
TLS, and possibly some "glue code" to support the linker.
... meaning?
Presently, MSVC has the necessary support, and GNU targets MinGW and Cygwin have all of the critical support as well, but lack the "glue."
The way to do it is to emit a pointer to the function to be called (which has the same signature as DllMain) in any of the 25 sections .CRT$XLA through .CRT$XLY. The linker merges all of these sections into a single null-termniated table of function pointers and creates a pointer to it from the executable image's TLS directory.
Do you mean something as simple as (for msvc): -------------------- #pragma code_seg(push, old_seg) #pragma code_seg(".CRT$XLA") BOOL WINAPI TlsCallback(HINSTANCE, DWORD, LPVOID) { <whatever> return TRUE; } #pragma code_seg(pop, old_seg) ------------------ What are the descriptions for the parameters above?
I am presently using this feature on the GCC target MinGW for this purpose. I have not distributed my code, though, so there is no widespread testing.
I am also not sure what the limitations are with regards to what functions may safely be called from the context of one of these callbacks. I assume it is similar to DllMain(), which it makes it dangerous to use this as a general-purpose mechanism to execute arbitrary code.
I would guess it has some (more or less) serious restrictions (but it shouldn't need to hold the loader lock, making it somewhat less restrictive than DllMain).
In particular, constructors often load DLLs as a side effect, particularly if lazy loading is being used, and this may cause difficult to diagnose or reproduce problems if one such constructor is being used in a TLS construction or destruction context. More research is required in this area.
In any case, I think this mechanism is definitely something that should be examined if static thread support is being considered.
Definitely agreed on that. // Johan

"Aaron W. LaFramboise" <aaronrabiddog51@aaronwl.com> wrote in message news:40C6168B.20400@aaronwl.com...
<>There is a little-known mechanism present in the Windows executable support, called TLS callbacks, that can call a list of arbitrary functions when a process starts and exits, and when threads attach or detach (very similar to DllMain). The calls are made in the context of the relevent thread. No linked DLL or overriden or special thread functions are necessary, as with ordinary approaches.
<>You've actually tried it under Win9x, NT4, w2k and XP/2003? Longhorn? I'm not skeptical, just interested. In any case, if it's part of the official PE documentation it should be possible. I just can't imagine why I've never heard of this before ...
I have not exhaustively tested it due to lack of places to test. I am fairly confident it works reliably on Win9x and Win2k/XP. I would suspect it works on similar platforms. I don't have any idea about
Johan Nilsson wrote: older NTs, Longhorn, CE, Win32s, or non-IA32 platforms, other than that the spec seems to indicate that it should work. You probably didn't hear about it for the same reason I didn't until I started studying the PE specification: there is no documented support for it outside of the specification. It was only relatively recently that I actually tried to use it, and was somewhat suprised when it actually worked.
TLS, and possibly some "glue code" to support the linker.
... meaning?
Presently, MSVC has the necessary support, and GNU targets MinGW and Cygwin have all of the critical support as well, but lack the "glue."
I appologize; I misspoke on the fact that the MSVC support is complete. I beleive this has something to do with missing support in tlssup.obj, but I'm limited in my ability to investigate due to intellectual
Well, I'm thinking more from an implementors point of view. The linker requires a file, that should be provided by someone in the compiler system, to define a few symbols the linker needs to set up the TLS directory. More below. property issues. Not too sure how to handle this.
The way to do it is to emit a pointer to the function to be called (which has the same signature as DllMain) in any of the 25 sections .CRT$XLA through .CRT$XLY. The linker merges all of these sections into a single null-termniated table of function pointers and creates a pointer to it from the executable image's TLS directory.
Do you mean something as simple as (for msvc):
-------------------- #pragma code_seg(push, old_seg) #pragma code_seg(".CRT$XLA")
BOOL WINAPI TlsCallback(HINSTANCE, DWORD, LPVOID) { <whatever> return TRUE; }
#pragma code_seg(pop, old_seg) ------------------
What are the descriptions for the parameters above?
In C, #include <stdio.h> #include <windows.h> void NTAPI my_callback (PVOID DllHandle, DWORD Reason, PVOID Reserved) { char buf[50]; sprintf(buf, "0x%lx 0x%lx 0x%lx", (unsigned long)DllHandle, (unsigned long)Reason, (unsigned long)Reserved); MessageBox(0, buf, "callback", 0); } #pragma data_seg(push, old_seg) #pragma data_seg(".CRT$XLB") DWORD my_callback_ptr = (DWORD)my_callback; #pragma data_seg(pop, old_seg) int main() { printf("hi\n"); return 0; } Note, this won't work because, as mentioned, for some reason it isn't being added to the TLS directory The parameters and all other docs on TLS callbacks are in "Microsoft Portable Executable and Common Object File Format Specification Revision 6.0 - February 1999" 6.7.2, apparently freely availible by websearch. Contrary to what I might have misleadingly stated before, no compiler seems to directly support using this out of box. I don't know what is wrong with Microsoft's tlssup.obj, if that is the problem. I do know that the callbacks work in actual executables, and I have gotten it to work on GNU binutils with some patches, and MSVC LINK with a replacement for tlssup.obj. It is my intention to get this feature working 100% on GNU GCC and binutils, which I admit are my primary interest, but I am unsure about MSVC. I am very hesitant to investigate MSVC any more than necessary as I do not want to get myself into ridiculous IP problems. I also neglected to mention that this feature may be unusable, in general, because the TLS callbacks are called before the C library's are. This means that stdio and the heap may not be fully availible, at least for MSVC. The standard C library (and perhaps other DLLs?) need to expect that they may not be notified first of ctors/dtors. MSVCRT may do this, but I am not sure. I don't know what garentees there are about MSVCRT.DLL being notified before any other loaded DLL, anyway. So, well, I have no idea how helpful this is to anyone, and I'm painfully aware of how hypocritical it is to not be able to provide a simple testcase. I just wanted to let people know that Windows seems to have this capability, and that the GNU environment on Windows will probably support this sometime soon. Aaron W. LaFramboise

<snip>
I appologize; I misspoke on the fact that the MSVC support is complete. I beleive this has something to do with missing support in tlssup.obj, but I'm limited in my ability to investigate due to intellectual property issues. Not too sure how to handle this.
This is an interesting possibility, however there are references in the archives to Bill Kempf having discussions with folks at Microsoft about how to fix tss/tls cleanup on Win32; unfortunately he doesn't reveal whether using TLS callbacks was suggested, so we don't know if it has already been considered, and possibly rejected.
It is my intention to get this feature working 100% on GNU GCC and binutils, which I admit are my primary interest, but I am unsure about MSVC. I am very hesitant to investigate MSVC any more than necessary as I do not want to get myself into ridiculous IP problems.
Unfortunately MSVC is probably what most people will be using :-(
I also neglected to mention that this feature may be unusable, in general, because the TLS callbacks are called before the C library's are. This means that stdio and the heap may not be fully availible, at least for MSVC. The standard C library (and perhaps other DLLs?) need to expect that they may not be notified first of ctors/dtors. MSVCRT may do this, but I am not sure. I don't know what garentees there are about MSVCRT.DLL being notified before any other loaded DLL, anyway.
Hmm. So an application whose tls cleanup required 'C' library functions would probably fail. An application that required a TLS cleanup handler that (for example) aquired a process shared mutex, fiddled with some process shared memory and then released the mutex might work, but would anyone like to guarantee it (without support from Microsoft)? I know that this is a 'corner case' that some would consider boost::thread doesn't need to support; my imagination has run as far as a monitoring application that uses a process shared mutex to 'watch' the threads created and destroyed by some other application (the one that we want to test) - I have applications now where a utility like this could be useful. You could do this by allocating a thread_specific_ptr object that recorded the process id, thread id and optional name of the thread on thread startup; thread_specfic_ptr object destruction would remove the thread from the list in the 'monitoring' application. Another way to do this would be to use sockets - anyone like to wager anything on whether this might work with TLS callbacks ?
So, well, I have no idea how helpful this is to anyone, and I'm painfully aware of how hypocritical it is to not be able to provide a simple testcase. I just wanted to let people know that Windows seems to have this capability, and that the GNU environment on Windows will probably support this sometime soon.
It's certainly interesting; like you though, I'm not convinced that it will solve the general case :-( Malcolm Malcolm Noyes

Malcolm Noyes wrote:
It is my intention to get this feature working 100% on GNU GCC and binutils, which I admit are my primary interest, but I am unsure about MSVC. I am very hesitant to investigate MSVC any more than necessary as I do not want to get myself into ridiculous IP problems.
Unfortunately MSVC is probably what most people will be using :-(
Well, I think it might be possible to get it working by linking against a replacement for tlssup.obj that will do the right thing. In fact, I happen to have such a replacement that was used with GCC to get static TLS to work, but I haven't tested it. I think it could be made to work for MSVC. As mentioned, I also plan to have it working reliably with GCC. Between GCC and MSVC, perhaps we have enough compilers to have a good basis for implementation. And since TLS callbacks are standard, who knows, maybe they're easy to get working on other compilers too.
I also neglected to mention that this feature may be unusable, in general, because the TLS callbacks are called before the C library's are. This means that stdio and the heap may not be fully availible, at least for MSVC. The standard C library (and perhaps other DLLs?) need to expect that they may not be notified first of ctors/dtors. MSVCRT may do this, but I am not sure. I don't know what garentees there are about MSVCRT.DLL being notified before any other loaded DLL, anyway.
Hmm. So an application whose tls cleanup required 'C' library functions would probably fail. An application that required a TLS cleanup handler that (for example) aquired a process shared mutex, fiddled with some process shared memory and then released the mutex might work, but would anyone like to guarantee it (without support from Microsoft)? I know that this is a 'corner case' that some would consider boost::thread doesn't need to support; my imagination has run as far as a monitoring application that uses a process shared mutex to 'watch' the threads created and destroyed by some other application (the one that we want to test) - I have applications now where a utility like this could be useful. You could do this by allocating a thread_specific_ptr object that recorded the process id, thread id and optional name of the thread on thread startup; thread_specfic_ptr object destruction would remove the thread from the list in the 'monitoring' application. Another way to do this would be to use sockets - anyone like to wager anything on whether this might work with TLS callbacks ?
I'm not sure exactly what happens. It needs testing, but in any case, I'm sure there are no documented garentees. I also think some hack to work around problems, if they are problems, could be made to work reliably. I do feel fairly strongly that the ability to nondeterministically call destructors for objects in dynamic TLS is important, and acheivable, somehow. The "I think this can work but I can't really say for sure" is due to me not having time to tear into this yet. Since interest in this mechanism has been indicated, I will definitely look into this and get the answers with regards to its usability within a few weeks. Hopefully then I will be able to say more definitively, "Yes, Boost can use this, and this is how" or whatever. Aaron W. LaFramboise

"Aaron W. LaFramboise" <aaronrabiddog51@aaronwl.com> wrote in message news:40C94580.5070600@aaronwl.com...
Malcolm Noyes wrote:
I also neglected to mention that this feature may be unusable, in general, because the TLS callbacks are called before the C
[snip discussion of GSS & MSVC implementations] library's
are. This means that stdio and the heap may not be fully availible, at least for MSVC. The standard C library (and perhaps other DLLs?) need to expect that they may not be notified first of ctors/dtors. MSVCRT may do this, but I am not sure. I don't know what garentees there are about MSVCRT.DLL being notified before any other loaded DLL, anyway.
I assume this means that "attach" callbacks are called before the C library's and "detach" callbacks are called *after* the C library's? Otherwise it's not a problem, as Boost.Thread's tss cleanup is interested primarily in detach callbacks (a process attach callback would be used to call TlsAlloc(), but that should work regardless of whether the C library is loaded). It may not be much of a problem, anyway, except when the cleanup is of a sort that must be performed even on process exit, which seems unusual to me for thread-specific storage. What I mean is, if necessary, the tss cleanup could detect that the process is exiting and stop trying to run cleanup commands (to prevent access errors, etc.), "leaking" any resources stored in thread_specific_ptr objects. It seems to me that for many applications such leaking wouldn't be much of a problem (but I may be wrong).
Hmm. So an application whose tls cleanup required 'C' library functions would probably fail.
That's what it sounds like to me--but only when the process is exiting, I presume?
An application that required a TLS cleanup handler that (for example) aquired a process shared mutex, fiddled with some process shared memory and then released the mutex might work, but would anyone like to guarantee it (without support from Microsoft)? I know that this is a 'corner case' that some would consider boost::thread doesn't need to support; my imagination has run as far as a monitoring application that uses a process shared mutex to 'watch' the threads created and destroyed by some other application (the one that we want to test) - I have applications now where a utility like this could be useful. You could do this by allocating a thread_specific_ptr object that recorded the process id, thread id and optional name of the thread on thread startup; thread_specfic_ptr object destruction would remove the thread from the list in the 'monitoring' application.
I'm not seeing this. Do you mean that the watching application would create a thread_specific_ptr object for a thread in the other application? I don't see how that would work: AFAIK the Win32 API Tls* functions don't work that way. Even if they did, it sounds like you would need static Tls unload messages from the other application, and I don't see how that would work, either. Unless you mean that the watched application is monitoring its own threads and telling the watcher application when they are attached and detached?
Another way to do this would be to use sockets - anyone like to wager anything on whether this might work with TLS callbacks ?
I'm not sure exactly what happens. It needs testing, but in any case, I'm sure there are no documented garentees. I also think some hack to work around problems, if they are problems, could be made to work reliably. I do feel fairly strongly that the ability to nondeterministically call destructors for objects in dynamic TLS is important, and acheivable, somehow.
I agree.
The "I think this can work but I can't really say for sure" is due to me not having time to tear into this yet. Since interest in this mechanism has been indicated, I will definitely look into this and get the answers with regards to its usability within a few weeks. Hopefully then I will be able to say more definitively, "Yes, Boost can use this, and this is how" or whatever.
Please do: I'm very interested. Mike

Michael Glassford wrote: [...]
It may not be much of a problem, anyway, except when the cleanup is of a sort that must be performed even on process exit, which seems unusual to me for thread-specific storage. ...
Yep. http://www.opengroup.org/austin/mailarchives/ag-review/msg01792.html regards, alexander.

"Alexander Terekhov" <terekhov@web.de> wrote in message news:40C9D052.A5F4987F@web.de...
Michael Glassford wrote: [...]
It may not be much of a problem, anyway, except when the cleanup
is of
a sort that must be performed even on process exit, which seems unusual to me for thread-specific storage. ...
Yep.
http://www.opengroup.org/austin/mailarchives/ag-review/msg01792.html
IIUC, the text at this link seems to refer to processes that are terminated due to an error condition; what about processes that exit normally? Mike

Michael Glassford wrote: [...]
http://www.opengroup.org/austin/mailarchives/ag-review/msg01792.html
IIUC, the text at this link seems to refer to processes that are terminated due to an error condition;
The section refers to normal process termination. http://www.opengroup.org/onlinepubs/009695399/functions/exit.html If you join http://www.opengroup.org/austin (there are no fees for membership), you can download free-only-to-members better formatted (with line numbers, etc.) PDFs. http://www.opengroup.org/publications/catalog/c046.htm http://www.opengroup.org/publications/catalog/c047.htm http://www.opengroup.org/publications/catalog/c048.htm http://www.opengroup.org/publications/catalog/c049.htm regards, alexander.

On Fri, 11 Jun 2004 07:14:40 -0400, in gmane.comp.lib.boost.devel Michael Glassford wrote:
I assume this means that "attach" callbacks are called before the C library's and "detach" callbacks are called *after* the C library's? Otherwise it's not a problem, as Boost.Thread's tss cleanup is interested primarily in detach callbacks (a process attach callback would be used to call TlsAlloc(), but that should work regardless of whether the C library is loaded).
I agree - loading shouldn't be a problem, however wouldn't you need to have the C runtime loaded so that when the tls cleanup handler runs it has access to functions that may be called during object releasing - for example a custom cleanup handler might call 'free', and this might fail, if I understand correctly ? Also, my impression is that this is related in some way to __declspec(thread) variables which suggests that: a) this may not work well for dynamically loaded DLLs (i.e. LoadLibrary, not statically linked to the stubs) b) we'd probably need to check that whatever mechanism works for MSVC doesn't break __declspec(thread) variables. Both will be pretty easy to check, of course.
It may not be much of a problem, anyway, except when the cleanup is of a sort that must be performed even on process exit, which seems unusual to me for thread-specific storage. What I mean is, if necessary, the tss cleanup could detect that the process is exiting and stop trying to run cleanup commands (to prevent access errors, etc.), "leaking" any resources stored in thread_specific_ptr objects. It seems to me that for many applications such leaking wouldn't be much of a problem (but I may be wrong).
I think I'd agree, and for any 'corner cases' it's always possible to require a call to 'tss.reset(0)' to free the resources before the process exits.
Hmm. So an application whose tls cleanup required 'C' library functions would probably fail.
That's what it sounds like to me--but only when the process is exiting, I presume?
That was my understanding as well; having read the relevant bits of the documentation it almost looks like the TLS callback 'hooks' were put there to implement just what we need. My only concern would be why they never got implemented (it appears to have been in the PE format for some time) - maybe someone from MS can enlighten us . . .
I'm not seeing this.
Please do: I'm very interested. So am I, and having read the docs (and not being an expert on
Sorry - didn't explain myself very well. Application to be tested does this: struct thread_info { thread_info(const std::string& name) { _id = GetCurrentThreadId(); _id_name = // make some string to identify the thread HANDLE h = CreateMutex(..., "NameKnownToUsAndMonitor"); WaitForSingleObject(h,INFINITE); // stuff the string into proces shared memory ReleaseMutex(h); CloseHandle(h); } ~thread_info() { HANDLE h = CreateMutex(..., "NameKnownToUsAndMonitor"); WaitForSingleObject(h,INFINITE); // remove the string from shared memory ReleaseMutex(h); CloseHandle(h); } DWORD _id; std::string _id_name; }; thread_specfic_ptr<thread_info> g_ti; DWORD WINAPI thread_proc(void*) { g_ti.reset(new thread_info("foo thread")); .... // when thread exits, thread local g_ti object is destroyed, // removes id from shared memory } const int NUM_THREADS = 10; int main(int, char*[]) { HANDLE h[NUM_THREADS];; for (int i=0; i < NUM_THREADS; ++i) h[i] = CreateThread(NULL, 0, thread_proc, NULL, 0, NULL); WaitForMultipleObjects(NUM_THREADS, h, TRUE, // wait all INFINITE); } Monitor does something like this: void foo() { HANDLE h = CreateMutex(..., "NameKnownToUsAndMonitor"); while(1) { WaitForSingleObject(h,INFINITE); // read strings from shared memory and display on dialog box ReleaseMutex(h); Sleep(100); // for example } CloseHandle(h); } Thus monitor 'knows' about all threads in test app. I'm planning to implement this as a test case for the three variations of 'fixes' that I outlined the other day (i'm working on test cases for 2 of these variations now ;-). linking), I can't begin to imagine how Aaron will make it work with MSVC, but I'll be very happy if it does work ;-) Good luck Aaron! Malcolm Malcolm Noyes

On Fri, 11 Jun 2004 07:14:40 -0400, in gmane.comp.lib.boost.devel Michael Glassford wrote:
I assume this means that "attach" callbacks are called before the C library's and "detach" callbacks are called *after* the C library's? Otherwise it's not a problem, as Boost.Thread's tss cleanup is interested primarily in detach callbacks (a process attach callback would be used to call TlsAlloc(), but that should work regardless of whether the C
"Malcolm Noyes" <boost@alchemise.com> wrote in message news:nbojc05l4vscs8jth4rneuhaj56tertm7f@4ax.com... library
is loaded).
I agree - loading shouldn't be a problem, however wouldn't you need to have the C runtime loaded so that when the tls cleanup handler runs it has access to functions that may be called during object releasing - for example a custom cleanup handler might call 'free', and this might fail, if I understand correctly ?
Yes. However, it seems to me that the C library will be available in the normal scenario of a thread detaching when the process isn't exiting, and that it will only become unavailable when the process is exiting, which, as I mentioned below, could be treated as a special case where cleanup is not attempted (better: is performed by some other mechanism). All this is assuming that the C library is actually unloaded too soon, which AFAIK hasn't been verified yet.
Also, my impression is that this is related in some way to __declspec(thread) variables which suggests that:
That's what I was thinking, too, though I haven't attempted to verify it yet.
a) this may not work well for dynamically loaded DLLs (i.e. LoadLibrary, not statically linked to the stubs)
Presumeably dllmain will work here, however--if we can hook into it.
b) we'd probably need to check that whatever mechanism works for MSVC doesn't break __declspec(thread) variables.
Both will be pretty easy to check, of course.
It may not be much of a problem, anyway, except when the cleanup is of a sort that must be performed even on process exit, which seems unusual to me for thread-specific storage. What I mean is, if necessary, the tss cleanup could detect that the process is exiting and stop trying to run cleanup commands (to prevent access errors, etc.), "leaking" any resources stored in thread_specific_ptr objects. It seems to me that for many applications such leaking wouldn't be much of a problem (but I may be wrong).
I think I'd agree, and for any 'corner cases' it's always possible to require a call to 'tss.reset(0)' to free the resources before the process exits.
Or, hopefully, there might be another way to force a cleanup of all threads if the process is exiting.
Hmm. So an application whose tls cleanup required 'C' library functions would probably fail.
That's what it sounds like to me--but only when the process is exiting, I presume?
That was my understanding as well; having read the relevant bits of the documentation it almost looks like the TLS callback 'hooks' were put there to implement just what we need. My only concern would be why they never got implemented (it appears to have been in the PE format for some time) - maybe someone from MS can enlighten us . . .
I'm not seeing this.
Sorry - didn't explain myself very well. Application to be tested does this:
Thus monitor 'knows' about all threads in test app. I'm planning to implement this as a test case for the three variations of 'fixes'
[snip code showing what it does] OK, that makes sense. But I don't understand what the problem is, then--unless you're thinking that some Win32 API functions won't be available at the point where the tss cleanup is called? that
I outlined the other day (i'm working on test cases for 2 of these variations now ;-).
Good!
Please do: I'm very interested. So am I, and having read the docs (and not being an expert on linking), I can't begin to imagine how Aaron will make it work with MSVC, but I'll be very happy if it does work ;-) Good luck Aaron!
Mike

I have already tried using the profiler mentioned below. It is a very slick thing. But it was not that useful on a complete run (that normally takes 20 hours or so) because the instrumented version takes orders of magnitude longer to run. In fact, after about an hour of running the instrumented version, the run was not even past the preliminary initialization stage :(. That being said, I think there is surely a lot of room for more clever and faster code, but it simply is computationally intensive by nature.... Peter Johan Nilsson <johan.nilsson@esrange.ssc.se> wrote: Hi, "Peter Danford" wrote in message news:20040602024428.45768.qmail@web60606.mail.yahoo.com...
I'm not sure other than the reasons regarding optimizations possible with statically linked libs. I know that there is also some overhead involved in calling dll functions as opposed to the static counterpart, but this surely is a small cost. Other than that, I am not sure.
As you are running VS.NET 2003, you could try the (free) community edition of Compuware's profiler, downloadable from: http://www.compuware.com/products/devpartner/profiler/default.asp Instrumenting your application (and boost.thread) should help you to find out the reason for the performance degradation when using the runtime linked version. --------------------------------- Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger

"Peter Danford" <pdanford_qed@yahoo.com> wrote in message news:20040603033313.94424.qmail@web60604.mail.yahoo.com...
I have already tried using the profiler mentioned below. It is a very slick thing. But it was not that useful on a complete run (that normally takes 20 hours or so) because the instrumented version takes orders of magnitude longer to run. In fact, after about an hour of running the instrumented version, the run was not even past the preliminary initialization stage :(.
If you could just leave it running long enough to perform the "real work" for some time (leave it running overnight perhaps) I guess you could abort it and take a look at the data collected so far. If you could do this once for the dynamic-linked version once for the static linked version, you should be able to see any percentual differences between times spent in the various Boost.Thread functions. Perhaps you could leave the main modules uninstrumented, and just prepare the Boost.Thread stuff for instrumentation to avoid the excessive startup time. This is just pure speculation - I don't even know if this is possible.
That being said, I think there is surely a lot of room for more clever and
faster code, but it simply is computationally intensive by nature.... The computationally intensive stuff should be unrelated to Boost.Thread performance. Actually I have a hard time finding out why there should be a difference at all unless you create a lot of temporary threads all the time (causing locks in DllMain's THREAD_ATTACH / DETACH notifications IIRC). // Johan

The problem is not the Boost.thread's dll itself, it is that using this dll requires using the RTL dll also. The extensive use of the RTL (in dll form) is apparently where the cost is. But I do not know if it is because I lose some of the optimizations or what... Peter Johan Nilsson <johan.nilsson@esrange.ssc.se> wrote:
That being said, I think there is surely a lot of room for more clever and
faster code, but it simply is computationally intensive by nature.... The computationally intensive stuff should be unrelated to Boost.Thread performance. Actually I have a hard time finding out why there should be a difference at all unless you create a lot of temporary threads all the time (causing locks in DllMain's THREAD_ATTACH / DETACH notifications IIRC). // Johan _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost --------------------------------- Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger

I am very glad to hear that you are working on the static problem though. The latest checkout from thread_dev branch revealed the use of thread specific storage in thread.cpp, so it looks like the latest work there may be making it more difficult to make a static subset...
Yes, I have pointed this out in the past. I have changes ready that allow building Boost.Threads as a statically linked library, which simply omit the tss implementation from the statically linked version of the library on Win32 platforms. I've hesitated to check them in because tss will be required by the implementation of the next version of the thread class, so unless I can implement tss cleanup properly on Win32 (which I haven't proved yet) the static library option would have to disappear again on Win32. Mike

By introducing the tss into the thread implementation mentioned below, are we gaining functionality so useful it outweighs the benefit realized by having a static option available? Peter Michael Glassford <glassfordm@hotmail.com> wrote:
I am very glad to hear that you are working on the static problem though. The latest checkout from thread_dev branch revealed the use of thread specific storage in thread.cpp, so it looks like the latest work there may be making it more difficult to make a static subset...
Yes, I have pointed this out in the past. I have changes ready that allow building Boost.Threads as a statically linked library, which simply omit the tss implementation from the statically linked version of the library on Win32 platforms. I've hesitated to check them in because tss will be required by the implementation of the next version of the thread class, so unless I can implement tss cleanup properly on Win32 (which I haven't proved yet) the static library option would have to disappear again on Win32. Mike _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost --------------------------------- Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger

-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org]On Behalf Of Peter Danford Sent: Thursday, June 03, 2004 3:28 PM To: boost@lists.boost.org Subject: Re: [boost] Re: Re: Thread lib (good reason for static)
By introducing the tss into the thread implementation mentioned below, are we gaining functionality so useful it outweighs the benefit realized by having a static option available?
Hi Boosters, While I cannot respond directly to the question posed above, maybe I can make useful observations on the wider issue? I have just been through the process of upgrading a plug-in architecture. This was based fundamentally (no surprise to anyone) on DLLs. A few relevant experiences from this upgrade follow; 1. Exposing STL containers across load modules is problematic Firstly, this was implementation-related; statics in template classes (STL - MSVC6/P.J. Plauger). An STL upgrade fixed that. Secondly, any runtime changes to such a container involve heap operations. If you modify the containers in different load modules then chaos ensues as the template generated code operates on the heap in the calling load module. 2. Complex link-time dependencies forced a backdown on wide use of DLLs as a general library strategy. An explanation is a bit too involved to take up space here. Suffice to say that with the complex combination of parts (application, local static libs, compiler runtime support; CRT, STL, Boost and 3rd party libs), DLLs as a general approach did not appear to be viable. Some effort was expended in this direction as the perceived benefits were very cool. The effort wasnt quite wasted but the goal was not reached. While this experience is not specifically about Boost it did involve templates and DLLs. It also quite possible that the goals of the upgrade were just too ambitious; limitations of DLLs were just not properly appreciated (they are now :-). So far I have sidestepped the issue with Boost by only using those components that come in header(s). Cheers, Scott

Peter Danford wrote:
By introducing the tss into the thread implementation mentioned below, are we gaining functionality so useful it outweighs the benefit realized by having a static option available?
This is a question we better not get into. ;-) A brief history. At first, Boost.Threads supported static linking. It used a helper DLL (threadmon.dll) to implement TSS cleanup on Windows. At that time I suggested to Bill Kempf that (a) it might be useful to isolate the mechanism in a dedicated "at_thread_exit" function exported by threadmon.dll, thereby allowing users to implement their own cleanup (or to omit it entirely if they know that their application does not continually create and destroy threads) and (b) that if Boost.Threads is implemented itself as a DLL (as pthreads-win32 is) there would be no need for a separate threadmon.dll. Now obviously (b) without (a) - the current status quo - is somewhat problematic for people that don't like DLLs or the DLL runtime (of which I am one). So I think that we ought to implement (a) as well; statically linking to Boost.Threads would then result in an unresolved reference to at_thread_exit, expecting a suitable definition from the user (or from at_thread_exit.dll, which we'd also supply). The DLL version of Boost.Threads should, of course, continue to work out of the box, with no user intervention required. It is very unfortunate that Bill is no longer active, or even reachable for comment. I wonder what happened to him.

Peter Danford wrote:
By introducing the tss into the thread implementation mentioned below, are we gaining functionality so useful it outweighs the benefit realized by having a static option available?
This is a question we better not get into. ;-)
A brief history. At first, Boost.Threads supported static linking. It used a helper DLL (threadmon.dll) to implement TSS cleanup on Windows. At
"Peter Dimov" <pdimov@mmltd.net> wrote in message news:00b101c4495e$72390730$0600a8c0@pdimov... that time
I suggested to Bill Kempf that (a) it might be useful to isolate the mechanism in a dedicated "at_thread_exit" function exported by threadmon.dll, thereby allowing users to implement their own cleanup (or to omit it entirely if they know that their application does not continually create and destroy threads) and (b) that if Boost.Threads is implemented itself as a DLL (as pthreads-win32 is) there would be no need for a separate threadmon.dll.
Now obviously (b) without (a) - the current status quo - is somewhat problematic for people that don't like DLLs or the DLL runtime (of which I am one). So I think that we ought to implement (a) as well; statically linking to Boost.Threads would then result in an unresolved reference to at_thread_exit, expecting a suitable definition from the user (or from at_thread_exit.dll, which we'd also supply). The DLL version of Boost.Threads should, of course, continue to work out of the box, with no user intervention required.
I like the idea. I had vague thoughts along those lines, but hadn't really formulated anything yet. Now that you mention your discussion with Bill Kempf, which I wasn't aware of before, I see signs in the code that he was working in that direction. I'll see what I can do to finish it.
It is very unfortunate that Bill is no longer active, or even reachable for comment. I wonder what happened to him.
I agree. Mike

"Peter Danford" <pdanford_qed@yahoo.com> wrote in message news:20040603032735.71640.qmail@web60601.mail.yahoo.com...
By introducing the tss into the thread implementation mentioned below, are we gaining functionality so useful it outweighs the benefit realized by having a static option available?
I think so. Here's a quote from a previous post I made on the subject: "The reasons [for using tss in the thread class's implementation] are two-fold: 1) The thread class becomes a handle class that holds a reference-counted pointer to a thread_data class. When a thread class is created, it gets access to the thread_data class for the thread using TLS. 2) As you might guess from the name of the thread_data class, there is other information associated with each thread; for instance, a thread id, a flag indicating whether the thread has been cancelled (yes, there is an exception-based implementation of thread cancellation), etc." Mike

unless I can implement tss cleanup properly on Win32 (which I haven't proved yet) the static library option would have to disappear again on Win32.
There are 3 possible ways that I'm aware of to attempt to fix this problem, although all have their drawbacks. I've tested at least 2 of them on some samples so I know that in principle they work, but I havn't had enogh time to test these to identify all the issues. The 3 methods that I know of are: i) At the first use of tss data, start a new thread to wait on the original thread handle. The theory is that when the original thread ends, the handle gets signalled, the 'watchdog' thread wakes up and cleans up the tss data for the original thread. An obvious optimisation is to have a watchdog class that can wait for several threads per instance (up to 63 since you'd have to have an event [or similar] to 'wake up' the thread when you wanted to add a new thread handle to wait for). This sounds easy but the optimisation complicates the implementation, as does the possibility that you might be waiting on a thread that doesn't end until after 'main' has ended (maybe the thread is joined in a static destructor). This is the solution that I tried first and I recall that there was some problem with termination that made it unworkable, but I'm sure that it should be possible. The idea of a watchdog has of course been mentioned before ;-) ii) Implement a thread watchdog via a dll using thread detach. This is the idea that Michael referred to that Roland proposed. I havn't tried this for 2 reasons; first it *requires* the distribution of a dll, and I'm aware from my own experience how difficult that is in some organisations from a political and logistical point of view. Second, there are things that you can't do in DllMain, for example MSDN mentions this: "Calling Win32 functions other than TLS, object-creation, and file functions may result in problems that are difficult to diagnose" (as usual with MS documentation, it's not clear whether this restriction applies to DLL_THREAD_DETACH). So whilst it would be possible to create a dll that had a 'C' interface (so that we could 'LoadLibrary' easily) and provided a callback for a cleanup function (passing a thread id so that we can use it to lookup in a map and perform the cleanup), there may be (i.e. probably will be) restrictions on what 'cleanup' might be allowed to do, so basically any non-trivial cleanup (e.g. calling socket functions) would not be allowed. iii) Implement a 'cleanup guard' as part of the thread function. Basically this means that we have an object that is responsible for managing the lifetime of the tss data and the lifetime of the guard object is limited by the 'scope' of the thread function (i.e. an instance of an object is declared at the beginning of the thread function and the destructor cleans up). Clearly this is easy to implement automatically for boost initialised threads, and would probably also be possible for use of tss data in the primary thread (i.e dynamic initialisation of statics and 'main') by introducing a static. That would leave 'adopted' threads which would have to declare an instance of the guard object to be able to use tss data. This solution has the advantage of simplicity (it's trivial to implement the allocation and cleanup of the tss data using a vector, for example) but has the disadvantage that it would require a change to the published interface for tss data - users of adopted threads with tss data requirements have to declare an instance of the guard to manage the scope of the tss data. It also has one other significant advantage IMO, which is that it should be possible to allow each new instance of the guard to manage all tss data for the thread until it gets destroyed, which would allow for an instance of the guard to be declared in a functor that was passed to a thread pool. This would mean that tss data used by functors when a thread pool thread executes the function would be cleaned up when the *functor* exits, not when the thread exits (which conceptually may never happen). This seems to me to be a more sensible place to apply the cleanup that when the thread exits. I'm trying to pull together a test implementation of all three methods - if I ever get there I'll post the results . . . Malcolm Noyes

"Malcolm Noyes" <boost@alchemise.com> wrote in message news:9lsub0pe67kve6uk1hfaif6o374hs9agg9@4ax.com...
unless I can implement tss cleanup properly on Win32 (which I haven't proved yet) the static library option would have to disappear again on Win32.
There are 3 possible ways that I'm aware of to attempt to fix this problem, although all have their drawbacks. I've tested at least 2 of them on some samples so I know that in principle they work, but I haven't had enough time to test these to identify all the issues. The 3 methods that I know of are:
i) At the first use of tss data, start a new thread to wait on the original thread handle. The theory is that when the original thread ends, the handle gets signaled, the 'watchdog' thread wakes up and cleans up the tss data for the original thread. An obvious optimisation is to have a watchdog class that can wait for several threads per instance (up to 63 since you'd have to have an event [or similar] to 'wake up' the thread when you wanted to add a new thread handle to wait for). This sounds easy but the optimisation complicates the implementation, as does the possibility that you might be waiting on a thread that doesn't end until after 'main' has ended (maybe the thread is joined in a static destructor). This is the solution that I tried first and I recall that there was some problem with termination that made it unworkable, but I'm sure that it should be possible. The idea of a watchdog has of course been mentioned before ;-)
This is an interesting approach, and worth looking into in spite of the problems you mention.
ii) Implement a thread watchdog via a dll using thread detach. This is the idea that Michael referred to that Roland proposed. I haven't tried this for 2 reasons; first it *requires* the distribution of a dll,
Actually, if I understand it correctly, the whole point of Roland's approach is that it doesn't require the distribution of a dll. The "dll" was created on-the-fly by the Boost.Thread code.
and I'm aware from my own experience how difficult that is in some organisations from a political and logistical point of view.
Unfortunately true.
Second, there are things that you can't do in DllMain, for example MSDN mentions this: "Calling Win32 functions other than TLS, object-creation, and file functions may result in problems that are difficult to diagnose" (as usual with MS documentation, it's not clear whether this restriction applies to DLL_THREAD_DETACH). So whilst it would be possible to create a dll that had a 'C' interface (so that we could 'LoadLibrary' easily) and provided a callback for a cleanup function (passing a thread id so that we can use it to lookup in a map and perform the cleanup), there may be (i.e. probably will be) restrictions on what 'cleanup' might be allowed to do, so basically any non-trivial cleanup (e.g. calling socket functions) would not be allowed.
This is a good point.
iii) Implement a 'cleanup guard' as part of the thread function. Basically this means that we have an object that is responsible for managing the lifetime of the tss data and the lifetime of the guard object is limited by the 'scope' of the thread function (i.e. an instance of an object is declared at the beginning of the thread function and the destructor cleans up). Clearly this is easy to implement automatically for boost initialised threads, and would probably also be possible for use of tss data in the primary thread (i.e dynamic initialisation of statics and 'main') by introducing a static. That would leave 'adopted' threads which would have to declare an instance of the guard object to be able to use tss data. This solution has the advantage of simplicity (it's trivial to implement the allocation and cleanup of the tss data using a vector, for example) but has the disadvantage that it would require a change to the published interface for tss data - users of adopted threads with tss data requirements have to declare an instance of the guard to manage the scope of the tss data. It also has one other significant advantage IMO, which is that it should be possible to allow each new instance of the guard to manage all tss data for the thread until it gets destroyed, which would allow for an instance of the guard to be declared in a functor that was passed to a thread pool. This would mean that tss data used by functors when a thread pool thread executes the function would be cleaned up when the *functor* exits, not when the thread exits (which conceptually may never happen). This seems to me to be a more sensible place to apply the cleanup that when the thread exits.
I'm trying to pull together a test implementation of all three methods - if I ever get there I'll post the results . . .
I'd be very interested in the results. Mike
participants (10)
-
Aaron W. LaFramboise
-
Alexander Terekhov
-
Johan Nilsson
-
Malcolm Noyes
-
Michael Glassford
-
Peter Danford
-
Peter Dimov
-
Rob Stewart
-
scott
-
Trey Jackson