The algorithm makes sense, and if all these claims hold then it should work, I think. I'm still concerned about step (2), however, if only because of the assumption that what you write as a load/ store in C++ maps to a load/store in ML. Will it? Always? I ask as a non-rhetorical question: does the C++ standard *guarantee* it? Because if it doesn't then it seems like a potential problem, but maybe I'm being too careful.
If I could answer in the affirmative, I could prove it works and not have to talk to anyone. I can make a bunch of lesser claims, that of course fail to add up to the strong claim but make me feel better: 1. For years, people have written and relied on C/C++ characteristics that apparently were not in the standard until later if at all. If the popular platforms support it, it takes a strong reason to justify a violation of it. 2. Java, on any platform, would seem to strongly depend upon a reference assignment being accomplished all-or-nothing as a machine assignment. The alternative would be threads that see partial references being assigned in different threads and all the safety of references is out the window. This is such a common operation, it would be difficult to believe that there is thread synchronization going on around every pointer/reference assignment. 3. I have looked at the output of many C compilers and have never seen a pointer assignment take multiple instructions. The exception might be if someone used a poorly-written memcpy-like operations to transfer the contents of a structure byte by byte. A well-written one takes the respective alignments into account and transfers aligned 32-bit quantities one-by-one. We are talking about code in a controlled Hashtable class, it would have to be the "C" compiler choosing to transfer a class consisting of a single address field using something other than a single instruction. It doesn't sound logical to me, but it wouldn't surprise me completely to see it happen, but I would report it as a compiler bug and expect it to be addressed even though it is not in the standard. 4. These issues have to be addressed with the multi-processor cores becoming more prevalent because they are basic to lock-free processing, or can C++ accept significant inferiority to Java with respect to lock-free multiprocessing? Sorry that my arguments in this case frequently refer to Java.
Have you read any of the literature on lock-free data structures, btw? There was articles in the October and December '04 C/C++ User Journal about them.
I have read a number of articles on the subject, several specifically on the topic of lock-free hashtables, several with implementations they claim to be proven to be correct. My own first attempt is far from a proven-correct implementation. Many seem to arrive at similar conclusions, although each has its own slightly-different take on the subject and quite different mechanisms. They are also very often theoretical and build their proofs of completeness on features not apparently explicitly present in "C" standards. I certainly have yet to find an implementation I can just plug in and use. They often resort to garbage collection to solve the hard problems (even if they are working in C) or if they solve the deallocation problem, they only do so for the main array, not for the entries in the hashtable itself.
I realize that I haven't answered your original question about atomic_count, btw, and I'm sorry: the reason is simply that I don't know!
Thanks for the comments anyway, which were thought-provoking. For what it is worth, if anyone else thinks the solution is worth a generic contribution, or is of the opposite opinion that it would not be generic enough, I would be interested in hearing. I suppose this sort of question should go to the developers list. I think it would be much more difficult to implement a table of shared_ptr than it is for intrusive_ptr, because assigning a shared_ptr is much less likely to be uninterrupted, leaving other threads with a broken visible state. There is clearly more detail to discuss as well. Ray Whitmer