
Howard Hinnant wrote:
On Aug 22, 2007, at 2:59 PM, Zach Laine wrote:
Could the folks who object to the current design spell it out for me a bit more explicitly -- what in the design is dangerous/inconvenient enough to throw away one or more of the 4 goals?
I would like to see an answer to Zach's question too. I do not know what the major objection is with the "current proposal". I only know that people are suggesting alternatives.
I deliberately only submitted an alternative for consideration instead of poking holes in your design or argumentation. But if you insist... I don't believe that my suggested alternative throws away any of the four goals, hypothetical malicious vendor D notwithstanding. It is still possible for the vendor to meet the goals. In addition, it adds Goal 5: the ability to control the level of checking globally; Goal 6: ... without source changes or even recompilation. This can help if hypothetical user D, lured by the "no overhead, less L1 cache misses!" slogan, uses unchecked<> as a matter of habit. "Checking is for other people." On reflection though, I'll change the constructor from explicit condition( Mutex * pm = 0 ); to explicit condition( Mutex * pm ); as it's too easy to accidentally disable checking by not including the condition in the member init list. Going back to: class shared_mutex { typedef mutex mutex_t; typedef condition<unchecked<mutex_t>> cond_t; mutex_t mut_; cond_t gate1_; cond_t gate2_; unsigned state_; ... and the L1 cache miss argument for: class A { shared_mutex mx_; ... }; vector<A> v; 1. The size of shared_mutex, according to your numbers, is 104. If we assume 16 bytes of state in A, this makes sizeof(A) 120. The addition of two pointers makes it 128. This is a 7% increase, but it also happens to round up the size of A to 128, which makes it never straddle a cache line, so the "more bloated" version will actually be faster. Note that I picked the number 16 before realizing that. :-) If A had 24 bytes of state the two pointers would of couse be detrimental. 2. I'm having a hard time imagining a program where the L1 cache misses due to the increased size of A would matter. An object of type A allocates a not insignificant amount of kernel resources, so it would be hard to keep so many A's in memory for the +7% L1 misses to show up. 3. What is sizeof(pthread_rwlock_t) on your platform? Is it not something like 48-56? This is half the size of the above shared_mutex, so users who are L1 cache miss conscious will not use the shared_mutex anyway. 4. The vector v would need to be protected by its own rwlock as well since you can't reallocate it while someone is accessing the A's, which creates a central bottleneck. vector< shared_ptr<A> > does not suffer from this problem as you can reallocate it while someone is holding a reference to an element in the form of shared_ptr. 5. If optimizing the L1 cache friendliness of shared_mutex is an important goal, I would consider moving the conditions to the heap as they aren't accessed on the fast path. class shared_mutex { mutex mx_; unsigned state_; void * rest_; }; I know that your actual shared_mutex will use atomics on state_, so you could even be able to make class shared_mutex { unsigned state_; // , state2_? void * rest_; }; work. Uncontended access now doesn't touch *rest_ and L1 cache misses are a fraction of what they used to be. Contended access does cause more cache misses, but this is shadowed by the cost of the contention.