Before we go much further, please let me explain my use case and considerations. My context is the need for a more capable and highly performant shadow memory representation for a reduced memory requiring alternative to taint analysis. This lower memory requirement and higher taint identity is achieved by the range based encoding, which consumes less than the bitvector shadow and affords more for identity storage. The scalability requirement is rather wild-we need only ever be able to insert. We don't need to really look things up-only test for a single special key upon insertion, which is rather trivial, requiring only a single cpu test instruction. The way I see the next generation dynamic binary instrumentation library & runtime operating is this: rather than use a bitmap index, requiring O(N) and having fast update times but overall poor identity storage, we can use an index and some careful optimizations to reduce the number of "cache misses" by taking advantage of the context. These optimizations actually take advantage of statistical characteristics of typical benign code rather than anything to do with actual data structures. By using a pre-allocated trie, we can store the first three bytes of any given address in the lookup, and retrieve/update/insert into that table in just a single instruction. That table fits in main memory easily, and represents keys into any range-offset encoding data structure. Between each insertion, the algorithm just compares the first three bytes. If that test succeeds, only the variables that were last updated are incremented; ie the range-offset encoding is updated (one more instruction). That makes 2-3 instructions. Statistically speaking, the majority memory is actually affected that is rather contiguous both as source and destination operands. On modern hardward, we may even get away with storing multiple recent lookups contexts in the CPU registers, being as there are quite a few available to work with, reducing penalties even further. I'm curious about using C++'s new relaxed memory capabilities with fences and what not, and possibly some ring buffers to maintain actually copies of subsegments of said trie and devoting a dedicated maintainer thread manage merging those upon each cache flush from each of the respective worker threads. So, total core overhead is somewhere like *N+C additional*, where N is the number of cores that the target process requires, and the C is the one (or however many) additional maintainer threads that merge between the workers and the coherent data structure. Hell, if it works out well, let each worker have it's own trie. But overall, this is a data structure that is very sensitive to concurrent operation, because target threads may race with themselves, and not only that, but we do not want analysis code to either race with it's "brethren" or with that of the application. So, really, wait-free red black trees were the only thing that I could think of that was the fastest (in a concurrent context) that would also affort correctness, and yet also be able to store range value encodings. But I'm open to anything that will operate in a concurrent context that will satisfy the requirements will do however. Now, back to everything you just said- @rebalancing generates a lot of cache line invalidation traffic Do you think a lazy rebalancing strategem with possibly dedicated worker thread could be applied? @nedtries provide a size_t key index algorithms. You can build the rest from that. I looked over the part where you linked me in further depth. The graphs are good, but I'm really disconcerted about the applicability of ned-tries because the metrics aren't what I'm interested in. An insertion has to be performed once per memory instruction. I read and have heard that for most software, memory operations tend to occur statistically about 33% of the time. So, I have to focus on # of instructions per insertion. Update is really just insertion. Nothing really is deleted, just updated. @Bitwise tries are hugely superior to red black trees for close and exact find lookups "Bitwise" - I don't want to fall back to using a bitvector. Those have already be applied in academia, and are publicly available. That's where I think everyone falls short; you might get ultimate speed there, but with good concurrent design and hardware primitive usage you can vastly reduce *perceived latency*. Also, identity diversity is a moot topic with bitvector shadow memory. @I think you may need to scale your estimates... Well, *immediate* tree rebalancing shouldn't ever occur unless the worst case occurs, which is that the two target application threads content for memory on a near same time basis. Also, all trees operate on colocal data-regionally colocated within proximally 255 bytes to be exact (on 32bit hardware). In fact, most tree insertion operations can be made very highly lazy-do not really update any of the other nodes of the tree, just continuously insert, even if numerous conflicting addresses aren't immediately resolved. I think identity storage schemes can be proven associative and commutative, greatly facilitating both concurrency and overall correctness. In this way, rebalancing is a rather infrequent operation. On Thu, Mar 5, 2015 at 7:30 PM, Niall Douglas <s_sourceforge@nedprod.com> wrote:
On 5 Mar 2015 at 13:49, Kenneth Adam Miller wrote:
Well, my requirements are that I store range-offset encodings and be able to access them for any reason as fast as possible. So, what I was thinking about doing was storing a range-value encoding as a tree, but it could be any kind of self balancing tree. The thing with red-black trees is, I know how to make them wait-free by reading and implementing a whitepaper that UTDallas published.
The rebalancing generates a lot of cache line invalidation traffic, and ruins concurrent modify performance. Academic theory often misses the reality.
Does your nedtree cover the ability to store range value encodings? Because my thinking is, with a trie, you can get the first 3 bytes of an integer without giving up to much space (it requires only slightly less than 32 megabytes contiguously stored). Each entry in that 32 megabytes can be a pointer to a red-black tree node, wherein when you do an insertion, you revert back to the wait-free redblack tree implementation. So, you can still do range-value encodings, but it will also be far, far faster than if you didn't having this caching to accelerate it.
nedtries provide a size_t key index algorithms. You can build the rest from that.
Bitwise tries are hugely superior to red black trees for close and exact find lookups, but inferior for closest find lookups. You can see some graphs at http://www.nedprod.com/programs/portable/nedtries/.
I estimate that it requires at most about 11 instructions in the absolute worst case scenario to do an insertion, and typical 7 instruction on the usual "cache-miss"., and amortized 2 instruction for contiguous memory insertions. It is very highly scalable, and can be managed in flexibly with various concurrency runtime patterns.
I think you may need to scale your estimates by about 10x. A cache line miss costs about 250 CPU cycles alone, and any kind of tree algorithm spends most of its time waiting on cache lines.
This is why bitwise tries are much faster than red black trees on out of order CPUs: they execute out of order much better. On an in order core like Intel Atom bitwise tries are extremely slow, and they're not great on even a Cortex A15 ARM either.
nedtries costs about 100 CPU cycles per insert/find/remove. That is exceptionally well balanced, most algorithms are much slower at say insert than finds (e.g. hash tables). If you do equally do finding, inserting and removing, and your item count is less than 10,000, bitwise tries are unbeatable.
Niall
-- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost