Ah ok. I apologize for posting too fast. I will be sure to go further in exhausting all resources before I post in the future. However, I did find some of the descriptions of the iostreams parts and how they compose (especially examples, that's what I was really looking for) vague, IMHO.
No, I haven't measured performance. And I understand well your concern with inappropriate attention to performance. The general advice is make it work first, then make it work faster; I agree with this advice strongly, but I think you might agree with me if I explain my context. If I see a particular solution that I believe will more likely guarantee lower latency due to unnecessary reallocations and re-initializations (as when a vector expands and must make all memory that it contains contiguous), I will try to eliminate it in the process of making it work the first time. I'm writing a pintool, and you just kind of have to understand what that is in order to know where I'm coming from. Intel's PIN tool suite is a dynamic binary instrumentation framework designed to allow programmers to specify callbacks at different levels of granularity, as well as write their own instrumentation functions. So if you wanted to gather data about every image load, or instruction execution, or routine call, on the fly you could do it.
Anyway, the context in which I am using this compression utility is a particularly important one for speed. A set of dynamic analysis routines steadily generate data from the target program that is being analyzed, and the point where I'm asking about now has to do with where, once the analysis threads have generated enough data, they report a handle to their buffer to a pool of compression threads; the compression threads drop the data they've been handed directly into a fresh buffer (ideally) that goes through a compressor. The problem is, because I'm instrumenting at instruction level granularity, my analysis code, between target application threads has to synchronize a lamport clock (now I'm really getting out of hand with my explanation lol); so, the whole program might produce a 400mb xml file, and these buffers get flushed at the frequency of accumulating about 30kb. So that's a lot of flushing! Those unnecessary allocation copies would, in the worst case, result in a 2-3x slowdown of the entire program due to the fact that it's literally pausing to repeat something that's unnecessary.
From what I had thought though, if my string was empty with 0's, and I appended to it, then I would end up with precisely my compressed data. The idea of acquireStringFromPool is that it returns a large string that is always cleared out for the filtering_ostream to fill up as though it were a device. Is that incorrect? How can I get that functionality?
Perhaps, however, that reserve approach would be good. I apologize the email was so long lol.