
In my experience, there are very few instances where parallelism can be usefully concealed behind a library interface. OpenMP has very high overhead and will help only for very long-running functions -- at least under Microsoft's compiler on x86 and x64 platforms. Algorithms that are likely to be applied to very large datasets could have the OMP pragmas inserted optionally but they would need to be protected by #ifdef logic (or given distinct names) because otherwise the overhead will destroy programs that make more frequent calls on smaller datasets. In VS2005 a parallel-for structure with a literal 0 enabling flag parameter adds all the overhead of one that is enabled; you can't use the enable logic in the OMP syntax to do effective data-size algorithm selection without replicating the code. Identifying interfaces that can be usefully altered to more readily permit applications to exploit parallelism may be harder but is a lot more likely to pay off. Also consider that, in an application that is already parallelized, there are no extra cores for the library to use internally. On a smaller scale, adding "vectorized" and/or "streaming producer-consumer" interfaces for selected libraries may help a lot by encouraging use of vector instruction / execution units, unrolling of loops and improving instruction and data locality. I must also repeat the age-old, time-tested capital-T Truth about optimization: if you do something that is not suggested by and validated against careful analysis of realistic use cases you are wasting your time. I strongly advise you to not start hacking without solid data. Gathering real-world use cases into a "library" of performance-oriented application code and datasets would be, in my opinion, a pretty good summer's work. Produce a final report that others can later mine for performance improvement opportunities. Wrap the use-case library up as a performance regression test suite; plumb the test into the boost automated testing infrastructure. Another good use for a library of smaller use cases is as fodder for "profile guided optimization" offered by many modern compilers. If you still have time left over you could add support for PGO to boost.build. For some chips, notably Itanium, PGO makes a very noticeable difference. Hard to see how that would help with header-only libs though, except by giving the application programmers an example to go by.