
The task of evaluating the beta function for each value in a container is of course parallelizable. But doing that requires parallelized STL algorithms, and has nothing to do with the Boost.Math library.
Doing vectorized math fast often requires that the per-value calculation be explicitly coded to help the compiler with software pipelining or even manually pipelined. That issue is indeed orthogonal to multithreading which requires only reentrancy on the part of the function implementations. On processors with deep pipelines, multiple execution units and plentiful registers the difference between monolithic loop body and one that can be software pipelined can easily be 2:1. I'm not contradicting the point quoted, just saying that if you want the container-oriented interfaces to the math functions to be fast it will require cooperation from the function implementations on most compilers. Parallelizing the calls to black-box functions will exploit multiple cores at low cost but you'll be leaving a lot of performance on the table.