
To actually make it part of std::sort, I'd need to have float, integer, and string versions, check for which applies, and if none apply, call the current std::sort. That will require significant work.
2) The only fully serial part is finding the size of the bins to create in the first iteration. This could be made parallel by allocating external memory for bins, but at the cost of locking code, or merging bins from each processor, in addition to the memory allocation, which probably isn't worth it with a small number of processors. If you did insert into bins at
Yes, that is the work I had in mind. But as I said, I'm not sure it would be worthwhile because nobody stands to benefit from a faster serial sorting algorithm for large data sets. this
point, the algorithm would be fully parallel, and it could skip the next >two steps.
I disagree with you here. You could easily split the input into kernels of data and process each one as a task to compute the size of bins required by that kernel and join the results.
3) Allocate bins (there are normally n/512 bins in the first iteration, so this is fast, even if serial)
4) In the swap loop, as soon as a bin is filled, fire it off on a separate thread. As you cut the data into 512-or-so pieces in the first iteration, each successive iteration should now be running efficiently in
How many swaps are required to fill the first bin is not reliable, so
Allocation of bins would happen as a serial stage in your pipeline, yes, but be very fast and not harm performance. parallel. this
step is not fully parallel.
As to std::sort being a straw man, I'd be happy to compare against a faster general-case serial sorting algorithm (if said algorithm exists). Improvements in serial sorting algorithms should mostly apply to
You are thinking about threading in a thread-oriented instead of task-oriented way. If you recursively create threads you will oversubscribe your machine and the threads will contend for cores. 512 threads is way, way, way too many. That's not a recipe for success. Instead of creating threads you want to create tasks and have a task scheduler that assigns them to threads. The first iteration of spreadsort can be fully parallel because if you cut the original input into enough kernels and process them as tasks you can keep all your cores busy until all the bins are filled, then proceed with parallel recursion into each bin. parallel
algorithms, and nothing stops you from running Spreadsort in parallel. I'd be happy to look into developing parallel versions once integer_sort, float_sort, and string_sort are in Boost, and hopefully I have enough money to buy a test machine with enough cores to be worth testing on.
You sort of missed my point about std::sort being a straw man. It is the best general case serial sorting algorithm. It isn't a good benchmark to compare against, however, in a world with parallel sorting algorithms and multi-core machines that are quickly becoming ubiquitous. Nobody should ever consider investigating using a faster serial algorithm for sorting large data sets when they would be much better served to apply the freely available implementation of a parallel algorithm. The thing stopping me from running spreadsort in parallel is that you haven't released a threaded version of the algorithm with a nice, clean, generic API. I would never parallelize it myself, because if I wanted a parallel sorting algorithm I'd use one off the shelf. Also, it stands to reason that there is at least a 50/50 chance parallel spreadsort would under-perform existing parallel sorting implementations. Whether it turns out to be better or worse than the best parallel implementation is what I'm interested in finding out. Marginally better than std::sort for large n only tells me that you are at a good point to start thinking about threading the algorithm since you have bottomed out on serial performance improvements. Luke