Re: [boost] [SORT] Parallel Algorithms
Hi , In the file of this link, in my dropbox space, you have the code, the test programs and the benchmarks with GCC. https://dl.dropboxusercontent.com/u/8437476/works/sort/Sort_V0.1.7z *FILE DESCRIPTION* In the root folder of the file you have a INSTRUCTIONS.txt file detailing the content of all the folders of the file, information for to compile the programs, where is each file, and which options and libraries you must link. *Test* Contains the test programs of each class and algorithms. *Src* Contains the source code of all. In the folder *src/boost/sort* there is a file *algorithm.hpp*, which is the only include file needed for invoke the algorithms. The parallel algorithms have a parameter of the NThread type which indicate the number of threads to use. By default use the same number than the HW threads of the machine. it's a very simple structure defined in boost/sort/util/atomic.hpp. *Benchmarks* In this folder are the benchmark code with GCC compiler under Linux x64. In the future I will add benchmarks with other compiler and Operating Systems. The folders *cpp-TimSort-master*, and *parallel_stable_sort* have the C++ code of the TimSort algorithm, found in https://travis-ci.org/gfx/cpp-TimSort, and the code of the sample sort with TBB. The code had been modified, due to the four versions use the same name space and many common names, and was not possible invoke the four versions in the same program. The modification put each version in a different name space. In the folder *GCC*, you can find 3 folders - *algorithm* : Have the code of the benchmark programs - *bin *: have the compiled and static linked version of the benchmark programs. In this folder you have a shell run.sh which run sequentially the programs and collect the results obtained in .txt files with the same name than the program - *Results *: in this folder you can find the result generated by the benchmark programs on different machines. Each machine have a folder where you can find the text files with the same name than the programs with the results, and a description.txt file with the description of the HW where run the benchmarks. The format used in this folder is the suggested for to present the results of all the machines used in the test. Here you have the result of my two computers, and can have a first idea about the results expected in other machines. The benchmarks, like in the old version, have a version with 64 bits numbers, and a version with objects of several sizes. About the memory used for of the algorithms. If you consider M the memory used by the data, the total memory used by the algorithms are : - GCC sort → M - boost sort → M - GCC parallel_sort → M - boost parallel_sort → M - TBB parallel_sort → M - GCC stable_sort → 2 M - boost stable_sort → 1.5 M - TimSort -> 2M - GCC parallel_stable_sort → 2.5 M - TBB parallel_sort → 2 M - boost sample_sort → 2 M - boost parallel_stable_sort → 1.5 M The algorithms use the indirect version when the size of the objects is greater than 128 bytes. I am speaking with several friends, for to run the benchmarks in more powerful machines. I think, the first is to check the speed. If the speed is good, we have something useful , if not, we must decide what must do with this. If something is useful , or we must leave all. If we decide continue, the next steps to do are : Reorganize the benchmark, and include benchmarks with string objects. Your idea about the use of files, I thing is the correct way to follow Finish all the documentation of the code for to generate a Doxygen documentation Generate the web pages of documentation for the Boost project Modify the test programs for to do according the bjam files If you found any problem , error or doubt, please , contact me for to resolve Francisco
Franciso,
Thanks for your detailed explanation.
I'll repeat what I've said before about your object benchmark:
I suggest changing the struct you use for benchmarking to this:
template
-
GCC sort → M -
boost sort → M
I don't think we'll ship this; this looks more like a proof-of-concept, unless the indirect sorting benefits are demonstrable after fixing the object you use in testing as described above. -
GCC parallel_sort → M -
boost parallel_sort → M
I see no advantage to this sort relative to tbb, and its performance on already-sorted data needs to improve to be competitive with tbb.
-
TBB parallel_sort → M
-
GCC stable_sort → 2 M -
boost stable_sort → 1.5 M
The combination of improved speed and lower memory usage looks impressive. I'll try to verify this, but this looks promising.
-
TimSort -> 2M -
GCC parallel_stable_sort → 2.5 M -
TBB parallel_sort → 2 M -
boost sample_sort → 2 M -
boost parallel_stable_sort → 1.5 M
Both of these algorithms look good. I'm inclined towards parallel_stable_sort because memory is becoming more of a bottleneck on modern systems, but I'll have to do some testing. Timsort seems fairly impressive, but specialized and already available for free, hence not necessary to include in the library.
If you found any problem , error or doubt, please , contact me for to resolve
Please fix your object benchmarks as described at the beginning of my email; otherwise I don't trust them. I also would like to see string sorting speeds. Once that is all together, I'd like to test and review them carefully and look to see if there are any weaknesses missed by your current tests. If they look good, and we can agree on which algorithms you want to add to the library, we should move on to cleaning this up for review. Steve
On March 24, 2015 10:32:35 PM EDT, Steven Ross
TimSort -> 2M
[snip]
Timsort seems fairly impressive, but specialized and already available for free, hence not necessary to include in the library.
While "specialized" may be sufficient grounds to avoid including Timsort, its availability elsewhere for free is not a good justification for excluding it from Boost. Providing a single source of good, varied algorithms, with equivalent documentation including comparative information to help users choose among the options would seem more than ample reason to include an algorithm. I realize that the more algorithms you include, the greater your documentation and maintenance burden, so you may still choose to reject Timsort on such grounds. I just didn't want you to dismiss it too quickly. ___ Rob (Sent from my portable computation engine)
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Rob Stewart Sent: 25 March 2015 08:43 To: boost@lists.boost.org Subject: Re: [boost] [SORT] Parallel Algorithms
On March 24, 2015 10:32:35 PM EDT, Steven Ross
wrote: TimSort -> 2M
[snip]
Timsort seems fairly impressive, but specialized and already available for free, hence not necessary to include in the library.
While "specialized" may be sufficient grounds to avoid including Timsort, its availability elsewhere for free is not a good justification for excluding it from Boost.
Providing a single source of good, varied algorithms, with equivalent documentation including comparative information to help users choose among the options would seem more than ample reason to include an algorithm.
+1 Paul --- Paul A. Bristow Prizet Farmhouse Kendal UK LA8 8AB +44 (0) 1539 561830
On Wed, Mar 25, 2015 at 4:43 AM, Rob Stewart
On March 24, 2015 10:32:35 PM EDT, Steven Ross
wrote: TimSort -> 2M
[snip]
Timsort seems fairly impressive, but specialized and already available for free, hence not necessary to include in the library.
While "specialized" may be sufficient grounds to avoid including Timsort, its availability elsewhere for free is not a good justification for excluding it from Boost.
Providing a single source of good, varied algorithms, with equivalent documentation including comparative information to help users choose among the options would seem more than ample reason to include an algorithm.
I realize that the more algorithms you include, the greater your documentation and maintenance burden, so you may still choose to reject Timsort on such grounds. I just didn't want you to dismiss it too quickly.
Sure, if a Timsort library author wishes to volunteer to include it in boost, I'll add it.
Hi Steven, thanks by you fast response. I want clarify several questions of your message. Please, excuse me my English. It's not my mother tongue, and I have errors and imprecisions. TEST OBJECTS The object used in the test ( Reburcio is a word of an exercise from my first pascal book, and don't have any sense) In the construction , copy construction and operator =, copy all the elements with a loop. The only one operation which use only the first element is the comparison. I had done a new comparison, and compare the sum of all the elements of the object. The time are similar, you must add 5% in the small objects to 15% in the big objects. TIMSORT The Tim Sort code is not mine. Tim Sort is used in Java and Android, and is specially efficient with big objects. ( You can see in benchmark_single_stable_sort_objects.txt). I use it for to speed compare. I didn't explore many about Tim Sort, because I saw the methods based on merge are faster, and I discarded the algorithm. If you think , it deserve a second opportunity, we can look for new implementations and check the speed in order to take a decision. I didn't measure the auxiliary memory used by Tim Sort. Wikipedia talk about an auxiliary memory O(N), but perhaps to say it's equal to the memory used by the data is not correct. In the next benchmark programs, as say Steven, the data are loaded from a file, and the program execute only 1 algorithm. With this configuration check the memory used is simple. Now it's very difficult because the program execute all the algorithms, and it's not possible to know how many memory use each algorithm. DESIGN PHASE Now , we are in the design phase. Nothing is fixed. We can comment and change anything. The specification is simple, provide sort algorithms, stable and not stable, for to run with 1 thread or with any number of threads, without any other external library ( as Open MP , TBB, CilkPlus..). Only with the C++ standard. If you want to discuss about this, we can do. We admit ideas, suggestions, opinions. Any help is welcome. The final decision about if the code is included or not in the boost sort library, depend of Steven, the creator of the library. After the design phase, with the decisions taken and the interfaces fixed. We prepare the final version, redesign a test system, test in deep and create the documentation. ABOUT MYSELF This SW provide of other project for the Boost library ( Contertree, concurrent red-black trees, with access by position , like the vectors), pending the final revision since near two years ago. When began, the implementation of parallel non stable sort, have similar performance than the GCC version, based on OpenMP, and the TBB version. The stable sort with 1 thread was slower than the GCC version, and the parallel stable sort was faster than the GCC version based on OpenMP. When someone mentioned the TBB sample sort implementation, even when is experimental, don't destroy the object in the auxiliary memory, and it's not exception safe. We began a crisis. That version was 20-30% faster than our version. I want tho show my admiration by TBB and their team. I think they are the best doing concurrent SW. I tried to improve, I would like to have more time, more equipment, and better documentation on order to find the best options, but I don't have, and I do this in my free time after my work and my family. I designed 6 stable sort algorithms, for to select the fast (internally named smart_merge_sort, about 10% faster than the GCC version). I prepared a version of Sample Sort, only with threads and atomic variables. In the benchmarks have similar performance than the TBB version, use one half of the auxiliary memory used by TBB and is exception safe. In my opinion, the parallel stable sort is reasonably good for to be included in the final version. The important case, which shows the quality of the algorithm, is when the data are unsorted. I will try to implement the detection of the sorted elements , like in TBB, but I am not sure about the results. Th internal loops of the algorithm are delicate, and a small change , can produce big changes in the performance, as happened in other algorithms We are in the design phase. We must test with others machines, big and small. Perhaps we must change or redesign things. But we need help. If someone test the programs, please send me the file result and the machine description ( Processor , number of cores,speed, memory , cache ) in order to obtain conclusions, fix the code and pass to the next phase. Thanks in advance Francisco
Hi Francisco,
On Wed, Mar 25, 2015 at 3:57 PM Francisco José Tapia
TEST OBJECTS
The object used in the test ( Reburcio is a word of an exercise from my first pascal book, and don't have any sense) In the construction , copy construction and operator =, copy all the elements with a loop. The only one operation which use only the first element is the comparison.
I had done a new comparison, and compare the sum of all the elements of the object. The time are similar, you must add 5% in the small objects to 15% in the big objects.
I'm not precisely sure of your meaning, but I suggested modification of the test object to use default operators where possible, to randomize the entire contents of the object (so that not all integers in the array are the same), and to use the entire contents of the object for comparison. I believe that these suggested changes make it more reflective of the conventional large-object comparison case. Most of the time, only the first integer will be compared, but in equal cases the entire array will be compared. I don't like feeding overly simplified objects into benchmarks, because compilers can make some surprising optimizations.
TIMSORT
The Tim Sort code is not mine.
Agreed. I think we can drop it from this comparison, as your algorithm is clearly much faster. I mentioned it because it is used with other languages, and responded when somebody else on the email group suggested I allow it. I'm not asking you for an implementation. Timsort uses N/2 additional memory.
DESIGN PHASE
Now , we are in the design phase. Nothing is fixed. We can comment and change anything.
The specification is simple, provide sort algorithms, stable and not stable, for to run with 1 thread or with any number of threads, without any other external library ( as Open MP , TBB, CilkPlus..). Only with the C++ standard.
If you want to discuss about this, we can do. We admit ideas, suggestions, opinions. Any help is welcome. The final decision about if the code is included or not in the boost sort library, depend of Steven, the creator of the library.
After the design phase, with the decisions taken and the interfaces fixed. We prepare the final version, redesign a test system, test in deep and create the documentation.
I suggest deciding what you want to try to include. Getting a competitive parallel unstable sort to tbb will be hard, and I'm skeptical of including a library that just replicates what tbb provides for easy use.
I tried to improve, I would like to have more time, more equipment, and better documentation on order to find the best options, but I don't have, and I do this in my free time after my work and my family.
I'm in the same situation with regards to available time. I think many boost participants are.
I designed 6 stable sort algorithms, for to select the fast (internally named smart_merge_sort, about 10% faster than the GCC version). I prepared a version of Sample Sort, only with threads and atomic variables. In the benchmarks have similar performance than the TBB version, use one half of the auxiliary memory used by TBB and is exception safe.
I like your parallel stable sorts. They look like the most promising part of your proposed library additions. Indirect sorting is also interesting. I just want reliable large-object and string comparison numbers for them before diving in further. Have you looked at integrating your testing with the approach in the Boost Sort library? The randomgen binary generates files with different types of random number distributions that can test a variety of scenarios, and the tune.pl script controls automated performance testing.
In my opinion, the parallel stable sort is reasonably good for to be included in the final version. The important case, which shows the quality of the algorithm, is when the data are unsorted. I will try to implement the detection of the sorted elements , like in TBB, but I am not sure about the results. Th internal loops of the algorithm are delicate, and a small change , can produce big changes in the performance, as happened in other algorithms
Is there a way you can do a fast initial pass, checking for already-sorted
data? That's what Spreadsort does (at each level of the sort).
Hi Steven Thanks by your message. Until now , I had been focused mainly in the algorithms, and I didn't dedicate many time to others things as the test format and the integration with boost::sort.Now, the first things to do are : - Try to detect when the data are ordered, - Revise in deep your library and code , in order to adapt the test and benchmarks to the procedures used in your library About the object used in the benchmarks, we can do something simple. Reburcio is a very small class in the file Benchmarks/GCC/algorithm/reburcio.hpp. Please rewrite and send me, and only need to recompile the benchmarks. *About the parts to include* With the parallel unstable sort, all the algorithms I examined have the same approach, and due this, similar performances. The range of decision is small. The goal is provide algorithms independent of any library or Operating System, fast , robust and easy to use. The idea of to do the same than a company is the origin of many Free SW, think about Oracle and MySQL, Internet Explorer and Firefox. Even the C++ standard, is to do the same than many companies are doing since many years ago, each with its own way. The final users are grateful because simplify our work. TBB is available in Windows, Linux and Os X. With others operating system ( by example Android and all the real time OS), you must recompile all the source code, and I have not clear about the dynamic linking of TBB in these operating systems. Many small machines don't have task scheduler, but have threads. To force to recompile TBB for to use a parallel sort is like force to rent a bus, for to transport only one person. Our code have similar performance, small code, independent of any Operating System, of any other code or library. If you are C++11 compliant you have parallel sort. I think, we must include sort, parallel sort, stable sort and parallel stable sort. Perhaps, too, sample sort, but it's less important Now I am beginning to examine your code for to integrate the new code with the boost sort approach. Yours Francisco
On Sun, Mar 29, 2015 at 6:25 AM Francisco José Tapia
Hi Steven
Thanks by your message.
Until now , I had been focused mainly in the algorithms, and I didn't dedicate many time to others things as the test format and the integration with boost::sort.Now, the first things to do are :
- Try to detect when the data are ordered,
Just verify that each successive element is not less than the previous. If inputs are unique (or completely identical if they compare the same), then you can compare results with those of std::sort.
- Revise in deep your library and code , in order to adapt the test and benchmarks to the procedures used in your library
About the object used in the benchmarks, we can do something simple. Reburcio is a very small class in the file Benchmarks/GCC/algorithm/reburcio.hpp. Please rewrite and send me, and only need to recompile the benchmarks.
I suggest renaming the variables to be clearer, but here's the idea:
template
*About the parts to include*
With the parallel unstable sort, all the algorithms I examined have the same approach, and due this, similar performances. The range of decision is small.
The goal is provide algorithms independent of any library or Operating System, fast , robust and easy to use. The idea of to do the same than a company is the origin of many Free SW, think about Oracle and MySQL, Internet Explorer and Firefox. Even the C++ standard, is to do the same than many companies are doing since many years ago, each with its own way. The final users are grateful because simplify our work.
TBB is available in Windows, Linux and Os X. With others operating system ( by example Android and all the real time OS), you must recompile all the source code, and I have not clear about the dynamic linking of TBB in these operating systems. Many small machines don't have task scheduler, but have threads. To force to recompile TBB for to use a parallel sort is like force to rent a bus, for to transport only one person.
Who wants to do a parallel sort on Android? The OS often only gives you one core (depending on priorities), and it would burn the battery at ridiculous speed. I just don't see the use case, unless you can provide some advantage in some fairly common use case, and it would have to fully match tbb's features to within a few percent, including efficiency on already-sorted input.
Our code have similar performance, small code, independent of any Operating System, of any other code or library. If you are C++11 compliant you have parallel sort. I think, we must include sort, parallel sort, stable sort and parallel stable sort. Perhaps, too, sample sort, but it's less important
Your stable sort and parallel stable sort look promising, and sample sort is a possibility. I recommend concentrating on them for now.
On Monday, April 06, 2015 11:13 AM, Steven Ross wrote:
Who wants to do a parallel sort on Android? The OS often only gives you one core (depending on priorities), and it would burn the battery at ridiculous speed.
I was under the impression that it's better on battery life to use all the cores at maximum and then sleep as quickly as possible. Clock gating of the components is getting better, but for example (and I'm completely making this up now), if two CPUs share a cache, then the cache is alive anyway so you may as well use the other CPU. I've heard the term "race to sleep" to describe this. I think it's hard to guess at; mobile CPUs now have 8 cores, and they're not even homogeneous, some are in order execution and some out of order. Ben
It's not just CPUs. Most of the physics engines used in games for phones are CPU based. The current multi threaded CPU architecture brings gamin to a whole new level. I see some physics done with sort, map, gather, scatter, and all. PS: I proposed this as a part of boost compute this GSoC giving GPU support for iOS devices using metal shading language. Regards, Aditya Atluri.
On Apr 6, 2015, at 1:37 AM, Ben Pope
wrote: On Monday, April 06, 2015 11:13 AM, Steven Ross wrote: Who wants to do a parallel sort on Android? The OS often only gives you one core (depending on priorities), and it would burn the battery at ridiculous speed.
I was under the impression that it's better on battery life to use all the cores at maximum and then sleep as quickly as possible. Clock gating of the components is getting better, but for example (and I'm completely making this up now), if two CPUs share a cache, then the cache is alive anyway so you may as well use the other CPU. I've heard the term "race to sleep" to describe this.
I think it's hard to guess at; mobile CPUs now have 8 cores, and they're not even homogeneous, some are in order execution and some out of order.
Ben
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On Mon, Apr 6, 2015 at 1:55 AM, Aditya Atluri
It's not just CPUs. Most of the physics engines used in games for phones are CPU based. The current multi threaded CPU architecture brings gamin to a whole new level. I see some physics done with sort, map, gather, scatter, and all.
PS: I proposed this as a part of boost compute this GSoC giving GPU support for iOS devices using metal shading language.
If you can find somebody whose frame rate on mobile is significantly impacted by CPU sorting speed, I'd love to chat with them.
On Apr 6, 2015, at 1:37 AM, Ben Pope
wrote: On Monday, April 06, 2015 11:13 AM, Steven Ross wrote: Who wants to do a parallel sort on Android? The OS often only gives you one core (depending on priorities), and it would burn the battery at ridiculous speed.
I was under the impression that it's better on battery life to use all the cores at maximum and then sleep as quickly as possible. Clock gating of the components is getting better, but for example (and I'm completely making this up now), if two CPUs share a cache, then the cache is alive anyway so you may as well use the other CPU. I've heard the term "race to sleep" to describe this.
I think it's hard to guess at; mobile CPUs now have 8 cores, and they're not even homogeneous, some are in order execution and some out of order.
The term race-to-sleep applies to faster but higher-power processors sometimes being more power efficient because the system can go to sleep faster. That said, as modern mobile systems normally operate with all but one core asleep, they are optimized for the single-core use case, and very little power is wasted on the unused CPUs. When you add in that parallel sorting has 75% of the CPU efficiency, it just doesn't make sense from a power perspective especially if some other task can be computed in parallel with the sort, and as there is a single-threaded option (spreadsort) that can sort 1.5-2X faster on just one core.
On Mon, Apr 6, 2015 at 12:53 PM, Steven Ross
The term race-to-sleep applies to faster but higher-power processors sometimes being more power efficient because the system can go to sleep faster. That said, as modern mobile systems normally operate with all but one core asleep,
Are you sure about that?
they are optimized for the single-core use case, and very little power is wasted on the unused CPUs.
-- Olaf
Regards, Aditya Atluri.
On Apr 6, 2015, at 6:53 AM, Steven Ross
wrote: On Mon, Apr 6, 2015 at 1:55 AM, Aditya Atluri
wrote: It's not just CPUs. Most of the physics engines used in games for phones are CPU based. The current multi threaded CPU architecture brings gamin to a whole new level. I see some physics done with sort, map, gather, scatter, and all. PS: I proposed this as a part of boost compute this GSoC giving GPU support for iOS devices using metal shading language.
If you can find somebody whose frame rate on mobile is significantly impacted by CPU sorting speed, I'd love to chat with them.
Unreal engine has gone open source. Check it. OpenSubDiv is pixars surface subdivision software widely used in movie production and games (COD Ghosts). There are backends for android and iOS. Last time I checked it should have ARM NEON based backend(need recheck). It is a common practice in gaming industry to use CPU for physics (before CUDA and OpenCL came along. but it is still in practice). I don't know anyone, but you can contact the developers in GitHub repositories.
On Apr 6, 2015, at 1:37 AM, Ben Pope
wrote: On Monday, April 06, 2015 11:13 AM, Steven Ross wrote: Who wants to do a parallel sort on Android? The OS often only gives you one core (depending on priorities), and it would burn the battery at ridiculous speed.
I was under the impression that it's better on battery life to use all the cores at maximum and then sleep as quickly as possible. Clock gating of the components is getting better, but for example (and I'm completely making this up now), if two CPUs share a cache, then the cache is alive anyway so you may as well use the other CPU. I've heard the term "race to sleep" to describe this.
I think it's hard to guess at; mobile CPUs now have 8 cores, and they're not even homogeneous, some are in order execution and some out of order.
The term race-to-sleep applies to faster but higher-power processors sometimes being more power efficient because the system can go to sleep faster. That said, as modern mobile systems normally operate with all but one core asleep, they are optimized for the single-core use case, and very little power is wasted on the unused CPUs. When you add in that parallel sorting has 75% of the CPU efficiency, it just doesn't make sense from a power perspective especially if some other task can be computed in parallel with the sort, and as there is a single-threaded option (spreadsort) that can sort 1.5-2X faster on just one core.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Regards, Aditya Atluri.
On Apr 6, 2015, at 6:53 AM, Steven Ross
wrote: On Mon, Apr 6, 2015 at 1:55 AM, Aditya Atluri
wrote: It's not just CPUs. Most of the physics engines used in games for phones are CPU based. The current multi threaded CPU architecture brings gamin to a whole new level. I see some physics done with sort, map, gather, scatter, and all. PS: I proposed this as a part of boost compute this GSoC giving GPU support for iOS devices using metal shading language.
If you can find somebody whose frame rate on mobile is significantly impacted by CPU sorting speed, I'd love to chat with them
FYI, Most of the rendering applications, have asynchronous ability. Like, after issuing a draw call, the gpu takes time to render the scene. Meanwhile, the CPU does all the physics calculation. Doing physics on the GPU causes increase in frame rendering time. This is one of the reason why mobile devices are still using CPU based Physics, AI systems. The design call of using CPU or GPU depends on the developer by testing the frame rate to be reasonable. If the frame rates are high, he may chose to go for GPU based Physics or AI. Else he sticks to CPU.
On Apr 6, 2015, at 1:37 AM, Ben Pope
wrote: On Monday, April 06, 2015 11:13 AM, Steven Ross wrote: Who wants to do a parallel sort on Android? The OS often only gives you one core (depending on priorities), and it would burn the battery at ridiculous speed.
I was under the impression that it's better on battery life to use all the cores at maximum and then sleep as quickly as possible. Clock gating of the components is getting better, but for example (and I'm completely making this up now), if two CPUs share a cache, then the cache is alive anyway so you may as well use the other CPU. I've heard the term "race to sleep" to describe this.
I think it's hard to guess at; mobile CPUs now have 8 cores, and they're not even homogeneous, some are in order execution and some out of order.
The term race-to-sleep applies to faster but higher-power processors sometimes being more power efficient because the system can go to sleep faster. That said, as modern mobile systems normally operate with all but one core asleep, they are optimized for the single-core use case, and very little power is wasted on the unused CPUs. When you add in that parallel sorting has 75% of the CPU efficiency, it just doesn't make sense from a power perspective especially if some other task can be computed in parallel with the sort, and as there is a single-threaded option (spreadsort) that can sort 1.5-2X faster on just one core.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Hi, About to change the name and use intarray, it's ok , I am changing now. I was looking the code of randomgen and I didn't understood how is running. With the explanation about the operator >>, I understand and I will use in the test implementation. About Android, I recommend you to see the Shield consoles from Nvidia. The Sw inside these consoles is comparable with any Linux workstation. But, perhaps we have not the correct perspective about this. We don't pretend to say to the programmers what must use, and what not. We pretend provide tools, in order to simplify their work, and use or not is their decision. In the actual implementation of parallel sort algorithms in TBB or with OpenMP, the way for to select only a part of the HW Threads is not clear. Recently Ezchip announces a processor with 100 ARM cores. Our implementation have a parameter for to select the number of HW threads to use, by default use all. But perhaps want to use only 40 HW threads, or a percent of the HW threads of the machine where is running the program. In the future, we must involve the GPU, but, in my opinion, first, we must obtain fast and robust parallel algorithms over the CPU. Francisco
On Mon, Apr 6, 2015 at 10:12 AM Aditya Atluri
Regards, Aditya Atluri.
On Apr 6, 2015, at 6:53 AM, Steven Ross
wrote: On Mon, Apr 6, 2015 at 1:55 AM, Aditya Atluri
wrote: It's not just CPUs. Most of the physics engines used in games for phones are CPU based. The current multi threaded CPU architecture brings gamin to a whole new level. I see some physics done with sort, map, gather, scatter, and all. PS: I proposed this as a part of boost compute this GSoC giving GPU support for iOS devices using metal shading language.
If you can find somebody whose frame rate on mobile is significantly impacted by CPU sorting speed, I'd love to chat with them
FYI, Most of the rendering applications, have asynchronous ability. Like, after issuing a draw call, the gpu takes time to render the scene. Meanwhile, the CPU does all the physics calculation. Doing physics on the GPU causes increase in frame rendering time. This is one of the reason why mobile devices are still using CPU based Physics, AI systems. The design call of using CPU or GPU depends on the developer by testing the frame rate to be reasonable. If the frame rates are high, he may chose to go for GPU based Physics or AI. Else he sticks to CPU.
I understand that they're doing physics calculations, but are those
physics calculations containing a substantial amount of purely sequential sort time, where they can't run other computations in parallel? If so, I'd be interested in talking with them to understand what they're doing.
participants (7)
-
Aditya Atluri
-
Ben Pope
-
Francisco José Tapia
-
Olaf van der Spek
-
Paul A. Bristow
-
Rob Stewart
-
Steven Ross