Re: [Boost-users] [Thread] Beginner question regarding thread groups
Hi, I actually looked at that first as well as another Intel Fortran solution called TBB (Thread Building Blocks). It seems to me that these approaches are made for instruction-level parallelism (i.e. processing a long loop with parallel_do or different but independent sections of code). I have a ton of other Fortran code that would benefit from this but ray tracing is not one of them.
From what I read so far, what I am looking for is task-level parallelism, not instruction-level, which is why I am looking into threads. If you want some ideas for reference, I did a quick check on the intel site and they have some papers and apparently even an openMP Fortran compiler if that helps you, [ obviously their comments are specific to their products but quite generally useful esp if you are looking for ideas ] http://cache-www.intel.com/cd/00/00/21/92/219292_hyperthreading_extract.pdf
( http://www.google.com/search?hl=en&q=site%3Aintel.com+openmp+performance+optimization In our approach, light emission is isotropic (physically, it corresponds to spontaneous emission in a semiconductor) and we want to know what the extraction efficiency (how much light actually gets out) and what the output distribution is going to look like. This is affected by die shaping, external packaging, refractive index of cladding layers, position and reflectivity of metal contacts, TE/TM emission, etc... So the difference between nearby rays is critical, unlike a visualization/CG render situation.
Just FYI but here is a quick reference to a few papers by my predecessor who wrote this Fortran code in the first place: http://scholar.google.ca/scholar?hl=en&lr=&q=author%3AShmatov+ray+tracing&btnG=Search We already have locality to a certain extent since each ray is bouncing around in its own "box" with a fixed number of facets and only interacts with certain predefined neighboring "boxes". So I am not too worried about cache at this point. I may have to change my mind later on though if this turns out to be a bottleneck.
While you say that rays are independent, if you do classical physical optics, nearby rays tend to have similar trajectories etc. Rather than let an ignorant but fair thread scheduler decide what piece of memory to access next, if you are cache aware, you could even consider something like sorting the rays to get the best locality and making them dependent with a transform scheme that recognizes they are similar if nearby etc. Regards,
Michel Lestrade Crosslight Software
Date: Mon, 25 Aug 2008 08:31:16 -0700 From: michel.lestrade@crosslight.com To: boost-users@lists.boost.org Subject: Re: [Boost-users] [Thread] Beginner question regarding thread groups
Hi,
I actually looked at that first as well as another Intel Fortran solution called TBB (Thread Building Blocks). It seems to me that these approaches are made for instruction-level parallelism (i.e. processing a long loop with parallel_do or different but independent sections of code). I have a ton of other Fortran code that would benefit from this but ray tracing is not one of them.
They have vectorization and parallelization but IIRC openMP is quite general multi-threading. [ soliciting a sales pitch from a boost person here, LOL ]
From what I read so far, what I am looking for is task-level parallelism, not instruction-level, which is why I am looking into threads. If you want some ideas for reference, I did a quick check on the intel site and they have some papers and apparently even an openMP Fortran compiler if that helps you, [ obviously their comments are specific to their products but quite generally useful esp if you are looking for ideas ] http://cache-www.intel.com/cd/00/00/21/92/219292_hyperthreading_extract.pdf
( http://www.google.com/search?hl=en&q=site%3Aintel.com+openmp+performance+optimization In our approach, light emission is isotropic (physically, it corresponds to spontaneous emission in a semiconductor) and we want to know what the extraction efficiency (how much light actually gets out) and what the output distribution is going to look like. This is affected by die shaping, external packaging, refractive index of cladding layers, position and reflectivity of metal contacts, TE/TM emission, etc... So the difference between nearby rays is critical, unlike a visualization/CG render situation.
I hadn't read your original post at the time but I'm not saying "be sloppy" just that you could still do quantitative work but make better use of the surface characteristics or locality. Ideas, that may or may not be useful, include things like expanding trig functions based on nearby calculations and breaking things up based on the "small" number of ( mostly flat) surfaces. Even things like look-up-tables can end up costing performance ( extreme case of course being VM). In at least one case, I found that an "irrelevant" sort of some text files may a huge difference in execution time ( due to later thrashing).
Just FYI but here is a quick reference to a few papers by my predecessor who wrote this Fortran code in the first place: http://scholar.google.ca/scholar?hl=en&lr=&q=author%3AShmatov+ray+tracing&btnG=Search
Thanks, I always ask people to post links to their work since it makes things more interesting and I'll look as soon as I can. I thought the code may be a bit old given the language :)
We already have locality to a certain extent since each ray is bouncing around in its own "box" with a fixed number of facets and only interacts with certain predefined neighboring "boxes". So I am not too worried about cache at this point. I may have to change my mind later on though if this turns out to be a bottleneck.
In this case, you could imagine a thread-per-box or something. Don't underestimate the return on things like upfront sorting or difference representations ( if deltas are smaller and reduce memory, you may still be able to get a net gain in performance from undoing the "compression").
While you say that rays are independent, if you do classical physical optics, nearby rays tend to have similar trajectories etc. Rather than let an ignorant but fair thread scheduler decide what piece of memory to access next, if you are cache aware, you could even consider something like sorting the rays to get the best locality and making them dependent with a transform scheme that recognizes they are similar if nearby etc. Regards,
Michel Lestrade Crosslight Software
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
_________________________________________________________________ Get ideas on sharing photos from people like you. Find new ways to share. http://www.windowslive.com/explore/photogallery/posts?ocid=TXT_TAGLM_WL_Phot...
participants (2)
-
Michel Lestrade
-
Mike Marchywka