Hi, I actually looked at that first as well as another Intel Fortran solution called TBB (Thread Building Blocks). It seems to me that these approaches are made for instruction-level parallelism (i.e. processing a long loop with parallel_do or different but independent sections of code). I have a ton of other Fortran code that would benefit from this but ray tracing is not one of them.
From what I read so far, what I am looking for is task-level parallelism, not instruction-level, which is why I am looking into threads. If you want some ideas for reference, I did a quick check on the intel site and they have some papers and apparently even an openMP Fortran compiler if that helps you, [ obviously their comments are specific to their products but quite generally useful esp if you are looking for ideas ] http://cache-www.intel.com/cd/00/00/21/92/219292_hyperthreading_extract.pdf
( http://www.google.com/search?hl=en&q=site%3Aintel.com+openmp+performance+optimization In our approach, light emission is isotropic (physically, it corresponds to spontaneous emission in a semiconductor) and we want to know what the extraction efficiency (how much light actually gets out) and what the output distribution is going to look like. This is affected by die shaping, external packaging, refractive index of cladding layers, position and reflectivity of metal contacts, TE/TM emission, etc... So the difference between nearby rays is critical, unlike a visualization/CG render situation.
Just FYI but here is a quick reference to a few papers by my predecessor who wrote this Fortran code in the first place: http://scholar.google.ca/scholar?hl=en&lr=&q=author%3AShmatov+ray+tracing&btnG=Search We already have locality to a certain extent since each ray is bouncing around in its own "box" with a fixed number of facets and only interacts with certain predefined neighboring "boxes". So I am not too worried about cache at this point. I may have to change my mind later on though if this turns out to be a bottleneck.
While you say that rays are independent, if you do classical physical optics, nearby rays tend to have similar trajectories etc. Rather than let an ignorant but fair thread scheduler decide what piece of memory to access next, if you are cache aware, you could even consider something like sorting the rays to get the best locality and making them dependent with a transform scheme that recognizes they are similar if nearby etc. Regards,
Michel Lestrade Crosslight Software