[phoenix] compile-time performance

On 9/12/2011 4:06 AM, Thomas Heller wrote:
On Monday, September 12, 2011 02:57:21 PM Joel de Guzman wrote:
For example, here is the current CT status of Phoenix2 vs Phoenix3 comparing the elapsed (CT) time for the phoenix2 vs. phoenix3 lambda_tests.cpp (**):
MSVC 10: Phoenix2: 00:04.5 Phoenix3: 00:29.9
G++ 4.5: Phoenix2: 00:02.6 Phoenix3: 00:04.7
I wasn't aware that Phoenix3 was so bad under MSVC 10.
Me neither. If that's the case, we have work to do.
You all know that Phoenix2 uses Fusion exclusively. Phoenix3 uses proto, which according to Eric does not use Fusion, although IIRC the core of Phoenix3 uses some Fusion still (quick check: Thomas uses an optimized-PP version of fusion:: vector for phoenix3).
Heller did a helluva perf-tweaks for Phx3 to get that number for g++ (alas, not MSVC). In fairness, I did absolutely no CT perf-tweaks for both Phoenix2 and Fusion.
(** I made sure both tests have exactly the same code, so I removed the last test. I can post the exact code if need be)
OK, I retract everything I said. Fusion rulez, and Proto needs help. Sorry for the blind accusation. Let's stop this and figure out what needs to be fixed.
FWIW, there are some unit tests that outperform the compile times of Phoenix2 (with gcc), the current bad hit on compile times seem to only occur with let, lambda and switch/case expressions.
This /suggests/ to me that the core of Phx3 is sound, but that the implementation of let, lambda, and switch are heavy. But I'm not blaming Thomas. I'm just suggesting a logical place to begin an investigation. And regardless, we should point Steven's template profiler at the code and see what it says. Thomas, do you want to give that a shot? -- Eric Niebler BoostPro Computing http://www.boostpro.com

On Mon, Sep 12, 2011 at 8:54 PM, Eric Niebler <eric@boostpro.com> wrote:
On 9/12/2011 4:06 AM, Thomas Heller wrote:
On Monday, September 12, 2011 02:57:21 PM Joel de Guzman wrote:
For example, here is the current CT status of Phoenix2 vs Phoenix3 comparing the elapsed (CT) time for the phoenix2 vs. phoenix3 lambda_tests.cpp (**):
MSVC 10: Phoenix2: 00:04.5 Phoenix3: 00:29.9
G++ 4.5: Phoenix2: 00:02.6 Phoenix3: 00:04.7
I wasn't aware that Phoenix3 was so bad under MSVC 10.
Me neither. If that's the case, we have work to do.
You all know that Phoenix2 uses Fusion exclusively. Phoenix3 uses proto, which according to Eric does not use Fusion, although IIRC the core of Phoenix3 uses some Fusion still (quick check: Thomas uses an optimized-PP version of fusion:: vector for phoenix3).
Heller did a helluva perf-tweaks for Phx3 to get that number for g++ (alas, not MSVC). In fairness, I did absolutely no CT perf-tweaks for both Phoenix2 and Fusion.
(** I made sure both tests have exactly the same code, so I removed the last test. I can post the exact code if need be)
OK, I retract everything I said. Fusion rulez, and Proto needs help. Sorry for the blind accusation. Let's stop this and figure out what needs to be fixed.
FWIW, there are some unit tests that outperform the compile times of Phoenix2 (with gcc), the current bad hit on compile times seem to only occur with let, lambda and switch/case expressions.
This /suggests/ to me that the core of Phx3 is sound, but that the implementation of let, lambda, and switch are heavy. But I'm not blaming Thomas. I'm just suggesting a logical place to begin an investigation.
FWIW, i experimented with different implementations, I even almost identically (as far as i could) copied the phx2 implementation. For the implementation right now I have several suspects. First of all, i think the code of the mentioned parts is very ugly (both of V2 and V3). Second of all, there some pretty severe type transformations going on, repeated calls to fusion::make_map etc. I am absolutely sure that there is quite some space for improvement.
And regardless, we should point Steven's template profiler at the code and see what it says. Thomas, do you want to give that a shot?
Yes, I can do that, I am swamped with work until the end of the week though. So if anyone wants to give some hints ... feel free to do so ...

On 9/12/2011 4:45 PM, Thomas Heller wrote:
On Mon, Sep 12, 2011 at 8:54 PM, Eric Niebler <eric@boostpro.com> wrote:
On 9/12/2011 4:06 AM, Thomas Heller wrote:
FWIW, there are some unit tests that outperform the compile times of Phoenix2 (with gcc), the current bad hit on compile times seem to only occur with let, lambda and switch/case expressions.
This /suggests/ to me that the core of Phx3 is sound, but that the implementation of let, lambda, and switch are heavy. But I'm not blaming Thomas. I'm just suggesting a logical place to begin an investigation.
FWIW, i experimented with different implementations, I even almost identically (as far as i could) copied the phx2 implementation. For the implementation right now I have several suspects. First of all, i think the code of the mentioned parts is very ugly (both of V2 and V3). Second of all, there some pretty severe type transformations going on, repeated calls to fusion::make_map etc. I am absolutely sure that there is quite some space for improvement.
<looks on trunk> fusion::make_map doesn't appear in the phoenix code anywhere. I see fusion::map appears in scope/detail/make_locals.hpp, but that header doesn't seem to get included from anywhere except itself, AFAICT. So, uh, what's up with that? -- Eric Niebler BoostPro Computing http://www.boostpro.com

On Monday, September 12, 2011 11:38:44 PM Eric Niebler wrote:
On 9/12/2011 4:45 PM, Thomas Heller wrote:
On Mon, Sep 12, 2011 at 8:54 PM, Eric Niebler <eric@boostpro.com> wrote:
On 9/12/2011 4:06 AM, Thomas Heller wrote:
FWIW, there are some unit tests that outperform the compile times of Phoenix2 (with gcc), the current bad hit on compile times seem to only occur with let, lambda and switch/case expressions.
This /suggests/ to me that the core of Phx3 is sound, but that the implementation of let, lambda, and switch are heavy. But I'm not blaming Thomas. I'm just suggesting a logical place to begin an investigation.
FWIW, i experimented with different implementations, I even almost identically (as far as i could) copied the phx2 implementation. For the implementation right now I have several suspects. First of all, i think the code of the mentioned parts is very ugly (both of V2 and V3). Second of all, there some pretty severe type transformations going on, repeated calls to fusion::make_map etc. I am absolutely sure that there is quite some space for improvement.
<looks on trunk>
fusion::make_map doesn't appear in the phoenix code anywhere. I see fusion::map appears in scope/detail/make_locals.hpp, but that header doesn't seem to get included from anywhere except itself, AFAICT. So, uh, what's up with that?
My appologies ... that file is a left over from one of my older implementations. After looking at the code i can explain what's going on. So, first of all there is scope/local_variable.hpp. This file includes the definition of the locals that is: it is hooked up in the custom_terminal mechanism i inventented to also recognize reference_wrapper. Second are the definitions for the is_nullary meta function. Third is the include of scope/detail/local_variable.hpp which is responsible for the lookup of local variables. which is basically just copied from phoenix V2. Now there is let and lambda, which are more or less independent of the local variables except for the lookup. Now, let shouldn't be so bad ... the only thing that might hit badly is the local lookup (no real proof though ...). lambda is bad, compile time wise, cause it generates two proto expression trees, once the lambda itself, and then after the first evaluation round. HTH, Thomas

On 9/14/2011 4:01 AM, Thomas Heller wrote:
On Monday, September 12, 2011 11:38:44 PM Eric Niebler wrote:
On 9/12/2011 4:45 PM, Thomas Heller wrote:
On Mon, Sep 12, 2011 at 8:54 PM, Eric Niebler <eric@boostpro.com> wrote:
On 9/12/2011 4:06 AM, Thomas Heller wrote:
FWIW, there are some unit tests that outperform the compile times of Phoenix2 (with gcc), the current bad hit on compile times seem to only occur with let, lambda and switch/case expressions.
This /suggests/ to me that the core of Phx3 is sound, but that the implementation of let, lambda, and switch are heavy. But I'm not blaming Thomas. I'm just suggesting a logical place to begin an investigation.
FWIW, i experimented with different implementations, I even almost identically (as far as i could) copied the phx2 implementation. For the implementation right now I have several suspects. First of all, i think the code of the mentioned parts is very ugly (both of V2 and V3). Second of all, there some pretty severe type transformations going on, repeated calls to fusion::make_map etc. I am absolutely sure that there is quite some space for improvement.
<looks on trunk>
fusion::make_map doesn't appear in the phoenix code anywhere. I see fusion::map appears in scope/detail/make_locals.hpp, but that header doesn't seem to get included from anywhere except itself, AFAICT. So, uh, what's up with that?
My appologies ... that file is a left over from one of my older implementations. After looking at the code i can explain what's going on. So, first of all there is scope/local_variable.hpp. This file includes the definition of the locals that is: it is hooked up in the custom_terminal mechanism i inventented to also recognize reference_wrapper. Second are the definitions for the is_nullary meta function. Third is the include of scope/detail/local_variable.hpp which is responsible for the lookup of local variables. which is basically just copied from phoenix V2. Now there is let and lambda, which are more or less independent of the local variables except for the lookup. Now, let shouldn't be so bad ... the only thing that might hit badly is the local lookup (no real proof though ...). lambda is bad, compile time wise, cause it generates two proto expression trees, once the lambda itself, and then after the first evaluation round.
Here are my current timings: vc 10 let phx2 4.2 let phx3 10.9 lambda phx2 4.7 lambda ph3 30.1 switch ph2 4.2 switch phx3 6.1 g++ let phx2 2.9 let phx3 2.9 lambda phx2 2.8 lambda ph3 4.8 switch ph2 2.5 switch phx3 2.2 I think we should concentrate on lambda as it is the worst hit of all especially for MSVC. As you noted, some tests are faster due to the recent CT-performance tweaks you did for Phoenix3 at least on g++. That is evident on e.g. switch. The others are still reasonable (e.g. let) but hopefully can be improved on MSVC. Regards, -- Joel de Guzman http://www.boostpro.com http://boost-spirit.com

Hi, I'm interested in this compile-time issue because I'm facing the same kind of issue in some of my own projects. I'm extensively using Boost so I don't really know if it comes from it or if it comes from my code (which similarly uses a lot of templates). Anyway, could you share a little bit over how you can find out what are the compile-time hitters? I've got some sources taking like 40s to build with gcc and it's getting really annoying. Are they some best practices when using template meta-programming? Some tricks to know about what pattern is slow and what's quicker for the compiler? For example -- as this is a phoenix thread -- could you share some examples of what took time in phoenix and how you fixed it? I've tried Steven Watanabe template profiler but got a lot of trouble with the STL complaining about its code being modified (which is true as the template profiler adds some code to it). Best regards, -- Beren Minor

On Wednesday, September 14, 2011 10:15:54 AM Beren Minor wrote:
Hi,
I'm interested in this compile-time issue because I'm facing the same kind of issue in some of my own projects. I'm extensively using Boost so I don't really know if it comes from it or if it comes from my code (which similarly uses a lot of templates).
Anyway, could you share a little bit over how you can find out what are the compile-time hitters? I've got some sources taking like 40s to build with gcc and it's getting really annoying. Are they some best practices when using template meta-programming? Some tricks to know about what pattern is slow and what's quicker for the compiler?
For example -- as this is a phoenix thread -- could you share some examples of what took time in phoenix and how you fixed it?
Unfortunately i can't really give general advise of what to do. Here is what I have done for parts of phoenix that brought down compile times on gcc and clang, obviously this doesn't hold for MSVC. 1) Partially preprocessing headers that had preprocessing loops to emulate variadic templates. This helped reducing the constant time needed for the preprocessor to generate code. Compile were reduced significantly but only in TUs that are not that big (i.e. if you have long expression that take ages to compile it is not really noticable, example here is spirit, it rarely depends on variadics and spend most of its time in instantiating templates). 2) Avoid mpl metafunctions like at, if_ etc. and use at_c, if_c. This reduces the number of instantiated templates because those meta functions are usually implemented in terms of their _c functions, for example: template <typename Sequence, typename N> struct at : at_c<Sequence, N::value> {}; Unfortunately this trick doesn't hold for fusion as it is the other way around there, the _c functions are implemented in terms of their non-_c counterparts. (FWIW, i think that here lies a potential optimization possibility for fusion). 3) Avoid full specializations. According to the standard, a template that is fully specialized needs to be instantiated. I used this technique for the various customization points in phoenix, for example too register rules and actions. The definition for a action is: struct default_actions { template <typename Rule, typename Dummy = void> struct when; }; When registering an action for a certain rule you can write: template <typename Dummy> struct default_actions::when<your_rule, Dummy> { // ... }; This avoids the instantiation of that specific template if the header containing is included, but the expression triggering that template isn't actually used, thus saving time. However, i tried to apply that trick to the various fusion extension points and it didn't work. One possible explanation for this behaviour is that the fusion extension points are very lightweight struct themself. That is they contain e nested struct which is a template themself that can't be instantiated yet. Thus the added complexity of the Dummy parameter lead to more compile time instead of decreasing it cause of less instantiations. The phoenix actions nested when struct is different as it is heavier to instantiate (it is usally implemented to derive from phoenix::call, which might be quite heavy on the compiler ... there might be another optimization possibility) 4) Avoid SFINAE. It doesn't scale. Consider a function that is overloaded with different sfinae things enabled. Upon a function call, the compiler puts all of these overloaded functions that have teh same arity as the function call in the "suitable function set", after that every SFINAE template needs to be instantiated and it needs to be decided if the type expression is valid or not. Prefer tag dispatching and boolean metafunctions to dispatch to the correct function.
I've tried Steven Watanabe template profiler but got a lot of trouble with the STL complaining about its code being modified (which is true as the template profiler adds some code to it).
I haven't really used it myself yet as i don't find it very helpful to extract valuable information out of the output. Additionally, the instantiation count alone isn't really meaningful, as can be seen with the diverging compile times of gcc and msvc in the other posts. Also the results of various optimization tries show that it isn't enough. I hope the lines I wrote above are enough to continue the discussion. Please correct me if I am wrong and i would be happy if others can contribute their experience too. Joel Falcou also suggested to compile a list of the usual TMP scenarios with different use case patterns. Based on these small scale examples we could analyze certain optimization or pesimization techniques more efficiently. I think we just don't fully understand the impact of TMP to compilers yet. I would like to get oppinions and insight from compiler vendors too, as they are at the source of "evil".

On 9/14/2011 6:32 PM, Thomas Heller wrote:
Unfortunately this trick doesn't hold for fusion as it is the other way around there, the _c functions are implemented in terms of their non-_c counterparts. (FWIW, i think that here lies a potential optimization possibility for fusion).
Patches always welcome ;-) Regards, -- Joel de Guzman http://www.boostpro.com http://boost-spirit.com

On 9/14/2011 6:57 PM, Joel de Guzman wrote:
On 9/14/2011 6:32 PM, Thomas Heller wrote:
Unfortunately this trick doesn't hold for fusion as it is the other way around there, the _c functions are implemented in terms of their non-_c counterparts. (FWIW, i think that here lies a potential optimization possibility for fusion).
Patches always welcome ;-)
Now I recall, I was merely following MPL. I took a look at MPL and it's done this way too. See (at.hpp): template< typename BOOST_MPL_AUX_NA_PARAM(Sequence) , typename BOOST_MPL_AUX_NA_PARAM(N) > struct at : at_impl< typename sequence_tag<Sequence>::type > ::template apply< Sequence,N > { BOOST_MPL_AUX_LAMBDA_SUPPORT(2,at,(Sequence,N)) }; template< typename Sequence , BOOST_MPL_AUX_NTTP_DECL(long, N) > struct at_c : at_impl< typename sequence_tag<Sequence>::type > ::template apply< Sequence,mpl::long_<N> > { }; And, thinking about this, it would break backward compatibility if we change the API for the extension mechanism to accept plain integral constants instead. The gains you see in using at_c happens when you start to do some computation. E.g. at_c<a + b> is faster that at<mpl::plus<a, b> >. Regards, -- Joel de Guzman http://www.boostpro.com http://boost-spirit.com

On 14.09.2011 12:32, Thomas Heller wrote:
Joel Falcou also suggested to compile a list of the usual TMP scenarios with different use case patterns. Based on these small scale examples we could analyze certain optimization or pesimization techniques more efficiently. I think we just don't fully understand the impact of TMP to compilers yet. I would like to get oppinions and insight from compiler vendors too, as they are at the source of "evil".
Such a list would be interesting from the compiler perspective as well, as we would have small test cases on which to analyze (and hopefullly improve) performance. Sebastian
participants (5)
-
Beren Minor
-
Eric Niebler
-
Joel de Guzman
-
Sebastian Redl
-
Thomas Heller