Random access and memoization in TMP

Hi. I was pondering the other day about the unique built in memoization capabilities of templates. It ocurred to me that if one could share more between different instantiations the compilation process would be sped up. For example, take drop :: Int -> [a] -> [a] defined so that (drop 0 list) is (list) (drop 1 list) is (tail list) (drop 2 list) is (tail (tail list)) in Haskell this would be written drop 0 list = list drop n list = drop (n-1) (tail list) So that the last clause exploits last call optimization. However, another way would be drop 0 list = list drop n list = tail (drop (n-1) list) Now, take at :: Int -> [a] -> a at n list = head (drop n list) If I construct a list in O(n) and then take the last element with at, also in O(n), then access to every element becomes O(1), because traversing to the last element precalculated all the other positions. I made a test with gcc, in which I traverse a list in forward direction but naively use at to get to the current position. There are four variants. homemade_slow uses the first drop definition. homemade_fast uses the second. mpl_vector and mpl_list use those sequence classes. The times were, in seconds: homemade_slow 41 homemade_fast 0.7 mpl_vector 4 mpl_list 11 I've been thinking the past days of other functions like drop that could be improved in a similar way, but they have eluded me. I didn't find mention of similar techniques in the MPL book, so I wonder if I'm making any sense. Regards, Bruno

On 08/17/2005 12:41 PM, Bruno Martínez wrote:
I was pondering the other day about the unique built in memoization capabilities of templates. [snip] If I construct a list in O(n) and then take the last element with at, also in O(n), then access to every element becomes O(1), because traversing to the last element precalculated all the other positions. [snip] I've been thinking the past days of other functions like drop that could be improved in a similar way, but they have eluded me. I didn't find mention of similar techniques in the MPL book, so I wonder if I'm making any sense.
Makes perfect sense to me. The fold metafunction calculates a sequence of partial results which can then be accessed with child_i_depth_j as shown in test driver in: http://boost-sandbox.sourceforge.net/vault/index.php?&direction=0&order=&directory=cppljevans/mpl There's another application in the cppljevans/mpl directory which can calculate the strides in a multi-dimentional array with constant sizes for each dimention. There's also another application which calculates offsets of elements in a tuple. I can envision another which calculates the closure of a bool matrix, but that would probably tax the compiler a bit much ;(

On Wed, 17 Aug 2005 15:29:51 -0300, Larry Evans <cppljevans@cox-internet.com> wrote:
On 08/17/2005 12:41 PM, Bruno Martínez wrote:
I was pondering the other day about the unique built in memoization capabilities of templates. [snip] If I construct a list in O(n) and then take the last element with at, also in O(n), then access to every element becomes O(1), because traversing to the last element precalculated all the other positions. [snip] I've been thinking the past days of other functions like drop that could be improved in a similar way, but they have eluded me. I didn't find mention of similar techniques in the MPL book, so I wonder if I'm making any sense.
Makes perfect sense to me. The fold metafunction calculates a sequence of partial results which can then be accessed with child_i_depth_j as shown in test driver in:
The recursion in child_i_depth_j is a tail call. IIUC, the recursion is like the first drop, so increasing depths don't share part of the calculation. However, it could be written in the other style. Bruno

On 08/18/2005 12:07 AM, Bruno Martínez wrote:
On Wed, 17 Aug 2005 15:29:51 -0300, Larry Evans [snip]
Makes perfect sense to me. The fold metafunction calculates a sequence of partial results which can then be accessed with child_i_depth_j as shown in test driver in:
The recursion in child_i_depth_j is a tail call. IIUC, the recursion is like the first drop, so increasing depths don't share part of the calculation. However, it could be written in the other style.
Yeah, you're right. I was thinking more of how fold was done. It calculates all the intermediate results, but leaves no way to access them. That was the purpose of child_i_depth_j. I'll *eventually* try to write it the other way, but I'm not sure when. Of course someone *expert* in doing that might try ;) -Larry

On 08/18/2005 05:08 AM, Larry Evans wrote: [snip]
Yeah, you're right. I was thinking more of how fold was done. It calculates all the intermediate results, but leaves no way to access them. That was the purpose of child_i_depth_j. I'll *eventually* try to write it the other way, but I'm not sure when. Of course someone *expert* in doing that might try ;)
Now that I think of it, the code at: http://cvs.sourceforge.net/viewcvs.py/boost-sandbox/boost-sandbox/boost/indexed_types/product.hpp?rev=1.2&view=auto Does that, but, unlike fold, it's specific for the factor template. In addition, it uses: TypeMap<Indices(Index-1)>::type to access the elements in the TypeSequence used to create the product (or tuple, to use the c++ standard term). Since that's O(1) time complexity, it should be no problem. Now, if we could just modify fold to do something similar. Unfortunately, the problem with fold is that the arguments of the partial result, ForwardOp<PreviousPartial,CurrentElement>, always vary; hence, they cannot be named easily with something like: factor<Indices, TypeMap, Index+1> as in product.hpp :( Maybe this would provide you (and me) with some ideas for how to do this. -best of luck Larry

Larry Evans <cppljevans@cox-internet.com> writes:
On 08/18/2005 12:07 AM, Bruno Martínez wrote:
On Wed, 17 Aug 2005 15:29:51 -0300, Larry Evans [snip]
Makes perfect sense to me. The fold metafunction calculates a sequence of partial results which can then be accessed with child_i_depth_j as shown in test driver in:
The recursion in child_i_depth_j is a tail call. IIUC, the recursion is like the first drop, so increasing depths don't share part of the calculation. However, it could be written in the other style.
Yeah, you're right. I was thinking more of how fold was done. It calculates all the intermediate results, but leaves no way to access them.
Of course you can use fold to build an O(1) access structure like vector or a map containing all of your intermediate results, and then access that. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Bruno Martínez <br1@internet.com.uy> writes:
I made a test with gcc, in which I traverse a list in forward direction but naively use at to get to the current position. There are four variants. homemade_slow uses the first drop definition. homemade_fast uses the second. mpl_vector and mpl_list use those sequence classes. The times were, in seconds:
homemade_slow 41 homemade_fast 0.7 mpl_vector 4 mpl_list 11
Memoization has lots of weird effects like this. If you amortize your "fast" access over touching all the elements of the list you clearly are beating mpl::vector. But what happens when you only need to look at the last 5 elements of a 100-element sequence? I wonder how likely that is after all.
I've been thinking the past days of other functions like drop that could be improved in a similar way, but they have eluded me. I didn't find mention of similar techniques in the MPL book, so I wonder if I'm making any sense.
You're making plenty of sense; it's something I've thought about quite a bit. But I never quite realized that you can make it "okay" to use O(N) operations as though they were O(1). Maybe this demands a re-think of metaprogram efficiency. If the MPL algorithms and iterators could be reworked to take better advantage of memoization, it might be that list becomes the most efficient datastructure overall, again. It would also be very interesting to see what happens to an algorithm like sort when you take forced memoization into account. -- Dave Abrahams Boost Consulting www.boost-consulting.com

On Thu, 18 Aug 2005 10:59:51 -0300, David Abrahams <dave@boost-consulting.com> wrote:
Memoization has lots of weird effects like this. If you amortize your "fast" access over touching all the elements of the list you clearly are beating mpl::vector. But what happens when you only need to look at the last 5 elements of a 100-element sequence? I wonder how likely that is after all.
It should be just as efficient, because there's no 'extra' work involved. The same number of elements are touched. On the other hand, I haven't been able to think how to accelerate index_of<>, even if it takes a little speed hit.
I've been thinking the past days of other functions like drop that could be improved in a similar way, but they have eluded me. I didn't find mention of similar techniques in the MPL book, so I wonder if I'm making any sense.
You're making plenty of sense; it's something I've thought about quite a bit. But I never quite realized that you can make it "okay" to use O(N) operations as though they were O(1). Maybe this demands a re-think of metaprogram efficiency. If the MPL algorithms and iterators could be reworked to take better advantage of memoization, it might be that list becomes the most efficient datastructure overall, again.
The patch to mpl::advance seems easy. I haven't done it locally because I don't know which version is the one my compiler actually sees.
It would also be very interesting to see what happens to an algorithm like sort when you take forced memoization into account.
Indeed. I searched for theory on memoization before posting, without success. It would seem the C++ template system is the first language with complete memoization. It probably doesn't make sense to have a language like that on a speed basis. If I don't misunderstand the efficiency appendix of your book, even at<0, myvec> is O(N) because of the similar instantiations the compiler has to go through. The difference is that this work is not interpreted. Even without some weird O(N)->O(1) cases, memoization has a sw engineering impact. As it's said in the book, it's easier to avoid blobs, for example, because there's no speed hit. It seems to me that there should be several patterns and idioms appropiate for such a system treated in some paper somewhere. Bruno

Bruno Martínez <br1@internet.com.uy> writes:
On Thu, 18 Aug 2005 10:59:51 -0300, David Abrahams <dave@boost-consulting.com> wrote:
Memoization has lots of weird effects like this. If you amortize your "fast" access over touching all the elements of the list you clearly are beating mpl::vector. But what happens when you only need to look at the last 5 elements of a 100-element sequence? I wonder how likely that is after all.
It should be just as efficient, because there's no 'extra' work involved. The same number of elements are touched.
My pointwas that for mpl::vector that's O(1) while for you it's O(N), So vector might win in that case.
On the other hand, I haven't been able to think how to accelerate index_of<>, even if it takes a little speed hit.
Hmm.
I've been thinking the past days of other functions like drop that could be improved in a similar way, but they have eluded me. I didn't find mention of similar techniques in the MPL book, so I wonder if I'm making any sense.
You're making plenty of sense; it's something I've thought about quite a bit. But I never quite realized that you can make it "okay" to use O(N) operations as though they were O(1). Maybe this demands a re-think of metaprogram efficiency. If the MPL algorithms and iterators could be reworked to take better advantage of memoization, it might be that list becomes the most efficient datastructure overall, again.
The patch to mpl::advance seems easy. I haven't done it locally because I don't know which version is the one my compiler actually sees.
That's easy enough to test by preprocessing your source.
It would also be very interesting to see what happens to an algorithm like sort when you take forced memoization into account.
Indeed. I searched for theory on memoization before posting, without success. It would seem the C++ template system is the first language with complete memoization.
Probably the first one where "you pay for memoization no matter what, so you may as well use it."
It probably doesn't make sense to have a language like that on a speed basis. If I don't misunderstand the efficiency appendix of your book, even at<0, myvec> is O(N) because of the similar instantiations the compiler has to go through.
What led you to believe that?
The difference is that this work is not interpreted.
<blink> Sorry, you lost me.
Even without some weird O(N)->O(1) cases, memoization has a sw engineering impact. As it's said in the book, it's easier to avoid blobs, for example, because there's no speed hit.
The book says that?
It seems to me that there should be several patterns and idioms appropiate for such a system treated in some paper somewhere.
Sounds like a good research topic for... someone like you! ;-) -- Dave Abrahams Boost Consulting www.boost-consulting.com

On Thu, 18 Aug 2005 17:42:26 -0300, David Abrahams <dave@boost-consulting.com> wrote:
Bruno Martínez <br1@internet.com.uy> writes:
On Thu, 18 Aug 2005 10:59:51 -0300, David Abrahams <dave@boost-consulting.com> wrote:
Memoization has lots of weird effects like this. If you amortize your "fast" access over touching all the elements of the list you clearly are beating mpl::vector. But what happens when you only need to look at the last 5 elements of a 100-element sequence? I wonder how likely that is after all.
It should be just as efficient, because there's no 'extra' work involved. The same number of elements are touched.
My pointwas that for mpl::vector that's O(1) while for you it's O(N), So vector might win in that case.
You're right.
It probably doesn't make sense to have a language like that on a speed basis. If I don't misunderstand the efficiency appendix of your book, even at<0, myvec> is O(N) because of the similar instantiations the compiler has to go through.
What led you to believe that?
The book. It has to search for the "memoization record" in a list, even if re instantiating the template would have been simple.
The difference is that this work is not interpreted.
<blink> Sorry, you lost me.
Nothing new. The search for the memoization record is faster because it's builtin, and it doesn't go through the process of template instantiation. Metrowerks seems to be bringing the elements accessed to the front af the list, even.
Even without some weird O(N)->O(1) cases, memoization has a sw engineering impact. As it's said in the book, it's easier to avoid blobs, for example, because there's no speed hit.
The book says that?
Sorry for putting words in your mouth.
It seems to me that there should be several patterns and idioms appropiate for such a system treated in some paper somewhere.
Sounds like a good research topic for... someone like you! ;-)
:D Bruno

Bruno Martínez <br1@internet.com.uy> writes:
On Thu, 18 Aug 2005 17:42:26 -0300, David Abrahams <dave@boost-consulting.com> wrote:
Bruno Martínez <br1@internet.com.uy> writes:
It probably doesn't make sense to have a language like that on a speed basis. If I don't misunderstand the efficiency appendix of your book, even at<0, myvec> is O(N) because of the similar instantiations the compiler has to go through.
What led you to believe that?
The book. It has to search for the "memoization record" in a list, even if re instantiating the template would have been simple.
Okay, let's be precise. *Technically*, at<0, myvec> is O(X) -- where X is the number of prior instantiations of at<i,v> -- on most existing compilers, but note that v doesn't have to be myvec nor does X have any relation to size<myvec>::value.
The difference is that this work is not interpreted.
<blink> Sorry, you lost me.
Nothing new. The search for the memoization record is faster because it's builtin, and it doesn't go through the process of template instantiation.
Okay, that's one way to look at it. The reasons it is faster are not so important for the purposes of this analysis, so I prefer to say that it can be very difficult to observe this X because the constant is so low compared to the cost of instantiating a single template.
Metrowerks seems to be bringing the elements accessed to the front af the list, even.
I think GCC does it too.
Even without some weird O(N)->O(1) cases, memoization has a sw engineering impact. As it's said in the book, it's easier to avoid blobs, for example, because there's no speed hit.
The book says that?
Sorry for putting words in your mouth.
I guess you're suggesting that instantiating N templates with 1 nested typedef might be presumed to be slower than instantiating 1 template with N nested typedefs? -- Dave Abrahams Boost Consulting www.boost-consulting.com

Okay, let's be precise. *Technically*, at<0, myvec> is O(X) -- where X is the number of prior instantiations of at<i,v> -- on most existing compilers, but note that v doesn't have to be myvec nor does X have any relation to size<myvec>::value.
Yes.
Okay, that's one way to look at it. The reasons it is faster are not so important for the purposes of this analysis, so I prefer to say that it can be very difficult to observe this X because the constant is so low compared to the cost of instantiating a single template.
Yes.
Metrowerks seems to be bringing the elements accessed to the front af the list, even.
I think GCC does it too.
Really? I don't know for sure about Metrowerks, but reordering the list would explaint the jumpy behaviour. In any case the strategies of both compilers are different.
Even without some weird O(N)->O(1) cases, memoization has a sw engineering impact. As it's said in the book, it's easier to avoid blobs, for example, because there's no speed hit.
The book says that?
Sorry for putting words in your mouth.
I guess you're suggesting that instantiating N templates with 1 nested typedef might be presumed to be slower than instantiating 1 template with N nested typedefs?
No, that's not what I'm trying to say. Let me put an example, in Haskell notation. blob :: Int -> (Int, Int> first :: Int -> Int second :: Int -> Int In a strict language the fastest implementation is: first i = func1 (commonfunc i) second i = func2 (commonfunc i) blob i = (func1 aux, func2 aux) where aux = commonfunc i The client does different calls depending on if it needs first i, second i or both. In a lazy language this way is just as fast: first i = r where (r, _) = blob i second i = r where (_, r) = blob i blob i = (func1 aux, func2 aux) where aux = commonfunc i The client can just call blob and then extract one of the results without paying for the other. With tmp I can go even further: first i = func1 (commonfunc i) second i = func2 (commonfunc i) blob i = (first i, second i) So blob can disappear without affecting performance. Bruno
participants (3)
-
Bruno Martínez
-
David Abrahams
-
Larry Evans