[chrono] steady_clock efficiency

I expect most people will use boost::chrono::steady_clock to measure performance. I've been considering switching from my raw calls to the Windows peformance counter APIs to the steady_clock. I'm a little worried about the overhead of what should be a non-intrusive facility. In examing the source, I see that steady_clock::now() reads the performance counter and then multiplies it by the reciprocal of the counter's frequency (acquired on first use of the clock) to return the time in nanoseconds. How expensive is this 64 bit multiply on a 32 bit system? How hard would it be to have now() return a time_point that stores the raw tick value and only converts it to nanoseconds when a calculation in real units is required? (Comparison to another time_point could still be done in performance counter ticks, saving the multiply.) Is it worth the extra complexity to demand-convert ticks to nanoseconds? Source is in boost_1_48_0\boost\chrono\detail\inlined\win\chrono.cpp

I expect most people will use boost::chrono::steady_clock to measure performance. I guess that also. I've been considering switching from my raw calls to the Windows peformance counter APIs to the steady_clock. I'm a little worried about the overhead of what should be a non-intrusive facility. Glad to hear you are considering it. Have you any measures that let you
Le 29/11/11 04:22, Kenneth Porter a écrit : think there is a serious overhead? How do you use the ::now() function to make your performance measures?
In examing the source, I see that steady_clock::now() reads the performance counter and then multiplies it by the reciprocal of the counter's frequency (acquired on first use of the clock) to return the time in nanoseconds. How expensive is this 64 bit multiply on a 32 bit system? I have added recently a performance test in https://svn.boost.org/svn/boost/branches/release/libs/chrono/perf/store_now_.... I have no access to a windows platform now, but if you could run it in your system? For a Mac-OS
I get for 100000 the following figures debug system_clock 107995476 nanoseconds changes: 137173 steady_clock 27631675 nanoseconds changes: 999999 high_resolution_clock 27683032 nanoseconds changes: 999999 process_real_cpu_clock 640388694 nanoseconds changes: 560828 process_user_cpu_clock 1090746878 nanoseconds changes: 50 process_system_cpu_clock 1083428100 nanoseconds changes: 62 process_cpu_clock 1093743446 nanoseconds changes: 238 release system_clock 44870075 nanoseconds changes: 46173 steady_clock 23371245 nanoseconds changes: 999999 high_resolution_clock 23554410 nanoseconds changes: 999999 process_real_cpu_clock 619828025 nanoseconds changes: 531914 process_user_cpu_clock 1079721262 nanoseconds changes: 45 process_system_cpu_clock 1081643605 nanoseconds changes: 63 process_cpu_clock 1077525848 nanoseconds changes: 217 That is ~24ns for steady_clock::now().
How hard would it be to have now() return a time_point that stores the raw tick value and only converts it to nanoseconds when a calculation in real units is required? Ummm, we can store whatever we want in steady_clock::time_point::duration::rep, but duration::period must be know at compile-time and in Windows this is a run-time information. So no, we cannot store the result of the performances counter and convert it when requiered without making it special case that would break all the chrono architecture.
(Comparison to another time_point could still be done in performance counter ticks, saving the multiply.) Is it worth the extra complexity to demand-convert ticks to nanoseconds? I see the advantage, I would like to considering it if the performances results are really catastrofic to you.
Best, Vicente

Le 29/11/11 09:39, Vicente J. Botet Escriba a écrit :
Le 29/11/11 04:22, Kenneth Porter a écrit :
How hard would it be to have now() return a time_point that stores the raw tick value and only converts it to nanoseconds when a calculation in real units is required? Ummm, we can store whatever we want in steady_clock::time_point::duration::rep, but duration::period must be know at compile-time and in Windows this is a run-time information. So no, we cannot store the result of the performances counter and convert it when requiered without making it special case that would break all the chrono architecture.
(Comparison to another time_point could still be done in performance counter ticks, saving the multiply.) Is it worth the extra complexity to demand-convert ticks to nanoseconds? I see the advantage, I would like to considering it if the performances results are really catastrofic to you.
Hi again, I've been thinking about the cost induced by the run-time getting of the period for steady_clock and other clocks. If I'm not wrong, the fact this period (the processor frequency) can change at run-time have as consequence that the measured time could be not monotonic neither steady nor exact. Maybe others with more knowledge could confirm my suspicion, or the contrary, explain why this is not a real problem. When using these clocks to measure performances, it is clear that what we want to measure is the number of ticks, independently of the processor frequency change. In addition, as you suggest, the conversion to a time_point or a duration could be delayed until the unit is needed. Unfortunately the duration class could not be used for this purpose. Maybe Boost could have a run-time duration class that stores the number of ticks and that is able to make conversions to duration using a discrete sample of the period at the time of the conversion. In this way we could expect that the time spent while doing performance measures could maybe divided by 2. Waiting for some comments from the Boost community, Vicente P.S. Beman has added a new version of his Boost.Timer library, that maybe it could help you. Best, Vicente

"Vicente J. Botet Escriba" <vicente.botet@wanadoo.fr> wrote in news:4ED49A27.2030500@wanadoo.fr:
Glad to hear you are considering it. Have you any measures that let you think there is a serious overhead? How do you use the ::now() function to make your performance measures?
Thanks for the feedback. A typical use case: const time_point startTime = steady_clock::now(); // stuff to measure here const time_point endTime = steady_clock::now(); const duration elapsed = endTime - startTime; Ideally the conversion from ticks to nanoseconds would be done by the subtraction operator. Another use case, waiting a precise amount of time: void wait(const duration timeToWait) { const time_point endTime = steady_clock::now() + timeToWait; while (steady_clock::now() < endTime) ; } Here, the conversion from nanoseconds to ticks would be done by the addition operator, so that the time_point comparison would be in ticks. I'd use this where I need to wait a small amount of time (much less than a scheduler tick) for a hardware device to respond, before giving up my time slice.

Kenneth Porter-2 wrote
"Vicente J. Botet Escriba" <vicente.botet@> wrote in news:4ED49A27.2030500@:
Glad to hear you are considering it. Have you any measures that let you think there is a serious overhead? How do you use the ::now() function to make your performance measures?
Thanks for the feedback. A typical use case:
const time_point startTime = steady_clock::now(); // stuff to measure here const time_point endTime = steady_clock::now(); const duration elapsed = endTime - startTime;
Ideally the conversion from ticks to nanoseconds would be done by the subtraction operator.
We could see a little improvement in this case as the conversion could be done only once, instead of two. Note the rt_ prefixes in the code (for run-time): const rt_time_point startTime = rt_steady_clock::now(); // stuff to measure here const rt_time_point endTime = rt_steady_clock::now(); const duration elapsed = duration_cast<duration>(endTime - startTime); // conversion is done by the cast operator The conversion from rt_duration to duration could also be implicit and the last line be const duration elapsed = endTime - startTime; // conversion is done implicitly by a implicit conversion operator Another use case, waiting a precise amount of time: void wait(const duration timeToWait) { const time_point endTime = steady_clock::now() + timeToWait; while (steady_clock::now() < endTime) ; } Here, the conversion from nanoseconds to ticks would be done by the addition operator, so that the time_point comparison would be in ticks. More exactly it should be done on the conversion from the timeToWait to ticks to wait void wait(const rt_duration ticksToWait) { const rt_time_point endTime = rt_steady_clock::now() + ticksToWait; // ** ticks while (rt_steady_clock::now() < endTime) ; } void wait(const duration timeToWait) { wait(rt_duration_cast<rt_duration>(timeToWait); // conversion to ticks } Again if the conversion from duration to rt_duration is implicit the last line becomes wait(timeToWait); and so the wait specialization will be not really needed. I'd use this where I need to wait a small amount of time (much less than a scheduler tick) for a hardware device to respond, before giving up my time slice. I understand the use case. I don't see a major difference in this case as what is important is the accuracy, which IMO is preserved. With the current design/implementation your processor will be just more busy which is no good however. I will add some performance tests that measure the the time spent to get the ticks and the period, to see if it is worth continuing. Best, Vicente -- View this message in context: http://boost.2283326.n4.nabble.com/chrono-steady-clock-efficiency-tp4117923p... Sent from the Boost - Dev mailing list archive at Nabble.com.

"Vicente J. Botet Escriba" <vicente.botet@wanadoo.fr> wrote in news:4ED49A27.2030500@wanadoo.fr:
I have added recently a performance test in https://svn.boost.org/svn/boost/branches/release/libs/chrono/perf/store _now_in_vector.cpp. I have no access to a windows platform now, but if you could run it in your system?
Grepping for "steady" from the output files: boost_1_48_0\bin.v2\libs\chrono\perf\store_now_in_vector_header.test\msvc- 10.0\debug\asynch-exceptions-on\threading-multi \store_now_in_vector_header.output: steady_clock 649713231 nanoseconds boost_1_48_0\bin.v2\libs\chrono\perf\store_now_in_vector_header.test\msvc- 10.0\release\asynch-exceptions-on\threading-multi \store_now_in_vector_header.output: steady_clock 697898058 nanoseconds boost_1_48_0\bin.v2\libs\chrono\perf\store_now_in_vector_shared.test\msvc- 10.0\debug\asynch-exceptions-on\threading-multi \store_now_in_vector_shared.output: steady_clock 645896685 nanoseconds boost_1_48_0\bin.v2\libs\chrono\perf\store_now_in_vector_shared.test\msvc- 10.0\release\asynch-exceptions-on\threading-multi \store_now_in_vector_shared.output: steady_clock 659919417 nanoseconds boost_1_48_0\bin.v2\libs\chrono\perf\store_now_in_vector_static.test\msvc- 10.0\debug\asynch-exceptions-on\threading-multi \store_now_in_vector_static.output: steady_clock 646621148 nanoseconds boost_1_48_0\bin.v2\libs\chrono\perf\store_now_in_vector_static.test\msvc- 10.0\release\asynch-exceptions-on\threading-multi \store_now_in_vector_static.output: steady_clock 636746989 nanoseconds So about 650 nanoseconds using a 32-bit build on Vista 64 Ultimate. BTW, I'd like to suggest changing the output format in the program to put the times first in fixed-width right-justified format followed by the clock names, so the times will line up nicely in columns.

Kenneth Porter <shiva.blacklist@sewingwitch.com> wrote in news:Xns9FACBC567FEF9shivawellcom@80.91.229.10:
So about 650 nanoseconds using a 32-bit build on Vista 64 Ultimate.
I recoded now() to eliminate the math with no improvement. The Windows API call totally dominates the time spent, so there's no significant savings to deferring the conversion. The lower numbers on XP may be due to it being single-core, and not needing to deal with which core is reading the counters. My Vista system is dual- core and I understand there's logic in the API to worry about that.

On 11/29/2011 08:31 PM, Kenneth Porter wrote:
10.0\release\asynch-exceptions-on\threading-multi \store_now_in_vector_static.output: steady_clock 636746989 nanoseconds
So about 650 nanoseconds using a 32-bit build on Vista 64 Ultimate.
Some of that may reflect quantization error. E.g., the clock output might be truncated to microsecond precision which introduces a 500 ns error on average and the actual read overhead is something like 150 ns. - Marsh

Marsh Ray <marsh@extendedsubset.com> wrote in news:4ED7BF7F.8050404@extendedsubset.com:
On 11/29/2011 08:31 PM, Kenneth Porter wrote:
10.0\release\asynch-exceptions-on\threading-multi \store_now_in_vector_static.output: steady_clock 636746989 nanoseconds
So about 650 nanoseconds using a 32-bit build on Vista 64 Ultimate.
Some of that may reflect quantization error.
E.g., the clock output might be truncated to microsecond precision which introduces a 500 ns error on average and the actual read overhead is something like 150 ns.
The profiling program reads the clock a million times, storing the results in a pre-allocated array, and times the whole operation. (An initial run that constructs a million time_points is used to factor out the loop and array member constructor time.) How would microsecond jitter affect the overall operation to that degree?

On 12/01/2011 10:23 PM, Kenneth Porter wrote:
Some of that may reflect quantization error.
E.g., the clock output might be truncated to microsecond precision which introduces a 500 ns error on average and the actual read overhead is something like 150 ns.
The profiling program reads the clock a million times, storing the results in a pre-allocated array, and times the whole operation. (An initial run that constructs a million time_points is used to factor out the loop and array member constructor time.) How would microsecond jitter affect the overall operation to that degree?
Ah, well not if you do it that way. When you look at the array, does it reflect any particular quantization? It might be interesting to run the test on bare metal and again in a VMware-type VM. The high resolution counters tend to be virtualized in VMs and could produce greatly different results. - Marsh

Le 05/12/11 04:41, Marsh Ray a écrit :
On 12/01/2011 10:23 PM, Kenneth Porter wrote:
Some of that may reflect quantization error.
E.g., the clock output might be truncated to microsecond precision which introduces a 500 ns error on average and the actual read overhead is something like 150 ns.
The profiling program reads the clock a million times, storing the results in a pre-allocated array, and times the whole operation. (An initial run that constructs a million time_points is used to factor out the loop and array member constructor time.) How would microsecond jitter affect the overall operation to that degree?
Ah, well not if you do it that way.
When you look at the array, does it reflect any particular quantization?
It might be interesting to run the test on bare metal and again in a VMware-type VM. The high resolution counters tend to be virtualized in VMs and could produce greatly different results.
Hi, The times in the array have the precision of the clock and stors the actual time, not a duration. I have no access to a VM. Could others run the performance test? Best, Vicente.

"Vicente J. Botet Escriba" <vicente.botet@wanadoo.fr> wrote in news:4EDC6CE2.8060203@wanadoo.fr:
I have no access to a VM. Could others run the performance test?
For those interested in running the test, simply go to the root of your boost installation and run this command: bjam libs/chrono/perf

On 06/12/11 08:07, Kenneth Porter wrote:
bjam libs/chrono/perf
This is currently broken on trunk for 64-bit Ubuntu Linux/g++ 4.6 Anthony -- Author of C++ Concurrency in Action http://www.stdthread.co.uk/book/ just::thread C++0x thread library http://www.stdthread.co.uk Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk 15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No. 5478976

Le 06/12/11 11:01, Anthony Williams a écrit :
On 06/12/11 08:07, Kenneth Porter wrote:
bjam libs/chrono/perf
This is currently broken on trunk for 64-bit Ubuntu Linux/g++ 4.6
You are right. I didn't updated it since I changed to version2 Index: Jamfile.v2 =================================================================== --- Jamfile.v2 (revision 75799) +++ Jamfile.v2 (working copy) @@ -42,6 +42,7 @@ # remark #383: value copied to temporary, reference to temporary used # remark #1418: external function definition with no prior declaration <toolset>intel:<cxxflags>-wd304,383,1418 + <define>BOOST_CHRONO_VERSION=2 ; rule chrono-run ( sources ) Committed revision 75827. Best, Vicente
participants (5)
-
Anthony Williams
-
Kenneth Porter
-
Marsh Ray
-
Vicente Botet
-
Vicente J. Botet Escriba