Re: [boost] [endian] swap_in_place use case

Please don't drop attributions. Vicente Botet wrote:
Dave Handley wrote:
Memory map a network endian file. Swap_in_place. Use.
You definitely don't want a copy in this case since your file could easily be very big. Think of the case where your file is huge (say 10Gb), you really don't want to perform a copy and swap since that puts your memory need at 20Gb instead of 10Gb.
Yes this could be a use case. I don't use to manage so bigs files. If I had to work with I will never copy the whole file. But I'm not sure that I will use a swap_in_place of the whole file. This could take too much time. I would try to split the task to do on the whole file in smaller parts.
What will you do with this big file, that makes the swap in place the best choice?
The file could be many things. It could be a day of market data for a given exchange. It could be image data or video data that I'm going to perform image analysis on (maybe run a filter over it, or something similar). The file doesn't even have to be that big. If I was memory mapping a 10MB file and needed to swap it, I wouldn't want to use 20MB instead of 10MB. Pretty much anything I want to do to that file that involves looking at most or all of the data you would be much better off using swap in place instead of any copying swap implementation. Examples of the sorts of things that you might want to do to large files include running filters or normalisers over image or video files. I have lots of programs that have multiple threads constantly memory mapping files that range in size from relatively small to hundreds of MB or low numbers of GB. Given that memory allocation is a key component of the run time of these programs, they would run significantly slower if I had to allocate double the amount of memory. Don't forget, if you need the whole file to be swapped, then the fastest way to do it will be a swap in place of the whole file. I will reiterate something I said in an earlier post. If boost accepts an endian library which does not provide an efficient swap in place, I will be unable to use it. The library will end up in the list of boost libraries which are too inefficient to use in performance sensitive production code. Dave Handley

----- Original Message ----- From: "Dave Handley" <Dave.Handley@morganstanley.com> To: <boost@lists.boost.org> Sent: Friday, June 04, 2010 5:47 PM Subject: Re: [boost] [endian] swap_in_place use case
Please don't drop attributions.
I did?
Vicente Botet wrote:
Dave Handley wrote:
Memory map a network endian file. Swap_in_place. Use.
You definitely don't want a copy in this case since your file could easily be very big. Think of the case where your file is huge (say 10Gb), you really don't want to perform a copy and swap since that puts your memory need at 20Gb instead of 10Gb.
Yes this could be a use case. I don't use to manage so bigs files. If I had to work with I will never copy the whole file. But I'm not sure that I will use a swap_in_place of the whole file. This could take too much time. I would try to split the task to do on the whole file in smaller parts.
What will you do with this big file, that makes the swap in place the best choice?
The file could be many things. It could be a day of market data for a given exchange. It could be image data or video data that I'm going to perform image analysis on (maybe run a filter over it, or something similar). The file doesn't even have to be that big. If I was memory mapping a 10MB file and needed to swap it, I wouldn't want to use 20MB instead of 10MB.
I'm not proposing to make a copy of the whole file.
Pretty much anything I want to do to that file that involves looking at most or all of the data you would be much better off using swap in place instead of any copying swap implementation. Examples of the sorts of things that you might want to do to large files include running filters or normalisers over image or video files.
Couldn't the filter be adapted to the endianess of the file and work directly on the disk format?
I have lots of programs that have multiple threads constantly memory mapping files that range in size from relatively small to hundreds of MB or low numbers of GB. Given that memory allocation is a key component of the run time of these programs, they would run significantly slower if I had to allocate double the amount of memory.
I repeat. I'm not proposing to make a copy of the whole file. Just seen if swap_in_place is the tool to apply in all the cases or if this is rstricted to some specific uses.
Don't forget, if you need the whole file to be swapped, then the fastest way to do it will be a swap in place of the whole file.
For example if I have a file with records with for example some different fields and I want to count on a specific field, I don't need to swap the whole file. Iterating on the records and making the conversion of the specific field should be much more performant than making a swap_in_place of the whole file and then iterate on the records and use the specific field.
I will reiterate something I said in an earlier post. If boost accepts an endian library which does not provide an efficient swap in place, I will be unable to use it. The library will end up in the list of boost libraries which are too inefficient to use in performance sensitive production code.
I understand. And I see that you need absolutely the swap_in_place of Tom's library. Best, Vicente

Dave Handley wrote:
You definitely don't want a copy in this case since your file could easily be very big. Think of the case where your file is huge (say 10Gb), you really don't want to perform a copy and swap since that puts your memory need at 20Gb instead of 10Gb.
If you're only going to access each element of the file, then converting endian on access should be faster than swapping in place and then processing, without needing double the memory. Could you please give a specific example (preferably with code) that I could play with? terry

If you're only going to access each element of the file, then converting endian on access should be faster than swapping in place and then processing, without needing double the memory. Could you please give a specific example (preferably with code) that I could play with?
Terry, I was under the impression I already provided you with a use case in an earlier thread. Converting to endian on access certainly won't be faster than swap_in_place, as you cannot have a "zero cost" version. Additionally, in our scenarios the data gets accessed multiple times. But just to re-iterate: 1) swap_in_place<>() can be zero cost in the case of no swapping, which cannot be said for the copying approach. I.e. it has a much better best case behaviour and the same cost for the worst case behaviour. 2) It is one of the safer ways of handling floating point numbers. Tom

Tom wrote:
Terry wrote If you're only going to access each element of the file, then converting endian on access should be faster than swapping in place and then processing, without needing double the memory. Could you please give a specific example (preferably with code) that I could play with?
I was under the impression I already provided you with a use case in an earlier thread.
I'm looking for a specific application, (perhaps a specific video compression algorithm), that I can try. My previous test program read a large disk file, as you proposed. Now, I need something
Converting to endian on access certainly won't be faster than swap_in_place, as you cannot have a "zero cost" version. Additionally, in our scenarios the data gets accessed multiple times.
What is an example of one of these scenarios?
But just to re-iterate: 1) swap_in_place<>() can be zero cost in the case of no swapping, which cannot be said for the copying approach. I.e. it has a much better best case behaviour and the same cost for the worst case behaviour. 2) It is one of the safer ways of handling floating point numbers.
I disagree. I have demonstrated that... 1) endian-on-access can be zero cost in the native-case. 2) swapping in place is not any safer because the C++ type system cannot help to determine which portions of an object have been swapped, nor document the endian properties of a data structure. endian<big, double> can, and should, be defined to provided portable floating-point transfer and persistent storage. terry

I'm looking for a specific application, (perhaps a specific video compression algorithm), that I can try. My previous test program read a large disk file, as you proposed. Now, I need something
The FFT and PCM Encoding are both good sample cases. Depending on the bandwidth, the number of samples can be very large. ie. each 1-MHz requires 2 Megasamples per second with up to 16-bits per sample, and it has to be done fast. These algorithms typically require multiple passes through homogeneous input data and are typically done in place. The FFT is O(N log N). Its clear to me now that some form of in-place-endian conversion should be a part of an endian library. Furthermore, that conversion should be specializable for data types where the target hardware supports endian conversion. terry

Terry Golubiewski wrote:
Tom wrote:
Terry wrote:
If you're only going to access each element of the file, then converting endian on access should be faster than swapping in place and then processing, without needing double the memory.
Let's be clear about a few things to avoid talking past one another: - No one is suggesting swapping an entire large file in place just to access a few bytes. - Accessing each swapped byte only once reduces the possible difference in the various approaches to one copy. - Accessing values multiple times, when swapping is actually required, means that swapping up front and then accessing the data will be faster than swapping on access. Because of those differences, swap-in-place is valuable. Whether one chooses to use it in any given use case or context is a separate matter. It is also clear that it is easier to make mistakes using the function-based approach, so the object-based approach is safer, if less efficient in specific use cases.
But just to re-iterate: 1) swap_in_place<>() can be zero cost in the case of no swapping, which cannot be said for the copying approach. I.e. it has a much better best case behaviour and the same cost for the worst case behaviour. 2) It is one of the safer ways of handling floating point numbers.
I disagree. I have demonstrated that...
1) endian-on-access can be zero cost in the native-case.
I think both can have zero cost if the interface is designed to handle that correctly, but then one must also consider the worst case costs.
2) swapping in place is not any safer because the C++ type system cannot help to determine which portions of an object have been swapped, nor document the endian properties of a
That presumes that the entire object hasn't been swapped, of course. In use cases in which only select datum are accessed, there can be an advantage to the endian type, but it comes at the cost of having to declare a parallel structure definition in the typical case (an OS/RTL structure must be redefined using the endian types). That is particularly onerous when only select fields are accessed, but it does mean that any alteration to the code reading the fields will automatically use the correct values should yet another field be read. In the non-object-based approach, changing the algorithm to read a new field means the maintainer must remember to swap the new field before using it. These differences should be discussed in the documentation in order to help the library user understand whether to use the function-based approach or the object-based approach. Because there are clear desires to use each, the documentation should not be biased between them but clearly document their strengths and weaknesses.
data structure. endian<big, double> can, and should, be defined to provided portable floating-point transfer and persistent storage.
Since endian is designed to return T by value, it suffers from the normalization problem described elsewhere in these discussions. That means the floating point specializations would, of necessity, require a different interface than those for integral types. That will be discomfiting. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.
participants (5)
-
Dave Handley
-
Stewart, Robert
-
Terry Golubiewski
-
Tomas Puverle
-
vicente.botet