[serialize] xml archives and base 64 encoded binary data

older
[patch] [filesystem] Get rid of a...

Russell Hind

20 Oct 2004 20 Oct '04

1:46 p.m.

Hi, I'm writing some binary data to an xml_oarchive (a vector of floats). I'm then giving the xml files to some python uses we have here so they can read them in. They had trouble with the base64 encoding of the binary data. One has mentioned that base64 spec says an '=' must be appended to the data to indicate the padding up to the next 4-byte multiple. When they added an '=' to the data between the tags in the resultant xml file, they could decode it successfully. If it is in the spec (am trying to find a link to check this), can the the base64 encoding in serialize be modified to do this before the 1.32.0 release, so it conforms to the standard? Thanks Russell

Show replies by date

Russell Hind

20 Oct 20 Oct

2:34 p.m.

Ok, looking at the spec from http://www.faqs.org/rfcs/rfc1521.html section 5.2, = is only needed when the data doesn't end on a four-character boundary, and is used to pad it to a four-character boundary. Does the encoding in serialize do this? (the data stream we had was long, so not sure exactly how many characters were in it as the XML file is hard to calculate this due to line breaks etc). When adding a single '=' to our data sent, the pythong decoding routine decoded it correctly which implies one character padding was necessary for our data and it doesn't look to have been added by serialize. Thanks Russell Russell Hind wrote:

...

Hi,

I'm writing some binary data to an xml_oarchive (a vector of floats).

I'm then giving the xml files to some python uses we have here so they can read them in. They had trouble with the base64 encoding of the binary data.

One has mentioned that base64 spec says an '=' must be appended to the data to indicate the padding up to the next 4-byte multiple.

When they added an '=' to the data between the tags in the resultant xml file, they could decode it successfully.

If it is in the spec (am trying to find a link to check this), can the the base64 encoding in serialize be modified to do this before the 1.32.0 release, so it conforms to the standard?

Thanks

Russell

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Robert Ramey

4:17 p.m.

When I wrote the code for encoding/decoding base64 I didn't include the trailing "=" required by the standard defnition of base64. This was for a couple of reasons: a) My de-serialization code didn't require the existence of "=" padding characters b) I didn't anticipate that the base64 text would be used by applications other than serialization c) I was short of time and had lots of other things to do. We'll take a look at this for 1.33. Robert Ramey "Russell Hind" <rh_gmane@mac.com> wrote in message news:cl5t29$fme$1@sea.gmane.org...

...

Ok, looking at the spec from http://www.faqs.org/rfcs/rfc1521.html section 5.2, = is only needed when the data doesn't end on a four-character boundary, and is used to pad it to a four-character boundary.

Does the encoding in serialize do this? (the data stream we had was long, so not sure exactly how many characters were in it as the XML file is hard to calculate this due to line breaks etc).

When adding a single '=' to our data sent, the pythong decoding routine decoded it correctly which implies one character padding was necessary for our data and it doesn't look to have been added by serialize.

Thanks

Russell

Russell Hind wrote:

...
Hi,

I'm writing some binary data to an xml_oarchive (a vector of floats).

I'm then giving the xml files to some python uses we have here so they can read them in. They had trouble with the base64 encoding of the binary data.

One has mentioned that base64 spec says an '=' must be appended to the data to indicate the padding up to the next 4-byte multiple.

When they added an '=' to the data between the tags in the resultant xml file, they could decode it successfully.

If it is in the spec (am trying to find a link to check this), can the the base64 encoding in serialize be modified to do this before the 1.32.0 release, so it conforms to the standard?

Thanks

Russell

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Russell Hind

5:17 p.m.

Robert Ramey wrote:

...

When I wrote the code for encoding/decoding base64 I didn't include the trailing "=" required by the standard defnition of base64. This was for a couple of reasons:

a) My de-serialization code didn't require the existence of "=" padding characters b) I didn't anticipate that the base64 text would be used by applications other than serialization c) I was short of time and had lots of other things to do.

We'll take a look at this for 1.33.

In that case, is there a way I can get a 'string' of base64 encoded data from the archive and write it as a string so I can handle this myself? I'm trying to generate data that can be passed by other systems but if it doesn't add this stuff in, then thats harder to do. I'm happy to call the encoding routine myself and append the data if you can give some pointers as to how I do this? Thanks Russell

Robert Ramey

6:04 p.m.

"Russell Hind" <rh_gmane@mac.com> wrote in message news:cl66ju$ftg$2@sea.gmane.org...

...

Robert Ramey wrote:

...
When I wrote the code for encoding/decoding base64 I didn't include the trailing "=" required by the standard defnition of base64. This was for a couple of reasons:

a) My de-serialization code didn't require the existence of "=" padding characters b) I didn't anticipate that the base64 text would be used by applications other than serialization

It would seem that you're using the xml archive for purporses other than for serialization. Of course I don't see any problem with this (until one decides to edit it and change its schema). But I am curious what use you've found fot it. I originally did it only to satisfy boost nit-pickers as I felt it was an inefficient way to implement serialization. I've since found it useful for debugging archives. I seems to be compatile with xml viewers so its useful for rendering archives in a visible way. So, after all I have to concede that the nit-picker do have a point. I have a sneaking suspicion that it will turn up in all kinds of unexpected places and I'm wonder what those might be.

...

...
c) I was short of time and had lots of other things to do.

We'll take a look at this for 1.33.

In that case, is there a way I can get a 'string' of base64 encoded data from the archive and write it as a string so I can handle this myself? I'm trying to generate data that can be passed by other systems but if it doesn't add this stuff in, then thats harder to do.

you have a couple of options: a) Make your own derivation of xml_(i/o)archive which uses your own version of write/read_binary. Advantage - wouldn't touch the current archive classes. The manual describes how to do this. b) Just fix the current code that does the read/write_binary text data. You could roll this in to your own version of 1.32 and be on your way. This is implemented as part of the dataflow iterators and I don't think this is very difficult except that that understanding my dataflow iterator idea would take some investment of effort that might not be worthwhile. There is already a test for serialization of binary data so even that is done. The reason I don't do it now is that it starts a whole chain reaction regarding testing on all the platforms that boost supports and it is a very inconvenient time to do this. Also no one raised the issue until now.

...

I'm happy to call the encoding routine myself and append the data if you can give some pointers as to how I do this?

Thanks

Russell

_______________________________________________ Unsubscribe & other changes:

http://lists.boost.org/mailman/listinfo.cgi/boost

...

Russell Hind

21 Oct 21 Oct

8:02 a.m.

Robert Ramey wrote:

...

It would seem that you're using the xml archive for purporses other than for serialization. Of course I don't see any problem with this (until one decides to edit it and change its schema). But I am curious what use you've found fot it. I originally did it only to satisfy boost nit-pickers as I felt it was an inefficient way to implement serialization. I've since found it useful for debugging archives. I seems to be compatile with xml viewers so its useful for rendering archives in a visible way. So, after all I have to concede that the nit-picker do have a point. I have a sneaking suspicion that it will turn up in all kinds of unexpected places and I'm wonder what those might be.

I've been using our in-house implemented serialiazation stuff for a few years which offsers similar functionality to yours. Unforunately ours was very geard towards quickly dealing with large (>1Gb) files that have 100,000's pointer-based objects stored in them so was tied to a specific app. The systems we are dealing at the moment only generates smaller files (20Mb or so) so boost::serialization will hopefully support it nicely. It also gives the advantage of XML/text archives as well as binary. We have an R&D group who only use python for testing purposes and want to read in our data files for extra processing and trying out new ides. Binary files are by far the most efficient, but describing the structure of a binary archive to someone who only uses python isn't easy at all. So XML seems like the way to go as they can visually look at it and see the information they want to pick out easily. Our data consists of many settings, 3d model information uses comments etc, all which are textual so XML/text supports them well, but three quarters of the data is vectors floating point scan data. Writing these textually would lead to an over-top archive. Complete binary would mean passing it to python users would be a pain, so XML with encoding seems like a good solution. When the files get bigger, we can put them through a zip because the python lot could still handle un-zipping and then reading xml so that isn't an issue. If it wasn't for the need to let our R&D group have access to data in this way, then I would go for a binary format but I'm hoping that ultimately zipped XML won't be a lot larger for our files (hoping to test in the next few days). The urgency of getting serialization up and running is that I've shyed away from introducing our serialization stuff in to the project and generating files in its format because I was hoping that boost serialization would be out in time (we ship in December) and could move to that as it is a much more flexible system than our in house one.

...

you have a couple of options:

a) Make your own derivation of xml_(i/o)archive which uses your own version of write/read_binary. Advantage - wouldn't touch the current archive classes. The manual describes how to do this. b) Just fix the current code that does the read/write_binary text data. You could roll this in to your own version of 1.32 and be on your way. This is implemented as part of the dataflow iterators and I don't think this is very difficult except that that understanding my dataflow iterator idea would take some investment of effort that might not be worthwhile. There is already a test for serialization of binary data so even that is done. The reason I don't do it now is that it starts a whole chain reaction regarding testing on all the platforms that boost supports and it is a very inconvenient time to do this. Also no one raised the issue until now.

Fixing the current code would be my ideal solution, I'll just have to see how much time I get to look in to this. If not, for now, I'm sure the python lot can handle adding the necessary padding characters in. I take it the archive version will be increased for the next release if something like this changes so current files will be compatible? Thanks Russell

Robert Ramey

3:31 p.m.

...

I take it the archive version will be increased for the next release if something like this changes so current files will be compatible?

Yes. The idea is that the archive version will be incrememented everytime a change occurs that will require different code to de-serialize. I hope that I can make the change in base 64 text i/o so that the padding is skipped without having to incremement the archive version number. Robert Ramey

Russell Hind

3:51 p.m.

Robert Ramey wrote:

...

...
I take it the archive version will be increased for the next release if something like this changes so current files will be compatible?

Yes. The idea is that the archive version will be incrememented everytime a change occurs that will require different code to de-serialize. I hope that I can make the change in base 64 text i/o so that the padding is skipped without having to incremement the archive version number.

Ok, thanks. The version increment wouldn't matter as it would allow us to know if the encoding had padding or not. Without incrementing it, we would have to check the length before adding it our self. I suppose it doesn't really matter either way. Cheers Russell

Robert Ramey

25 Oct 25 Oct

1:23 a.m.

"Russell Hind" <rh_gmane@mac.com> wrote in message news:cl7qf0$b1a$1@sea.gmane.org...

...

Robert Ramey wrote:

...
It would seem that you're using the xml archive for purporses other than

for

...
serialization. Of course I don't see any problem with this (until one decides to edit it and change its schema). But I am curious what use you've found fot it. I originally did it only to satisfy boost nit-pickers as I felt it was an inefficient way to implement serialization. I've since found it useful for debugging archives. I seems to be compatile with xml viewers so its useful for rendering archives in a visible way. So, after all I have to concede that the nit-picker do have a point. I have a sneaking suspicion that it will turn up in all kinds of unexpected places and I'm wonder what those might be.

I've been using our in-house implemented serialiazation stuff for a few years which offsers similar functionality to yours. Unforunately ours was very geard towards quickly dealing with large (>1Gb) files that have 100,000's pointer-based objects stored in them so was tied to a specific app.

The systems we are dealing at the moment only generates smaller files (20Mb or so) so boost::serialization will hopefully support it nicely. It also gives the advantage of XML/text archives as well as binary.

We have an R&D group who only use python for testing purposes and want to read in our data files for extra processing and trying out new ides. Binary files are by far the most efficient, but describing the structure of a binary archive to someone who only uses python isn't easy at all. So XML seems like the way to go as they can visually look at it and see the information they want to pick out easily.

Can't you use boost python to call boost serialization from within python via some wrapper function? wouldn't this make the whole process totally painless? I believe someone else, (I forgot whom) was doing this with good success.

...

Our data consists of many settings, 3d model information uses comments etc, all which are textual so XML/text supports them well, but three quarters of the data is vectors floating point scan data. Writing these textually would lead to an over-top archive. Complete binary would mean passing it to python users would be a pain, so XML with encoding seems like a good solution.

When the files get bigger, we can put them through a zip because the python lot could still handle un-zipping and then reading xml so that isn't an issue.

If it wasn't for the need to let our R&D group have access to data in this way, then I would go for a binary format but I'm hoping that ultimately zipped XML won't be a lot larger for our files (hoping to test in the next few days).

The urgency of getting serialization up and running is that I've shyed away from introducing our serialization stuff in to the project and generating files in its format because I was hoping that boost serialization would be out in time (we ship in December) and could move to that as it is a much more flexible system than our in house one.

...
you have a couple of options:

a) Make your own derivation of xml_(i/o)archive which uses your own

version

...

...
of write/read_binary. Advantage - wouldn't touch the current archive classes. The manual describes how to do this. b) Just fix the current code that does the read/write_binary text data. You could roll this in to your own version of 1.32 and be on your way. This is implemented as part of the dataflow iterators and I don't think this is very difficult except that that understanding my dataflow iterator idea would take some investment of effort that might not be worthwhile. There is already a test for serialization of binary data so even that is done. The reason I don't do it now is that it starts a whole chain reaction regarding testing on all the platforms that boost supports and it is a very inconvenient time to do this. Also no one raised the issue until now.

Fixing the current code would be my ideal solution, I'll just have to see how much time I get to look in to this. If not, for now, I'm sure the python lot can handle adding the necessary padding characters in.

I take it the archive version will be increased for the next release if something like this changes so current files will be compatible?

Thanks

Russell

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Russell Hind

8:43 a.m.

Robert Ramey wrote:

...

Can't you use boost python to call boost serialization from within python via some wrapper function? wouldn't this make the whole process totally painless? I believe someone else, (I forgot whom) was doing this with good success.

Possibly, but I don't know python at all, and the people here who do use it don't know it brilliantly. The idea of using XML is so its basically text readable so they can understand the file format easily, but we do require some binary data otherwise the files would get too big so zipped XML seemed like a good option which is why when serialization was announced we decided that would help. Personally, I'd stick to a pure binary format but describing a file format to them which has things like reference counting information and type information in it is no easy task. One day, we plan to provide dlls to access our files which would solve all problems, but we have other things as higher priority for now. Thanks Russell

Robert Ramey

4:40 p.m.

...

Robert Ramey wrote:

...
Can't you use boost python to call boost serialization from within

Well, I don't know python, and have never worked with boost.python. However, in response to your question I looked at the documentation for boost python. It would seem to me that making a small function to wrap serialization so it could be called from boost python should be investigated. It looks to me like a couple of hours work could solve save you days or weeks of work in parsing the XML archives. I can't believe its not worth the time to at least investigate. "Russell Hind" <rh_gmane@mac.com> wrote in message news:clieb6$u87$1@sea.gmane.org... python

...

...
via some wrapper function? wouldn't this make the whole process totally painless? I believe someone else, (I forgot whom) was doing this with good success.

Possibly, but I don't know python at all, and the people here who do use it don't know it brilliantly. The idea of using XML is so its basically text readable so they can understand the file format easily, but we do require some binary data otherwise the files would get too big so zipped XML seemed like a good option which is why when serialization was announced we decided that would help.

Personally, I'd stick to a pure binary format but describing a file format to them which has things like reference counting information and type information in it is no easy task.

It would be if the above could be made to wrk.

...

One day, we plan to provide dlls to access our files which would solve all problems,

Well, that's an optimistic assesment !

...

but we have other things as higher priority for now.

...

Thanks

Russell

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Russell Hind

5:11 p.m.

Robert Ramey wrote:

...

Well, I don't know python, and have never worked with boost.python. However, in response to your question I looked at the documentation for boost python. It would seem to me that making a small function to wrap serialization so it could be called from boost python should be investigated. It looks to me like a couple of hours work could solve save you days or weeks of work in parsing the XML archives. I can't believe its not worth the time to at least investigate.

This would then entail the python code to have duplicates of all our data structures wouldn't it? A function to call serialization wouldn't be too hard, but it would read it in to C++ data structures wouldn't it (we have lists of shared_ptrs of objects etc that I've know idea how python could deal with these) They only pick out the information they require, not the entire file, so pointing them at the relevant bits isn't too hard. Also, you haven't met our physicists who use the python code :) They tend to get fairly annoyed as soon as we mention something they can't understand but they like the letters 'XML' so tends to get them off our backs a bit as they think they might be able to handle the files :) I don't think they've really thought it through that they'd still need to skip over loads of bits and learn what tags are what. Cheers Russell

Daryle Walker

21 Oct 21 Oct

6:16 a.m.

On 10/20/04 12:17 PM, "Robert Ramey" <ramey@rrsd.com> wrote:

...

When I wrote the code for encoding/decoding base64 I didn't include the trailing "=" required by the standard defnition of base64. This was for a couple of reasons:

a) My de-serialization code didn't require the existence of "=" padding characters b) I didn't anticipate that the base64 text would be used by applications other than serialization c) I was short of time and had lots of other things to do.

We'll take a look at this for 1.33.

This looks like a potential bug in the making, so I don't think we should just punt on the issue until next time (especially if the time between releases continues to lengthen). This could lead to us and all other base-64 decoders after 1.32 is released to include code that forgives the lack of "=" symbols for "backwards compatibility with the 'broken from a standards perspective' base-64 encoder given in Boost 1.32". (In other words, we'll be _creating_ a situation like coding workarounds for broken compilers or coding HTML/CSS workarounds for broken compilers.) For what we could do: 1. Put a note in the serialization docs saying that its base-64 technique is private to itself and should _not_ be used in conjunction with outside code expecting to encode or decode standard base-64 data. 2. Fix the current base-64 code to match the outside standard [1] is quick-and-dirty and barely better than doing nothing. [2] would be the better overall fix. How long would it take to add "=" processing? If it's short enough, maybe the release could wait for it. (We've already used a lot of time between the 1.32 preparation announcement and the actual branch.) -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com

Bronek Kozicki

23 Oct 23 Oct

8:46 a.m.

Daryle Walker wrote:

...

2. Fix the current base-64 code to match the outside standard

FWIW I agree with you, adding "... before 1.32 release". I hope it's possible to be fixed quickly. B.

Robert Ramey

25 Oct 25 Oct

1:30 a.m.

I did it Robert Ramey "Bronek Kozicki" <brok@rubikon.pl> wrote in message news:417A1A77.5080508@rubikon.pl...

...

Daryle Walker wrote:

...
2. Fix the current base-64 code to match the outside standard

FWIW I agree with you, adding "... before 1.32 release". I hope it's possible to be fixed quickly.

B. _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

7594

Age (days ago)

7599

Last active (days ago)

List overview

Download

14 comments

4 participants

participants (4)

Bronek Kozicki
Daryle Walker
Robert Ramey
Russell Hind