[boost] [serialization] Binary archive STL container template specialization?

20 May 2008

      Greetings everyone,

I need to stream a data over a very small data connection (think  
cell-phones and GPRS).  The type of data will vary, but will typically  
contain strings, numbers, and short groups of these basic types.  The  
application on both ends of the pipe will be written in C++, so I've  
loosely decided on the boost::serialization library as it virtually  
eliminates all of the code I would have needed to manually write.   
(Awesome!)

I've made up a bunch of test archives, and I'd like to get some feedback  
on a possible optimization, or at least specialization, of the code that  
streams out STL collections.

The current serialization methodology for STL containers saves the size of  
the container followed by each item inside the container.  This code is  
also used for std::string as it behaves like an STL container of  
characters.  The down-side of this is that each string uses a minimum of 8  
bytes (32-bit integer) plus the string payload.

Proposal: Write out a single byte that indicates the number of elements to  
follow.  If the number of elements is 255 or more, write out a single byte  
0xF, followed by the size_type indicating the correct count.  Reading  
follows the same pattern in reverse.  Read a single byte.  If the byte is  
0xF, read size_type, otherwise you have the count.

Simple example from my problem domain: If I have a list of three-letter  
bin locations in a warehouse and each bin contains a quantity of a  
specific item, I will have the following data to send:

DER: 427
ALU: 582
COM: 821
TER: 991
FLO: 0
TER: 298
ALP: 332
PED: 773

Using the boost serialization framework, that data becomes 160 bytes (8  
for the size, 8+length for each string, and 8 for each integer).  Using my  
proposal, the data size drops to 97 bytes, nearly 60% in data reduction.

I theorize that many serialized strings and collections are less than 255  
items or characters (especially in my problem domain) and that this  
technique will save us many on-the-wire bytes over time.

A) Do you think this is a reasonable addition/modification to the  
serialization library?

B) Is there any way to add this functionality to the serialization library  
without breaking existing archives?  I see a call to get_library_version  
in the code, but I'm not sure what is the purpose of this statement.   
Anyone?

Thanks,
Eric

[boost] [serialization] Binary archive STL container template specialization?

Eric Hill