Re: [Boost-users] serialization & several text archive within the samefiles...

24 Mar 2008

      Hi robert and all,
...
...
...
On Sun, March 23, 2008 10:11 pm, Robert Ramey wrote:
if you don't like tracking - you can turn it off for the types you want.
see below.
...
to add your own characters between archives try something like this:
this is what I have done of course, to make it run.
but I was considering that such behavior is the responsability
of the core lib, not the user (me), that's my point.
As I use a 'serialization manager' class that implements
some txt/xml/bin archive depending of the file extension
provided  by the users of my lib, I would find more, say, 'elegant'
not to treat text archives in a different way considering xml/bin ars.
but maybe it is a purist issue! and maybe not really relevant.
of course, this is not a pb if I add this 'blank' char by hand.
So it is fine for me.

-- next point:

About memory tracking: after a few investigations while
considering my needs, I find it very powerful.
I only implies some care while using it.
As I use serialization by pointers on nested objects
in my lib, tracking is very useful for it handle the links
properly.

The only problem I met is the following:
- I have say 1000000 records (some instances of a class) to write
in the archive.
- Each record uses massively std::vector or std::list as internal members
  (subrecords with dynamically allocated memory)
- Each record (with its internal subrecords)
  is more than 1000 bytes long --> so total storage for my data set
  is about 1Gb in a single output file!
- More, each record has pointers members in it, so tracking is activated
  and I NEED it.

Now, when I write all the records using a kind of loop on my archive,
I must not reuse the same memory addresses (as explained in the sample
programs and docs in boost::serialization, I cannot use a temporary
record instance in the loop).
So I have to store in the RAM of the machine
the whole sets of records (and dynamically allocated subrecords)
at the same time before to store them in the archive (1Gb!).
Typically I use a std::list for this.
It implies to run a machine with 1Gb available RAM,
which cannot be garanted for all my users,
even on our computing center...

On the other side, during the loop process,
there is no way to erase (nor reduce the size of)
previous written records while saving the current one
in the archive. Of course this would
save the current running memory but this available memory
could be 'reused' at some point by the
tracking mechanism (this is rather a random process that depends on the
system)
and then one would experience some misinterpretation of data
as already serialized stuff through my pointers. Then
the output will be corrupted.

This is what I call a "long-range memory tracking effect":

- within a single record (short range),
memory tracking is fine for it enables to
maintain some connections between 'subrecords' through pointers
in a very nice and 'storage saving' way.
More de-serialization works perfectly, without duplicate
records/subrecords and memory leak issues.

- for successive records (long range),
I have no need of pointers to make links
between objects (records and subrecords in it),
so tracking is unuseful but it is still activated
in the same program!
Then it leads to some nesting between memory addresses among
different records. This is a corruption case, unless, as I explained
above, all records (and internal subrecords) are kept in memory
till the end of the serialization process... then I need 1 Gb RAM
machine!

The only way I have found in my progs to break this long range effect
is to use one archive per record. It then seems that
memory tracking is confined within the limit of the current archive
which is exactly what I want.

Finally, what will be useful to enable my "per-record" serialization
approach for a large data set, this is a kind of "memory tracking reset"
function that could be invoked online while looping on the archive.
I have no idea if it is possible to implement this, and if other
people could find it useful, Robert first!

Hope my point has been understood.
At least, if you can confirm that tracking mechanism is confined
within a single archive, I can use this strategy of
multiple archives per file.

Thanks a lot for your attention, advice and constructive critics.

And many many thanks to Robert
for this very nice and elegant serialization
library. This is really a great and useful work!

frc
--
...
François Mauger wrote:
...
Hi
I have large amount of data to store/load from files.
I use boost::serialization library to do it.
It has very nice features.
I use version 1.33:
...
...
...
libboost-serialization-dev                 1.33.1-9ubuntu3.1
Checking different strategies to play with my data and i/o archives,
I met the following problem:
if I save several text_oarchives within the same output file (a trick to
break side-effects of memory tracking), then the deserialization failed
for there is no separator between successive text archives.
I have to explicitely add a 'std::endl' in the output stream
to make it run.
This pb does not appear
with xml archives for the </tag> at the end is unambiguous to parse the
end of each archive. I did not check for binary archives, but I guess
there will be no pb.
For me a mandatory 'white' character
should be added as the last byte in a text output archive (when
destructor is invoked?). This will make more coherent (symmetric!) in
comparison with xml/binary ars. This is only a suggestion: I cannot
imagine all the side-effects such a strategy could imply.
A sample demo file is attached.
Thanks for your attention.
frc
...
_______________________________________________
Boost-users mailing list
Boost-users@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/boost-users
_______________________________________________
Boost-users mailing list
Boost-users@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/boost-users
-- 
Francois Mauger
Laboratoire de Physique Corpusculaire de Caen et Universite de Caen
ENSICAEN - 6, Boulevard du Marechal Juin, 14050 CAEN Cedex, FRANCE
e-mail: mauger@lpccaen.in2p3.fr
tel.: (0/+33) 2 31 45 25 12
fax: (0/+33) 2 31 45 25 49