[serialization] feedback on a possible port of the XML archive grammar to Qi

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I've recently started working on porting Boost.Serialization's XML parser from Spirit Classic to Spirit Qi (I've got some free time on my hands, and I heard that there was some interest in this task). Before I start putting more time into this, I'd love to get some feedback/suggestions and make sure someone else isn't already working on this. My understanding is that porting the parser from Spirit Classic to Spirit Qi will not only speed up XML serialization, but would also increase portability (on some compilers, Serialization only works with Spirit 1.6, which is not shipped with the latest Boost release). My current plan is to rewrite the grammar in Qi, and modify the existing XML archive classes to use the new parser without changing the interface or the outputted XML. - Bryce Lelbach -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkx9e9oACgkQO/fqqIuE2t4IRwCdGOUqHtLhWeVIK3ruTipf4jHu +j8AoIP7CUssPlkBpLTWLHYRcPls0/ZJ =ki/H -----END PGP SIGNATURE-----

Bryce Lelbach aka wash wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
I've recently started working on porting Boost.Serialization's XML parser from Spirit Classic to Spirit Qi (I've got some free time on my hands, and I heard that there was some interest in this task). Before I start putting more time into this, I'd love to get some feedback/suggestions and make sure someone else isn't already working on this.
My understanding is that porting the parser from Spirit Classic to Spirit Qi will not only speed up XML serialization, but would also increase portability (on some compilers, Serialization only works with Spirit 1.6, which is not shipped with the latest Boost release).
Spirit 1.6 has to be used on borland compilers since subsequent versions of spirit don't work on borland. I don't think that Qi will work with borland either. So this would mean that xml serialization won't work with borland compilers any more. BUT, the current version of the library won't build with borland compilers anyway so undertaking this task won't have any real downside.
My current plan is to rewrite the grammar in Qi, and modify the existing XML archive classes to use the new parser without changing the interface or the outputted XML.
This get's my vote. Actually I don't think you'll have to modify anything other than xml_grammar.cpp. My understanding is that Qi is supposed to be much faster than previous versions and presumably better all around. There is caveat. I tweaked the xml_grammar.cpp to use a lower level entry point in the hopes of permitting xml_archives to be thread-safe. I believe that this was successful but I can't know for sure. Good Luck with this. Robert Ramey

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 09/01/2010 12:45 AM, Robert Ramey wrote:
This get's my vote. Actually I don't think you'll have to modify anything other than xml_grammar.cpp.
So far, the only places I've had to make major changes are in /boost/archive/impl/basic_xml_grammar.hpp and /libs/serialization/src/basic_xml_grammar.ipp. Once I get the basic rewrite finished and fully working, I might refactor how the grammar is instantiated/called internally. ATM I'm leaving all uses of the grammar class in Serialization untouched.
There is caveat. I tweaked the xml_grammar.cpp to use a lower level entry point in the hopes of permitting xml_archives to be thread-safe. I believe that this was successful but I can't know for sure.
Maybe we could come up with a few use cases for thread-safety in xml_archives? I'd be happy to implement a small test suite if you could give me some parameters to work off of. Then you could verify the existing implementation and earlier versions of Serialization, and I could ensure my rewrite performs as desired in concurrent applications. At this point I've ported about 90% of the grammar. I've also replaced your assign/append functors (for Spirit Classic semantic actions) in /boost/archive/impl/basic_xml_grammar.ipp with Phoenix lambda expressions. The Classic XML parser has a number of workarounds in the grammar for template depth issues on Darwin (GCC 3.1). These workarounds seem to be deprecated (circa 2004); Spirit Qi tends to instantiate deeper templates than Spirit Classic, and more modern versions of Darwin GCC seem capable of handling Spirit Qi. I'd appreciate it if you could let me know if I'm missing part of the picture in regards to these workarounds (they're in the ctor for the grammar, in /boost/archive/impl/basic_xml_grammar.ipp). - Bryce Lelbach -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkyIV+UACgkQO/fqqIuE2t65HgCfSYTGFO0WXea6ll/M79f/XXsy rmAAoIZ8iVvHtHB7rD1spz5+3Iqakt2/ =mBq/ -----END PGP SIGNATURE-----

Bryce Lelbach aka wash wrote:
Maybe we could come up with a few use cases for thread-safety in xml_archives?
I assume you mean tests.
I'd be happy to implement a small test suite if you could give me some parameters to work off of. Then you could verify the existing implementation and earlier versions of Serialization, and I could ensure my rewrite performs as desired in concurrent applications.
Note that there is an extensive test suite for the serialization library. You should run that on your local machine. This entails running bjam with the right switches. They way I do it is to cd to the test directory and invoke ../../../regression/tools/src/library_test.sh This will run all the tests with your new version and generate a complete table with test results. Be sure and test all combinations debug/release, static/dynamic versions. It takes some time to run, but it doesn't require any manual effort.
At this point I've ported about 90% of the grammar.
Great, as with all software projects, you've only got 90% left to go.
I've also replaced your assign/append functors (for Spirit Classic semantic actions) in /boost/archive/impl/basic_xml_grammar.ipp with Phoenix lambda expressions.
I assumed something like that would be necessary.
The Classic XML parser has a number of workarounds in the grammar for template depth issues on Darwin (GCC 3.1). These workarounds seem to be deprecated (circa 2004);
I'm not sure what it means to deprecate a work around. Maybe you mean that they aren't necessary with later versions. The workaround code should indicate which compiler version they apply to. So maybe it's just a question of changing. I would like to avoid hearing from someone who has an old compiler if I can avoid it.
Spirit Qi tends to instantiate deeper templates than Spirit Classic, and more modern versions of Darwin GCC seem capable of handling Spirit Qi. I'd appreciate it if you could let me know if I'm missing part of the picture in regards to these workarounds (they're in the ctor for the grammar, in /boost/archive/impl/basic_xml_grammar.ipp).
The way it works is that the package is released, and the problems come in. As they come in, they are addressed with the workarounds. I don't see anyway around this. If you start with a clean slate, the cycle will begin a new. So, I would leave in the workarounds if they still seem applicable. If they don't seem to fit anymore, then clean things up. It's really a matter of judgement. I'm going to leave it up to you since you'll be dealing with any complaints which arrive in the future. Thanks for taking this on. My understanding is that the new library is much faster than the currently used one so people should be happy about that. Good Luck with this. Robert Ramey

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 09/09/2010 02:32 AM, Robert Ramey wrote:
Note that there is an extensive test suite for the serialization library. You should run that on your local machine.
I've finished and done (preliminary) testing on the new grammar. All tests passed on x86_64-linux-gnu-gcc-4.5.1 (link=shared, variant=debug), four warnings (unsigned/signed comparisons in other parts of Serialization that I haven't touched). I've got my machine running the full test suite (link=shared,static, variant=debug,profile,release). I'm also running the profile.sh script in performance (I can't find any existing performance data for Serialization to compare against in the HTML docs, though). I can't do Windows tests: I currently only have Linux machines available to me. My local copy of the boost trunk is checked out with svn via https. How should I get the changes to you for review? Patch?
My understanding is that the new library is much faster than the currently used one so people should be happy about that.
I don't think we'll see a notable speed increase (at least, not with the work I've done so far). The bottleneck in the xml parser is the input stream -> intermediary string -> spirit design pattern in basic_xml_grammar<CharType>::my_parse. Removing the middleman string is a bit of a problem. A stream iterator such as std::istream_iterator can't be used with Spirit, because of the backtracking present in a a recursive descent parser such as Qi. Spirit provides a multi_pass iterator (fulfills forward iterator) that can wrap an input iterator for use with Qi. Rules that present the possible need to backtrack will cause buffering with the multi_pass iterator. So, to (hopefully) get a respectable increase in speed, I'll have to refactor the new Qi grammar to minimize use of rules that will backtrack, and then modify the grammar and interface to use the multi_pass iterator. If the new parser doesn't break horribly on Windows/other compilers, I'll get started on the multi_pass stuff this weekend. - Bryce Lelbach -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkyJyGwACgkQO/fqqIuE2t6QqACdE2Cw4OGtbnnclDXx2t2N3lRx 9ZkAn2zH//H16XTMtkpcbAVKF8f4RL/J =9oYL -----END PGP SIGNATURE-----

Bryce Lelbach aka wash wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 09/09/2010 02:32 AM, Robert Ramey wrote:
Note that there is an extensive test suite for the serialization library. You should run that on your local machine.
I've finished and done (preliminary) testing on the new grammar. All tests passed on x86_64-linux-gnu-gcc-4.5.1 (link=shared, variant=debug), four warnings (unsigned/signed comparisons in other parts of Serialization that I haven't touched). I've got my machine running the full test suite (link=shared,static, variant=debug,profile,release).
OK this looks very good. Just out of curiousity - what method did you use to run the tests?
I'm also running the profile.sh script in performance (I can't find any existing performance data for Serialization to compare against in the HTML docs, though).
Very good. I've only made a few performance tests. Perhaps you want to make one or two for xml_?archives. I reallize this is sort of a pain. But it is very helpful and the test will be a permanent contribution.
I can't do Windows tests: I currently only have Linux machines available to me.
My local copy of the boost trunk is checked out with svn via https. How should I get the changes to you for review? Patch?
If you have update access to the trunk, the easiest for me would be if you just checked them in. And BTW, you'd get test results right off the test matrix. If you don't have trunk accesss, I guess the best would be to send me the changed files.
My understanding is that the new library is much faster than the currently used one so people should be happy about that.
I don't think we'll see a notable speed increase (at least, not with the work I've done so far). The bottleneck in the xml parser is the input stream -> intermediary string -> spirit design pattern in basic_xml_grammar<CharType>::my_parse.
Hmmm - how do you know this?
Removing the middleman string is a bit of a problem. A stream iterator such as std::istream_iterator can't be used with Spirit, because of the backtracking present in a a recursive descent parser such as Qi.
Spirit provides a multi_pass iterator (fulfills forward iterator) that can wrap an input iterator for use with Qi. Rules that present the possible need to backtrack will cause buffering with the multi_pass iterator.
As I remember that is what I used. I found it to be no problem in 8 years. In my recollection, it simplified my code. So you should really look into this.
So, to (hopefully) get a respectable increase in speed, I'll have to refactor the new Qi grammar to minimize use of rules that will backtrack, and then modify the grammar and interface to use the multi_pass iterator. If the new parser doesn't break horribly on Windows/other compilers, I'll get started on the multi_pass stuff this weekend.
I'm quite confident that if it passes with a recent version of gcc it will pass with recent versions of windows compilers (with a fiew tweaks). So before you check it in I would like to see. a) a performance test b) usage of the multi-pass iterator. Robert Ramey

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 09/10/2010 01:07 PM, Robert Ramey wrote:
OK this looks very good. Just out of curiousity - what method did you use to run the tests?
I used the Boost.Regression script you mentioned ITT.
If you have update access to the trunk, the easiest for me would be if you just checked them in. And BTW, you'd get test results right off the test matrix. If you don't have trunk accesss, I guess the best would be to send me the changed files.
I don't trunk access, I could request it if that'd be easier for you.
I don't think we'll see a notable speed increase (at least, not with the work I've done so far). The bottleneck in the xml parser is the input stream -> intermediary string -> spirit design pattern in basic_xml_grammar<CharType>::my_parse.
Hmmm - how do you know this?
a) a performance test b) usage of the multi-pass iterator.
Only way to find out is to do ^ a) and b), and compare the results against the Spirit.Classic version of the XML grammar the Spirit.Qi version of the XML grammar that uses the intermediary string. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkyKb/gACgkQO/fqqIuE2t7XtACZAYZGK2tfT8jsMbx+U3Vd4cFE w28AoJTnS5DclzJfBvdTlNTL27Hwl9Ht =WWAu -----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 An update on this: using the multi_pass iterator decreases memory usage, but it slows things down. IMO this is undesirable, given that Serialization XML is not even close to being a memory hog. I've developed a test suite for the XML grammar parser using Boost.PP, the high_resultion_timer class in Boost.Spirit, and (ironically) Boost.Serialization to save the results in an XML archive. The test harness consists of a bunch of macros, a few template classes, and a handful of utility functions. Various test variants can be produced by defining preprocessor tokens with the parameters and invoking a macro which builds the appropriate main() function for the test. The test works by creating a tree of N-ary nodes of a fixed maximum depth D, with an underlying serializable type S. Currently, only strings and integer types can be used as the underlying type. Each node has N children; for nodes at depth < D, the children are more nodes. Nodes at depth D have S children. (this was a bit of a pain to implement with Boost.PP. I ended up having to implement a preprocessor power function) Each of the serializable types are initiated with random data using Boost.Uuid. End of the day, a test with depth D, node arity N and underlying type S creates a tree with N^D + N^(D-1) ... + N^0 nodes, N^D strings/integers each initiated with a unique identifier. Currently, I've created 8 tests to be run, classified by the total number of integers/strings created by each test: string4, int4, string16, string16, string64, string64, string256, string256. I tried doing the next greatest depth, string1024/int1024, but the preprocessor can't handle it. Each individual test will create one tree, serialize it a temporary file, and then record how long it takes xml_iarchive to load the data. The test program will then look for an existing xml result archive (each test uses it's own result archive), and create a new result archive if an existing one isn't found. Stored in the archive is a serialized std::list containing accumulated test results. The compiler/platform that tests are performed on is serialized into the archive and the test will throw an exception if they don't match, e.g. if you attempt to combine results performed on a Windows machine with those performed on Linux. I still need to write a Jamfile to run all the tests multiple times. I also need to write a program to load the std::list of results from the finalized result archives and accumulate the data. I should finish that today, and then I'll run the performance test on the Qi grammar (the one without multi_pass) and the Classic grammar. It occurs to me that I might have overdone this a little bit :p. - - Bryce Lelbach -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkyOayYACgkQO/fqqIuE2t75VQCg8AXGVRtGfb3MXtdzZvuohQ+Y NzMAmwbj9j+K/WUqfy8d1RNBLiAHrwtq =gHY8 -----END PGP SIGNATURE-----

Bryce Lelbach aka wash wrote:
It occurs to me that I might have overdone this a little bit :p.
This is Boost. That's pretty hard to do. You should forward this message to whomever is in charge of grantin update privileges to the trunk Please give bruce these privileges. Bruce. Update the trunk with your version. This should require little or no changes in boost/libs/serialization/test. Put your performance tests in boost/libs/serialization/performance. Robert Ramey

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Bruce.
(It's with a y, Bryce, actually :)) I sent in a request for trunk access this weekend, linked to this thread. Hopefully I'll hear back from the mods sometime today or tomorrow. It is of course my intention to actively maintain this patch.
Update the trunk with your version. This should require little or no changes in boost/libs/serialization/test. Put your performance tests in boost/libs/serialization/performance.
- From the root of my working copy of the trunk, I have: M boost/serialization/item_version_type.hpp M boost/archive/impl/xml_iarchive_impl.ipp M boost/archive/impl/basic_xml_grammar.hpp A libs/serialization/xml_performance A libs/serialization/xml_performance/string64_test.cpp A libs/serialization/xml_performance/int16_test.cpp A libs/serialization/xml_performance/string256_test.cpp A libs/serialization/xml_performance/int64_test.cpp A libs/serialization/xml_performance/macro.hpp A libs/serialization/xml_performance/int256_test.cpp A libs/serialization/xml_performance/high_resolution_timer.hpp A libs/serialization/xml_performance/harness.hpp A libs/serialization/xml_performance/string4_test.cpp A libs/serialization/xml_performance/node.hpp A libs/serialization/xml_performance/Jamfile.v2 A libs/serialization/xml_performance/string16_test.cpp A libs/serialization/xml_performance/int4_test.cpp M libs/serialization/src/xml_grammar.cpp M libs/serialization/src/basic_xml_grammar.ipp M libs/serialization/src/xml_wgrammar.cpp xml_performance is just a temporary directory I was using as a sandbox. Once I put the finishing touches on my work, run the performance tests on linux, and get write access, I will check in. The change to boost/serialization/item_version_type.hpp is the addition of "#include <cassert>"; I came across a build error (in one of the examples, I believe) arising from a missing declaration of assert. - - Bryce Lelbach -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEUEARECAAYFAkyOgvAACgkQO/fqqIuE2t5F3wCY+o78zF+xaWGymQTqlNQG/8sN AwCfTliBPdoSE/R6DsawGJug1lQpO+E= =Jx5c -----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Bumping this thread as I still haven't heard back from the Boost moderators about commit access. Ramey, would you like to me email you the changes as a patch? -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkyaGO0ACgkQO/fqqIuE2t6WsQCfY91ccGpwg79QjoyPDHqw0H+3 4PYAn1EqB3oVCIzr9uOFgVbNayjZ7tV5 =x0cw -----END PGP SIGNATURE-----

Bryce Lelbach aka wash wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Bumping this thread as I still haven't heard back from the Boost moderators about commit access. Ramey, would you like to me email you the changes as a patch?
I still would much prefer that you be given commit access. You've done the work and run it through the same exhaustive testing that I do. So there is nothing I can do that you havn't done already. Without that, I would have to intervene in the process and I can't do that now. I know it's a pain, but continuing to bug the boost moderators really is the most effective solution. I suspect it's just a question of finding the right one. Try sending email directly to the moderators listed on the web page which mentions them. Robert Ramey -----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAkyaGO0ACgkQO/fqqIuE2t6WsQCfY91ccGpwg79QjoyPDHqw0H+3 4PYAn1EqB3oVCIzr9uOFgVbNayjZ7tV5 =x0cw -----END PGP SIGNATURE----- _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
participants (2)
-
Bryce Lelbach aka wash
-
Robert Ramey