[serialization] wrongly initialized char set for XML grammar

Hello All, XML grammar parser in boost::serialization has a definition of spirit primitive to deal with XML character set. Initialization of this pimitive looks like following (file $(BOOST)/libs/serialization/src/xml_grammar.cpp): Char = chset_t("\x9\xA\xD\x20-\xFF"); Obviously, if char type is signed, then \xFF means -1 and the above initialization will be equivalent to Char = chset_t("\x9\xA\xD"); -- -- Vyacheslav E. Andrejev -- System Architect, Optech International, Inc. -- E-mail: mortituris@mail.ru

Vyacheslav E. Andrejev wrote:
Hello All,
XML grammar parser in boost::serialization has a definition of spirit primitive to deal with XML character set. Initialization of this pimitive looks like following (file $(BOOST)/libs/serialization/src/xml_grammar.cpp):
Char = chset_t("\x9\xA\xD\x20-\xFF");
Obviously, if char type is signed, then \xFF means -1 and the above initialization will be equivalent to
Char = chset_t("\x9\xA\xD");
No, that does not happen. The signed(ness) is ignored by the char-set class. The full 8-bits is mapped to a 256-bit bitset. Do you see a need for negative char values? Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

Hello Joel, JG> No, that does not happen. The signed(ness) is ignored by the JG> char-set class. The full 8-bits is mapped to a 256-bit bitset. Do JG> you see a need for negative char values? Unfortunately, it does happen. Look at the definition of two functions. The first is boost::spirit::utility::impl::construct_chset (in the file $(BOOST)/boost/spirit/utility/impl/chset.ipp): template <typename CharT, typename CharT2> void construct_chset(boost::shared_ptr<basic_chset<CharT> >& ptr, CharT2 const* definition) { CharT2 ch = *definition++; while (ch) { CharT2 next = *definition++; if (next == '-') { next = *definition++; if (next == 0) { ptr->set(ch); ptr->set('-'); break; } ptr->set(ch, next); } else { ptr->set(ch); } ch = next; } } It is obvious that for our input "\x9\xA\xD\x20-\xFF" the function "ptr->set(ch, next);" will be invoked with the first argument equals to '\x20', i.e. 32, and the second equals to '\xff', i.e. -1. Yes, you right, the implementation then will try to map these arguments into a bitset of unsigned values. But look at the way it is implemented in boost::spirit::basic_chset_8bit<CharT>::set. You can find the implementation of in $(BOOST)/boost/spirit/utility/impl/chset/basic_chset.ipp: basic_chset_8bit<CharT>::set(CharT from, CharT to) { for (int i = from; i <= to; ++i) bset.set((unsigned char)i); } There will be not a single iteration when from == 32 and to == -1. To correct this issue I suggest change Char = chset_t("\x9\xA\xD\x20-\xFF"); to Char = chset_t("\x9\xA\xD\x20-\x7f\x80-\xFF"); The latter one will work independently of sign'ness of the char type. JG> Do JG> you see a need for negative char values? No. But char is signed by default on my compiler. That fact along with improper initialization of Char primitive caused me problems when I tried to implement XML comments skipping in boost::serialization. Regards. -- -- Vyacheslav E. Andrejev -- System Architect, Optech International, Inc. -- E-mail: mortituris@mail.ru JG> Vyacheslav E. Andrejev wrote: JG>
Hello All,
XML grammar parser in boost::serialization has a definition of spirit primitive to deal with XML character set. Initialization of this pimitive looks like following (file $(BOOST)/libs/serialization/src/xml_grammar.cpp):
Char = chset_t("\x9\xA\xD\x20-\xFF");
Obviously, if char type is signed, then \xFF means -1 and the above initialization will be equivalent to
Char = chset_t("\x9\xA\xD");
JG> No, that does not happen. The signed(ness) is ignored by the JG> char-set class. The full 8-bits is mapped to a 256-bit bitset. Do JG> you see a need for negative char values? JG> JG> Regards, JG> JG> _______________________________________________ JG> Unsubscribe & other changes: JG> http://lists.boost.org/mailman/listinfo.cgi/boost

Vyacheslav E. Andrejev wrote:
Hello Joel,
JG> No, that does not happen. The signed(ness) is ignored by the JG> char-set class. The full 8-bits is mapped to a 256-bit bitset. Do JG> you see a need for negative char values?
Unfortunately, it does happen. Look at the definition of two functions. The first is boost::spirit::utility::impl::construct_chset (in the file $(BOOST)/boost/spirit/utility/impl/chset.ipp):
[snips]
It is obvious that for our input "\x9\xA\xD\x20-\xFF" the function "ptr->set(ch, next);" will be invoked with the first argument equals to '\x20', i.e. 32, and the second equals to '\xff', i.e. -1. Yes, you right, the implementation then will try to map these arguments into a bitset of unsigned values. But look at the way it is implemented in boost::spirit::basic_chset_8bit<CharT>::set. You can find the implementation of in $(BOOST)/boost/spirit/utility/impl/chset/basic_chset.ipp:
basic_chset_8bit<CharT>::set(CharT from, CharT to) { for (int i = from; i <= to; ++i) bset.set((unsigned char)i); }
There will be not a single iteration when from == 32 and to == -1.
You are right!
To correct this issue I suggest change
Char = chset_t("\x9\xA\xD\x20-\xFF");
to
Char = chset_t("\x9\xA\xD\x20-\x7f\x80-\xFF");
The latter one will work independently of sign'ness of the char type.
Your solution is a good workaround for that specific case. However, in the interest of correcting this (doing the right thing) for the whole, and especially for Spirit2 currently in development, the general question is: should we ignore the signedness of a char? A simple fix is to cast CharT to unsigned char (the handling in basic_chset_8bit is inadequate in this regard). But again, the question is: should we ignore the signedness of a char? Hartmut and I had a discussion and we think the answer is yes. Expressions like in the XML parser are the normal expected behavior. However, we're a bit reluctant to enforce this as this would entail lots of code changes library-wide and might break existing code using the library. Perhaps the most sensible thing, for now, is to apply the fix you gave and document this behavior. I welcome comments and opinions. I'm cross-posting to the Spirit lists. Here's Vyacheslav's original post (for context): Vyacheslav E. Andrejev wrote:
Hello All,
XML grammar parser in boost::serialization has a definition of spirit primitive to deal with XML character set. Initialization of this pimitive looks like following (file $(BOOST)/libs/serialization/src/xml_grammar.cpp):
Char = chset_t("\x9\xA\xD\x20-\xFF");
Obviously, if char type is signed, then \xFF means -1 and the above initialization will be equivalent to
Char = chset_t("\x9\xA\xD");
Thank you! Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

Hello Joel, JG> Perhaps the most sensible thing, for now, is to JG> apply the fix you gave and document this behavior. I entirely agree. -- -- Vyacheslav E. Andrejev -- System Architect, Optech International, Inc. -- E-mail: mortituris@mail.ru
participants (2)
-
Joel de Guzman
-
Vyacheslav E. Andrejev