[serialization]A question about the implementation of XML_name in basic_xml_oarchive.ipp

I ran across the implementation of XML_name. It is a functor used to validate each single character in an XML name. The initialization of a stack-based look-up table everytime the functor is called puzzles me (see excerpt below). void operator()(CharType t) const{ unsigned char lookup_table[] = { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0, // -. 1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0, // 0-9 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // A- 1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1, // -Z _ 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // a- 1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0, // -z 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 }; if((unsigned)t > 127) return; if(0 == lookup_table[(unsigned)t]) boost::throw_exception( xml_archive_exception( xml_archive_exception::xml_archive_tag_name_error ) ); } Is there any reason why it was done this way? For threading issues? Thanks, Sean

Is there any reason why it was done this way? For threading issues?
LOL, it is done that way because a) that was the first way it occurred to me to do it b) it worked c) it didn't occur to me when I wrote it that there might be a performance issue Of course, now that you mention it ... My first inclination would be to use const unsigned char lookup_table[] = { ... Though I'm not sure that every compiler will exploit the const-ness to avoid re-initalizing lookup_table[] every time. Thanks for a very astute observation. Robert Ramey Sean Huang wrote:
I ran across the implementation of XML_name. It is a functor used to validate each single character in an XML name. The initialization of a stack-based look-up table everytime the functor is called puzzles me (see excerpt below).
void operator()(CharType t) const{ unsigned char lookup_table[] = { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0, // -. 1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0, // 0-9 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // A- 1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1, // -Z _ 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // a- 1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0, // -z 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 }; if((unsigned)t > 127) return; if(0 == lookup_table[(unsigned)t]) boost::throw_exception( xml_archive_exception( xml_archive_exception::xml_archive_tag_name_error ) ); }
Is there any reason why it was done this way? For threading issues?
Thanks,
Sean _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Robert Ramey wrote:
Is there any reason why it was done this way? For threading issues?
LOL, it is done that way because
a) that was the first way it occurred to me to do it b) it worked c) it didn't occur to me when I wrote it that there might be a performance issue
Of course, now that you mention it ...
My first inclination would be to use
const unsigned char lookup_table[] = { ...
Though I'm not sure that every compiler will exploit the const-ness to avoid re-initalizing lookup_table[] every time.
What about: static const unsigned char lookup_table[] = { ... }; The static tells the compiler that this is a global constant and not just related to this call of the function. The const tells the compiler that the data isn't (shouldn't be) modified. I think this will work on all compilers. (Compare this with declaring class-based constants using "static const ..."). - Reece

"Reece Dunn" <msclrhd@hotmail.com> wrote in message news:BAY101-F67FCBC7D549563DC6FD47A08D0@phx.gbl...
What about:
static const unsigned char lookup_table[] = { ... };
The static tells the compiler that this is a global constant and not just related to this call of the function. The const tells the compiler that the data isn't (shouldn't be) modified.
I think this will work on all compilers. (Compare this with declaring class-based constants using "static const ...").
- Reece
Could you please elaborate this for a bit? What is the issue with using a const static member? Thanks, Sean

On 9/28/05 11:28 AM, "Robert Ramey" <ramey@rrsd.com> wrote:
Is there any reason why it was done this way? For threading issues?
LOL, it is done that way because
a) that was the first way it occurred to me to do it b) it worked c) it didn't occur to me when I wrote it that there might be a performance issue
There a lot more issues than that.
Of course, now that you mention it ...
My first inclination would be to use
const unsigned char lookup_table[] = { ...
Though I'm not sure that every compiler will exploit the const-ness to avoid re-initalizing lookup_table[] every time.
As another poster said, you could make the "lookup_table" static too.
Thanks for a very astute observation.
But you got other problems, here's the entire class in <boost/archive/impl/basic_xml_oarchive.ipp> //======================================================================== //... namespace boost { namespace archive { namespace detail { template<class CharType> struct XML_name { typedef bool result_type; void operator()(CharType t) const{ unsigned char lookup_table[] = { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0, // -. 1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0, // 0-9 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // A- 1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1, // -Z _ 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // a- 1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0, // -z 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 }; if((unsigned)t > 127) return; if(0 == lookup_table[(unsigned)t]) boost::throw_exception( xml_archive_exception( xml_archive_exception::xml_archive_tag_name_error ) ); } }; //... //======================================================================== A static-const table would have to be created for each version of XML_name, but (in the code later in the file) you only use a "const char" version of the template. So why bother making this a template at all? Believe it or not, Boost code doesn't _have_ to be made of templates. * Why do you use "const char" instead of just "char" in your instantiations? * result_type doesn't match what operator() returns. * If the table is just 0's and 1's, why not just use "bool"? Is it because "bool" can be bigger than a byte? * The code assumes that CharType is compatible with "unsigned". What if it isn't? This suggests to be that you should make a non-template function that optimizes the code for "char" and make the class template call that function (with appropriate conversions). * The code assumes that CharType's code points are ASCII compatible. * XML is supposed to be Unicode-friendly. What about non-ASCII characters that could be used in XML names? * As another poster said, what's wrong with using functions like "isalnum"? If you're afraid it may include non-ASCII letters and numbers, then just use a giant switch statement. Hopefully compiler makers haven't forgotten how to optimize those (usually into a table). * Why do bad values outside the range simply return while ones inside the range throw?
Sean Huang wrote:
I ran across the implementation of XML_name. It is a functor used to validate each single character in an XML name. The initialization of a stack-based look-up table everytime the functor is called puzzles me (see excerpt below).
void operator()(CharType t) const{ unsigned char lookup_table[] = { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0, // -. 1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0, // 0-9 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // A- 1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1, // -Z _ 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // a- 1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0, // -z 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 }; if((unsigned)t > 127) return; if(0 == lookup_table[(unsigned)t]) boost::throw_exception( xml_archive_exception( xml_archive_exception::xml_archive_tag_name_error ) ); }
Is there any reason why it was done this way? For threading issues?
-- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com

Daryle Walker wrote:
But you got other problems, here's the entire class in <boost/archive/impl/basic_xml_oarchive.ipp>
//======================================================================== //... namespace boost { namespace archive {
namespace detail { template<class CharType> struct XML_name { typedef bool result_type; void operator()(CharType t) const{ unsigned char lookup_table[] = { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0, // -. 1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0, // 0-9 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // A- 1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1, // -Z _ 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // a- 1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0, // -z 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 }; if((unsigned)t > 127) return; if(0 == lookup_table[(unsigned)t]) boost::throw_exception( xml_archive_exception( xml_archive_exception::xml_archive_tag_name_error ) ); } };
//... //========================================================================
A static-const table would have to be created for each version of XML_name, but (in the code later in the file) you only use a "const char" version of the template. So why bother making this a template at all? Believe it or not, Boost code doesn't _have_ to be made of templates.
When I wrote this I didn't know that it woudl be used for only one kind of character.
* Why do you use "const char" instead of just "char" in your instantiations?
off hand I don't remember. But it might ot might not have had something to do with what happens when one passes a const argument to an argument taking a non-const. This will generate a compile error for class types. I don't remember if compilers complain for primitive types or not.
* result_type doesn't match what operator() returns.
It seems that that is vestige of some earlier incarnation.
* If the table is just 0's and 1's, why not just use "bool"? Is it because "bool" can be bigger than a byte?
yes
* The code assumes that CharType is compatible with "unsigned". What if it isn't? This suggests to be that you should make a non-template function that optimizes the code for "char" and make the class template call that function (with appropriate conversions).
* The code assumes that CharType's code points are ASCII compatible.
an oversight on my part.
* XML is supposed to be Unicode-friendly. What about non-ASCII characters that could be used in XML names?
The code assumes that CharType can be converted to an unsigned integer. This is true for the instantiation used.
* As another poster said, what's wrong with using functions like "isalnum"? If you're afraid it may include non-ASCII letters and numbers, then just use a giant switch statement. Hopefully compiler makers haven't forgotten how to optimize those (usually into a table).
isalnum didn't include certain punctuation that is legal in xml tags so it whould been isalmum || ... . In fact, what is really needed is is_xxml_tag_char - so I made this.
* Why do bad values outside the range simply return while ones inside the range throw?
I don't remember why I did this now. There is a whole layer going on here that isn't really apparent. When an XML archive is read xml tags are converted to string variables for internal usage withing the archive implementation. This presents a problem. There are currently two cases xml_?archive for normal 8 bit characters and _w?archive for wide characters. xml_archive just reads in the characters into the string. Whether the xml is in ascii, the local multi-byte character set or utf8 is not taken into account. That is the codecvt facet doesn't do anything. xml_warchive presumes that the xml is coded in utf16. variables to be placed in strings - such as xml tags are tranlated into multi-byte characters in the current locale. Then they are checked here for invalid xml_characters. So what's really missing here is a good ix_xml_tag function. Given the concerns raised in this email, it would seem that it should consider the posibility of EBCDIC characters as well. Should anyone have an implementation of such a function I would be pleased to consider it. Robert Ramey

"Sean Huang" <huangsean@hotmail.com> writes:
I ran across the implementation of XML_name. It is a functor used to validate each single character in an XML name. The initialization of a stack-based look-up table everytime the functor is called puzzles me (see excerpt below).
void operator()(CharType t) const{ unsigned char lookup_table[] = { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0, // -. 1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0, // 0-9 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // A- 1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1, // -Z _ 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // a- 1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0, // -z 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 }; if((unsigned)t > 127) return; if(0 == lookup_table[(unsigned)t]) boost::throw_exception( xml_archive_exception( xml_archive_exception::xml_archive_tag_name_error ) ); }
Is there any reason why it was done this way? For threading issues?
Hmm... Also, is the apparent dependency on ASCII encoding truly portable? -- Dave Abrahams Boost Consulting www.boost-consulting.com

On 9/28/05, David Abrahams <dave@boost-consulting.com> wrote:
Hmm... Also, is the apparent dependency on ASCII encoding truly portable?
Doubtful. Wouldn't testing for std::isalnum || '-' || '_' be a better idea? Perhaps not quite as performant (once the lookup table was made static), but certainly more portable and simpler to read. -- Caleb Epstein caleb dot epstein at gmail dot com

Caleb Epstein wrote:
On 9/28/05, David Abrahams <dave@boost-consulting.com> wrote:
Hmm... Also, is the apparent dependency on ASCII encoding truly portable?
Doubtful. Wouldn't testing for std::isalnum || '-' || '_' be a better idea? Perhaps not quite as performant (once the lookup table was made static), but certainly more portable and simpler to read.
-- Caleb Epstein caleb dot epstein at gmail dot com _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
In most implementations, the is*()'s are implemented using exactly the same method. It's generally safe, as pretty much every code-page barring EBCDIC (or whatever it is) that isn't obviously non-latin (CJK codepages, espcially) uses the ASCII characters for values less than 128, including the Unicode code-points (8 is particularly good at this: all the surrogates are > 127). If it is a CJK code-page, you're screwed no matter what you do. (wide characters, and just what is an "alphabetic" ideograph?)
participants (7)
-
Caleb Epstein
-
Daryle Walker
-
David Abrahams
-
Reece Dunn
-
Robert Ramey
-
Sean Huang
-
Simon Buchan