Re: [boost] Proposal: XML APIs in boost

7 Nov 2005

      Stefan Seefeld <seefeld@sympatico.ca> writes:
...
Anthony Williams wrote:
...
...
...
That's one of my main criticisms of your suggested API --- it's too tightly
bound to libxml, and doesn't really allow for substitution of another parser.
In order to use a particular external type, such as std::string, the user has
to supply a specialization of converter<> for their type, which converts to
and from the libxml xmlChar type.
Correct. That's the price to pay for not forcing any particular unicode library
on users who want to use the XML API.
Hmm. What is an xmlChar? From your string.hpp, it appears it is the same as a
normal char, since you can cast a char* to an xmlchar*, but I don't know
libxml2, so I wouldn't like to assume.

I would rather that the boost::xml API defined a type (even if it was a
typedef for the libxml xmlChar), and the requirements on that type
(e.g. ASCII, UTF-32 or UTF-8 encoding).

By exposing the underlying character type of the backend like this, you are
restricting the backends to those that share the same internal character type
and encoding, or imposing an additional conversion layer on backends with a
different internal encoding.

Just as an example, my axemill parser uses a POD struct containing an unsigned
long as the character type, so that each Unicode Codepoint is a single
"character", and I don't have to worry about variable-length encodings such as
UTF-8 internally. If I wanted use axemill as the backend parser, and handle
std::wstring input on a platform where wchar_t was UTF-32, but keep xmlChar in
the API, the converter would have to change UTF-32 to UTF-8 (I assume), and
then internally this would have to be converted back to UTF-32.

I would suggest that the API accepts input in UTF-8, UTF-16 and UTF-32. The
user then has to supply a conversion function from their encoding to one of
these, and the library converts internally if the one they choose is not the
"correct" one. I would imagine that a user that works in UTF-8 will choose to
provide a UTF-8 conversion, someone that works with UCS-2 wchar_t characters
will provide a UTF-16 conversion, and someone that works with UTF-32 wchar_t
characters will provide a UTF-32 conversion. Someone who uses a different
encoding, such as EBCDIC, will provide a conversion appropriate to their
usage. This should produce the minimum of cross-encoding conversions.
...
...
...
...
My other criticism so far is the node::type() function. I really don't believe
in such type tags; we should be using virtual function dispatch instead, using
the Visitor pattern. Your traversal example could then ditch the
traverse(node_ptr) overload, and instead be called with
document->root.visit(traversal)
Node types aren't (runtime-) polymorphic right now, but is that really a big deal ?
Polymorphism is important for extensibility. However here the set of node types
is well known (and rather limited).
Polymorphism is not just important for extensibility of the polymorphic set,
but also for type-safe, convenient handling of objects whose exact type is
only known at runtime. Check-type-flag-and-cast is a nasty smell that I would
rather avoid where possible.
I understand and agree with you...in principle. There are a number of situations
where a method call will return a node_ptr, and the user typically wants to cast
to the exact type. Examples include element child iteration, node_sets (resulting
from xpath lookup, etc.).
However, doing this with a visitor-like visit/accept pair of methods incures
two virtual method calls, just to get hold of the real type. That's a lot !
Two virtual method calls (node.visit/visitor.handle) vs two plain function
calls (node.type/handle_xxx), a switch, and a "cast" that constructs a new
object.

Have you run a profiler on it?

Premature optimization is the root of all evil; I would rather have something
that helps me write correct code, rather than fast code. I really dislike
switch-on-type code, and I'm not convinced that it is necessarily faster in
all cases.
...
As my node implementations already know their type (in terms of an enum tag),
casting is a simple matter of rewrapping the implementation by a new proxy.
There's nothing stopping this from continuing to work.
...
Using RTTI to represent the node's type is definitely possible. I'm just not
convinced of its advantages.
I'm not convinced of the advantage of not using it ;-)
...
...
One additional comment on re-reading the samples --- having to instantiate
every template for the external string type seems rather awkward.
One alternative is to accept and return an internal string type, and provide
conversion functions to/from the user's external string type. This way, the
library is not dependent on the string type, but it does add complexity to the
interface.
Right, I considered that. One has to be careful with those string conversions,
though, to avoid unnecessary copies.
Yes. However, this problem exists with your proposed API anyway --- by
converting on every call to the API, you are forcing possibly-unnecessary
conversions on your users. For example, they may want to add the same
attribute to 50 nodes; your proposed API requires that the attribute name is
converted 50 times. Accepting an internal string type, and making the user do
the conversion allows the user to do the conversion once, and then pass the
converted string 50 times.
...
...
Another alternative is to make the functions that accept or return the user's
string type into templates, whilst leaving the enclosing class as a
non-template, since there are no data members of the user's string
type. Template parameter type deduction can be used to determine the type when
it is given as a parameter, and explicit specification can be used when it is
needed for a return type.
Actually the use I have in mind is really what I demonstrate in the examples:
Developers who decide to use boost::xml::dom fix the unicode binding by specifying
the string type, conversion, etc. in a single place (such as my 'string.hpp'),
such then all the rest of the code doesn't need to be aware of this mechanism.
In your case I couldn't encapsulate the binding in a single place, as you mention
yourself.
Agreed, but you wouldn't have to. It's also more flexible --- it would allow
input to come in with one encoding/string type, and output to be generated
with a different encoding/string type, but the same boost::xml::dom objects
could be used.
...
What would be possible, though, is to put all types into a single parametrized
struct:
template <typename S>
struct types
{
   typedef typename document<S> document_type;
   typedef typename node_ptr<element<S> > element_ptr;
   ...
};
This is preferable to the current proposed API, but I still prefer that the
conversion happens at the boundary as per my suggestions, rather than the
entire classes being parameterized.

Anthony
-- 
Anthony Williams
Software Developer
Just Software Solutions Ltd
http://www.justsoftwaresolutions.co.uk