Re: [boost] Proposal: XML APIs in boost

8 Nov 2005

      Stefan Seefeld <seefeld@sympatico.ca> writes:
...
Anthony Williams wrote:
...
Assume I know the encoding and character type I wish to use as input. In order
to specialize converter<> for my string type, I need to know what encoding and
character type the library is using. If the encoding and character type are
not specified in the API, but are instead open to the whims of the backend, I
cannot write my conversion code.
Ah, I think I understand what you mean by 'character type'. Yes, you are right.
The code as I posted it to the vault is missing these bits. that enable users
to write converters without knowing backend-specific details. However, some
'dom::char_trait' should be enough, right ?
Yes and no. Suppose my incoming data is a stream of 8-bit "characters", using
Shift-JIS encoding. I need to write a converter to convert this to whatever
encoding is accepted by the XML API. I need to know which encoding to use when
I write my converter --- if the API is expecting UTF-16 stored in a string of
unsigned shorts, my converter is going to be quite different to if the API is
expecting UTF-8 stored in a string of unsigned chars. I also need to know how
to construct the final string --- whether I need to provide a
boost::xml::char_type*, or whether I need to construct a
boost::xml::string_type from a pair of iterators, or something else.
...
...
...
...
I would suggest that the API accepts input in UTF-8, UTF-16 and UTF-32. The
user then has to supply a conversion function from their encoding to one of
these, and the library converts internally if the one they choose is not the
"correct" one.
It already does. libxml2 provides conversion functions. I need to hook them
up into such an 'xml char trait'.
I don't understand how your response ties in with my comment, so I'll try
again.
I was suggesting that we have overloads like:
node::append_element(utf8_string_type);
node::append_element(utf16_string_type);
node::append_element(utf32_string_type);
With two of them (but unspecified which two) converting to the correct
internal encoding.
Oh, but that multiplies quite a chunk of the API by four !
What's the fourth option? Yes, I agree it multiplies the API, but for the
convenience of users.
...
Typically, a unicode library provides converter functions, so what advantage
would such a rich interface have instead of asking the user to do the conversion
before calling into the xml library ?
It avoids the user doing any conversion in many cases.
...
If the internal storage encoding is a compile-time constant that can be queried
from the proposed dom::char_trait, it should be simple for users to decide how
to write the converter, and in particular, how to pass strings in the most
efficient way.
If the encoding is only available as a compile-time constant, that won't help
me write a converter. I need it available as a software-writing-time constant
for that (i.e. specified in the documentation).

If you don't want to fix the encoding in the docs, maybe we should require
that the user supply conversions to each of UTF-8, UTF-16 and UTF-32, and the
library will use whichever is most convenient.
...
...
If I specify the conversions to use directly on the input and output, then I
can cleanly separate my application into three layers --- process input, and
build DOM in internal encoding; process DOM as necessary; display result to
user.
If the string type and encoding is inherently part of the DOM types, this is
not so simple.
I still don't understand what you have in mind: Are you thinking of using
two separate unicode libraries / string types for input and output ? Again
unicode libraries should provide encoding conversion, if all you want is
to use distinct encodings.
I may not understand the details well enough, but asking for the API to
integrate the string conversions as you seem to be doing sounds exactly
like what you accused me of doing: premature optimization. ;-)
It seems I am failing to communicate my thoughts correctly, since optimization
is certainly far from my thoughts. It is separation of concerns that I am
currently thinking about.

In the input layer of an application, you need to deal with all the variety of
encodings that the user might supply. I'm quite happy to use a single Unicode
library to deal with the conversions, but I can imagine having to deal with
numerous external encodings. I would like the rest of the application to have
no need to know about the complications of the input handling, and the variety
of encodings used --- provided I get a set of DOM objects from somewhere, the
rest of the application shouldn't care.

Once the input has been handled, and the DOM built, there might be additional
input in terms of XPath expressions, or element names, which might be in
another encoding still. Again, the choice of input encoding here should have
no impact on the rest of the application.

With the current design, the whole API is tied to a single external string
type, with a single converter function for converting to the internal string
type. This implies that if you wish to use different encodings, you need a
different external string type, and therefore you end up with different
template instantiations for different encodings, and my nice separate
application parts suddenly need to know what encodings are used for input and
output.
...
...
...
I'm not sure I understand your requirement ? Do you really want to plug in
multiple unicode libraries / string types ? Or do you want to use multiple
encodings ?
Multiple encodings, generally. However, your converter<> template doesn't
allow for that --- it only allows one encoding per string type.
Ah, well, the converter is not even half-finished, as in its current form
it is tied to the string type. It sure requires some substantial design
to be of any practical use.
Ok. I'm trying to raise issues which will affect the design.

For axemill, I decided to provide a set of conversion templates for converting
between encodings.

Firstly, there is the Decode template that takes a pair of input iterators,
and returns the first UTF-32 character in the sequence, advancing the start
iterator in the process.

Secondly, there is the Encode template that takes a single UTF-32 character,
and an output iterator, and writes the character to the output iterator in the
appropriate encoding.

These templates are then specialized for each encoding, by using types as tags
(so there are types axemill::encoding::ASCII, axemill::encoding::UTF8,
axemill::encoding::UTF32_LE, axemill::encoding::ISO_8859_1, etc.)

Then I provide template functions convertFrom<someEncoding>(start,end) and
convertFrom<someEncoding>(std::string) which convert to the internal string
type, and convertFrom<someEncoding>(start,end,out), which converts an input
sequence, and appends it to the specified output sequence (which must be a
sequence of the internal UTF-32 characters).

Complementing that are the convertTo overloads, which convert an internal
string to a std::string in some encoding, and convert an input range of
internal UTF-32 characters, writing to an output iterator.

Finally, there is a recode<inputEncoding,outputEncoding>(start,end,out)
template function that takes an input range in some encoding, and writes it
out as a different encoding, going through the internal UTF-32 character type
in the middle.

This allows for the full complement of input and output encodings to be used
(provided appropriate specializations of Encode<> and Decode<> are provided),
but the main part of the library is oblivious to this, and just uses the
internal UTF-32 string type.

Anthony
-- 
Anthony Williams
Software Developer
Just Software Solutions Ltd
http://www.justsoftwaresolutions.co.uk