
Stefan Seefeld <seefeld@sympatico.ca> writes:
Anthony Williams wrote:
Assume I know the encoding and character type I wish to use as input. In order to specialize converter<> for my string type, I need to know what encoding and character type the library is using. If the encoding and character type are not specified in the API, but are instead open to the whims of the backend, I cannot write my conversion code.
Ah, I think I understand what you mean by 'character type'. Yes, you are right. The code as I posted it to the vault is missing these bits. that enable users to write converters without knowing backend-specific details. However, some 'dom::char_trait' should be enough, right ?
Yes and no. Suppose my incoming data is a stream of 8-bit "characters", using Shift-JIS encoding. I need to write a converter to convert this to whatever encoding is accepted by the XML API. I need to know which encoding to use when I write my converter --- if the API is expecting UTF-16 stored in a string of unsigned shorts, my converter is going to be quite different to if the API is expecting UTF-8 stored in a string of unsigned chars. I also need to know how to construct the final string --- whether I need to provide a boost::xml::char_type*, or whether I need to construct a boost::xml::string_type from a pair of iterators, or something else.
I would suggest that the API accepts input in UTF-8, UTF-16 and UTF-32. The user then has to supply a conversion function from their encoding to one of these, and the library converts internally if the one they choose is not the "correct" one.
It already does. libxml2 provides conversion functions. I need to hook them up into such an 'xml char trait'.
I don't understand how your response ties in with my comment, so I'll try again.
I was suggesting that we have overloads like:
node::append_element(utf8_string_type); node::append_element(utf16_string_type); node::append_element(utf32_string_type);
With two of them (but unspecified which two) converting to the correct internal encoding.
Oh, but that multiplies quite a chunk of the API by four !
What's the fourth option? Yes, I agree it multiplies the API, but for the convenience of users.
Typically, a unicode library provides converter functions, so what advantage would such a rich interface have instead of asking the user to do the conversion before calling into the xml library ?
It avoids the user doing any conversion in many cases.
If the internal storage encoding is a compile-time constant that can be queried from the proposed dom::char_trait, it should be simple for users to decide how to write the converter, and in particular, how to pass strings in the most efficient way.
If the encoding is only available as a compile-time constant, that won't help me write a converter. I need it available as a software-writing-time constant for that (i.e. specified in the documentation). If you don't want to fix the encoding in the docs, maybe we should require that the user supply conversions to each of UTF-8, UTF-16 and UTF-32, and the library will use whichever is most convenient.
If I specify the conversions to use directly on the input and output, then I can cleanly separate my application into three layers --- process input, and build DOM in internal encoding; process DOM as necessary; display result to user.
If the string type and encoding is inherently part of the DOM types, this is not so simple.
I still don't understand what you have in mind: Are you thinking of using two separate unicode libraries / string types for input and output ? Again unicode libraries should provide encoding conversion, if all you want is to use distinct encodings.
I may not understand the details well enough, but asking for the API to integrate the string conversions as you seem to be doing sounds exactly like what you accused me of doing: premature optimization. ;-)
It seems I am failing to communicate my thoughts correctly, since optimization is certainly far from my thoughts. It is separation of concerns that I am currently thinking about. In the input layer of an application, you need to deal with all the variety of encodings that the user might supply. I'm quite happy to use a single Unicode library to deal with the conversions, but I can imagine having to deal with numerous external encodings. I would like the rest of the application to have no need to know about the complications of the input handling, and the variety of encodings used --- provided I get a set of DOM objects from somewhere, the rest of the application shouldn't care. Once the input has been handled, and the DOM built, there might be additional input in terms of XPath expressions, or element names, which might be in another encoding still. Again, the choice of input encoding here should have no impact on the rest of the application. With the current design, the whole API is tied to a single external string type, with a single converter function for converting to the internal string type. This implies that if you wish to use different encodings, you need a different external string type, and therefore you end up with different template instantiations for different encodings, and my nice separate application parts suddenly need to know what encodings are used for input and output.
I'm not sure I understand your requirement ? Do you really want to plug in multiple unicode libraries / string types ? Or do you want to use multiple encodings ?
Multiple encodings, generally. However, your converter<> template doesn't allow for that --- it only allows one encoding per string type.
Ah, well, the converter is not even half-finished, as in its current form it is tied to the string type. It sure requires some substantial design to be of any practical use.
Ok. I'm trying to raise issues which will affect the design. For axemill, I decided to provide a set of conversion templates for converting between encodings. Firstly, there is the Decode template that takes a pair of input iterators, and returns the first UTF-32 character in the sequence, advancing the start iterator in the process. Secondly, there is the Encode template that takes a single UTF-32 character, and an output iterator, and writes the character to the output iterator in the appropriate encoding. These templates are then specialized for each encoding, by using types as tags (so there are types axemill::encoding::ASCII, axemill::encoding::UTF8, axemill::encoding::UTF32_LE, axemill::encoding::ISO_8859_1, etc.) Then I provide template functions convertFrom<someEncoding>(start,end) and convertFrom<someEncoding>(std::string) which convert to the internal string type, and convertFrom<someEncoding>(start,end,out), which converts an input sequence, and appends it to the specified output sequence (which must be a sequence of the internal UTF-32 characters). Complementing that are the convertTo overloads, which convert an internal string to a std::string in some encoding, and convert an input range of internal UTF-32 characters, writing to an output iterator. Finally, there is a recode<inputEncoding,outputEncoding>(start,end,out) template function that takes an input range in some encoding, and writes it out as a different encoding, going through the internal UTF-32 character type in the middle. This allows for the full complement of input and output encodings to be used (provided appropriate specializations of Encode<> and Decode<> are provided), but the main part of the library is oblivious to this, and just uses the internal UTF-32 string type. Anthony -- Anthony Williams Software Developer Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk