Re: [boost] Proposal: XML APIs in boost

1 Nov 2005

      ...
...
IMO, Unicode support is way beyond string template parameter. Unicode 
means
different character sets to support, different encoding format, different
encoding schemes sets and different tradeoffs in optimization and all 
above.
...
,
and &), a name start character, a name character or "other", the
Sort of. For XML processing, the primary feature of Unicode is the 
extended
character set. For XML 1.0, once an XML processor has decided whether or 
not a
given character is whitespace, one of the special characters (such as <, 
peculiarities
of Unicode are mostly irrelevant. Obviously, there has to be code to 
handle
the detection of the input encoding, and conversion to a stream of Unicode
codepoints, in order to facilitate such classification. However, beyond 
that,
the details don't matter.
I think it's more then just that.

Scenario 1: I prefer parse documents that use only first plane, use UCS2 as 
encoding format and UTF8, UTF16 as Encoding scheme. IOW I will always use 
wchar_t and wstring.
Scenario 2: I prefer parse documents that use only ASCII chars, use 8bit as 
encoding format and 7bit as encoding scheme. IOW prefer to use char as 
std::string and I do not want to know about any transcoding, wide chars 
e.t.c.
Scenario 3: I prefer parse documents that use whole Unicode set, use UTF16 
as encoding format and UTF8, UTF16 as Encoding scheme and I want parser to 
be lazy, IOW if it is big(huge) XML document that uses UTF8, I do not want 
parser to convert any CDATA immediately into native encoding form, until 
requested, but only do some local char by char conversion required for 
markup detection. (Essentially I want to limit memory usage and unnecessary 
work)
Scenario 4: I prefer parse documents that use whole Unicode set, use UCS4 as 
encoding format and support a wide variety (10 or more) different encoding 
schemes.  I do not care about performance and memory usage that much - but 
prefer single parser that does it all.

I could list a lot of different usage  schemes with different tradeoffs. 
Eventually it bound to affect XML parser interface in regards to Unicode 
support (instead of Unicode I would prefer to use term Charsets and Encoding 
scheme sets - Unicode is just one particular charset/encoding scheme sets 
combination)

Gennadiy

Re: [boost] Proposal: XML APIs in boost

Gennadiy Rozental