Re: [boost] [General] Always treat std::strings as UTF-8

18 Jan 2011

      Alexander Lamaison wrote:
...
I don't understand how it could possibly not help.  If I see an api
function call_me(std::string arg) I know next to nothing about what it's
expecting from the string (except that by convention it tends to mean
'string in OS-default encoding').
You should read the documentation of call_me (*). Yes, I know that in the 
real world the documentation often doesn't specify an encoding (worse - the 
encoding varies between platforms and even versions of the same library), 
but if the developer of call_me hasn't bothered to document the encoding of 
the argument, he won't bother to use a special UTF-8 type for the argument, 
either. :-)

(*) And the documentation should either say that call_me accepts UTF-8, or 
that call_me is encoding-agnostic, that is, it treats the string as a byte 
sequence.

I can think of one reason to use a separate type - if you want to overload 
on encoding:

    void f( latin1_t arg );
    void f( utf8_t arg );

In most such cases that spring to mind, however, what the user actually 
wants is:

    void f( string arg, encoding_t enc );

or even

    void f( string arg, string encoding );

In principle, as Chad Nelson says, it's useful to have separate types if the 
program uses several different encodings at once, fixed at compile time. I 
don't consider such a way of programming a good idea though. Strings should 
be either byte sequences or UTF-8; input can be of any encoding, possibly 
not known until runtime, but it should always be either processed as a byte 
sequence or converted to UTF-8 as a first step.

Regarding the OS-default encoding - if, on Windows, you ever encounter or 
create a string in the OS default encoding, you've already lost - this code 
can't be correct. :-)

Re: [boost] [General] Always treat std::strings as UTF-8

Peter Dimov