Re: [boost] [General] Always treat std::strings as UTF-8

18 Jan 2011

      On Mon, Jan 17, 2011 at 7:33 PM, Dave Abrahams <dave@boostpro.com> wrote:
...
On Mon, Jan 17, 2011 at 10:09 PM, Peter Dimov <pdimov@pdimov.com> wrote:
...
Alexander Lamaison wrote:
...
I don't understand how it could possibly not help.  If I see an api
function call_me(std::string arg) I know next to nothing about what it's
expecting from the string (except that by convention it tends to mean
'string in OS-default encoding').
You should read the documentation of call_me (*). Yes, I know that in the
real world the documentation often doesn't specify an encoding (worse - the
encoding varies between platforms and even versions of the same library),
but if the developer of call_me hasn't bothered to document the encoding of
the argument, he won't bother to use a special UTF-8 type for the argument,
either. :-)
(*) And the documentation should either say that call_me accepts UTF-8, or
that call_me is encoding-agnostic, that is, it treats the string as a byte
sequence.
I can think of one reason to use a separate type - if you want to overload
on encoding:
  void f( latin1_t arg );
  void f( utf8_t arg );
In most such cases that spring to mind, however, what the user actually
wants is:
  void f( string arg, encoding_t enc );
or even
  void f( string arg, string encoding );
In principle, as Chad Nelson says, it's useful to have separate types if the
program uses several different encodings at once, fixed at compile time. I
don't consider such a way of programming a good idea though. Strings should
be either byte sequences or UTF-8; input can be of any encoding, possibly
not known until runtime, but it should always be either processed as a byte
sequence or converted to UTF-8 as a first step.
DISCLAIMER: I have almost no experience with the details of this
stuff.  I only know a few general things about programming (fewer
every day).
I think the reason to use separate types is to provide a type-safety
barrier between your functions that operate on utf-8 and system or
3rd-party interfaces that don't or may not.  In principle, that should
force you to think about encoding and decoding at all the places where
it may be needed, and should allow you to code naturally and with
confidence where everybody is operating in utf8-land.  The typical
failures I've seen, where there is no such mechanism (e.g. in Python
where there's no static typing), are caused because programmers lose
track of whether what they're handling is encoded as utf-8 or not.
UTF-8 allows the use of char * for type erasure for strings, much like
void * allows that in general. Using C++ type tags to discriminate
between different data pointed by void pointers is mostly redundant
except when type safety is postponed until run-time; and that's only
marginally safer than using string tags.

Emil Dotchevski
Reverge Studios, Inc.
http://revergestudios.com/reblog/index.php?n=ReCode.ReCode