
On Mon, Jan 17, 2011 at 7:33 PM, Dave Abrahams <dave@boostpro.com> wrote:
On Mon, Jan 17, 2011 at 10:09 PM, Peter Dimov <pdimov@pdimov.com> wrote:
Alexander Lamaison wrote:
I don't understand how it could possibly not help. If I see an api function call_me(std::string arg) I know next to nothing about what it's expecting from the string (except that by convention it tends to mean 'string in OS-default encoding').
You should read the documentation of call_me (*). Yes, I know that in the real world the documentation often doesn't specify an encoding (worse - the encoding varies between platforms and even versions of the same library), but if the developer of call_me hasn't bothered to document the encoding of the argument, he won't bother to use a special UTF-8 type for the argument, either. :-)
(*) And the documentation should either say that call_me accepts UTF-8, or that call_me is encoding-agnostic, that is, it treats the string as a byte sequence.
I can think of one reason to use a separate type - if you want to overload on encoding:
void f( latin1_t arg ); void f( utf8_t arg );
In most such cases that spring to mind, however, what the user actually wants is:
void f( string arg, encoding_t enc );
or even
void f( string arg, string encoding );
In principle, as Chad Nelson says, it's useful to have separate types if the program uses several different encodings at once, fixed at compile time. I don't consider such a way of programming a good idea though. Strings should be either byte sequences or UTF-8; input can be of any encoding, possibly not known until runtime, but it should always be either processed as a byte sequence or converted to UTF-8 as a first step.
DISCLAIMER: I have almost no experience with the details of this stuff. I only know a few general things about programming (fewer every day).
I think the reason to use separate types is to provide a type-safety barrier between your functions that operate on utf-8 and system or 3rd-party interfaces that don't or may not. In principle, that should force you to think about encoding and decoding at all the places where it may be needed, and should allow you to code naturally and with confidence where everybody is operating in utf8-land. The typical failures I've seen, where there is no such mechanism (e.g. in Python where there's no static typing), are caused because programmers lose track of whether what they're handling is encoded as utf-8 or not.
UTF-8 allows the use of char * for type erasure for strings, much like void * allows that in general. Using C++ type tags to discriminate between different data pointed by void pointers is mostly redundant except when type safety is postponed until run-time; and that's only marginally safer than using string tags. Emil Dotchevski Reverge Studios, Inc. http://revergestudios.com/reblog/index.php?n=ReCode.ReCode