
Dave Abrahams wrote:
At Tue, 18 Jan 2011 13:27:29 +0200, Peter Dimov wrote:
Dave Abrahams wrote:
I think the reason to use separate types is to provide a type-safety barrier between your functions that operate on utf-8 and system or 3rd-party interfaces that don't or may not. In principle, that should force you to think about encoding and decoding at all the places where it may be needed, and should allow you to code naturally and with confidence where everybody is operating in utf8-land.
Yes, in principle. It isn't terribly necessary if everybody is operating in UTF-8 land though.
But they won't be. That's not today's reality.
They should be, though. As a practical matter, the difference between taking/returning a string and taking/returning an utf8_t is to force people to write an explicit conversion. This penalizes people who are already in UTF-8 land because it forces them to use utf8_t( s, encoding_utf8 ) and s.c_str( encoding_utf8 ) everywhere, without any gain or need. It's true that for people whose strings are not UTF-8, forcing those explicit conversions may be considered a good thing. So it depends on what your goals are. Do you want to promote the use of UTF-8 for all strings, or do you want to enable people to remain in non-UTF-8-land?
It's a bit like defining a separate integer type for nonnegative ints for type safety reasons - useful in theory, but nobody does it.
I refer you to Boost.Units
I'm sure that there are many libraries that use units in their interfaces, I just haven't heard of them. :-) There's also the additional consideration of utf8_t's invariant. Does it require valid UTF-8? One possible specification of fopen might be: FILE* fopen( char const* name, char const* mode ); The 'name' argument must be UTF-8 on Unicode-aware platforms and file systems such as Windows/NTFS and Mac OS X/HFS+. It can be an arbitrary byte sequence on encoding-agnostic platforms and file systems such as Linux and Solaris, but UTF-8 is recommended. On Windows, the UTF-8 sequence may be invalid due to the presence of UTF-16 surrogates encoded as single code points, but such use is discouraged.