
----- Original Message ----
From: Peter Dimov <pdimov@pdimov.com>
Dave Abrahams wrote:
At Tue, 18 Jan 2011 13:27:29 +0200, Peter Dimov wrote:
Dave Abrahams wrote:
I think the reason to use separate types is to provide a type-safety barrier between your functions that operate on utf-8 and system or 3rd-party interfaces that don't or may not. In principle, that should force you to think about encoding and decoding at all the places where it may be needed, and should allow you to code naturally and with confidence where everybody is operating in utf8-land.
Yes, in principle. It isn't terribly necessary if everybody is operating in UTF-8 land though.
But they won't be. That's not today's reality.
They should be, though. As a practical matter, the difference between taking/returning a string and taking/returning an utf8_t is to force people to write an explicit conversion. This penalizes people who are already in UTF-8 land because it forces them to use utf8_t( s, encoding_utf8 ) and s.c_str( encoding_utf8 ) everywhere, without any gain or need. It's true that for people whose strings are not UTF-8, forcing those explicit conversions may be considered a good thing. So it depends on what your goals are. Do you want to promote the use of UTF-8 for all strings, or do you want to enable people to remain in non-UTF-8-land?
+1
There's also the additional consideration of utf8_t's invariant. Does it require valid UTF-8? One possible specification of fopen might be:
FILE* fopen( char const* name, char const* mode );
The 'name' argument must be UTF-8 on Unicode-aware platforms and file systems such as Windows/NTFS and Mac OS X/HFS+. It can be an arbitrary byte sequence on encoding-agnostic platforms and file systems such as Linux and Solaris, but UTF-8 is recommended.
+1 As well. Also I would like to add a small note of general C++ design as a language: don't pay on what you don't need. And 95% of all uses of strings is encoding agnostic! Artyom