On 09.10.2015 17:41, Peter Dimov wrote:
Beman Dawes wrote:
IMO, a critical aspect of all of those, including utf-8 to utf-8, is that they detect all utf-8 errors since ill-formed utf-8 is used as an attack vector.
That is what I alluded to earlier with my bikeshedding comment - I personally find this policy a bit too firm for my taste. Sure, sometimes I do want to reject any invalid UTF-8 with extreme prejudice, but at other times I do not. For instance, when I get a Windows file name, it can well be invalid UTF-16, which when converted will become invalid UTF-8 but which will roundtrip correctly back to its original invalid UTF-16 form and refer to the same file. That's why things like CESU-8 or WTF-8 exist.
So I like the "method" argument of locale::conv::utf_to_utf, except that I think that it doesn't offer enough control.
I think, UTF-8 is UTF-8 (i.e. the character encoding that is described by the standard), and the tool for working with it should adhere to the specification. This includes signalling about invalid code sequences instead of producing them. WTF-8 and CESU-8 are not UTF-8 but different encodings. Dealing with them should be the user's explicit choice (e.g. the user should write utf16_to_wtf8 instead of utf16_to_utf8).