On 4/25/24 18:48, Peter Dimov wrote:
Andrey Semashev wrote:
The right way to not deal with these issues is to simply not take wide strings in the first place. This forces the user to supply "the canonical octet representation".
Since we do take wide strings, we have implicitly accepted the responsibility to produce the canonical octet representation for them. And inserting zeroes randomly is simply wrong.
Ok, so maybe we should simply deprecate the support for wide string inputs?
That's one possible way to deal with it, yes.
Although I think that for char16_t and char32_t inputs the canonical representation is unambiguous.
If you mean that char16_t and char32_t strings will still be converted to UTF-8 internally then you still have the issue of incorrect UTF-16/32 strings on input. I think, the only input character type we should allow is char. And we should not care what encoding it is or whether it is valid or not. IOW, we should take it as an opaque sequence of bytes.
This leaves wchar_t and while nobody on POSIX will shed a tear, Windows users will probably be disappointed if we take that away.
Well, we do provide libraries for character encoding conversion, so users are free to use those. Sure, it adds a bit of work, but not that much. And I doubt name generator is very popular.