On Tue, Jan 7, 2020 at 5:16 PM Peter Dimov via Boost
Gavin Lambert wrote:
But the conversion from WTF-8 to UCS-16 can interpret the joining point as a different character, resulting in a different sequence. Unless I've misread something, this could occur if the first string ended in an unpaired high surrogate and the second started with an unpaired low surrogate (or rather the WTF-8 equivalents thereof).
I don't see why do you think this would present a problem. The conversion of the first string will end in an unpaired high surrogate. The conversion of the second string will start with an unpaired low surrogate. The two, when concatenated, will form a valid UTF-16 encoding of a non-BMP character. Where is the issue here?
That's my point essentially. However Gavin refers to the fact that the current WTF-8 spec explicitly says that an encoding of high/low surrogate pairs is invalid in WTF-8. For example UTF-16: d83d de09 should be encoded as WTF-8: f0 9f 98 89 But if one "UTF-16" string ended in d83d and the other in de09, concatenating in WTF-8 would yield "Invalid WTF-8": ed a0 bd ed b8 89 The spec explicitly prohibits this. The rationale behind this is to have a unique representation of any "UTF-16" stream, just like UTF-8 requires shortest representations. It might be important for security reasons if you're going to compare those "invalid WTF-8" strings, but it is not an issue if the next thing you do is converting them back to UTF-16. -- Yakov Galka http://stannum.co.il/