
On 11/09/2010 20:34, Artyom wrote:
Ahh I see, I do following:
When I read for example 4 byes of UTF-8 that go to codepoint> 0xFFFF I do following:
1. I write first surrogate pair to output stream, I update the state to reflect that first part of the pair was written and **I do not consume input** 2. Same 4 utf-8 bytes again and see that state is marked to that first part of pair was written so I write the second and consume the input.
So actually do_in called twice for same input.
The code in question is in loop that keeps on going until from reaches from_end or the conversion fails (due to insufficient input or otherwise), so both surrogates should be written in the same do_in invocation.
Actually the mbstate_t is POD type that should be initialized to 0. I must make sure that sizeof(mbstate_t)>= 2, and then I use it as temporary storage for state.
I'm not talking about that, I meant the reinterpret casting between uchar and uint_type, but actually I suppose they're the same, maybe just different signedness, so that should be somewhat ok. It's still not allowed by the strict aliasing rules though.