
Tilman Kuepper <kuepper <at> xgraphic.de> writes:
Hello world,
Hi Tilman,
I took a closer look at the UTF-8 codecvt facet which is part of the program_options library. A test program is attached.
The last assert (in the Read-function) fails with g++ (GCC) 3.3.3 (Debian 20040429).
After some debugging I think I found the problem:
Could you clarify where the problem is? Does it break program_options, or does it break some use of UTF-8 that you make?
The function utf8_codecvt_facet_wchar_t::do_in() converts only valid (com- plete) UTF-8 sequences into internal (wchar_t) characters. In case the input buffer ends with an incomplete UTF-8 character, do_in() returns codecvt_base::partial and points from_next at the beginning of this incomplete UTF-8 sequence.
Oh.... this 'partial' is messy thing. I think I though it means 'partial character found', but later figured out it means something different. I think I even fixed a bug with incorrectly returned 'partial' in that facet some time ago.
Obviously the library (libstdc++) is surprised by the fact that the codecvt facet stops the translation, although there is still room in the output buffer (i. e. to_next != to_end) and not all input characters have been processed (from_next != from_end).
As a consequence the for-loop in the test program stops too early (wifstream not "good" any longer) and assert(pos == wstr.size()) fails.
Is this a known issue with the GNU library or with the UTF-8 conversion facet? And what can be done?
Unless somebody else can shed some light, there are two choices: 1. You can wait until I'm back from vacation. 2. You can figure out the exact meaning of 'partial' and send a patch. Thanks, Volodya