[program_options / g++ 3.3.3] Problem with UTF-8 codecvt facet

Hello world, I took a closer look at the UTF-8 codecvt facet which is part of the program_options library. A test program is attached. The last assert (in the Read-function) fails with g++ (GCC) 3.3.3 (Debian 20040429). After some debugging I think I found the problem: The function utf8_codecvt_facet_wchar_t::do_in() converts only valid (com- plete) UTF-8 sequences into internal (wchar_t) characters. In case the input buffer ends with an incomplete UTF-8 character, do_in() returns codecvt_base::partial and points from_next at the beginning of this incomplete UTF-8 sequence. Obviously the library (libstdc++) is surprised by the fact that the codecvt facet stops the translation, although there is still room in the output buffer (i. e. to_next != to_end) and not all input characters have been processed (from_next != from_end). As a consequence the for-loop in the test program stops too early (wifstream not "good" any longer) and assert(pos == wstr.size()) fails. Is this a known issue with the GNU library or with the UTF-8 conversion facet? And what can be done? Best regards from Aachen, Tilman PS: You can find the codecvt facet cpp/hpp files in the folders /boost/boost/program_options/detail/ and in /boost/libs/program_options/src/ PPS: There seems to be no problem with VC7.1/Dinkumware. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- #include "utf8_codecvt_facet.hpp" #include <string> #include <iostream> #include <fstream> #include <locale> #include <cassert> using namespace std; using namespace boost; using namespace boost::program_options; using namespace boost::program_options::detail; namespace { wstring wstr; } inline bool IsUnicode(wchar_t wch) { if(wch >= 0x00D800 && wch < 0x00E000) return false; if(wch >= 0x00FFFE && wch < 0x010000) return false; if(wch >= 0x110000) return false; return true; } inline void MakeTestStr() { const size_t loops = 1; wstr.clear(); wstr.reserve(loops * 0x110000); for(size_t i = 0; i < loops; ++i) for(wchar_t wch = 0; wch < WCHAR_MAX; ++wch) if(IsUnicode(wch)) wstr += wch; } inline void Write() { locale loc; locale utf8loc(loc, new utf8_codecvt_facet<wchar_t, char>()); wofstream f; f.imbue(utf8loc); f.open("test.utf8", ios::binary); f << wstr; assert(f); } inline void Read() { locale loc; locale utf8loc(loc, new utf8_codecvt_facet<wchar_t, char>()); wifstream f; f.imbue(utf8loc); f.open("test.utf8", ios::binary); wchar_t wch; size_t pos = 0; for(; f.get(wch); ++pos) assert(wstr[pos] == wch); assert(pos == wstr.size()); // ***PROBLEM HERE*** } int main() { MakeTestStr(); Write(); Read(); }

Tilman Kuepper <kuepper <at> xgraphic.de> writes:
Hello world,
Hi Tilman,
I took a closer look at the UTF-8 codecvt facet which is part of the program_options library. A test program is attached.
The last assert (in the Read-function) fails with g++ (GCC) 3.3.3 (Debian 20040429).
After some debugging I think I found the problem:
Could you clarify where the problem is? Does it break program_options, or does it break some use of UTF-8 that you make?
The function utf8_codecvt_facet_wchar_t::do_in() converts only valid (com- plete) UTF-8 sequences into internal (wchar_t) characters. In case the input buffer ends with an incomplete UTF-8 character, do_in() returns codecvt_base::partial and points from_next at the beginning of this incomplete UTF-8 sequence.
Oh.... this 'partial' is messy thing. I think I though it means 'partial character found', but later figured out it means something different. I think I even fixed a bug with incorrectly returned 'partial' in that facet some time ago.
Obviously the library (libstdc++) is surprised by the fact that the codecvt facet stops the translation, although there is still room in the output buffer (i. e. to_next != to_end) and not all input characters have been processed (from_next != from_end).
As a consequence the for-loop in the test program stops too early (wifstream not "good" any longer) and assert(pos == wstr.size()) fails.
Is this a known issue with the GNU library or with the UTF-8 conversion facet? And what can be done?
Unless somebody else can shed some light, there are two choices: 1. You can wait until I'm back from vacation. 2. You can figure out the exact meaning of 'partial' and send a patch. Thanks, Volodya

Goooood Morning,
Does it break program_options, or does it break some use of UTF-8 that you make?
Don't worry... ;-) I wanted to use the UTF-8 converter in one of my own programs and decided to do some tests before... So far only my test program broke. The program_options library might be affected (not yet tested!), if options are read from a UTF-8 file which is longer than 8192 bytes and contains many "international characters" (i. e. UTF-8 sequences of two or more bytes).
Oh.... this 'partial' is messy thing.
Very true.
Unless somebody else can shed some light, there are two choices: 1. You can wait until I'm back from vacation. 2. You can figure out the exact meaning of 'partial' and send a patch.
We can do both... ;-) I'll try to find out more about this 'partial' thing. The next step would be to make the current implementation of the conversion facet more robust (or to fix a bug in libstdc++...?!). If a new version of the UTF-8 facet is available it may find its way into other libraries, e. g. program_options, serialization etc. Best regards, Tilman

On Monday 02 August 2004 16:33, Tilman Kuepper wrote:
inline void Write() { locale loc; locale utf8loc(loc, new utf8_codecvt_facet<wchar_t, char>()); wofstream f; f.imbue(utf8loc); f.open("test.utf8", ios::binary); f << wstr; assert(f); }
Without looking at the real case, the last assert is possibly useless: if there is still data in some internal buffer, only the stream going out of scope will flush that (and perhaps fail to do so). BTDT. Just add an 'f.flush()'. Uli

Hi,
f << wstr; assert(f);
Without looking at the real case, the last assert is possibly useless: if there is still data in some internal buffer, only the stream going out of scope will flush that (and perhaps fail to do so). BTDT. Just add an 'f.flush()'.
O. k., that's right. But the "real" problem remains: This particular libstdc++ (g++ 3.3.3) dosn't like the codecvt facet to return "partial". I am still not sure, if the codecvt facet or the library has to be fixed. Some regression tests should try to convert some "really large" chunks of data (which contain UTF-8 multi-byte encodings). It seems that libstdc++ uses internal buffers of 8 KB size; the current conversion facet causes problems only, if you need more room. Best regards, Tilman
participants (3)
-
Tilman Kuepper
-
Ulrich Eckhardt
-
Vladimir Prus