
Hello world, I took a closer look at the UTF-8 codecvt facet which is part of the program_options library. A test program is attached. The last assert (in the Read-function) fails with g++ (GCC) 3.3.3 (Debian 20040429). After some debugging I think I found the problem: The function utf8_codecvt_facet_wchar_t::do_in() converts only valid (com- plete) UTF-8 sequences into internal (wchar_t) characters. In case the input buffer ends with an incomplete UTF-8 character, do_in() returns codecvt_base::partial and points from_next at the beginning of this incomplete UTF-8 sequence. Obviously the library (libstdc++) is surprised by the fact that the codecvt facet stops the translation, although there is still room in the output buffer (i. e. to_next != to_end) and not all input characters have been processed (from_next != from_end). As a consequence the for-loop in the test program stops too early (wifstream not "good" any longer) and assert(pos == wstr.size()) fails. Is this a known issue with the GNU library or with the UTF-8 conversion facet? And what can be done? Best regards from Aachen, Tilman PS: You can find the codecvt facet cpp/hpp files in the folders /boost/boost/program_options/detail/ and in /boost/libs/program_options/src/ PPS: There seems to be no problem with VC7.1/Dinkumware. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- #include "utf8_codecvt_facet.hpp" #include <string> #include <iostream> #include <fstream> #include <locale> #include <cassert> using namespace std; using namespace boost; using namespace boost::program_options; using namespace boost::program_options::detail; namespace { wstring wstr; } inline bool IsUnicode(wchar_t wch) { if(wch >= 0x00D800 && wch < 0x00E000) return false; if(wch >= 0x00FFFE && wch < 0x010000) return false; if(wch >= 0x110000) return false; return true; } inline void MakeTestStr() { const size_t loops = 1; wstr.clear(); wstr.reserve(loops * 0x110000); for(size_t i = 0; i < loops; ++i) for(wchar_t wch = 0; wch < WCHAR_MAX; ++wch) if(IsUnicode(wch)) wstr += wch; } inline void Write() { locale loc; locale utf8loc(loc, new utf8_codecvt_facet<wchar_t, char>()); wofstream f; f.imbue(utf8loc); f.open("test.utf8", ios::binary); f << wstr; assert(f); } inline void Read() { locale loc; locale utf8loc(loc, new utf8_codecvt_facet<wchar_t, char>()); wifstream f; f.imbue(utf8loc); f.open("test.utf8", ios::binary); wchar_t wch; size_t pos = 0; for(; f.get(wch); ++pos) assert(wstr[pos] == wch); assert(pos == wstr.size()); // ***PROBLEM HERE*** } int main() { MakeTestStr(); Write(); Read(); }