[boost][exception] Wide-character design considerations

Hello, How do you guys deal with boost.exceptions and wide-character strings? Currently, I convert, all wide-character strings are narrowed to their single-byte variants before passing them to boost::error_info tags. This however leads to a) partially unreadable strings: when narrowing a boost::filesystem::wpath to fit into a boost::errinfo_file_name b) potential throw from the narrowing code. c) calls narrowing code all around our code-basis. Best regards, Christoph

On 6/22/2010 02:33, Christoph Heindl wrote:
Hello,
How do you guys deal with boost.exceptions and wide-character strings?
All my std::strings are utf-8. That fixes this problem:
a) partially unreadable strings: when narrowing a boost::filesystem::wpath to fit into a boost::errinfo_file_name
And it reduces this problem to out-of-memory errors:
b) potential throw from the narrowing code.
-- Rainer Deyke - rainerd@eldwood.com

On Tue, Jun 22, 2010 at 10:11 AM, Rainer Deyke
On 6/22/2010 02:33, Christoph Heindl wrote:
Hello,
How do you guys deal with boost.exceptions and wide-character strings?
All my std::strings are utf-8. That fixes this problem:
Probably not all strings, if you want to do string manipulations std::wstring is a better fit. Filenames, yes, I keep as utf-8 std::strings too.
a) partially unreadable strings: when narrowing a boost::filesystem::wpath to fit into a boost::errinfo_file_name
And it reduces this problem to out-of-memory errors
To be fair, out-of-memory conditions are probably less common than bad utf-8 strings (which are also not very common, I'd think.) You are making a valid point though, it is true that adding error_info to exceptions may throw. The benefit of this behavior is that the postcondition is that the error_info has been added to the exception, which means that at the catch site you can assert on missing error_info. I still think that Christoph is raising a valid issue. Even if you keep the file names as utf-8 strings, diagnostic_information doesn't know about it. I think it is possible for boost::errinfo_file_name to deal with the situation better, it is on my todo list. Thanks, Emil Dotchevski Reverge Studios, Inc. http://www.revergestudios.com/reblog/index.php?n=ReCode

On Wed, Jun 23, 2010 at 12:15 AM, Emil Dotchevski
On Tue, Jun 22, 2010 at 10:11 AM, Rainer Deyke
wrote: On 6/22/2010 02:33, Christoph Heindl wrote:
Hello,
How do you guys deal with boost.exceptions and wide-character strings?
All my std::strings are utf-8. That fixes this problem:
Probably not all strings, if you want to do string manipulations std::wstring is a better fit. Filenames, yes, I keep as utf-8 std::strings too.
ok, that's a variant to consider. off-topic: that would mean i have to decode those strings before passing them to boost::filesystem, right?
I still think that Christoph is raising a valid issue. Even if you keep the file names as utf-8 strings, diagnostic_information doesn't know about it. I think it is possible for boost::errinfo_file_name to deal with the situation better, it is on my todo list.
what do you have in mind? Currently the only idea I have is to treat all utf-8 characters < 128 as ascii characters and escape the rest. Best regards, Christoph

I still think that Christoph is raising a valid issue. Even if you keep the file names as utf-8 strings, diagnostic_information doesn't know about it. I think it is possible for boost::errinfo_file_name to deal with the situation better, it is on my todo list.
what do you have in mind? Currently the only idea I have is to treat all utf-8 characters < 128 as ascii characters and escape the rest.
Ha! More than a decade ago when I made my exception class (for Windows development) that dealt with adding key-value pairs in an extensible way, it was wchar_t from the getgo. I had to convince developers to use wide chars throughout, since that is what Windows wants. File names and registry keys are Unicode, and user input will be Unicode, and all that shows up in the error message. IMO, the exception class should take UTF-16 as wstring or L literals and "deal with it" internally if need be. As long as it produces the original value again when queried, the internal format does not matter. (In my old implementation, performance when adding was a consideration though) --John (beware of footer!) TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.

On Wed, Jun 23, 2010 at 9:25 AM, John Dlugosz
I still think that Christoph is raising a valid issue. Even if you keep the file names as utf-8 strings, diagnostic_information doesn't know about it. I think it is possible for boost::errinfo_file_name to deal with the situation better, it is on my todo list.
what do you have in mind? Currently the only idea I have is to treat all utf-8 characters < 128 as ascii characters and escape the rest.
IMO, the exception class should take UTF-16 as wstring or L literals and "deal with it" internally if need be. As long as it produces the original value again when queried, the internal format does not matter.
The Boost Exception framework itself doesn't set a limitation on what
you can stuff in exceptions. You can use something like:
typedef boost::error_info

and then you can stuff it in and recover it from exceptions just fine. The only issue is how boost::diagnostic_information (which returns a std::string) will display a wfile_name, and I'm sure whatever it does now isn't correct.
So this isn't a trivial problem. Perhaps the correct thing to do is document that boost::diagnostic_information returns a UTF-8 string, I kind of prefer this to the other possibility, to add a boost::wdiagnostic_information.
The Standard Library supplied with Visual C++ doesn't work with a UTF-8 "locale", and some versions give an error if you try to set that. The mblen string stuff in the source I've read is all designed around single byte or double byte characters with discernable prefixes, which works with the shift-JIS and other system code pages, but NOT with UTF-8. System functions take the "system code page" which might be different for file-name related functions, but did not support UTF-8 in the historical Windows line, but appears to be there for modern versions. But, lots of code was written to support Windows 95 and is still out there. Actually, I don't know if passing UTF-8 to the Windows API-A functions work! Normally, one uses the UTF-16 (-W) forms. Meanwhile, console output uses a different code page, and handles UTF-8 only if you set it up that way and supply a different font. That makes the regular shell stuff go funny though since it translates file names and such to use the "file name" code page mentioned earlier. So, making the human-reportable string be UTF-8 is simply not going to sit well with Windows programmers. Make it UTF-16 and I can pass it to wcout<< or call TextOut, OutputDebugString, etc. with no problem. --John TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.

On Wed, Jun 23, 2010 at 5:04 PM, John Dlugosz
and then you can stuff it in and recover it from exceptions just fine. The only issue is how boost::diagnostic_information (which returns a std::string) will display a wfile_name, and I'm sure whatever it does now isn't correct.
So this isn't a trivial problem. Perhaps the correct thing to do is document that boost::diagnostic_information returns a UTF-8 string, I kind of prefer this to the other possibility, to add a boost::wdiagnostic_information.
The Standard Library supplied with Visual C++ doesn't work with a UTF-8 "locale", and some versions give an error if you try to set that.
With Boost Exception this won't lead to a compile error.
System functions take the "system code page" which might be different for file-name related functions, but did not support UTF-8 in the historical Windows line, but appears to be there for modern versions. But, lots of code was written to support Windows 95 and is still out there. Actually, I don't know if passing UTF-8 to the Windows API-A functions work! Normally, one uses the UTF-16 (-W) forms.
No, it won't work, the non-W Windows functions don't understand UTF-8, and the -W functions use UTF-16.
So, making the human-reportable string be UTF-8 is simply not going to sit well with Windows programmers. Make it UTF-16 and I can pass it to wcout<< or call TextOut, OutputDebugString, etc. with no problem.
Right, but UTF-16 won't sit well with non-Windows programmers. :) Emil Dotchevski Reverge Studios, Inc. http://www.revergestudios.com/reblog/index.php?n=ReCode

System functions take the "system code page" which might be different for file-name related functions, but did not support UTF-8 in the historical Windows line, but appears to be there for modern versions. But, lots of code was written to support Windows 95 and is still out there. Actually, I don't know if passing UTF-8 to the Windows API-A functions work! Normally, one uses the UTF-16 (-W) forms.
No, it won't work, the non-W Windows functions don't understand UTF-8, and the -W functions use UTF-16.
Even when the code page is set to UTF-8? I thought the -A forms are wrappers that call MultiByteCharToWideChar on the arguments and then call the -W forms. So anything that function works with should work.
So, making the human-reportable string be UTF-8 is simply not going to sit well with Windows programmers. Make it UTF-16 and I can pass it to wcout<< or call TextOut, OutputDebugString, etc. with no problem.
Right, but UTF-16 won't sit well with non-Windows programmers. :)
After more reflection, I think a more complete answer is that the narrow-string function should not blindly return UTF-8, but should reflect the current locale setting. --John TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.

On 6/23/2010 8:04 PM, John Dlugosz wrote:
and then you can stuff it in and recover it from exceptions just fine. The only issue is how boost::diagnostic_information (which returns a std::string) will display a wfile_name, and I'm sure whatever it does now isn't correct.
So this isn't a trivial problem. Perhaps the correct thing to do is document that boost::diagnostic_information returns a UTF-8 string, I kind of prefer this to the other possibility, to add a boost::wdiagnostic_information.
The Standard Library supplied with Visual C++ doesn't work with a UTF-8 "locale", and some versions give an error if you try to set that. The mblen string stuff in the source I've read is all designed around single byte or double byte characters with discernable prefixes, which works with the shift-JIS and other system code pages, but NOT with UTF-8.
If UTF-8 just contains Ansi characters, Visual C++' standard library should work with UTF-8 as just plain Ansi.
System functions take the "system code page" which might be different for file-name related functions, but did not support UTF-8 in the historical Windows line, but appears to be there for modern versions. But, lots of code was written to support Windows 95 and is still out there. Actually, I don't know if passing UTF-8 to the Windows API-A functions work! Normally, one uses the UTF-16 (-W) forms.
Again if the UTF-8 characters are just Ansi characters the Windows API-A functions should work with a UTF-8 C string.
Meanwhile, console output uses a different code page, and handles UTF-8 only if you set it up that way and supply a different font. That makes the regular shell stuff go funny though since it translates file names and such to use the "file name" code page mentioned earlier.
So, making the human-reportable string be UTF-8 is simply not going to sit well with Windows programmers. Make it UTF-16 and I can pass it to wcout<< or call TextOut, OutputDebugString, etc. with no problem.
Wasn't someone working on a Boost Unicode library which could convert a UTF-8 stream to its equivalent UTF-16 stream ?

On Wed, Jun 23, 2010 at 7:14 PM, Edward Diener
Wasn't someone working on a Boost Unicode library which could convert a UTF-8 stream to its equivalent UTF-16 stream ?
The conversion isn't a problem, the question is how to integrate this stuff in the boost::diagnostic_information API. Emil Dotchevski Reverge Studios, Inc. http://www.revergestudios.com/reblog/index.php?n=ReCode

On Thu, Jun 24, 2010 at 12:43 AM, Emil Dotchevski
The Boost Exception framework itself doesn't set a limitation on what you can stuff in exceptions. You can use something like:
typedef boost::error_info
wfile_name; and then you can stuff it in and recover it from exceptions just fine. The only issue is how boost::diagnostic_information (which returns a std::string) will display a wfile_name, and I'm sure whatever it does now isn't correct.
So this isn't a trivial problem. Perhaps the correct thing to do is document that boost::diagnostic_information returns a UTF-8 string, I kind of prefer this to the other possibility, to add a boost::wdiagnostic_information.
What about: - having boost::diagnostic_informationreturn a UTF-8 encoded string for all error_info containing a std::wstring - adding a boost::wdiagnostic_information ? Christoph

On Wed, Jun 23, 2010 at 10:40 PM, Christoph Heindl
On Thu, Jun 24, 2010 at 12:43 AM, Emil Dotchevski
wrote: The Boost Exception framework itself doesn't set a limitation on what you can stuff in exceptions. You can use something like:
typedef boost::error_info
wfile_name; and then you can stuff it in and recover it from exceptions just fine. The only issue is how boost::diagnostic_information (which returns a std::string) will display a wfile_name, and I'm sure whatever it does now isn't correct.
So this isn't a trivial problem. Perhaps the correct thing to do is document that boost::diagnostic_information returns a UTF-8 string, I kind of prefer this to the other possibility, to add a boost::wdiagnostic_information.
What about: - having boost::diagnostic_informationreturn a UTF-8 encoded string for all error_info containing a std::wstring
What is difference between that and just documenting that boost::diagnostic_information returns a UTF-8 string *always*?
- adding a boost::wdiagnostic_information
If boost::diagnostic_information returns a UTF-8 string, boost::wdiagnostic_information is a trivial wrapper. Emil Dotchevski Reverge Studios, Inc. http://www.revergestudios.com/reblog/index.php?n=ReCode

On Thu, Jun 24, 2010 at 7:53 AM, Emil Dotchevski
If boost::diagnostic_information returns a UTF-8 string, boost::wdiagnostic_information is a trivial wrapper.
True, what I thought about is converting std::string/utf8 to a std::wstring/utf16 or 32 depending on the wchar_t type. Besides, I'd prefer to have even a trivial wrapper in one place than in each individual project. Best regards, Christoph

What about: - having boost::diagnostic_informationreturn a UTF-8 encoded string for all error_info containing a std::wstring - adding a boost::wdiagnostic_information
No. The narrow form should be encoded based on the currently selected locale's settings. That is, after all, the whole point of having it. Other functions you pass it to will be expecting that. --John TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.

On Thu, Jun 24, 2010 at 8:31 AM, John Dlugosz
What about: - having boost::diagnostic_informationreturn a UTF-8 encoded string for all error_info containing a std::wstring - adding a boost::wdiagnostic_information No. The narrow form should be encoded based on the currently selected locale's settings. That is, after all, the whole point of having it. Other functions you pass it to will be expecting that.
The current documented behavior of boost::diagnostic_information is that it converts error_info objects (of any type!) to string by calling a user-defined to_string function, or (if no such overload exists) by means of std::ostringstream (if that's not possible, a compile error is NOT issued anyway.) So I suppose the narrow form is encoded based on the currently selected locale unless the user interferes by providing a custom to_string overload. So, the question is how does std::ostringstream handle std::wstrings. If the locale can be configured so that the std::wstring is converted to UTF-8, I think that there is no need to change anything in Boost Exception because (unless the user interferes) boost::diagnostic_information's format will depend on the locale. Someone please correct me if I got this wrong! Emil Dotchevski Reverge Studios, Inc. http://www.revergestudios.com/reblog/index.php?n=ReCode

So, the question is how does std::ostringstream handle std::wstrings. If the locale can be configured so that the std::wstring is converted to UTF-8, I think that there is no need to change anything in Boost Exception because (unless the user interferes) boost::diagnostic_information's format will depend on the locale.
My experience with MS compilers is that there is no wstring overloads for operator<< on the narrow ios. That might be version specific, though, or the truth might be more complicated. --John TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.

On 6/24/2010 09:31, John Dlugosz wrote:
No. The narrow form should be encoded based on the currently selected locale's settings. That is, after all, the whole point of having it. Other functions you pass it to will be expecting that.
Which other function? Yours? Mine? The standard libraries? Third party libraries? My functions, at least, use utf-8 exclusively. The standard library generally (but not always) treats strings as opaque blobs of binary data. Third party libraries vary, but most of the libraries I use assume or at least support utf-8. Using locale-dependent character encodings is just plain broken. It won't work if you deal with characters that are not in the current locale. It won't work if you save a file in one locale and load it from another. It won't work if you compile with one locale and run on the other. Unfortunately, wide strings are /also/ broken. On some platforms, they are utf-32. On some platforms, they are utf-16. On some platforms they are UCS-2, and you have no way to encode characters past the BMP. You also have to worry about byte-order issues. Utf-8 is the only sane way to deal with international text. -- Rainer Deyke - rainerd@eldwood.com

No. The narrow form should be encoded based on the currently selected locale's settings. That is, after all, the whole point of having it. Other functions you pass it to will be expecting that.
Which other function? Yours? Mine? The standard libraries? Third party libraries?
Standard library's, and third party that follows the example or calls std primitives to do the work.
Unfortunately, wide strings are /also/ broken. On some platforms, they are utf-32. On some platforms, they are utf-16. On some platforms they are UCS-2, and you have no way to encode characters past the BMP. You also have to worry about byte-order issues.
I've never had to deal with wide strings on other platforms; only the way in which they are pervasive in Windows. I do see that the new C++ draft standard addresses the issue by making explicit types for 16-bit and 32-bit strings encoded in UTF, as distinct from the rather vague "wide".
Utf-8 is the only sane way to deal with international text.
Too bad the early framers of the OS didn't simply add a UTF-8 code page rather than doubling the API. I would like to point out that using GB18030 is also a useful approach, since support for that is required by law to sell in mainland China, it is really supported by commercial software. And it is defined as a Unicode encoding format. So, you could select that as your code page and all the narrow-string Windows functions and stdlib functions would work properly. --John TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.
participants (5)
-
Christoph Heindl
-
Edward Diener
-
Emil Dotchevski
-
John Dlugosz
-
Rainer Deyke