[Process] List of small issues

Hello, I've made a micro review of Boost.Process, there are small issues I've found: - stream buffer implementation: underflow() you need to check the errno if it is EINTR, returning -1 and getting EINTR is not a problem and thus it should retry. sync(); Same as in underflow() - check for EINTR and retry, but also there is a problem that you do not check that you had fully written the data. For example what I write out << std::flush sync() is called and I expect the all data should be written to device, so if return value is less then the size of the buffer you should retry and write again till buffer is empty, you get error or EOF. - Windows and Unicode. You are using CreateProcessA. I would recommend to always use wide API and convert narrow strings to wide similarly to what boost::filesystem::v3 does, so for example where the global locale as utf-8 facet you would convert narrow strings to wide and run it. Notes: 1. You can also always assume that strings under windows are UTF-8 and always convert them to wide string before system calls. This is I think better approach, but it is different from what most of boost does. 2. I do not recommend adding wide API - makes the code much uglier, rather convert normal strigns to wide strings before system call. - It may be very good addition to implement full support of putback. Additional point: ----------------- I've noticed that you planned asynchronous notification and so on but I think it is quite important to add feature that provide an ability to wait for multiple processes to terminate and have timeout. It can be done using sigtimedwait/sigwait and assigned signal handlers for SIGCHLD Artyom P.S.: Good luck with the review library looks overall very nice.

Artyom wrote on Thursday, January 13, 2011 9:36 AM
I've made a micro review of Boost.Process, there are small issues I've found:
I'm not sure if there Boost.Process code base has been updated, but a change in Boost.System in v1.44 means that occurrences of boost::system::system_category need to be replaced by boost::system::system_category() because the argument type to boost::system::system_error changed in v1.44 Just something I found when we upgraded to v1.45 recently. Erik ---------------------------------------------------------------------- This message w/attachments (message) is intended solely for the use of the intended recipient(s) and may contain information that is privileged, confidential or proprietary. If you are not an intended recipient, please notify the sender, and then please delete and destroy all copies and attachments, and be advised that any review or dissemination of, or the taking of any action in reliance on, the information contained in or attached to this message is prohibited. Unless specifically indicated, this message is not an offer to sell or a solicitation of any investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Sender. Subject to applicable law, Sender may intercept, monitor, review and retain e-communications (EC) traveling through its networks/systems and may produce any such EC to regulators, law enforcement, in litigation and as required by law. The laws of the country of each sender/recipient may impact the handling of EC, and EC may be archived, supervised and produced in countries other than the country in which you are located. This message cannot be guaranteed to be secure or free of errors or viruses. References to "Sender" are references to any subsidiary of Bank of America Corporation. Securities and Insurance Products: * Are Not FDIC Insured * Are Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a Condition to Any Banking Service or Activity * Are Not Insured by Any Federal Government Agency. Attachments that are part of this EC may have additional important disclosures and disclaimers, which you should read. This message is subject to terms available at the following link: http://www.bankofamerica.com/emaildisclaimer. By messaging with Sender you consent to the foregoing.

On Thu, 13 Jan 2011 06:35:53 -0800 (PST) Artyom <artyomtnk@yahoo.com> wrote: [...]
Notes:
1. You can also always assume that strings under windows are UTF-8 and always convert them to wide string before system calls.
This is I think better approach, but it is different from what most of boost does. [...]
An interesting thought... I developed a set of ASCII/UTF-8/16/32 classes for my company not too long ago, and I became fairly familiar with the UTF-8 encoding scheme. There was only one issue that stopped me from assuming that all std::string types as UTF-8-encoded: what if the string *isn't* meant as UTF-8 encoded, and contains characters with the high-bit set? There's nothing technically stopping that from happening, and there's no way to determine with complete certainty whether even a string that seems to be valid UTF-8 was intended that way, or whether the UTF-8-like characters are really meant as their high-ASCII values. Maybe you know something I don't, that would allow me to change it? I hope so, it would simplify some of the code greatly. -- Chad Nelson Oak Circle Software, Inc. * * *

Hello All, I wanted to talk about it for a loooooong time. however never got there. ------------------------------------------------- Proposal Summary: =================== - We need to treat std::string, char const * as UTF-8 strings on Windows and drop a support of so called ANSI API. - Optuional but recommended: Deprecate wide strings as unportable API. Basics: ======== There is a big difference in handing Unicode in Windows and POSIX platforms API. it can be summarized as following: OS Moder Unix Modern Windows ------------------------------------------------- char string: UTF-8 Obsolete ANSI codepage (like 1251) wchar_t string: UTF-32 UTF-16 OS Native API: char wchar_t Common encoding: UTF-8 UTF-16 Unicode Support Modern Unix Modern Windows ---------------------------------------------- char API Full Unicode Not supported wchar_t API Not Exists Full Bottom line: You can't open or delete a file in cross plafrom way! Suggestion: =========== Char Strings ------------ - Under POSIX platform: Treat them as byte sequences with current locale, by default assume that they are UTF-8 as: a) Default Locale on most OSs is UTF-8 locale b) POSIX API does not care about encodings Even if the locale is not UTF-8 you still can do anything right as - Under Windows platform: a) Treat them as UTF-8 strings, convert them to UTF-16 just before accessing system services. b) Never use ANSI API always use Wide API. It is anyway default internal encoding. Wide String: ------------ - Deprecate them, unless you have something tied to Windows system API. a) They are not portable: no OS (except Windows) uses Wide strings in their API. b) They are not well defined: may be UTF-16 or UTF-32 For more details read: http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmf... What problem this would solve for us? ===================================== 1. All standard API support Unicode naturally as it supposed to be. - Want to open boost::filesystem::fstream? - Want to pass parameters to other process? - Want to display message? - Want to read XML or JSON? All works with Unicode by default because: a) It is Unicode by default on Unix b) Because they are mapped to wide API on Windows. 2. Portable program should no longer worry about setting standard locale facets, etc. The program becomes much more portable. 3. Fewer bugs related to Unicode handling. Artyom ----- Original Message ----
Chad Nelson <chad.thecomfychair@gmail.com>
Artyom <artyomtnk@yahoo.com> wrote:
[...]
Notes:
1. You can also always assume that strings under windows are UTF-8 and always convert them to wide string before system calls.
This is I think better approach, but it is different from what most of boost does. [...]
An interesting thought... I developed a set of ASCII/UTF-8/16/32 classes for my company not too long ago, and I became fairly familiar with the UTF-8 encoding scheme. There was only one issue that stopped me from assuming that all std::string types as UTF-8-encoded: what if the string *isn't* meant as UTF-8 encoded, and contains characters with the high-bit set?
There's nothing technically stopping that from happening, and there's no way to determine with complete certainty whether even a string that seems to be valid UTF-8 was intended that way, or whether the UTF-8-like characters are really meant as their high-ASCII values.
Maybe you know something I don't, that would allow me to change it? I hope so, it would simplify some of the code greatly.

Hi, On Thu, Jan 13, 2011 at 8:21 PM, Artyom <artyomtnk@yahoo.com> wrote:
Hello All,
I wanted to talk about it for a loooooong time. however never got there.
-------------------------------------------------
Proposal Summary: ===================
- We need to treat std::string, char const * as UTF-8 strings on Windows and drop a support of so called ANSI API.
- Optuional but recommended:
Deprecate wide strings as unportable API.
Fully agree. Two years ago I would very probably be advocating some kind of TCHAR/wxChar/QChar/whatever-like character type switching, but since then I've spent a lot of time developing portable GUI applications and found out the hard way that it is better to dump all the ANSI CPXXXX / UTF-XY encodings and stick to UTF-8 and defer the conversion to whatever the native API uses until you make the actual call. a) UTF-16 in principle is ok but many implementations are not:
http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmf...
b) UTF-32 is basically a waste of memory for most localizations.
[snip]
Suggestion: ===========
Char Strings ------------
- Under POSIX platform:
Treat them as byte sequences with current locale, by default assume that they are UTF-8 as:
a) Default Locale on most OSs is UTF-8 locale b) POSIX API does not care about encodings Even if the locale is not UTF-8 you still can do anything right as
- Under Windows platform:
a) Treat them as UTF-8 strings, convert them to UTF-16 just before accessing system services. b) Never use ANSI API always use Wide API. It is anyway default internal encoding.
Wide String: ------------
- Deprecate them, unless you have something tied to Windows system API.
+1, IMO having two APIs that are not seamlesly interchangeble in the code (at least with the macro trickery) is useless. [snip]
What problem this would solve for us? =====================================
1. All standard API support Unicode naturally as it supposed to be.
- Want to open boost::filesystem::fstream? - Want to pass parameters to other process? - Want to display message? - Want to read XML or JSON?
All works with Unicode by default because:
a) It is Unicode by default on Unix b) Because they are mapped to wide API on Windows.
2. Portable program should no longer worry about setting standard locale facets, etc.
The program becomes much more portable.
3. Fewer bugs related to Unicode handling.
Artyom
+1, but from my experience it is easier to say than to do. My knowledge of Unicode and utf-8 is little more than superficial and I didn't do a lot of char-by-char manipulation, but to do what you are proposing we need at least some straightforward (and efficient) way to convert the native strings to the required encoding at the call site. I'm not trying to nitpick on anyones implementation of a Unicode library here but having to instantiate ~10 transcoding-related classed just to call ShellExecuteW is not my idea of straightforward. :) [snip] BR, Matus

On Fri, Jan 14, 2011 at 4:42 AM, Matus Chochlik <chochlik@gmail.com> wrote:
b) UTF-32 is basically a waste of memory for most localizations.
I'm not an expert, so take this with a grain of salt. But couldn't it just as easily be said that UTF-8 is a waste of CPU? There are a number of operations that are constant time if you can assume a fixed size for a character that I would think would have to be linear for UTF-8, for example accessing the Nth character.

John B. Turpish wrote:
On Fri, Jan 14, 2011 at 4:42 AM, Matus Chochlik <chochlik@gmail.com> wrote:
b) UTF-32 is basically a waste of memory for most localizations.
I'm not an expert, so take this with a grain of salt. But couldn't it just as easily be said that UTF-8 is a waste of CPU? There are a number of operations that are constant time if you can assume a fixed size for a character that I would think would have to be linear for UTF-8, for example accessing the Nth character.
Yes, in principle, but: - you rarely, if ever, need to access the Nth character; - waste of space is also a waste of CPU due to more cache misses; - UTF-8 has the nice property that you can do things with a string without even decoding the characters; for example, you can sort UTF-8 strings as-is, or split them on a specific (7 bit) character, such as '.' or '/'. Typically, UTF/UCS-32 is only needed as an intermediate representation in very few places, the rest of the strings can happily stay UTF-8.

On Fri, Jan 14, 2011 at 04:54:05PM +0200, Peter Dimov wrote:
John B. Turpish wrote: - UTF-8 has the nice property that you can do things with a string without even decoding the characters; for example, you can sort UTF-8 strings as-is, or split them on a specific (7 bit) character, such as '.' or '/'. Please excuse me if I'm stating the obvious, but I feel I should mention that binary sorting is not collation.
"The basic principle to remember is: The position of characters in the Unicode code charts does not specify their sorting weight." -- http://unicode.org/reports/tr10/#Introduction Any application that requires you to present a sorted list of strings to a user pretty much requires a collation algorithm; in that sense, the usefulness of the above mentioned property of UTF-8 is limited. Again, sorry if I'm stating the obvious here. I've had to bring up that argument in character encoding related discussions more than once, and it's become a bit of a knee-jerk response by now ;) For the application discussed, i.e. for passing strings to OS APIs, this really doesn't matter, though. Where it does matter slightly is when deciding whether or not to use UTF-8 internally in your application. The UCA maps code points to collation elements, or strings into lists of collation elements, and then binary sorts those collation element lists instead of the original strings. My guess would be that using UCS/UTF-32 for that is likely to be cheaper, though I haven't actually ran any comparisons here. If anyone has, I'd love to know. All of this is mostly an aside, I guess :) Jens -- 1.21 Jiggabytes of memory should be enough for anybody.

JensFinkhäuser wrote:
Please excuse me if I'm stating the obvious, but I feel I should mention that binary sorting is not collation.
Yes, you're right. Sorting (lexicographically) UTF-8 as sequences of 8-bit unsigned integers gives the same result as sorting their UCS-32 equivalents as sequences of 32 bit unsigned integers.

2011/1/14 John B. Turpish <jbturp@gmail.com>:
I'm not an expert, so take this with a grain of salt. But couldn't it just as easily be said that UTF-8 is a waste of CPU? There are a number of operations that are constant time if you can assume a fixed size for a character that I would think would have to be linear for UTF-8, for example accessing the Nth character.
John, As I understand the choice is between UTF-8 and UTF-16, since UTF-32 is a waste of memory. Given that, there is never fixed size for a character or linear times - both UTF-8 and UTF-16 are variable-size encodings of UTF-32. Alexander Churanov

On Fri, Jan 14, 2011 at 1:36 PM, Alexander Churanov <alexanderchuranov@gmail.com> wrote:
John,
As I understand the choice is between UTF-8 and UTF-16, since UTF-32 is a waste of memory. Given that, there is never fixed size for a character or linear times - both UTF-8 and UTF-16 are variable-size encodings of UTF-32.
Yes, my comment was in response to a comment about UTF-32 as pertaining to an internal encoding. I'd only use UTF-16 if the APIs I used required it, and the conversion could be done at the interface (for example in a fascade). What interests me is if there's a good reason to use UTF-8 internally and give UTF-32 the same treatment as UTF-16, or vice versa. I do find the simplicity of a fixed-width encoding alluring. By the way, I disagree with Peter's assessment that, "you rarely, if ever, need to access the Nth character," but I will gladly cede that this depends on your problem domain.

John B. Turpish wrote:
By the way, I disagree with Peter's assessment that, "you rarely, if ever, need to access the Nth character," but I will gladly cede that this depends on your problem domain.
It obviously depends on the problem domain :-) but, when talking about Unicode, you can't reliably access the Nth character, in general, even with UCS-32. (As far as I know.)

On 01/14/2011 02:05 PM, Peter Dimov wrote:
John B. Turpish wrote:
By the way, I disagree with Peter's assessment that, "you rarely, if ever, need to access the Nth character," but I will gladly cede that this depends on your problem domain.
It obviously depends on the problem domain :-) but, when talking about Unicode, you can't reliably access the Nth character, in general, even with UCS-32. (As far as I know.) I don't understand. UCS-32 (I assume you meant encoded as UTF-32) is a fixed width encoding so the n-th character is just 4n away from the beginning of the string. Right?
Patrick

On Fri, Jan 14, 2011 at 9:35 PM, Patrick Horgan <phorgan1@gmail.com> wrote:
On 01/14/2011 02:05 PM, Peter Dimov wrote:
John B. Turpish wrote:
By the way, I disagree with Peter's assessment that, "you rarely, if ever, need to access the Nth character," but I will gladly cede that this depends on your problem domain.
It obviously depends on the problem domain :-) but, when talking about Unicode, you can't reliably access the Nth character, in general, even with UCS-32. (As far as I know.)
I don't understand. UCS-32 (I assume you meant encoded as UTF-32) is a fixed width encoding so the n-th character is just 4n away from the beginning of the string. Right?
No. The nth code point is 4n bytes from the beginning of the string, but characters may be made of a combination of adjacent code points. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 01/14/2011 07:16 PM, Dave Abrahams wrote:
On Fri, Jan 14, 2011 at 9:35 PM, Patrick Horgan<phorgan1@gmail.com> wrote:
... elision ... I don't understand. UCS-32 (I assume you meant encoded as UTF-32) is a fixed width encoding so the n-th character is just 4n away from the beginning of the string. Right? No. The nth code point is 4n bytes from the beginning of the string, but characters may be made of a combination of adjacent code points. Ahhhh! Of course this occurred to me moments after clicking send. lol! There should be a name for that phenomenon. Some correlation to staircase wit.
Patrick

From: Patrick Horgan <phorgan1@gmail.com> On 01/14/2011 02:05 PM, Peter Dimov wrote:
John B. Turpish wrote:
By the way, I disagree with Peter's assessment that, "you rarely, if ever, need to access the Nth character," but I will gladly cede that this depends on your problem domain.
It obviously depends on the problem domain :-) but, when talking about Unicode, you can't reliably access the Nth character, in general, even with UCS-32. (As far as I know.)
I don't understand. UCS-32 (I assume you meant encoded as UTF-32) is a fixed width encoding so the n-th character is just 4n away from the beginning of the string. Right?
No, Nth Unicode code-point is at nth position not a character. For example in word "שָלוֹם" as 4 characters "שָ", "ל", "וֹ", "ם" and 6 code points: ש ָ ל ו ֹ מ Where two code points are diacritic marks. Boost.Locale has special character iterator to handle characters for this purpose and it works on characters and not code points. See: http://cppcms.sourceforge.net/boost_locale/html/tutorial.html#8e296a067a3756... Artyom

From: John B. Turpish On Fri, Jan 14, 2011 at 4:42 AM, Matus Chochlik <chochlik@gmail.com> wrote:
b) UTF-32 is basically a waste of memory for most localizations.
I'm not an expert, so take this with a grain of salt. But couldn't it just as easily be said that UTF-8 is a waste of CPU? There are a number of operations that are constant time if you can assume a fixed size for a character that I would think would have to be linear for UTF-8, for example accessing the Nth character.
IIUC you can't assume a fixed size for a character even with UTF-32. In UTF-32 only _codepoints_ have fixed size, yet one character may be composed of several codepoints, e.g. a latin letter followed by a diacritical mark, making up one character (http://en.wikipedia.org/wiki/Combining_character). Best regards, Robert

On Fri, Jan 14, 2011 at 5:52 PM, Robert Kawulak <robert.kawulak@gmail.com> wrote:
IIUC you can't assume a fixed size for a character even with UTF-32. In UTF-32 only _codepoints_ have fixed size, yet one character may be composed of several codepoints, e.g. a latin letter followed by a diacritical mark, making up one character (http://en.wikipedia.org/wiki/Combining_character).
Best regards, Robert
I stand corrected. This sort of the thing is the reason I start with disclaimers like, "I'm not an expert, so take this with a grain of salt." Anyhow, thanks for the info.

On Fri, 14 Jan 2011 10:42:44 +0100, Matus Chochlik wrote:
Hi,
On Thu, Jan 13, 2011 at 8:21 PM, Artyom <artyomtnk@yahoo.com> wrote:
Hello All,
I wanted to talk about it for a loooooong time. however never got there.
-------------------------------------------------
Proposal Summary: ===================
- We need to treat std::string, char const * as UTF-8 strings on Windows and drop a support of so called ANSI API.
- Optuional but recommended:
Deprecate wide strings as unportable API.
Fully agree. Two years ago I would very probably be advocating some kind of TCHAR/wxChar/QChar/whatever-like character type switching, but since then I've spent a lot of time developing portable GUI applications and found out the hard way that it is better to dump all the ANSI CPXXXX / UTF-XY encodings and stick to UTF-8 and defer the conversion to whatever the native API uses until you make the actual call.
-1 I'm opposed to this strategy simply because it differs from the way existing libraries treat narrow strings. Not least the STL. If you open an fstream with a narrow filename, for instance, this isn't treated as a UTF-8 string. It's treated as being in the local codepage. What the Visual Studio implementation of the STL actually does is pretty much the same as how Boost.Filesystem v3 treats paths: It uses mbstowcs_s to convert the narrow string to the wchar_t form and then uses _wfsopen to open the file. Importantly, mbstowcs_s treats the narrow string as being in the local codepage which on Windows _won't_ be UTF-8. If you tried to open an fstream by handing it a UTF-8 encoded string, you would end up with severe problems. For shits and giggles I tried to open a std::fstream with "שלום-سلام-pease-Мир.txt" as the filename. What it ends up doing is creating a file called "ש×××-Ø³ÙØ§Ù -pease-ÐиÑ.txt"! While this behaviour isn't great, it is standard. I don't think we should make boost produce UTF-8 narrow string on Windows. A programmer would expect to be able to take such a string and pass it to STL functions. As you can see, that wouldn't work. Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

Alexander Lamaison wrote:
I'm opposed to this strategy simply because it differs from the way existing libraries treat narrow strings.
It differs from them because it's right, and existing libraries are wrong. Unfortunately, they'll continue being wrong for a long time, because of this same argument.

At Fri, 14 Jan 2011 17:07:20 +0200, Peter Dimov wrote:
Alexander Lamaison wrote:
I'm opposed to this strategy simply because it differs from the way existing libraries treat narrow strings.
It differs from them because it's right, and existing libraries are wrong. Unfortunately, they'll continue being wrong for a long time, because of this same argument.
Does the "right" strategy come with some policies/practices that can allow it to coexist with the existing "wrong" libraries? If so, I'm all +1 on it. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Dave Abrahams wrote:
At Fri, 14 Jan 2011 17:07:20 +0200, Peter Dimov wrote:
Alexander Lamaison wrote:
I'm opposed to this strategy simply because it differs from the way existing libraries treat narrow strings.
It differs from them because it's right, and existing libraries are wrong. Unfortunately, they'll continue being wrong for a long time, because of this same argument.
Does the "right" strategy come with some policies/practices that can allow it to coexist with the existing "wrong" libraries? If so, I'm all +1 on it.
Unfortunately not. A library that requires its input paths to be UTF-8 always gets bug reports from users who are accustomed to using another encoding for their narrow strings. There is plenty of precedent they can use to justify their complaint.

At Fri, 14 Jan 2011 17:50:02 +0200, Peter Dimov wrote:
Dave Abrahams wrote:
At Fri, 14 Jan 2011 17:07:20 +0200, Peter Dimov wrote:
Alexander Lamaison wrote:
I'm opposed to this strategy simply because it differs from the way existing libraries treat narrow strings.
It differs from them because it's right, and existing libraries are wrong. Unfortunately, they'll continue being wrong for a long time, because of this same argument.
Does the "right" strategy come with some policies/practices that can allow it to coexist with the existing "wrong" libraries? If so, I'm all +1 on it.
Unfortunately not. A library that requires its input paths to be UTF-8 always gets bug reports from users who are accustomed to using another encoding for their narrow strings. There is plenty of precedent they can use to justify their complaint.
I don't see the problem you cited as an answer to my question. Let me try asking it differently: how do I program in an environment that has both "right" and "wrong" libraries? Also, is there any use in trying to get the difference into the type system, e.g. by using some kind of wrapper over std::string that gives it a distinct "utf-8" type? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Fri, 14 Jan 2011 10:59:09 -0500, Dave Abrahams wrote:
Also, is there any use in trying to get the difference into the type system, e.g. by using some kind of wrapper over std::string that gives it a distinct "utf-8" type?
I would love to see something like this because as things stand it is far too easy to forget that narrow strings (and wide strings for that matter) aren't all alike and often need converting even when the character width doesn't change. Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

Dave Abrahams wrote:
Let me try asking it differently: how do I program in an environment that has both "right" and "wrong" libraries?
There's really no good answer to that; it's, basically, a mess. You could use UTF-8 everywhere in your code, pass that to "right" libraries as-is, and only pass wchar_t[] to "wrong" libraries and the OS. This doesn't work when the "wrong" libraries or the OS don't have a wide API though. And there's no standard way of being wrong; some libraries use the OS narrow API, some convert to wchar_t[] internally and use the wide API, using a variety of encodings - the OS default (and there can be more than one), the C locale, the C++ locale, or a global encoding that can be set per-library. It's even more fun when supposedly portable libraries use different decoding strategies depending on the platform.
Also, is there any use in trying to get the difference into the type system, e.g. by using some kind of wrapper over std::string that gives it a distinct "utf-8" type?
This could help; a hybrid right+wrong library ought probably be able to take either utf8_string or non_utf8_string, with the latter using who-knows-what encoding. :-) The "bite the bullet" solution is just to demand "right" libraries and use UTF-8 throughout.

At Fri, 14 Jan 2011 19:37:23 +0200, Peter Dimov wrote:
Also, is there any use in trying to get the difference into the type system, e.g. by using some kind of wrapper over std::string that gives it a distinct "utf-8" type?
This could help; a hybrid right+wrong library ought probably be able to take either utf8_string or non_utf8_string, with the latter using who-knows-what encoding. :-)
The "bite the bullet" solution is just to demand "right" libraries and use UTF-8 throughout.
OK, thanks. Consider me +1 on whatever you recommend. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Fri, 14 Jan 2011 10:59:09 -0500 Dave Abrahams <dave@boostpro.com> wrote:
At Fri, 14 Jan 2011 17:50:02 +0200, Peter Dimov wrote:
Unfortunately not. A library that requires its input paths to be UTF-8 always gets bug reports from users who are accustomed to using another encoding for their narrow strings. There is plenty of precedent they can use to justify their complaint.
I don't see the problem you cited as an answer to my question. Let me try asking it differently: how do I program in an environment that has both "right" and "wrong" libraries?
Also, is there any use in trying to get the difference into the type system, e.g. by using some kind of wrapper over std::string that gives it a distinct "utf-8" type?
The system I'm now using for my programs might interest you. I have four classes: ascii_t, utf8_t, utf16_t, and utf32_t. Assigning one type to another automatically converts it to the target type during the copy. (Converting to ascii_t will throw an exception if a resulting character won't fit into eight bits.) Each type has an internal storage type as well, based on the character size (ascii_t and utf8_t use std::string, utf16_t uses 16-bit characters, etc). You can access the internal storage type using operator* or operator->. For a utf8_t variable 'v', for example, *v gives you the UTF-8-encoded string. An std::string is assumed to be ASCII-encoded. If you really do have UTF-8-encoded data to get into the system, you either assign it to a utf8_t using operator*, or use a static function utf8_t::precoded. std::wstring is assumed to be utf16_t- or utf32_t-encoded already, depending on the underlying character width for the OS. A function is simply declared with parameters of the type that it needs. You can call it with whichever type you've got, and it will be auto-converted to the needed type during the call, so for the most part you can ignore the different types and use whichever one makes the most sense for your application. I use utf8_t as the main internal string type for my programs. For portable OS-interface functions, there's a typedef (os::native_t) to the type that the OS's API functions need. For Linux-based systems, it's utf8_t; for Windows, utf16_t. There's also a typedef (os::unicode_t) that is utf32_t on Linux and utf16_t on Windows, but I'm not sure there's a need for that. There are some parts of the code that could use polishing, but I like the overall design, and I'm finding it pretty easy to work with. Anyone interested in seeing the code? -- Chad Nelson Oak Circle Software, Inc. * * *

On Sat, 15 Jan 2011 10:08:22 -0500, Chad Nelson wrote:
On Fri, 14 Jan 2011 10:59:09 -0500 Dave Abrahams <dave@boostpro.com> wrote:
Also, is there any use in trying to get the difference into the type system, e.g. by using some kind of wrapper over std::string that gives it a distinct "utf-8" type?
The system I'm now using for my programs might interest you.
I have four classes: ascii_t, utf8_t, utf16_t, and utf32_t.
... snip
There are some parts of the code that could use polishing, but I like the overall design, and I'm finding it pretty easy to work with. Anyone interested in seeing the code?
Yes please! This sounds roughly like the solution I'd been imagining where, for instance, boost::filesystem::path has string and wstring contructors that work as they do now but also has path(utf8_string) constructors that must be called like this: std::string system_encoded_text = some_non_utf8_aware_library_call(); filesystem::path utf8_file_path(boost::utf8_string(system_encoded_text)); The utf8_string class would do the conversion from the system encoding. Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

On Sat, 15 Jan 2011 17:04:24 +0000 Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
On Sat, 15 Jan 2011 10:08:22 -0500, Chad Nelson wrote:
On Fri, 14 Jan 2011 10:59:09 -0500 Dave Abrahams <dave@boostpro.com> wrote:
Also, is there any use in trying to get the difference into the type system, e.g. by using some kind of wrapper over std::string that gives it a distinct "utf-8" type?
The system I'm now using for my programs might interest you.
I have four classes: ascii_t, utf8_t, utf16_t, and utf32_t.
... snip
There are some parts of the code that could use polishing, but I like the overall design, and I'm finding it pretty easy to work with. Anyone interested in seeing the code?
Yes please!
http://www.oakcircle.com/toolkit.html I've released it under the Boost license, so anyone may use it as they wish. I think one part of os.cpp uses a class from the library that I didn't include, but it's minor and can easily be replaced with one of the Boost.Random classes instead. Everything else should work stand-alone. It's pretty well documented, but ask me if you have any questions.
This sounds roughly like the solution I'd been imagining where, for instance, boost::filesystem::path has string and wstring contructors that work as they do now but also has path(utf8_string) constructors that must be called like this:
std::string system_encoded_text = some_non_utf8_aware_library_call(); filesystem::path utf8_file_path(boost::utf8_string(system_encoded_text));
The utf8_string class would do the conversion from the system encoding.
That's how I designed it. :-) -- Chad Nelson Oak Circle Software, Inc. * * *

At Sun, 16 Jan 2011 09:58:00 -0500, Chad Nelson wrote:
http://www.oakcircle.com/toolkit.html
I've released it under the Boost license, so anyone may use it as they wish.
Care to submit it for review? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Sun, 16 Jan 2011 11:38:20 -0500 Dave Abrahams <dave@boostpro.com> wrote:
At Sun, 16 Jan 2011 09:58:00 -0500, Chad Nelson wrote:
http://www.oakcircle.com/toolkit.html
I've released it under the Boost license, so anyone may use it as they wish.
Care to submit it for review?
Have you looked at the code? ;-) Seriously, I don't think it's anywhere near Boost-quality. There are at least a few changes I'd want to make before I'd consider it even marginally ready, and I can't take the time away from paying work right now to do them. However, if someone else wants to run with it, I'm willing to donate what I've got, and help with the work. -- Chad Nelson Oak Circle Software, Inc. * * *

From: Chad Nelson http://www.oakcircle.com/toolkit.html
I've released it under the Boost license, so anyone may use it as they wish.
A very nice and useful utility. Anyway, I'll share some comments, just in case you want to hear some. ;-) "Be warned, if you try to convert a UTF-coded value to ASCII, each decoded character must fit into an unsigned eight-bit type. If it doesn't, the library will throw an \c oakcircle::unicode::will_not_fit exception." I think that exception is not always appropriate. A better solution would be a policy-based class design or additional conversion function accepting an error policy. This way the user could tell the converter to use some "similarly looking" or "invalid" character instead of throwing when exact conversion is not possible. "Note that, like pointers, they can hold a null value as well, created by passing \c boost::none to the type's contructor or setting it equal to that value." I don't feel the interface with pointer semantics is the most suitable here. Are there any practical advantages from being able to have a null string? Even if so, one could use an actual pointer or boost::optional anyway. Moreover, it would be nice if the proper encoding of the underlying string was the classes' invariant. Currently the classes cannot guarantee this because they allow for direct access to the value which may be freely changed by the user with no respect to the encoding. Best regards, Robert

On Sun, 16 Jan 2011 20:10:57 +0100 Robert Kawulak <robert.kawulak@gmail.com> wrote:
From: Chad Nelson http://www.oakcircle.com/toolkit.html
I've released it under the Boost license, so anyone may use it as they wish.
A very nice and useful utility. Anyway, I'll share some comments, just in case you want to hear some. ;-)
I'm always interested in comments -- thanks!
"Be warned, if you try to convert a UTF-coded value to ASCII, each decoded character must fit into an unsigned eight-bit type. If it doesn't, the library will throw an \c oakcircle::unicode::will_not_fit exception."
I think that exception is not always appropriate. A better solution would be a policy-based class design or additional conversion function accepting an error policy. This way the user could tell the converter to use some "similarly looking" or "invalid" character instead of throwing when exact conversion is not possible.
And if I were going to submit it for review, that's exactly what I'd want too. That code was written solely for my own use, or other programmers working with my company's code later, despite how the documentation makes it look.
"Note that, like pointers, they can hold a null value as well, created by passing \c boost::none to the type's contructor or setting it equal to that value."
I don't feel the interface with pointer semantics is the most suitable here. Are there any practical advantages from being able to have a null string?
Nope. That's there solely so that certain functions can use it to return an error value, using the same semantics as Boost.Optional, without explicitly wrapping it in a Boost.Optional. If I were going to submit it for review, I'd probably remove that completely.
Even if so, one could use an actual pointer or boost::optional anyway.
I did use Boost.Optional at first, but for my code, I found it easier to built that into the classes.
Moreover, it would be nice if the proper encoding of the underlying string was the classes' invariant. Currently the classes cannot guarantee this because they allow for direct access to the value which may be freely changed by the user with no respect to the encoding.
As I said, this was written solely for my company's code. I know how to ensure that changes to the internal data are consistent with the type, and the design ensures that doing so is awkward enough to make people scrutinize the code doing it carefully, so a code-review should catch any problems easily. But again, if I were to submit it to Boost, I'd likely change that first. I'd also want to add full string emulation. Right now it only partly emulates a string, and for any real work you're likely to need to access the internal data. -- Chad Nelson Oak Circle Software, Inc. * * *

The system I'm now using for my programs might interest you.
I have four classes: ascii_t, utf8_t, utf16_t, and utf32_t. Assigning one type to another automatically converts it to the target type during the copy. (Converting to ascii_t will throw an exception if a resulting character won't fit into eight bits.)
If so (and this is what I see in code) ASCII is misleading. It should be called Latin1/ISO-8859-1 but not ASCII.
An std::string is assumed to be ASCII-encoded. If you really do have UTF-8-encoded data to get into the system, you either assign it to a utf8_t using operator*, or use a static function utf8_t::precoded. std::wstring is assumed to be utf16_t- or utf32_t-encoded already, depending on the underlying character width for the OS.
This is very bad assumption. To be honest, I've written lots of code with direct UTF-8 strings in it (Boost.Locale tests) and this worked perfectly well with MSVC, GCC and Intel compilers (as long as I work with char * not L"") and this works file all the time. It is bad assumption, the encoding should be byte string which may be UTF-8 or may be not. There are two cases we need to treat strings and encoding: 1. We handle human language or text - collation, formatting etc. 2. We want to access Windows Wide API that is not locale agnostic.
For portable OS-interface functions, there's a typedef (os::native_t) to the type that the OS's API functions need. For Linux-based systems, it's utf8_t; for Windows, utf16_t. There's also a typedef (os::unicode_t) that is utf32_t on Linux and utf16_t on Windows, but I'm not sure there's a need for that.
When you work with Linux and Unix at all you should not change encoding. There were discussions about it. For example following code: #include <fstream> #include <cstdio> #include <assert.h> int main() { { std::ofstream t("\xFF\xFF.txt"); if(!t) { /// Not valid for this os - Mac OS X return 0; } t << "test"; t.close(); } { std::ifstream t("\xFF\xFF.txt"); std::string s; t >> s; assert( s=="test"); t.close(); } std::remove("\xFF\xFF.txt"); } Which is valid code and works regardless of current locale on POSIX platforms. Using your API it would fail as it holds some assumptions on encoding.
There are some parts of the code that could use polishing, but I like the overall design, and I'm finding it pretty easy to work with. Anyone interested in seeing the code?
IMHO, I don't think that inventing new strings or new text containers is a way to go. std::string is perfectly fine as long as you code in consistent way. Artyom

On Sun, 16 Jan 2011 12:56:23 -0800 (PST) Artyom <artyomtnk@yahoo.com> wrote:
The system I'm now using for my programs might interest you.
I have four classes: ascii_t, utf8_t, utf16_t, and utf32_t. Assigning one type to another automatically converts it to the target type during the copy. (Converting to ascii_t will throw an exception if a resulting character won't fit into eight bits.)
If so (and this is what I see in code) ASCII is misleading. It should be called Latin1/ISO-8859-1 but not ASCII.
Probably, but latin1_t isn't very obvious, and iso_8859_1_t is a little awkward to type. ;-) As I've said, this code was written solely for my company, I'd make a number of changes if I were going to submit it to Boost.
An std::string is assumed to be ASCII-encoded. If you really do have UTF-8-encoded data to get into the system, you either assign it to a utf8_t using operator*, or use a static function utf8_t::precoded. std::wstring is assumed to be utf16_t- or utf32_t-encoded already, depending on the underlying character width for the OS.
This is very bad assumption. To be honest, I've written lots of code with direct UTF-8 strings in it (Boost.Locale tests) and this worked perfectly well with MSVC, GCC and Intel compilers (as long as I work with char * not L"") and this works file all the time.
It is bad assumption, the encoding should be byte string which may be UTF-8 or may be not.
But if you assigned that byte string to a utf*_t type, how would you treat it? I had to either make some assumption, or disallow assigning from an std::string and char* entirely. And it's just too convenient to use those assignments, for things like constants, to give that up. The way I designed it, you're supposed to feed it only ASCII (or Latin-1, if you prefer) text when you make an assignment that way. If you have some differently-coded text, you'd feed it in through another class, one that knows its coding and is designed to decode to UTF-32 the way that utf8_t and utf16_t are, so that the templated conversion functions know how to handle it.
There are two cases we need to treat strings and encoding:
1. We handle human language or text - collation, formatting etc. 2. We want to access Windows Wide API that is not locale agnostic.
I'm not sure where you're coming from. Those are two broad categories of uses for that code, but arguably not the only two.
For portable OS-interface functions, there's a typedef (os::native_t) to the type that the OS's API functions need. For Linux-based systems, it's utf8_t; for Windows, utf16_t. There's also a typedef (os::unicode_t) that is utf32_t on Linux and utf16_t on Windows, but I'm not sure there's a need for that.
When you work with Linux and Unix at all you should not change encoding. There were discussions about it. [...] Using your API it would fail as it holds some assumptions on encoding.
Why would you feed "\xFF\xFF.txt" into a utf8_t type, if you didn't want it encoded as UTF-8? If you have a function that requires some different encoding, you'd use that encoding instead. For filenames, you'd treat the strings entered by the user or obtained from the file system as opaque blocks of bytes. In any case, all modern Linux OSes use UTF-8 by default, so I haven't seen any need to worry about other forms yet. I'm not even sure how I'd tell what code-page a Linux system is set to use, so far I've never needed to know that. Though if a Russian customer comes along and tells me my code doesn't work right on his Linux system, I'll re-think that.
There are some parts of the code that could use polishing, but I like the overall design, and I'm finding it pretty easy to work with. Anyone interested in seeing the code?
IMHO, I don't think that inventing new strings or new text containers is a way to go. std::string is perfectly fine as long as you code in consistent way.
I have to respectfully disagree. std::string says nothing about the encoding of the data within it. If you're using more than one type of encoding in your program, like Latin-1 and UTF-8, then using std::strings is like using void pointers -- no type safety, no way to automate conversions when necessary, and no way to select overloaded functions based on the encoding. A C++ solution pretty much requires that they be unique types. -- Chad Nelson Oak Circle Software, Inc. * * *

On Sun, 16 Jan 2011 21:41:25 -0500, Chad Nelson wrote:
On Sun, 16 Jan 2011 12:56:23 -0800 (PST) Artyom <artyomtnk@yahoo.com> wrote:
The system I'm now using for my programs might interest you.
I have four classes: ascii_t, utf8_t, utf16_t, and utf32_t. Assigning one type to another automatically converts it to the target type during the copy. (Converting to ascii_t will throw an exception if a resulting character won't fit into eight bits.)
If so (and this is what I see in code) ASCII is misleading. It should be called Latin1/ISO-8859-1 but not ASCII.
Probably, but latin1_t isn't very obvious, and iso_8859_1_t is a little awkward to type. ;-) As I've said, this code was written solely for my company, I'd make a number of changes if I were going to submit it to Boost.
I'm a little concerned by this talk of ASCII and Latin1. When, say, utf8_t is given a char* does it not treat is as OS-default encoded rather than ASCII/Latin1? I've skimmed to code but havn't managed to work out how the classes treat this case. Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

On Mon, 17 Jan 2011 11:14:26 +0000 Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
On Sun, 16 Jan 2011 21:41:25 -0500, Chad Nelson wrote:
If so (and this is what I see in code) ASCII is misleading. It should be called Latin1/ISO-8859-1 but not ASCII.
Probably, but latin1_t isn't very obvious, and iso_8859_1_t is a little awkward to type. ;-) As I've said, this code was written solely for my company, I'd make a number of changes if I were going to submit it to Boost.
I'm a little concerned by this talk of ASCII and Latin1. When, say, utf8_t is given a char* does it not treat is as OS-default encoded rather than ASCII/Latin1? I've skimmed to code but havn't managed to work out how the classes treat this case.
Right now, the utf*_t classes assume that any std::string fed directly into them is meant to be translated as-is. It's assumed to consist of characters that should be directly encoded as their unsigned values. That works perfectly for seven-bit ASCII text, but may be problematic for values with the high-bit set. I've done some research, and it looks like it would require little effort to create an os::string_t type that uses the current locale, and assume all raw std::strings that contain eight-bit values are coded in that instead. Design-wise, ascii_t would need to change slightly after this, to throw on anything that can't fit into a *seven*-bit value, rather than eight-bit. I'll add the default-character option to both types as well, and maybe make other improvements as I have time. With this change, the os::native_t typedef would either be completely redundant or simply wrong, so I'll remove it. I should be able to find the time for that sometime this week, if all goes well. Artyom, since you seem to have more experience with this stuff than I, what do you think? Would those alterations take care of your objections? -- Chad Nelson Oak Circle Software, Inc. * * *

On Mon, 17 Jan 2011 09:39:20 -0500, Chad Nelson wrote:
Right now, the utf*_t classes assume that any std::string fed directly into them is meant to be translated as-is. It's assumed to consist of characters that should be directly encoded as their unsigned values. That works perfectly for seven-bit ASCII text, but may be problematic for values with the high-bit set.
I've done some research, and it looks like it would require little effort to create an os::string_t type that uses the current locale, and assume all raw std::strings that contain eight-bit values are coded in that instead.
I'm not sure about the os namespace ;) What about just calling it native_t like your other class but in the same namespace as utf8_t etc.
Design-wise, ascii_t would need to change slightly after this, to throw on anything that can't fit into a *seven*-bit value, rather than eight-bit. I'll add the default-character option to both types as well, and maybe make other improvements as I have time.
Sounds good.
Artyom, since you seem to have more experience with this stuff than I, what do you think? Would those alterations take care of your objections?
Also, Artyom's Boost.Locale does very sophisticated encoding conversion but the unicode conversions done by utf*_t look (scarily?) small. Do they do as good a job or should these classes make use of the conversions in Boost.Locale? Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

On Mon, 17 Jan 2011 15:12:48 +0000 Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
I've done some research, and it looks like it would require little effort to create an os::string_t type that uses the current locale, and assume all raw std::strings that contain eight-bit values are coded in that instead.
I'm not sure about the os namespace ;) What about just calling it native_t like your other class but in the same namespace as utf8_t etc.
If os::native_t were still going to be around, I wouldn't want something that potentially confusing. But since it's going away, I see no problem with that. I've updated my notes.
Artyom, since you seem to have more experience with this stuff than I, what do you think? Would those alterations take care of your objections?
Also, Artyom's Boost.Locale does very sophisticated encoding conversion but the unicode conversions done by utf*_t look (scarily?) small. Do they do as good a job or should these classes make use of the conversions in Boost.Locale?
They should probably use Boost.Locale. I just haven't looked at it yet. I'll check it out when I get some time to dig into that project again, likely later this week. -- Chad Nelson Oak Circle Software, Inc. * * *

I've done some research, and it looks like it would require little effort to create an os::string_t type that uses the current locale, and assume all raw std::strings that contain eight-bit values are coded in that instead.
Design-wise, ascii_t would need to change slightly after this, to throw on anything that can't fit into a *seven*-bit value, rather than eight-bit. I'll add the default-character option to both types as well, and maybe make other improvements as I have time.
Unfortunately this is not the correct approach as well. For example why do you think it is safe to pass ASCII subset of utf-8 to current non-utf-8 locale? For example Shift-JIS that is in use on Windows/ANSI API has different subset in 0-127 range - it is not ASCII! Also if you want to use std::codecvt facet... Don't relay on them unless you know where they come from! 1. By default they are noop - in the default C locale 2. Under most compilers they are not implemented properly. OS \ Compiler MSVC GCC SunOS/stlport SunOS/standard ------------------------------------------------------------------- Windows ok none - - Linux - ok ? ? Mac OS X - none - - FreeBSD - none - - Solaris - none buggy! ok-but-non-standard Bottom lines don't relate on "current locale" :-)
Artyom, since you seem to have more experience with this stuff than I, what do you think? Would those alterations take care of your objections?
The rule of thumb is following: - When you hadle with strings as text storage just use std::string - When you do a system call a) on Posix - pass it as is b) on Windows - Convert to Wide API from UTF-8 - When handling text as text (i.e. formatting, collation etc.) use good library. I would strongly recommend to read the answer of Pavel Radzivilovsky on Stackoverflow: http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmf... And he is hard-core-windows-programmer, designer, architext and developer and still he had chosen UTF-8! The problem that the issue is so completated that making it absolutly general and on the other hand right is only one - decide what you are working with and stick with it. In CppCMS project I work with (and I developed Boost.Locale because of it) I stick by default with UTF-8 and use plain std::string - works like a charm. Invening "special unicode strings or storage" does not improve anybody's understanding of Unicode neither improve its handing. Best, Artyom

On Mon, Jan 17, 2011 at 1:09 PM, Artyom <artyomtnk@yahoo.com> wrote:
Artyom, since you seem to have more experience with this stuff than I, what do you think? Would those alterations take care of your objections?
<snip>
I would strongly recommend to read the answer of Pavel Radzivilovsky on Stackoverflow:
http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmf...
I just want to say, as a note of encouragement, that I would absolutely *love* to see these problems addressed in Boost by people who really know this domain (and those willing to listen and work with them). I'm excited by the direction of this thread! -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Mon, 17 Jan 2011 17:35:54 -0500 Dave Abrahams <dave@boostpro.com> wrote:
[...] I just want to say, as a note of encouragement, that I would absolutely *love* to see these problems addressed in Boost by people who really know this domain (and those willing to listen and work with them). I'm excited by the direction of this thread!
It's starting to look like I'll have to improve the utf*_t classes anyway, for my company's current project. I'm almost certain that we'll be happy to donate the results to the Boost library. -- Chad Nelson Oak Circle Software, Inc. * * *

On Mon, 17 Jan 2011 10:09:13 -0800 (PST), Artyom wrote:
The problem that the issue is so completated that making it absolutly general and on the other hand right is only one - decide what you are working with and stick with it.
In CppCMS project I work with (and I developed Boost.Locale because of it) I stick by default with UTF-8 and use plain std::string - works like a charm.
Invening "special unicode strings or storage" does not improve anybody's understanding of Unicode neither improve its handing.
I don't understand how it could possibly not help. If I see an api function call_me(std::string arg) I know next to nothing about what it's expecting from the string (except that by convention it tends to mean 'string in OS-default encoding'). If I see call_me(boost::utf8_t arg), I know *exactly* what it's after. Further, assuming I know what format my own strings are in, I know how to provide it with what it expects. Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

On Mon, 17 Jan 2011 23:14:48 +0000 Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
On Mon, 17 Jan 2011 10:09:13 -0800 (PST), Artyom wrote:
[...] Invening "special unicode strings or storage" does not improve anybody's understanding of Unicode neither improve its handing.
I don't understand how it could possibly not help. If I see an api function call_me(std::string arg) I know next to nothing about what it's expecting from the string (except that by convention it tends to mean 'string in OS-default encoding'). If I see call_me(boost::utf8_t arg), I know *exactly* what it's after. Further, assuming I know what format my own strings are in, I know how to provide it with what it expects.
+1. +100. :-) That's exactly what I was aiming for. And as an added bonus, if you've got a string type that can translate itself to utf32_t, then it doesn't matter what kind of string the function wants because the classes can handle the conversions themselves. However, after looking into the matter further for this discussion, I see that he does have a valid point about locales and various encodings. My classes definitely don't handle those well enough yet, and the program we're currently developing (which I'm not at liberty to discuss until it's released) will almost certainly need that. I really wanted to avoid a dependency on the ICU library or anything similar if at all possible, but it looks like it might be inevitable. :-( -- Chad Nelson Oak Circle Software, Inc. * * *

On Mon, 17 Jan 2011 18:47:04 -0500, Chad Nelson wrote:
I really wanted to avoid a dependency on the ICU library or anything similar if at all possible, but it looks like it might be inevitable. :-(
You may well find that you can ;) Artyom's latest work on Boost.Locale allows you to select from a range of different backends giving varying levels of locale support. ICU gives the 'best' results but for my project Swish, for instance, I didn't need any of these advanced features so I just use the Win32 backend. This uses the Windows API to do the conversions etc. and freed me from the beast that is ICU. Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

On Tue, 18 Jan 2011 02:47:54 +0000 Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
On Mon, 17 Jan 2011 18:47:04 -0500, Chad Nelson wrote:
I really wanted to avoid a dependency on the ICU library or anything similar if at all possible, but it looks like it might be inevitable. :-(
You may well find that you can ;) Artyom's latest work on Boost.Locale allows you to select from a range of different backends giving varying levels of locale support. ICU gives the 'best' results but for my project Swish, for instance, I didn't need any of these advanced features so I just use the Win32 backend. This uses the Windows API to do the conversions etc. and freed me from the beast that is ICU.
Oh! I didn't realize that, thanks for the information! In that case, what would people say to not having any conversion code in the Unicode strings stuff at all (other than between the different UTF-* codings, and maybe to and from ASCII for convenience), and relying on Boost.Locale for that? Then the trade-offs are up to the developer using each. I'll have to see how painless I can make the boundaries between them. -- Chad Nelson Oak Circle Software, Inc. * * *

On Tue, 18 Jan 2011 09:51:12 -0500, Chad Nelson wrote:
On Tue, 18 Jan 2011 02:47:54 +0000 Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
On Mon, 17 Jan 2011 18:47:04 -0500, Chad Nelson wrote:
I really wanted to avoid a dependency on the ICU library or anything similar if at all possible, but it looks like it might be inevitable. :-(
You may well find that you can ;) Artyom's latest work on Boost.Locale allows you to select from a range of different backends giving varying levels of locale support. ICU gives the 'best' results but for my project Swish, for instance, I didn't need any of these advanced features so I just use the Win32 backend. This uses the Windows API to do the conversions etc. and freed me from the beast that is ICU.
Oh! I didn't realize that, thanks for the information!
In that case, what would people say to not having any conversion code in the Unicode strings stuff at all (other than between the different UTF-* codings, and maybe to and from ASCII for convenience), and relying on Boost.Locale for that? Then the trade-offs are up to the developer using each.
I don't think the string classes should implement _any_ of the conversions themselves but should delegate them all to Boost.Locale. However, they should look like they're doing the conversions by hiding the Boost.Locale aspect from the caller as much as possible. Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

On Tue, 18 Jan 2011 15:29:05 +0000 Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
On Tue, 18 Jan 2011 09:51:12 -0500, Chad Nelson wrote:
In that case, what would people say to not having any conversion code in the Unicode strings stuff at all (other than between the different UTF-* codings, and maybe to and from ASCII for convenience), and relying on Boost.Locale for that? Then the trade-offs are up to the developer using each.
I don't think the string classes should implement _any_ of the conversions themselves but should delegate them all to Boost.Locale. However, they should look like they're doing the conversions by hiding the Boost.Locale aspect from the caller as much as possible.
Why delegate them to another library? Those classes already have efficient, flexible, and correct iterator-based template code for the conversions between the UTF-* types. I'd rather just farm out the stuff that those types are weak at, like converting to and from system-specific locales. -- Chad Nelson Oak Circle Software, Inc. * * *

On Tue, 18 Jan 2011 10:54:57 -0500, Chad Nelson wrote:
On Tue, 18 Jan 2011 15:29:05 +0000 Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
I don't think the string classes should implement _any_ of the conversions themselves but should delegate them all to Boost.Locale. However, they should look like they're doing the conversions by hiding the Boost.Locale aspect from the caller as much as possible.
Why delegate them to another library? Those classes already have efficient, flexible, and correct iterator-based template code for the conversions between the UTF-* types. I'd rather just farm out the stuff that those types are weak at, like converting to and from system-specific locales.
If they can do that, that's great! The conversion code was so short that I assumed it wasn't a full, complete conversion algorithm. After all, something the size of ICU is apparently necessary for full Unicode support! Please forgive my scepticism :P Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

On Tue, 18 Jan 2011 16:04:29 +0000 Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
On Tue, 18 Jan 2011 10:54:57 -0500, Chad Nelson wrote:
Why delegate them to another library? Those classes already have efficient, flexible, and correct iterator-based template code for the conversions between the UTF-* types. I'd rather just farm out the stuff that those types are weak at, like converting to and from system-specific locales.
If they can do that, that's great! The conversion code was so short that I assumed it wasn't a full, complete conversion algorithm.
They're complete, and accurate. The algorithms aren't overly complex, they just translate between different forms of the exact same data, after all.
After all, something the size of ICU is apparently necessary for full Unicode support!
Please forgive my scepticism :P
Of course! :-) It's an understandable confusion, full Unicode support involves a *lot* more than what those classes handle, or are meant to. -- Chad Nelson Oak Circle Software, Inc. * * *

On Tue, 18 Jan 2011 16:04:29 +0000 Alexander Lamaison<awl03@doc.ic.ac.uk> wrote:
On Tue, 18 Jan 2011 10:54:57 -0500, Chad Nelson wrote:
Why delegate them to another library? Those classes already have efficient, flexible, and correct iterator-based template code for the conversions between the UTF-* types. I'd rather just farm out the stuff that those types are weak at, like converting to and from system-specific locales. If they can do that, that's great! The conversion code was so short that I assumed it wasn't a full, complete conversion algorithm. They're complete, and accurate. The algorithms aren't overly complex, they just translate between different forms of the exact same data, after all. If you can assume that the encoding is correct already that's true. Most the code to convert from utf-8 to utf-32 or utf-16, for example, is to check that you don't have overly long encodings that cause security issues or other violations of the well-formedness table in the unicode spec. Otherwise, especially if you carry things around in utf-8 by
On 01/18/2011 08:23 AM, Chad Nelson wrote: preference, and do your checking in that encoding, you open yourself up to problems. (http://capec.mitre.org/data/definitions/80.html). If you don't ever accept utf-8 encoded things from users, of course, you don't have to worry about this, but I would write the conversion defensively. I should say that I haven't read your code yet and you might very well do this correctly. The code conversion facet used by a lot of boost code doesn't. It was written to an older version of the spec for utf-8 and allows 5 and 6 character encodings. It does have these security concerns. I offered awhile back to replace it, but assume that with the locale stuff coming up for review it would be better to go with that. I did write a replacement for utf8_codecvt_facet.cpp utf8_codecvt_facet.hpp that could be dropped in for the use of serialization and passes the tests in that part of boost. Patrick

On Tue, 18 Jan 2011 13:18:58 -0800 Patrick Horgan <phorgan1@gmail.com> wrote:
If they can do that, that's great! The conversion code was so short that I assumed it wasn't a full, complete conversion algorithm. They're complete, and accurate. The algorithms aren't overly complex, they just translate between different forms of the exact same data, after all. If you can assume that the encoding is correct already that's true. Most the code to convert from utf-8 to utf-32 or utf-16, for example, is to check that you don't have overly long encodings that cause security issues or other violations of the well-formedness table in
On 01/18/2011 08:23 AM, Chad Nelson wrote: the unicode spec. Otherwise, especially if you carry things around in utf-8 by preference, and do your checking in that encoding, you open yourself up to problems.
(http://capec.mitre.org/data/definitions/80.html). If you don't ever accept utf-8 encoded things from users, of course, you don't have to worry about this, but I would write the conversion defensively.
The conversion code in those classes does exactly that, and will (at the moment) throw an exception on any problem. It is, again at the moment, possible for a programmer to get invalid encodings into the utf*_t strings, but it shouldn't be possible to ever get them from the conversion functions. The unit tests that I wrote for it (not included in the package) deliberately tries to feed in invalid code, just to ensure that it's caught correctly.
I should say that I haven't read your code yet and you might very well do this correctly. The code conversion facet used by a lot of boost code doesn't. It was written to an older version of the spec for utf-8 and allows 5 and 6 character encodings. It does have these security concerns.
Then having freshly-written code, using the latest specifications, is an advantage. ;-)
I offered awhile back to replace it, but assume that with the locale stuff coming up for review it would be better to go with that. I did write a replacement for utf8_codecvt_facet.cpp utf8_codecvt_facet.hpp that could be dropped in for the use of serialization and passes the tests in that part of boost.
I saw your message, and your generous offer. It, and the silence that greeted it, is part of what convinced me that I needed to write my own conversion functions. -- Chad Nelson Oak Circle Software, Inc. * * *

...elision by patrick... The conversion code in those classes does exactly that, and will (at the moment) throw an exception on any problem.
It is, again at the moment, possible for a programmer to get invalid encodings into the utf*_t strings, but it shouldn't be possible to ever get them from the conversion functions. The unit tests that I wrote for it (not included in the package) deliberately tries to feed in invalid code, just to ensure that it's caught correctly. It shouldn't be possible at all to have one with invalid encodings in it. Is it that you don't check in the constructors to make sure that
On 01/18/2011 04:39 PM, Chad Nelson wrote: the data passed in is valid for the encoding? I could just imagine someone ending up with user data from a web page in one of these strings. Could you get invalid data in there? If so, it's just a matter of a clever person looking for an exploit. You don't want to go passing around utf8_t strings that are invalid to trusting routines. If you _are_ going to have these types their utility comes from being able to trust that they are what they say they are. If you can have one that isn't what it says it is you might as well just have std::string. Patrick

On Tue, 18 Jan 2011 17:27:27 -0800 Patrick Horgan <phorgan1@gmail.com> wrote:
On 01/18/2011 04:39 PM, Chad Nelson wrote:
It is, again at the moment, possible for a programmer to get invalid encodings into the utf*_t strings, but it shouldn't be possible to ever get them from the conversion functions. The unit tests that I wrote for it (not included in the package) deliberately tries to feed in invalid code, just to ensure that it's caught correctly.
It shouldn't be possible at all to have one with invalid encodings in it. Is it that you don't check in the constructors to make sure that the data passed in is valid for the encoding?
In the present incarnation, it's that the code using the classes can directly manipulate the internal storage if it wants to. For the purpose I designed those classes (use within my company), that's not a problem, but I'll certainly change it before offering it up for dissection by bloodthirsty Boost reviewers. ;-)
I could just imagine someone ending up with user data from a web page in one of these strings. Could you get invalid data in there?
Only if the program blindly puts it there -- a problem that our code-review system should prevent. In the hypothetical Boost version, you'd *have* to feed it into the class through something like the utf8_t::precoded function, and that function would confirm that it's all correct before allowing it in.
If so, it's just a matter of a clever person looking for an exploit. You don't want to go passing around utf8_t strings that are invalid to trusting routines. If you _are_ going to have these types their utility comes from being able to trust that they are what they say they are. If you can have one that isn't what it says it is you might as well just have std::string.
A valid point, and one I'll keep in mind for the next iteration of those classes. -- Chad Nelson Oak Circle Software, Inc. * * *

Alexander Lamaison wrote:
I don't understand how it could possibly not help. If I see an api function call_me(std::string arg) I know next to nothing about what it's expecting from the string (except that by convention it tends to mean 'string in OS-default encoding').
You should read the documentation of call_me (*). Yes, I know that in the real world the documentation often doesn't specify an encoding (worse - the encoding varies between platforms and even versions of the same library), but if the developer of call_me hasn't bothered to document the encoding of the argument, he won't bother to use a special UTF-8 type for the argument, either. :-) (*) And the documentation should either say that call_me accepts UTF-8, or that call_me is encoding-agnostic, that is, it treats the string as a byte sequence. I can think of one reason to use a separate type - if you want to overload on encoding: void f( latin1_t arg ); void f( utf8_t arg ); In most such cases that spring to mind, however, what the user actually wants is: void f( string arg, encoding_t enc ); or even void f( string arg, string encoding ); In principle, as Chad Nelson says, it's useful to have separate types if the program uses several different encodings at once, fixed at compile time. I don't consider such a way of programming a good idea though. Strings should be either byte sequences or UTF-8; input can be of any encoding, possibly not known until runtime, but it should always be either processed as a byte sequence or converted to UTF-8 as a first step. Regarding the OS-default encoding - if, on Windows, you ever encounter or create a string in the OS default encoding, you've already lost - this code can't be correct. :-)

On Mon, Jan 17, 2011 at 10:09 PM, Peter Dimov <pdimov@pdimov.com> wrote:
Alexander Lamaison wrote:
I don't understand how it could possibly not help. If I see an api function call_me(std::string arg) I know next to nothing about what it's expecting from the string (except that by convention it tends to mean 'string in OS-default encoding').
You should read the documentation of call_me (*). Yes, I know that in the real world the documentation often doesn't specify an encoding (worse - the encoding varies between platforms and even versions of the same library), but if the developer of call_me hasn't bothered to document the encoding of the argument, he won't bother to use a special UTF-8 type for the argument, either. :-)
(*) And the documentation should either say that call_me accepts UTF-8, or that call_me is encoding-agnostic, that is, it treats the string as a byte sequence.
I can think of one reason to use a separate type - if you want to overload on encoding:
void f( latin1_t arg ); void f( utf8_t arg );
In most such cases that spring to mind, however, what the user actually wants is:
void f( string arg, encoding_t enc );
or even
void f( string arg, string encoding );
In principle, as Chad Nelson says, it's useful to have separate types if the program uses several different encodings at once, fixed at compile time. I don't consider such a way of programming a good idea though. Strings should be either byte sequences or UTF-8; input can be of any encoding, possibly not known until runtime, but it should always be either processed as a byte sequence or converted to UTF-8 as a first step.
DISCLAIMER: I have almost no experience with the details of this stuff. I only know a few general things about programming (fewer every day). I think the reason to use separate types is to provide a type-safety barrier between your functions that operate on utf-8 and system or 3rd-party interfaces that don't or may not. In principle, that should force you to think about encoding and decoding at all the places where it may be needed, and should allow you to code naturally and with confidence where everybody is operating in utf8-land. The typical failures I've seen, where there is no such mechanism (e.g. in Python where there's no static typing), are caused because programmers lose track of whether what they're handling is encoded as utf-8 or not. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Mon, Jan 17, 2011 at 7:33 PM, Dave Abrahams <dave@boostpro.com> wrote:
On Mon, Jan 17, 2011 at 10:09 PM, Peter Dimov <pdimov@pdimov.com> wrote:
Alexander Lamaison wrote:
I don't understand how it could possibly not help. If I see an api function call_me(std::string arg) I know next to nothing about what it's expecting from the string (except that by convention it tends to mean 'string in OS-default encoding').
You should read the documentation of call_me (*). Yes, I know that in the real world the documentation often doesn't specify an encoding (worse - the encoding varies between platforms and even versions of the same library), but if the developer of call_me hasn't bothered to document the encoding of the argument, he won't bother to use a special UTF-8 type for the argument, either. :-)
(*) And the documentation should either say that call_me accepts UTF-8, or that call_me is encoding-agnostic, that is, it treats the string as a byte sequence.
I can think of one reason to use a separate type - if you want to overload on encoding:
void f( latin1_t arg ); void f( utf8_t arg );
In most such cases that spring to mind, however, what the user actually wants is:
void f( string arg, encoding_t enc );
or even
void f( string arg, string encoding );
In principle, as Chad Nelson says, it's useful to have separate types if the program uses several different encodings at once, fixed at compile time. I don't consider such a way of programming a good idea though. Strings should be either byte sequences or UTF-8; input can be of any encoding, possibly not known until runtime, but it should always be either processed as a byte sequence or converted to UTF-8 as a first step.
DISCLAIMER: I have almost no experience with the details of this stuff. I only know a few general things about programming (fewer every day).
I think the reason to use separate types is to provide a type-safety barrier between your functions that operate on utf-8 and system or 3rd-party interfaces that don't or may not. In principle, that should force you to think about encoding and decoding at all the places where it may be needed, and should allow you to code naturally and with confidence where everybody is operating in utf8-land. The typical failures I've seen, where there is no such mechanism (e.g. in Python where there's no static typing), are caused because programmers lose track of whether what they're handling is encoded as utf-8 or not.
UTF-8 allows the use of char * for type erasure for strings, much like void * allows that in general. Using C++ type tags to discriminate between different data pointed by void pointers is mostly redundant except when type safety is postponed until run-time; and that's only marginally safer than using string tags. Emil Dotchevski Reverge Studios, Inc. http://revergestudios.com/reblog/index.php?n=ReCode.ReCode

At Mon, 17 Jan 2011 21:46:36 -0800, Emil Dotchevski wrote:
I think the reason to use separate types is to provide a type-safety barrier between your functions that operate on utf-8 and system or 3rd-party interfaces that don't or may not. In principle, that should force you to think about encoding and decoding at all the places where it may be needed, and should allow you to code naturally and with confidence where everybody is operating in utf8-land. The typical failures I've seen, where there is no such mechanism (e.g. in Python where there's no static typing), are caused because programmers lose track of whether what they're handling is encoded as utf-8 or not.
UTF-8 allows the use of char * for type erasure for strings, much like void * allows that in general.
Yes, that's exactly my point, although this isn't a property of UTF-8; it's a more general thing. In a dynamic language like Python everything is type-erased.
Using C++ type tags to discriminate between different data pointed by void pointers is mostly redundant
Exactly. I'm suggesting, essentially, to avoid the use of void pointers except where you're forced to, at the boundaries with "legacy" interfaces. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Dave Abrahams wrote:
I think the reason to use separate types is to provide a type-safety barrier between your functions that operate on utf-8 and system or 3rd-party interfaces that don't or may not. In principle, that should force you to think about encoding and decoding at all the places where it may be needed, and should allow you to code naturally and with confidence where everybody is operating in utf8-land.
Yes, in principle. It isn't terribly necessary if everybody is operating in UTF-8 land though. It's a bit like defining a separate integer type for nonnegative ints for type safety reasons - useful in theory, but nobody does it. If you're designing an interface that takes UTF-8 strings, it still may be worth it to have the parameters be of a utf8-specific type, if you want to force your users to think about the encoding of the argument each time they call one of your functions... this is a legitimate design decision. If you're in control of the whole program, though, it's usually not worth it - you just keep everything in UTF-8.

On Tue, 18 Jan 2011 13:27:29 +0200, Peter Dimov wrote:
Dave Abrahams wrote:
I think the reason to use separate types is to provide a type-safety barrier between your functions that operate on utf-8 and system or 3rd-party interfaces that don't or may not. In principle, that should force you to think about encoding and decoding at all the places where it may be needed, and should allow you to code naturally and with confidence where everybody is operating in utf8-land.
Yes, in principle. It isn't terribly necessary if everybody is operating in UTF-8 land though.
Which is exactly why it's necessary: everybody _isn't_ operating in UTF-8 land. Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

From: Alexander Lamaison <awl03@doc.ic.ac.uk>
Dave Abrahams wrote:
I think the reason to use separate types is to provide a type-safety barrier between your functions that operate on utf-8 and system or 3rd-party interfaces that don't or may not. In principle, that should force you to think about encoding and decoding at all the places where it may be needed, and should allow you to code naturally and with confidence where everybody is operating in utf8-land.
Yes, in principle. It isn't terribly necessary if everybody is operating in
UTF-8 land though.
Which is exactly why it's necessary: everybody _isn't_ operating in UTF-8 land.
The problem is that you need to pic some encoding and UTF-8 is the most universal and useful. Otherwise you should: 1. Reinvent the string 2. Reinvent standard library to use new string 3. Reinvent 1001 other libraries to use the new string. It is just neither feasible no necessary. Artyom

At Tue, 18 Jan 2011 05:35:17 -0800 (PST), Artyom wrote:
From: Alexander Lamaison <awl03@doc.ic.ac.uk>
Dave Abrahams wrote:
I think the reason to use separate types is to provide a type-safety barrier between your functions that operate on utf-8 and system or 3rd-party interfaces that don't or may not. In principle, that should force you to think about encoding and decoding at all the places where it may be needed, and should allow you to code naturally and with confidence where everybody is operating in utf8-land.
Yes, in principle. It isn't terribly necessary if everybody is operating in
UTF-8 land though.
Which is exactly why it's necessary: everybody _isn't_ operating in UTF-8 land.
The problem is that you need to pic some encoding and UTF-8 is the most universal and useful.
Why is that a problem?
Otherwise you should:
1. Reinvent the string
My idea is that you just wrap it. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Tue, 18 Jan 2011 05:35:17 -0800 (PST) Artyom <artyomtnk@yahoo.com> wrote:
From: Alexander Lamaison <awl03@doc.ic.ac.uk>
Yes, in principle. It isn't terribly necessary if everybody is operating in UTF-8 land though.
Which is exactly why it's necessary: everybody _isn't_ operating in UTF-8 land.
The problem is that you need to pic some encoding and UTF-8 is the most universal and useful.
I'll second that. Little wasted space, no byte-order problems, and very easy to work with (finding the first byte of a character, for instance, is child's play).
Otherwise you should:
1. Reinvent the string
Or at least wrap it. ;-)
2. Reinvent standard library to use new string
Not entirely necessary, for the same reason that very few changes to the standard library are needed when you switch from char strings to char16_t strings to char32_t strings -- the standard library, designed around the idea of iterators, is mostly type-agnostic. The utf*_t types provide fully functional iterators, so they'll work fine with most library functions, so long as those functions don't care that some characters are encoded as multiple bytes. It's just the ones that assume that a single byte represents all characters that you have to replace, and you'd have to replace those regardless of whether you're using a new string type or not, if you're using any multi-byte encoding.
3. Reinvent 1001 other libraries to use the new string.
Again, seldom necessary. Just use a type system that can translate between your internal coding and what the library wants, at the boundaries. If the other library you want to use can't handle multi-byte encodings, you'd have to modify or reinvent it anyway.
It is just neither feasible no necessary.
My code says it's perfectly feasible. ;-) Whether it's necessary or not is up to the individual developer, but the type-safety it offers is more in line with the design philosophy of C++ than using std::string for everything. I hate to harp on the same tired example, but why do you really need any pointer type other than void*? It's the same idea. -- Chad Nelson Oak Circle Software, Inc. * * *

Otherwise you should:
1. Reinvent the string
Or at least wrap it. ;-)
2. Reinvent standard library to use new string
Not entirely necessary, for the same reason that very few changes to the standard library are needed when you switch from char strings to char16_t strings to char32_t strings -- the standard library, designed around the idea of iterators, is mostly type-agnostic.
Ok... Few things: 1. UTF-32 is waste of space - don't use it unless it is something like handling code points (char32_t) 2. UTF-16 is too error prone (See: UTF-16 considered harmful) 3. There is not special type char8_t distinct from char, so you can't use it.
The utf*_t types provide fully functional iterators,
Ok let's thing what do you need iterators for? Accessing "characters" if so you are most likely doing something terribly wrong as you ignore the fact that codepoint != character. I would say such iterator is wrong by design unless you develop a Unicode algorithm that relates to code point.
so they'll work fine with most library functions, so long as those functions don't care that some characters are encoded as multiple bytes. It's just the ones that assume that a single byte represents all characters that you have to replace, and you'd have to replace those regardless of whether you're using a new string type or not, if you're using any multi-byte encoding.
Ok... The paragraph above is inheritable wrong first of all lets cleanup all things:
that some characters are encoded as multiple bytes
Characters are not code points.
the ones that assume that a single byte represents all characters
Please I want to make this statement even more clearer C H A R A C T E R != C O D E P O I N T Even in single byte encodings - for examples windows-1255 is single byte encoding and still my represent a single character using 1, 2 or 3 bytes! Once again - when you work with string you don't work with them as series of characters you want with them and text entities - text chunks.
and you'd have to replace those regardless of whether you're using a new string type or not, if you're using any multi-byte encoding.
No I would not because I don't look at string as on the sequence of code points - by themselves then are meaningless. Code points are meaningful in terms of Unicode algorithms that know how to combine them. So if you want to handle text chunks you will have to use some Unicode aware library.
It is just neither feasible no necessary.
My code says it's perfectly feasible. ;-) Whether it's necessary or not is up to the individual developer, but the type-safety it offers is more in line with the design philosophy of C++ than using std::string for everything. I hate to harp on the same tired example, but why do you really need any pointer type other than void*? It's the same idea.
No it isn't. String is text chunk. You can combine them, concatenate them, search for specific substrings or relate to ASCII characters for example like in HTML and parse them and this is perfectly doable withing standard std::string regardless it is UTF-8, Latin1 or other ISO-8859-* ASCII compatible encoding. This is very different. Giving you "utf-8" string or UTF-8 container would give you false feeling that you doing something right. Unicode is not about splitting string into code points or iterating over them... It is totally different thing. Artyom

From: Artyom Ok let's thing what do you need iterators for? Accessing "characters" if so you are most likely doing something terribly wrong as you ignore the fact that codepoint != character.
I would say such iterator is wrong by design unless you develop a Unicode algorithm that relates to code point.
Now wouldn't it be nice if ascii_t (or whatever it's called) and utf*_t string classes had 3 kinds of iterators: - storage iterator (char, wchar_t etc.), - codepoint iterator, - character iterator. You could then reuse many existing algorithms to perform operations on a level that is sufficient in a given situation, like: - bitwise copy: std::copy(utf8_1.storage_begin(), utf8_1.storage_end(), utf8_2.storage_begin()) - check if utf32 is a substring of utf8, codepoint-wise: std::search(utf8.codepoint_begin(), utf8.codepoint_end(), utf32.codepoint_begin(), utf32.codepoint_end()) - character-wise copy ascii_t to utf_16, considering the codepage of ascii object: utf16_t utf16(ascii.character_begin(), ascii_t.character_end()) - count codepoints: std::distance(utf8.codepoint_begin(), utf8.codepoint_end()) - count characters: std::distance(utf8.character_begin(), utf8.character_end()) - get the 5th codepoint: std::advance(utf8.codepoint_begin(), 5) I don't know Unicode quirks enough to tell how useful this interface would be, but it seems interesting. What do you think? Best regards, Robert

On Wed, 19 Jan 2011 00:00:59 +0100 Robert Kawulak <robert.kawulak@gmail.com> wrote:
From: Artyom Ok let's thing what do you need iterators for? Accessing "characters" if so you are most likely doing something terribly wrong as you ignore the fact that codepoint != character.
I would say such iterator is wrong by design unless you develop a Unicode algorithm that relates to code point.
Now wouldn't it be nice if ascii_t (or whatever it's called) and utf*_t string classes had 3 kinds of iterators: - storage iterator (char, wchar_t etc.), - codepoint iterator, - character iterator.
The current iterators fall under the storage iterator category, but code-point iterators are easily possible. Character iterators may require help from a full-fledged Unicode library (I don't yet know whether there's a simple way to determine what code-points are combining ones, I doubt there is), but they should be doable too.
You could then reuse many existing algorithms to perform operations on a level that is sufficient in a given situation [...] I don't know Unicode quirks enough to tell how useful this interface would be, but it seems interesting.
And intriguing. When I get back to the Unicode string classes, I'll look into adding such iterators. -- Chad Nelson Oak Circle Software, Inc. * * *

On Tue, 18 Jan 2011 11:01:10 -0800 (PST) Artyom <artyomtnk@yahoo.com> wrote:
2. Reinvent standard library to use new string
Not entirely necessary, for the same reason that very few changes to the standard library are needed when you switch from char strings to char16_t strings to char32_t strings -- the standard library, designed around the idea of iterators, is mostly type-agnostic.
Ok... Few things:
1. UTF-32 is waste of space - don't use it unless it is something like handling code points (char32_t) 2. UTF-16 is too error prone (See: UTF-16 considered harmful)
No argument with either assertion.
3. There is not special type char8_t distinct from char, so you can't use it.
That's why I wrote the utf8_t type. I'd have been quite happy to just use an std::basic_string<utf8_byte_t>, and I looked into the C++0x "opaque typedef" idea to see if it was possible. I couldn't find any elegant way to make it work, and the opaque typedef proposal was dropped from the spec, so I felt that I had to write the utf8_t class. However, I'm not sure what point you're trying to make with the above.
The utf*_t types provide fully functional iterators,
Ok let's thing what do you need iterators for? Accessing "characters" if so you are most likely doing something terribly wrong as you ignore the fact that codepoint != character.
In the current incarnation of the class, the iterators are for accessing the bytes, to make it trivially compatible with things like std::copy.
I would say such iterator is wrong by design unless you develop a Unicode algorithm that relates to code point.
If that's needed (and it probably is), it's easy enough to add. It just wouldn't use the standard begin() and end() functions.
so they'll work fine with most library functions, so long as those functions don't care that some characters are encoded as multiple bytes. It's just the ones that assume that a single byte represents all characters that you have to replace, and you'd have to replace those regardless of whether you're using a new string type or not, if you're using any multi-byte encoding.
Ok...
The paragraph above is inheritable wrong
Oh?
first of all lets cleanup all things:
that some characters are encoded as multiple bytes
Characters are not code points.
A semantic point. Correct, but irrelevant to the argument I was trying to make.
the ones that assume that a single byte represents all characters
Please I want to make this statement even more clearer
C H A R A C T E R != C O D E P O I N T
Even in single byte encodings - for examples windows-1255 is single byte encoding and still my represent a single character using 1, 2 or 3 bytes!
std::copy, std::mismatch, std::equal, std::search, and several others would work equally well on UTF-8 strings. Functions that only allow you to specify a single element to work with, like std::find, would require a slightly different kind of iterator, one that operated on either characters or code-points. I don't see how that makes anything in the quoted paragraph inherently wrong.
Once again - when you work with string you don't work with them as series of characters you want with them and text entities - text chunks.
That depends on what you're doing with them. If you're using them as translations for messages your program is sending out, then your statement is correct -- you treat them as opaque blobs. But if for instance you're parsing a file, you want tokens, which *are* merely an arbitrary series of characters. Or if your program allows the user to edit a file, you want something that gives you single characters, regardless of how many bytes or code-points they're encoded in.
and you'd have to replace those regardless of whether you're using a new string type or not, if you're using any multi-byte encoding.
No I would not because I don't look at string as on the sequence of code points - by themselves then are meaningless.
Code points are meaningful in terms of Unicode algorithms that know how to combine them.
So if you want to handle text chunks you will have to use some Unicode aware library.
If you want to sort them, properly for the locale you're working with, you're correct. If you just want to write them out, or edit them, then barring things like messages in mixed left-to-right and right-to-left languages, it's fairly simple.
It is just neither feasible no necessary.
My code says it's perfectly feasible. ;-) Whether it's necessary or not is up to the individual developer, but the type-safety it offers is more in line with the design philosophy of C++ than using std::string for everything. I hate to harp on the same tired example, but why do you really need any pointer type other than void*? It's the same idea.
No it isn't. String is text chunk.
You can combine them, concatenate them, search for specific substrings or relate to ASCII characters for example like in HTML and parse them and this is perfectly doable withing standard std::string regardless it is UTF-8, Latin1 or other ISO-8859-* ASCII compatible encoding.
This is very different.
I'm trying to understand your point, but with no success so far. If you want something that gives you characters or code-points, then an std::string has no chance of working in any multi-byte encoding -- a UTF-whatever-specific type does.
Giving you "utf-8" string or UTF-8 container would give you false feeling that you doing something right.
How?
Unicode is not about splitting string into code points or iterating over them... It is totally different thing.
I'm baffled by this statement. For doing anything interesting, Unicode or any other encoding *is* about iterating over characters (or code-points, if that's what you're looking for). Your point seems to be that the utf*_t classes are actively harmful in some way that I don't see, and using std::string somehow mitigates that by making you do more work. Or am I misunderstanding you? -- Chad Nelson Oak Circle Software, Inc. * * *

From: Chad Nelson <chad.thecomfychair@gmail.com> On Tue, 18 Jan 2011 11:01:10 -0800 (PST) Artyom <artyomtnk@yahoo.com> wrote:
2. Reinvent standard library to use new string
Not entirely necessary, for the same reason that very few changes to the standard library are needed when you switch from char strings to char16_t strings to char32_t strings -- the standard library, designed around the idea of iterators, is mostly type-agnostic.
Ok... Few things:
1. UTF-32 is waste of space - don't use it unless it is something like handling code points (char32_t) 2. UTF-16 is too error prone (See: UTF-16 considered harmful)
No argument with either assertion.
3. There is not special type char8_t distinct from char, so you can't use it.
That's why I wrote the utf8_t type. I'd have been quite happy to just use an std::basic_string<utf8_byte_t>, and I looked into the C++0x "opaque typedef" idea to see if it was possible.
Even if opaque typedef would be included in C++0x it would be still not feasable for use as string. See with character come many other goodies standard library provides. For example: Why this works: std::basic_stringstream<wchar_t> ss; ss << 10.4; And this does not: std::basic_stringstream<unsigned> ss; ss << 10.4; This is not only because you have problems overloading << for both unsigned int as number and as "character" it is because when you would try to write double into stream it would call: std::use_facet<std::num_put<unsigned> >(ss.getloc()).put(...) And: 1. It is not defined and not installed facet 2. It may be even not possible to create because some facets are explicitly specialized for character types, like for example codecvt fact is specialized for char and wchar_t (and in C++0x char16_t, char32_t) I had facet this problem when tested char16_t/char32_t under gcc with partial standard library implementation that hadn't specialized classes for them and I couldn't get many things works. This is real problem. So it would just not work even if C++0x had opaque typedefs.
so they'll work fine with most library functions, so long as those functions don't care that some characters are encoded as multiple bytes. It's just the ones that assume that a single byte represents all characters that you have to replace, and you'd have to replace those regardless of whether you're using a new string type or not, if you're using any multi-byte encoding.
Ok...
The paragraph above is inheritable wrong
Oh?
I hope you are not offended but I just had seen so many things go wrong because of such assumptions so I'm little bit frustrated that such things come again and again.
Once again - when you work with string you don't work with them as series of characters you want with them and text entities - text chunks.
That depends on what you're doing with them. If you're using them as translations for messages your program is sending out, then your statement is correct -- you treat them as opaque blobs. But if for instance you're parsing a file, you want tokens, which *are* merely an arbitrary series of characters.
What tokens are constructed of? Series of character, right, series of characters are represented as text chunks and they can be searched easily. If fact I had written some JSON and HTML parsers that are fully encoding and UTF-8 aware without any need to access specific code point. Note: I did validated the text - that it is valid UTF-8 but that is separate stage, but it does not require from me to iterate over each code point.
Or if your program allows the user to edit a file, you want something that gives you single characters, regardless of how many bytes or code-points they're encoded in.
That is what great Unicode aware toolkits like Qt, GtkMM and others with hundreds and thousands of lines of code do for you. Of course you may use Boost.Locale that provide character, work, sentence and line break iterators over plain strings very well. See: http://cppcms.sourceforge.net/boost_locale/html/tutorial.html#8e296a067a3756...
I'm trying to understand your point, but with no success so far. If you want something that gives you characters or code-points, then an std::string has no chance of working in any multi-byte encoding -- a UTF-whatever-specific type does.
It works perfectly well. However for text analysis you either: 1. Use a library like Boost.Locale 2. Relate to the ASCII subset of the text allowing to handle 99% of various formats there - you don't need code point iterators for this.
Your point seems to be that the utf*_t classes are actively harmful in some way that I don't see, and using std::string somehow mitigates that by making you do more work. Or am I misunderstanding you?
My statement is following: - utf*_t would not give any real added value - that what I was trying to show, and of you want to iterate over codepoints you can do it with external iterator over std::string perfectly well. But in most cases you don't want to iterate over codepoints but rather characters words and other text entities and codepoints would not help you with this. - utf*_t would create troubles as it would require instant conversions between utf*_t types and 1001 other libraries. And what is even more important it can't be simply integrated into existing C++ string framework. - All I suggest is when you work on windows - don't use ANSI encodings, assume that std::string is UTF-8 encoded and convert it to UTF-16 on system call boundaries. Basically - don't reinvent things, try to make current code work well - it has some design flaws by overall C++/STL is totally fine for unicode handing it needs some things to improve but providing utf*_t classes is not the way to go. This is **My** point of view. Best Regards, Artyom

On Wed, 19 Jan 2011 00:44:39 -0800 (PST) Artyom <artyomtnk@yahoo.com> wrote:
From: Chad Nelson <chad.thecomfychair@gmail.com>
3. There is not special type char8_t distinct from char, so you can't use it.
That's why I wrote the utf8_t type. I'd have been quite happy to just use an std::basic_string<utf8_byte_t>, and I looked into the C++0x "opaque typedef" idea to see if it was possible.
Even if opaque typedef would be included in C++0x it would be still not feasable for use as string. [...] I had facet this problem when tested char16_t/char32_t under gcc with partial standard library implementation that hadn't specialized classes for them and I couldn't get many things works.
This is real problem.
So it would just not work even if C++0x had opaque typedefs.
I think there would have been ways around the problem. For the example you quoted, the most logical solution would probably be to just use basic_stringstream<wchar_t> and convert the string afterward. Not a very satisfying solution, but it would have worked. In any case, the point is moot, since opaque typedefs won't be in C++0x.
Ok...
The paragraph above is inheritable wrong
Oh?
I hope you are not offended but I just had seen so many things go wrong because of such assumptions so I'm little bit frustrated that such things come again and again.
'Fraid they'll continue to come up, because there are always new developers and there isn't a lot of information on the subject available where a developer would stumble into it by accident. Having a set of UTF string types with three different kinds of iterators would at least make some C++ programmers realize that the problem exists, when they wouldn't have before.
Or if your program allows the user to edit a file, you want something that gives you single characters, regardless of how many bytes or code-points they're encoded in.
That is what great Unicode aware toolkits like Qt, GtkMM and others with hundreds and thousands of lines of code do for you. [...]
Which is great, if you happen to be using a Qt-based or Gtk-based interface in your program, but useless if you're not. I'd prefer a solution that's not tied to monolithic libraries that try to deliver everything and the kitchen sink.
I'm trying to understand your point, but with no success so far. If you want something that gives you characters or code-points, then an std::string has no chance of working in any multi-byte encoding -- a UTF-whatever-specific type does.
It works perfectly well. However for text analysis you either:
1. Use a library like Boost.Locale 2. Relate to the ASCII subset of the text allowing to handle 99% of various formats there - you don't need code point iterators for this.
Why would you want to do either of those, when something like a utf8_t class could make the Boost.Locale interface easier and more intuitively obvious to use, and eliminate the ASCII restriction too?
Your point seems to be that the utf*_t classes are actively harmful in some way that I don't see, and using std::string somehow mitigates that by making you do more work. Or am I misunderstanding you?
My statement is following:
- utf*_t would not give any real added value - that what I was trying to show, and of you want to iterate over codepoints you can do it with external iterator over std::string perfectly well.
You can also handle strings perfectly well the C way, with manually allocated memory, strcpy, strlen, and the like. But you still see the benefits of using an std::string class.
But in most cases you don't want to iterate over codepoints but rather characters words and other text entities and codepoints would not help you with this.
An explicit *character* iterator, over a UTF-type, would solve that problem.
- utf*_t would create troubles as it would require instant conversions between utf*_t types and 1001 other libraries.
And what is even more important it can't be simply integrated into existing C++ string framework.
Oh? :-) The way I'm envisioning it, you could do something like this... utf8_t foo = some_utf8_text; cout << *native_t(foo); ...to send a string to stdout. It would be automatically transcoded to the system's current code page (probably using Boost.Locale) if the code-page isn't already UTF-8, and the asterisk would provide an std::string in that type. Though of course, the utf*_t classes would be provided with an output operator of their own that would take care of that for you, so you wouldn't have to. If you needed to interface with a Windows API function... utf16_t bar = foo; // Automatic conversion DrawTextW(dc, bar->c_str(), bar->length(), rect, flags); ...that would do the trick, and would probably get buried in a library function of some sort that takes a utf16_t type. If you fed it an std::string, std::wstring, or utf32_t type, it would be automatically converted when the function is called. And if you fed it a utf16_t, of course, no conversion would be done, it would be used as-is. So while you might have to do some conversion to other string types to interface with different existing libraries (like Qt), the process is very simple and can probably be automated. *If* you decided to use the utf*_t types at all. And as I've said before, you can simply use std::string for any functions that are encoding-agnostic.
- All I suggest is when you work on windows - don't use ANSI encodings, assume that std::string is UTF-8 encoded and convert it to UTF-16 on system call boundaries.
Assumptions like that will cause problems for existing codebases, which are probably using std::strings in ways that would break. Can something as widely used as Boost afford to make a breaking change like that? On the other hand, with a set of UTF types, you could provide two overloads, one that blindly operates on std::strings as it does now, and one that works on the most convenient UTF form, which would automatically provide some guarantees about the content (such as, that it's valid). If, of course, the function you're using cares about the encoding. As you pointed out, most won't, and can be left using std::string with no problem. And if you, the function's author, want to move away from the std::string form, you just mark it deprecated and leave it there with a warning about when it will go away. The company using the library can make its own decision about whether to upgrade beyond that point or not. I don't foresee many authors with a need for that kind of thing, but for those that do, it would be nice if it were there.
Basically - don't reinvent things, try to make current code work well - it has some design flaws by overall C++/STL is totally fine for unicode handing it needs some things to improve but providing utf*_t classes is not the way to go.
This is **My** point of view.
Thanks for making it clear. I have to disagree though. Most programmers don't want to delve into Unicode and learn about the intricacies of code-points and the like. They just want to use it. The UTF string types should let them do so, in most cases, with a much gentler learning curve than using ICU (or even Boost.Locale) directly. -- Chad Nelson Oak Circle Software, Inc. * * *

At Tue, 18 Jan 2011 13:27:29 +0200, Peter Dimov wrote:
Dave Abrahams wrote:
I think the reason to use separate types is to provide a type-safety barrier between your functions that operate on utf-8 and system or 3rd-party interfaces that don't or may not. In principle, that should force you to think about encoding and decoding at all the places where it may be needed, and should allow you to code naturally and with confidence where everybody is operating in utf8-land.
Yes, in principle. It isn't terribly necessary if everybody is operating in UTF-8 land though.
But they won't be. That's not today's reality.
It's a bit like defining a separate integer type for nonnegative ints for type safety reasons - useful in theory, but nobody does it.
I refer you to Boost.Units
If you're designing an interface that takes UTF-8 strings,
...as we are...
it still may be worth it to have the parameters be of a utf8-specific type, if you want to force your users to think about the encoding of the argument each time they call one of your functions...
Or, you may want to use a UTF-8 specific type to force users of legacy char* interfaces (and ourselves) to think about decoding each time they call a legacy char* interfaces.
this is a legitimate design decision. If you're in control of the whole program, though, it's usually not worth it - you just keep everything in UTF-8.
By definition, since we're library designers, we don't have said control. And people *will* be using whatever Boost does with "legacy" non-UTF-8 interfaces. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Tue, 18 Jan 2011 08:48:59 -0500, Dave Abrahams wrote:
At Tue, 18 Jan 2011 13:27:29 +0200, Peter Dimov wrote:
Dave Abrahams wrote:
I think the reason to use separate types is to provide a type-safety barrier between your functions that operate on utf-8 and system or 3rd-party interfaces that don't or may not. In principle, that should force you to think about encoding and decoding at all the places where it may be needed, and should allow you to code naturally and with confidence where everybody is operating in utf8-land.
Yes, in principle. It isn't terribly necessary if everybody is operating in UTF-8 land though.
But they won't be. That's not today's reality.
It's a bit like defining a separate integer type for nonnegative ints for type safety reasons - useful in theory, but nobody does it.
I refer you to Boost.Units
If you're designing an interface that takes UTF-8 strings,
...as we are...
it still may be worth it to have the parameters be of a utf8-specific type, if you want to force your users to think about the encoding of the argument each time they call one of your functions...
Or, you may want to use a UTF-8 specific type to force users of legacy char* interfaces (and ourselves) to think about decoding each time they call a legacy char* interfaces.
this is a legitimate design decision. If you're in control of the whole program, though, it's usually not worth it - you just keep everything in UTF-8.
By definition, since we're library designers, we don't have said control. And people *will* be using whatever Boost does with "legacy" non-UTF-8 interfaces.
+1 for every point. Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

Dave Abrahams wrote:
At Tue, 18 Jan 2011 13:27:29 +0200, Peter Dimov wrote:
Dave Abrahams wrote:
I think the reason to use separate types is to provide a type-safety barrier between your functions that operate on utf-8 and system or 3rd-party interfaces that don't or may not. In principle, that should force you to think about encoding and decoding at all the places where it may be needed, and should allow you to code naturally and with confidence where everybody is operating in utf8-land.
Yes, in principle. It isn't terribly necessary if everybody is operating in UTF-8 land though.
But they won't be. That's not today's reality.
They should be, though. As a practical matter, the difference between taking/returning a string and taking/returning an utf8_t is to force people to write an explicit conversion. This penalizes people who are already in UTF-8 land because it forces them to use utf8_t( s, encoding_utf8 ) and s.c_str( encoding_utf8 ) everywhere, without any gain or need. It's true that for people whose strings are not UTF-8, forcing those explicit conversions may be considered a good thing. So it depends on what your goals are. Do you want to promote the use of UTF-8 for all strings, or do you want to enable people to remain in non-UTF-8-land?
It's a bit like defining a separate integer type for nonnegative ints for type safety reasons - useful in theory, but nobody does it.
I refer you to Boost.Units
I'm sure that there are many libraries that use units in their interfaces, I just haven't heard of them. :-) There's also the additional consideration of utf8_t's invariant. Does it require valid UTF-8? One possible specification of fopen might be: FILE* fopen( char const* name, char const* mode ); The 'name' argument must be UTF-8 on Unicode-aware platforms and file systems such as Windows/NTFS and Mac OS X/HFS+. It can be an arbitrary byte sequence on encoding-agnostic platforms and file systems such as Linux and Solaris, but UTF-8 is recommended. On Windows, the UTF-8 sequence may be invalid due to the presence of UTF-16 surrogates encoded as single code points, but such use is discouraged.

On Tue, Jan 18, 2011 at 6:46 PM, Peter Dimov <pdimov@pdimov.com> wrote:
Dave Abrahams wrote:
At Tue, 18 Jan 2011 13:27:29 +0200, Peter Dimov wrote:
But they won't be. That's not today's reality.
They should be, though. As a practical matter, the difference between taking/returning a string and taking/returning an utf8_t is to force people to write an explicit conversion. This penalizes people who are already in UTF-8 land because it forces them to use utf8_t( s, encoding_utf8 ) and s.c_str( encoding_utf8 ) everywhere, without any gain or need. It's true that for people whose strings are not UTF-8, forcing those explicit conversions may be considered a good thing. So it depends on what your goals are. Do you want to promote the use of UTF-8 for all strings, or do you want to enable people to remain in non-UTF-8-land?
+1 Boost, as the cutting edge C++ library should try to enforce new standards and not dwell on old and obsolete ones. Today everybody is (maybe slowly) moving towards UTF-8 and creating a new string class/wrapper for UTF-8 that nobody uses, IMO, encourages the usage of the old ANSI encodings. Maybe a better course of action would be to create ansi_str_t with the encoding tags for the legacy ANSI-encoded strings, which could be obsoleted in the future, and use std::string as the default class for UTF-8 strings. We will have to do this transition anyway at one point, so why not do it now. my 0.02€ regards Matus

At Tue, 18 Jan 2011 19:27:08 +0100, Matus Chochlik wrote:
On Tue, Jan 18, 2011 at 6:46 PM, Peter Dimov <pdimov@pdimov.com> wrote:
Dave Abrahams wrote:
At Tue, 18 Jan 2011 13:27:29 +0200, Peter Dimov wrote:
But they won't be. That's not today's reality.
They should be, though. As a practical matter, the difference between taking/returning a string and taking/returning an utf8_t is to force people to write an explicit conversion. This penalizes people who are already in UTF-8 land because it forces them to use utf8_t( s, encoding_utf8 ) and s.c_str( encoding_utf8 ) everywhere, without any gain or need. It's true that for people whose strings are not UTF-8, forcing those explicit conversions may be considered a good thing. So it depends on what your goals are. Do you want to promote the use of UTF-8 for all strings, or do you want to enable people to remain in non-UTF-8-land?
+1
Boost, as the cutting edge C++ library should try to enforce new standards and not dwell on old and obsolete ones. Today everybody is (maybe slowly) moving towards UTF-8 and creating a new string class/wrapper for UTF-8 that nobody uses, IMO, encourages the usage of the old ANSI encodings.
Maybe a better course of action would be to create ansi_str_t with the encoding tags for the legacy ANSI-encoded strings, which could be obsoleted in the future, and use std::string as the default class for UTF-8 strings. We will have to do this transition anyway at one point, so why not do it now.
Now that's an interesting thought! -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Matus Chochlik wrote:
On Tue, Jan 18, 2011 at 6:46 PM, Peter Dimov <pdimov@pdimov.com>
Boost, as the cutting edge C++ library should try to enforce new standards and not dwell on old and obsolete ones.
A boost library can't just make a change which makes it's obsolete for those already using it. They are often built into large, real applications which can't constantly revisit every issue every release. Users have to know that using a boost library will save them effort, not burden them with a new maintainence task
Today everybody is (maybe slowly) moving towards UTF-8
It wasn't that long ago that "everybody" was moving to wchar/wstring to support unicode. And a lot of people did. You can't know the future and you can't impose your view of it on everyone else.
and creating a new string class/wrapper for UTF-8 that nobody uses,
lol - well no one is going to use it until it exists.
IMO, encourages the usage of the old ANSI encodings.
I'm not see this at all.
Maybe a better course of action would be to create ansi_str_t with the encoding tags for the legacy ANSI-encoded strings, which could be obsoleted in the future,
obsoleted by whom?
and use std::string as the default class for UTF-8 strings.
Thereby breaking millions (billions?) of lines of currently working programs
We will have to do this transition anyway at one point,
One can't know that
so why not do it now.
I confess I haven't followed this discussion in all it's detail, so please bear with me if I'm repeating something someone said or have missed something obvious. To my way of thinking, the way std::string is used is often equivalent to vector<char>. It has extra sauce, but it's not all that much about manipulating text as it is about manipulating a string of bytes (named characters). So what's wrong with something like the following: struct utf8string : public std::string { struct iterator { const char * operator++(); // move to next code point, utf8char operator*(); // return next utf8 char etc. // ... }; // maybe some other stuff - e.g. trap non-sensical operations }; and while you're at it struct ascii_string : public std::string { std::local m_l; // ascii_string & operator+=(char c) { assert(c < 128); } // etc. }; struct jis_string : public std::string { // etc. }; and while your at it, if you've got nothing else to do struct ebcdc_string : public std::string { ascii_string & operator+=(char c) { assert(c < 128); } // etc. }; Just a thought. Robert Ramey

On Tue, Jan 18, 2011 at 8:03 PM, Robert Ramey <ramey@rrsd.com> wrote:
Matus Chochlik wrote:
On Tue, Jan 18, 2011 at 6:46 PM, Peter Dimov <pdimov@pdimov.com>
Boost, as the cutting edge C++ library should try to enforce new standards and not dwell on old and obsolete ones.
A boost library can't just make a change which makes it's obsolete for those already using it. They are often built into large, real applications which can't constantly revisit every issue every release. Users have to know that using a boost library will save them effort, not burden them with a new maintainence task
I did not mean to say, that we just declare "from now on use std::string for utf-8" and that is all. I am aware that this would require some work to ensure as much backward compatibility as possible or even create interface breaking changes.
Today everybody is (maybe slowly) moving towards UTF-8
It wasn't that long ago that "everybody" was moving to wchar/wstring to support unicode. And a lot of people did. You can't know the future and you can't impose your view of it on everyone else.
Yes but they never abandoned the ANSI encodings. This is why nearly every big C++ library has its XYstring class that uses ifdefs to switch between string/wstring.
and creating a new string class/wrapper for UTF-8 that nobody uses,
lol - well no one is going to use it until it exists.
Is it necessary to explain that I did not mean it that way. What I meant that we can hardly expect that everybody will adopt utf8_t when Boost introduces it. As a consequence everybody will remain with std::string and ANSI encodings.
IMO, encourages the usage of the old ANSI encodings.
I'm not see this at all.
Maybe a better course of action would be to create ansi_str_t with the encoding tags for the legacy ANSI-encoded strings, which could be obsoleted in the future,
obsoleted by whom?
By its authors. The ansi_str_t would serve as a temporary buffer before everybody switches to utf-8.
and use std::string as the default class for UTF-8 strings.
Thereby breaking millions (billions?) of lines of currently working programs
As a few people (who know a lot more about Unicode than I do) pointed out it will be not that tragic (again I do *not* think that this change does not involve some work).
We will have to do this transition anyway at one point,
One can't know that
Well the whole string-encoding-related mess has to be resolved an to me it seems that UTF-8 is the candidate that will do it, not because somebody says it but because it is already happening. Just look at the Web and at the new releases of the major database systems (I know this is not the whole IT sector but a relevant part of it and many more examples could be found)
so why not do it now.
I confess I haven't followed this discussion in all it's detail, so please bear with me if I'm repeating something someone said or have missed something obvious.
To my way of thinking, the way std::string is used is often equivalent to vector<char>. It has extra sauce, but it's not all that much about manipulating text as it is about manipulating a string of bytes (named characters). So what's wrong with something like the following:
struct utf8string : public std::string { struct iterator { const char * operator++(); // move to next code point, utf8char operator*(); // return next utf8 char etc. // ... }; // maybe some other stuff - e.g. trap non-sensical operations };
and while you're at it
struct ascii_string : public std::string { std::local m_l; // ascii_string & operator+=(char c) { assert(c < 128); } // etc. };
struct jis_string : public std::string { // etc. };
and while your at it, if you've got nothing else to do
struct ebcdc_string : public std::string { ascii_string & operator+=(char c) { assert(c < 128); } // etc. };
Just a thought.
That instead of the currently used 2 string classes you'll end up with N string classes. That thought is not very appealing to me. BR Matus

struct utf8string : public std::string { struct iterator { const char * operator++(); // move to next code point, utf8char operator*(); // return next utf8 char etc. // ... }; // maybe some other stuff - e.g. trap non-sensical operations };
and while you're at it
struct ascii_string : public std::string { std::local m_l; // ascii_string & operator+=(char c) { assert(c < 128); } // etc. };
struct jis_string : public std::string { // etc. };
and while your at it, if you've got nothing else to do
struct ebcdc_string : public std::string { ascii_string & operator+=(char c) { assert(c < 128); } // etc. };
Just a thought.
That instead of the currently used 2 string classes you'll end up with N string classes. That thought is not very appealing to me.
I don't think that's a fair statement. The above only has 4 and that's including EBCDIC. Sorry, I don't get the "2" above. In any case, one could state with Just utf8_string and ansi_string (should be simple), put it into boost and see how many people use it. If it's truely an improvement, usage of std:string would atrophy to the point of being irrelevent. If there are still reasons for using std::string directly, then it wouldn't, but no harm would be done. This has all the upside and none of the downside. If this were made,

On Tue, Jan 18, 2011 at 8:53 PM, Robert Ramey <ramey@rrsd.com> wrote:
Just a thought.
That instead of the currently used 2 string classes you'll end up with N string classes. That thought is not very appealing to me.
I don't think that's a fair statement. The above only has 4 and that's including EBCDIC.
But those four are not the only widespread encoding schemes, what about KOI8, CPXYZ, etc.
Sorry, I don't get the "2" above.
I meant std::string and std::wstring which contains one string class too many IMO.
In any case, one could state with Just utf8_string and ansi_string (should be simple), put it into boost and see how many people use it. If it's truely an improvement, usage of std:string would atrophy to the point of being irrelevent. If there are still reasons for using std::string directly, then it wouldn't, but no harm would be done. This has all the upside and none of the downside.
If this were made,
One of the downsides is that C++ would be abandoning a nice name 'string' to ugly 'utf8_t' or whatever.

On Wed, 19 Jan 2011 10:26:05 +0100 Matus Chochlik <chochlik@gmail.com> wrote:
That instead of the currently used 2 string classes you'll end up with N string classes. That thought is not very appealing to me.
I don't think that's a fair statement. The above only has 4 and that's including EBCDIC.
But those four are not the only widespread encoding schemes, what about KOI8, CPXYZ, etc.
There wouldn't be any need for special string types for them. They would be represented by native_t if the system is set to use them, and std::string types would just be assumed to be coded in that form.
In any case, one could state with Just utf8_string and ansi_string (should be simple), put it into boost and see how many people use it. If it's truely an improvement, usage of std:string would atrophy to the point of being irrelevent. If there are still reasons for using std::string directly, then it wouldn't, but no harm would be done. This has all the upside and none of the downside.
If this were made,
One of the downsides is that C++ would be abandoning a nice name 'string' to ugly 'utf8_t' or whatever.
Believe it or not, you'd get used to it. :-) I thought wchar_t was the height of ugliness when I first saw it, but it seems perfectly acceptable now, even attractively descriptive. -- Chad Nelson Oak Circle Software, Inc. * * *

On Wed, Jan 19, 2011 at 2:07 PM, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
On Wed, 19 Jan 2011 10:26:05 +0100 Matus Chochlik <chochlik@gmail.com> wrote:
But those four are not the only widespread encoding schemes, what about KOI8, CPXYZ, etc.
There wouldn't be any need for special string types for them. They would be represented by native_t if the system is set to use them, and std::string types would just be assumed to be coded in that form.
Yes, My point was that there is no need to create ascii_string, jis_string and ebcdc_string in the first place but to handle the conversion during the initialization of the-one-and-only-string-type-we-decide-to-use. :)
In any case, one could state with Just utf8_string and ansi_string (should be simple), put it into boost and see how many people use it. If it's truely an improvement, usage of std:string would atrophy to the point of being irrelevent. If there are still reasons for using std::string directly, then it wouldn't, but no harm would be done. This has all the upside and none of the downside.
If this were made,
One of the downsides is that C++ would be abandoning a nice name 'string' to ugly 'utf8_t' or whatever.
Believe it or not, you'd get used to it. :-) I thought wchar_t was the height of ugliness when I first saw it, but it seems perfectly acceptable now, even attractively descriptive.
Yes, probably I would. But try to imagine that you are a novice who decides which language to learn. Would you pick a language that has 3 (provided the utf8_t becomes standard) standard string related classes not to mention all those dozens of classes implemented by various libraries ? Best, Matus

On Wed, 19 Jan 2011 14:23:25 +0100 Matus Chochlik <chochlik@gmail.com> wrote:
On Wed, Jan 19, 2011 at 2:07 PM, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
There wouldn't be any need for special string types for them. They would be represented by native_t if the system is set to use them, and std::string types would just be assumed to be coded in that form.
Yes, My point was that there is no need to create ascii_string, jis_string and ebcdc_string in the first place but to handle the conversion during the initialization of the-one-and-only-string-type-we-decide-to-use. :)
You'll get no argument against that from me. :-) But at least two utf*_t types will be useful even after that conversion: utf8_t (because it will automatically catch any encoding problems, including malicious ones), and utf16_t (because the Windows API requires its data in that form).
One of the downsides is that C++ would be abandoning a nice name 'string' to ugly 'utf8_t' or whatever.
Believe it or not, you'd get used to it. :-) I thought wchar_t was the height of ugliness when I first saw it, but it seems perfectly acceptable now, even attractively descriptive.
Yes, probably I would. But try to imagine that you are a novice who decides which language to learn. Would you pick a language that has 3 (provided the utf8_t becomes standard) standard string related classes not to mention all those dozens of classes implemented by various libraries ?
A novice isn't likely to pick C++ regardless. My nine-year-old nephew wanted to learn programming recently, and despite my own enthusiasm for C++, I had to recommend he start with Python -- he's plenty smart enough to dive directly into C++, but the learning curve would be a lot steeper, and the string-type problem is only one of the issues. -- Chad Nelson Oak Circle Software, Inc. * * *

On Tue, Jan 18, 2011 at 2:18 PM, Matus Chochlik <chochlik@gmail.com> wrote:
and creating a new string class/wrapper for UTF-8 that nobody uses,
lol - well no one is going to use it until it exists.
Is it necessary to explain that I did not mean it that way. What I meant that we can hardly expect that everybody will adopt utf8_t when Boost introduces it. As a consequence everybody will remain with std::string and ANSI encodings.
I think maybe you underestimate our influence. It won't be immediate, but I believe we *could* produce the new lingua franca and get it widely-adopted. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

From: Dave Abrahams <dave@boostpro.com>
On Tue, Jan 18, 2011 at 2:18 PM, Matus Chochlik <chochlik@gmail.com> wrote:
and creating a new string class/wrapper for UTF-8 that nobody uses,
lol - well no one is going to use it until it exists.
Is it necessary to explain that I did not mean it that way. What I meant that we can hardly expect that everybody will adopt utf8_t when Boost introduces it. As a consequence everybody will remain with std::string and ANSI encodings.
I think maybe you underestimate our influence. It won't be immediate, but I believe we *could* produce the new lingua franca and get it widely-adopted.
I think you little bit overestimate Boost power :-) There are lots of "Unicode" strings: - QString, - gtkmm::ustring, - icu::UnicodeString - wxString But finally the only 1 string exists anywhere - std::string. So maybe just stick with and and just to work on how we are using it. Artyom

On Tue, Jan 18, 2011 at 8:57 PM, Dave Abrahams <dave@boostpro.com> wrote:
On Tue, Jan 18, 2011 at 2:18 PM, Matus Chochlik <chochlik@gmail.com> wrote:
and creating a new string class/wrapper for UTF-8 that nobody uses,
lol - well no one is going to use it until it exists.
Is it necessary to explain that I did not mean it that way. What I meant that we can hardly expect that everybody will adopt utf8_t when Boost introduces it. As a consequence everybody will remain with std::string and ANSI encodings.
I think maybe you underestimate our influence. It won't be immediate, but I believe we *could* produce the new lingua franca and get it widely-adopted.
I didn't mean to say that Boost does not have the influence to do that. It easily could. I think that it should instead use its influence in the C++ world to help phasing out the ANSI encodings in favor of UTF-8 without sacrificing a 'flagship-class' like std::string and introducing a new one into the already crowded club of string handling classes. Regards Matus

...elision by patrick...
Maybe a better course of action would be to create ansi_str_t with the encoding tags for the legacy ANSI-encoded strings, which could be obsoleted in the future, and use std::string as the default class for UTF-8 strings. We will have to do this transition anyway at one point, so why not do it now. First, how annoying that that text mode on windows is called ANSI. It has nothing to do with ANSI. Second, I think you forget that it's a big world with large number of single byte and multibyte encodings that will be in strings. It's just self defense. If someone gives you something in a utf-8 string type, you can make _some_ assumption, that absent error, it's supposed to be
On 01/18/2011 10:27 AM, Matus Chochlik wrote: that encoding. Other than that you can't. If a std::string _can_ be many different things, then a std::string _will_ be many different things. Partitioning the space of things it can be and dealing with each of them correctly is a good thing, I think. Patrick

[Patrick Horgan]
First, how annoying that that text mode on windows is called ANSI. It has nothing to do with ANSI.
http://msdn.microsoft.com/en-us/goglobal/bb964658.aspx
ANSI: Acronym for the American National Standards Institute. The term "ANSI" as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community. The source of this comes from the fact that the Windows code page 1252 was originally based on an ANSI draft-which became International Organization for Standardization (ISO) Standard 8859-1. "ANSI applications" are usually a reference to non-Unicode or code page-based applications.
STL

----- Original Message ----
From: Peter Dimov <pdimov@pdimov.com>
Dave Abrahams wrote:
At Tue, 18 Jan 2011 13:27:29 +0200, Peter Dimov wrote:
Dave Abrahams wrote:
I think the reason to use separate types is to provide a type-safety barrier between your functions that operate on utf-8 and system or 3rd-party interfaces that don't or may not. In principle, that should force you to think about encoding and decoding at all the places where it may be needed, and should allow you to code naturally and with confidence where everybody is operating in utf8-land.
Yes, in principle. It isn't terribly necessary if everybody is operating in UTF-8 land though.
But they won't be. That's not today's reality.
They should be, though. As a practical matter, the difference between taking/returning a string and taking/returning an utf8_t is to force people to write an explicit conversion. This penalizes people who are already in UTF-8 land because it forces them to use utf8_t( s, encoding_utf8 ) and s.c_str( encoding_utf8 ) everywhere, without any gain or need. It's true that for people whose strings are not UTF-8, forcing those explicit conversions may be considered a good thing. So it depends on what your goals are. Do you want to promote the use of UTF-8 for all strings, or do you want to enable people to remain in non-UTF-8-land?
+1
There's also the additional consideration of utf8_t's invariant. Does it require valid UTF-8? One possible specification of fopen might be:
FILE* fopen( char const* name, char const* mode );
The 'name' argument must be UTF-8 on Unicode-aware platforms and file systems such as Windows/NTFS and Mac OS X/HFS+. It can be an arbitrary byte sequence on encoding-agnostic platforms and file systems such as Linux and Solaris, but UTF-8 is recommended.
+1 As well. Also I would like to add a small note of general C++ design as a language: don't pay on what you don't need. And 95% of all uses of strings is encoding agnostic! Artyom

On Tue, 18 Jan 2011 19:46:41 +0200, Peter Dimov wrote:
Dave Abrahams wrote:
At Tue, 18 Jan 2011 13:27:29 +0200, Peter Dimov wrote:
There's also the additional consideration of utf8_t's invariant. Does it require valid UTF-8? One possible specification of fopen might be:
FILE* fopen( char const* name, char const* mode );
The 'name' argument must be UTF-8 on Unicode-aware platforms and file systems such as Windows/NTFS and Mac OS X/HFS+. It can be an arbitrary byte sequence on encoding-agnostic platforms and file systems such as Linux and Solaris, but UTF-8 is recommended.
On Windows, the UTF-8 sequence may be invalid due to the presence of UTF-16 surrogates encoded as single code points, but such use is discouraged.
Are you saying this is how it should be or this is how it is? Because, on Windows, 'name' certainly can't be UTF-8! The implementation takes 'name' to be in the default local codepage, uses mbstowchar to up-convert it to a UCS2 wchar_t string and delegates it to _wfopen (or similar - I'm doing this from memory). The up-conversion will turn multi-byte UTF-8 chars into gibberish. For example fopen with 'name' being "שלום-سلام-pease-Мир.txt" creates a file called "שלו×-سلام-pease-Мир" Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

Alexander Lamaison wrote:
On Tue, 18 Jan 2011 19:46:41 +0200, Peter Dimov wrote: ...
FILE* fopen( char const* name, char const* mode );
The 'name' argument must be UTF-8 on Unicode-aware platforms and file [...]
Are you saying this is how it should be or this is how it is?
This is one possible and very reasonable specification of a "how it should be" fopen.

Alexander Lamaison wrote: [about fopen]
Are you saying this is how it should be or this is how it is? Because, on Windows, 'name' certainly can't be UTF-8! The implementation takes 'name' to be in the default local codepage, uses mbstowchar to up-convert it to a UCS2 wchar_t string and delegates it to _wfopen (or similar - I'm doing this from memory).
For what it's worth, it doesn't, it calls CreateFileA. This may or may not produce the same result as what you describe, depending on whether the current C locale matches the ANSI code page.

On Wed, 19 Jan 2011 18:52:19 +0200, Peter Dimov wrote:
Alexander Lamaison wrote:
[about fopen]
Are you saying this is how it should be or this is how it is? Because, on Windows, 'name' certainly can't be UTF-8! The implementation takes 'name' to be in the default local codepage, uses mbstowchar to up-convert it to a UCS2 wchar_t string and delegates it to _wfopen (or similar - I'm doing this from memory).
For what it's worth, it doesn't, it calls CreateFileA. This may or may not produce the same result as what you describe, depending on whether the current C locale matches the ANSI code page.
You're quite right. I was mixing up with fstream. Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

At Tue, 18 Jan 2011 19:46:41 +0200, Peter Dimov wrote:
Dave Abrahams wrote:
At Tue, 18 Jan 2011 13:27:29 +0200, Peter Dimov wrote:
Dave Abrahams wrote:
I think the reason to use separate types is to provide a type-safety barrier between your functions that operate on utf-8 and system or 3rd-party interfaces that don't or may not. In principle, that should force you to think about encoding and decoding at all the places where it may be needed, and should allow you to code naturally and with confidence where everybody is operating in utf8-land.
Yes, in principle. It isn't terribly necessary if everybody is operating in UTF-8 land though.
But they won't be. That's not today's reality.
They should be, though. As a practical matter, the difference between taking/returning a string and taking/returning an utf8_t is to force people to write an explicit conversion. This penalizes people who are already in UTF-8 land because it forces them to use utf8_t( s, encoding_utf8 ) and s.c_str( encoding_utf8 ) everywhere, without any gain or need. It's true that for people whose strings are not UTF-8, forcing those explicit conversions may be considered a good thing. So it depends on what your goals are. Do you want to promote the use of UTF-8 for all strings, or do you want to enable people to remain in non-UTF-8-land?
Oh, I get it. Nevermind :-) -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Tue, Jan 18, 2011 at 1:39 PM, Dave Abrahams <dave@boostpro.com> wrote:
At Tue, 18 Jan 2011 19:46:41 +0200, Peter Dimov wrote:
Dave Abrahams wrote:
At Tue, 18 Jan 2011 13:27:29 +0200, Peter Dimov wrote:
Dave Abrahams wrote:
I think the reason to use separate types is to provide a type-safety barrier between your functions that operate on utf-8 and system or 3rd-party interfaces that don't or may not. In principle, that should force you to think about encoding and decoding at all the places where it may be needed, and should allow you to code naturally and with confidence where everybody is operating in utf8-land.
Yes, in principle. It isn't terribly necessary if everybody is operating in UTF-8 land though.
But they won't be. That's not today's reality.
They should be, though. As a practical matter, the difference between taking/returning a string and taking/returning an utf8_t is to force people to write an explicit conversion. This penalizes people who are already in UTF-8 land because it forces them to use utf8_t( s, encoding_utf8 ) and s.c_str( encoding_utf8 ) everywhere, without any gain or need. It's true that for people whose strings are not UTF-8, forcing those explicit conversions may be considered a good thing. So it depends on what your goals are. Do you want to promote the use of UTF-8 for all strings, or do you want to enable people to remain in non-UTF-8-land?
Oh, I get it. Nevermind :-)
On second thought... There are two ways this could go AFAICS: 1. We just use std::string for UTF-8 and eventually the whole world will catch up 2. We establish some other type for UTF-8 and *it* becomes the lingua franca Aren't things still enough of a mess out there that #2 is just as likely to work well? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

There are two ways this could go AFAICS:
1. We just use std::string for UTF-8 and eventually the whole world will catch up
2. We establish some other type for UTF-8 and *it* becomes the lingua franca
If Boost abandons std::string in interfaces that expects UTF-8, does that mean I as a user need to sprinkle boost::to_utf_8(my_std_string,...) // in whatever form to_utf8 may be all over my/ours (quite gigantic) code base? Without doing so, I assume will cause compilation errors, but for what gain? If some code was broken before, it will remain so after I've injected all
This would be nice. those to_utf8 calls as well. To solve actual problems I need to track the origin of my std::string's content, which require a traditional bug-hunting session anyway. No additional typed interface in the world will help me here IMO. Aren't things still enough of a mess out there that #2 is just as
likely to work well? --
"Just as likely to work well" doesn't sound good enough for me, from a maintenance point of view. I can picture how the changeset looks on the poor branch that decides to upgrade to such a version of boost. The problem isn't the type, but the content. There are algorithms in stl that have requirements on their input (sorted, usually), why is this different? I'm sure it wouldn't be supported with an introduction of sorted_value_input_iterator that I can pass to std::set_xxx functions. (?). What would be helpful if doable, is to build boost with BOOST_TRACK_INVALID_UTF_8, also for release builds. This would cause an exception or a call to user-defined function if boost code stumbles upon bad strings. - Christian

At Tue, 18 Jan 2011 14:50:51 -0600, Christian Holmquist wrote:
There are algorithms in stl that have requirements on their input (sorted, usually), why is this different?
The main difference is that strings tend to have static (often immutable) content, so it makes more sense to tie their properties to a type. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

At Tue, 18 Jan 2011 14:50:51 -0600, Christian Holmquist wrote:
There are two ways this could go AFAICS:
1. We just use std::string for UTF-8 and eventually the whole world will catch up
This would be nice.
2. We establish some other type for UTF-8 and *it* becomes the lingua franca
If Boost abandons std::string in interfaces that expects UTF-8, does that mean I as a user need to sprinkle boost::to_utf_8(my_std_string,...) // in whatever form to_utf8 may be all over my/ours (quite gigantic) code base?
Only if you're going to adopt new, currently nonexistent boost interfaces that operate on the new utf-8 type, or if you decide to do a wholesale adoption of that type in place of std::string. The latter sounds like quite a huge investment for your codebase, so it's probably not a good idea in the short-term but it might be a good long-term move.
Without doing so, I assume will cause compilation errors, but for what gain? If some code was broken before, it will remain so after I've injected all those to_utf8 calls as well.
I'm not talking about breaking any existing code.
To solve actual problems I need to track the origin of my std::string's content, which require a traditional bug-hunting session anyway. No additional typed interface in the world will help me here IMO.
Help you where? It doesn't sound like you have a problem you want to solve. If you like the status quo, don't change anything.
Aren't things still enough of a mess out there that #2 is just as
likely to work well?
"Just as likely to work well" doesn't sound good enough for me, from a maintenance point of view.
Huh? If it works just as well as the alternative, it does. If it's a huge hassle compared to the alternative, it doesn't work. I don't claim to know the answer to my question, but if the answer turned out to be "it's just as likely to work well," I don't understand how you could object.
I can picture how the changeset looks on the poor branch that decides to upgrade to such a version of boost. The problem isn't the type, but the content.
There are algorithms in stl that have requirements on their input (sorted, usually), why is this different?
There are also types in the STL that guarantee sortedness. If you write an algorithm that has to do a set intersection and you're operating on std::vectors, you have to explicitly document that they're required to be sorted, and the user of your algorithm has to carefully conform to that requirement without help from the compiler. If you accept only std::sets, the requirement is implicit and enforced by the compiler.
I'm sure it wouldn't be supported with an introduction of sorted_value_input_iterator that I can pass to std::set_xxx functions. (?).
No, it wouldn't.
What would be helpful if doable, is to build boost with BOOST_TRACK_INVALID_UTF_8, also for release builds. This would cause an exception or a call to user-defined function if boost code stumbles upon bad strings.
In my experience with Python, which uses exactly that strategy, it works badly. The problem is that so many common strings are just ASCII, and thus are not changed by encoding/decoding in utf-8, so it's very easy to overlook a problem until very late in the game, and when it *is* detected that is often very far away from the code that should have done the encoding/decoding in the first place. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Tue, 18 Jan 2011 14:50:51 -0600 Christian Holmquist <c.holmquist@gmail.com> wrote:
There are two ways this could go AFAICS: [...]
2. We establish some other type for UTF-8 and *it* becomes the lingua franca
If Boost abandons std::string in interfaces that expects UTF-8, does that mean I as a user need to sprinkle boost::to_utf_8(my_std_string,...) // in whatever form to_utf8 may be all over my/ours (quite gigantic) code base?
Only for functions that need to know the encoding of a string. As Artyom has rightly pointed out, most functions operate perfectly well by treating strings as opaque blocks of data, or as individual bytes. It's only things like Boost.RegEx or some of the string-manipulation functions that might want to act a bit differently in the face of multi-byte characters. Or, of course, newly-written functions in user code, outside of the Boost library.
Without doing so, I assume will cause compilation errors, but for what gain? If some code was broken before, it will remain so after I've injected all those to_utf8 calls as well. To solve actual problems I need to track the origin of my std::string's content, which require a traditional bug-hunting session anyway. No additional typed interface in the world will help me here IMO.
Maybe. But having a function whose parameters or return type is explicitly utf8_t will tell you (and the compiler) exactly what kind of string it's expecting, right in the code, whereas something that takes or returns an std::string doesn't. If you have to look up that information in the documentation, you're a lot more likely to miss it.
[...] What would be helpful if doable, is to build boost with BOOST_TRACK_INVALID_UTF_8, also for release builds. This would cause an exception or a call to user-defined function if boost code stumbles upon bad strings.
Interesting idea, but it pushes the problem entirely to runtime. Having utf*_t types lets the compiler do at least some of the work for you. -- Chad Nelson Oak Circle Software, Inc. * * *

On Tue, 18 Jan 2011 19:46:41 +0200 "Peter Dimov" <pdimov@pdimov.com> wrote:
Dave Abrahams wrote:
Yes, in principle. It isn't terribly necessary if everybody is operating in UTF-8 land though.
But they won't be. That's not today's reality.
They should be, though. As a practical matter, the difference between taking/returning a string and taking/returning an utf8_t is to force people to write an explicit conversion. This penalizes people who are already in UTF-8 land because it forces them to use utf8_t( s, encoding_utf8 ) and s.c_str( encoding_utf8 ) everywhere, without any gain or need. [...]
It doesn't have to. So long as the utf8_t class can easily determine what encoding it's being fed, it can be set up to do the conversion itself if necessary. That's how my utf*_t classes are designed; feed a utf8_t to a function that interfaces with the Windows API and takes a utf16_t parameter, and the classes will transparently convert it. If that function returns a utf16_t, and your internal storage type is utf8_t, just assign it directly to the utf8_t. If you're still using std::string, then the UTF classes would have to either make some assumptions or force you to add explicit conversions. But only library functions that care about the encoding would need to be written with utf*_t parameters, everything else could be left using std::string without any problem. My utf8_t class lets you get the std::string with operator*, so it's easy to use with such encoding-agnostic functions as well. -- Chad Nelson Oak Circle Software, Inc. * * *

On Tue, 18 Jan 2011 20:37:44 -0500, Chad Nelson wrote:
My utf8_t class lets you get the std::string with operator*, so it's easy to use with such encoding-agnostic functions as well.
I meant to mention this: please, no ;) Can we make it .raw() or .str() or something, anything but an operator overload? Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

On Wed, 19 Jan 2011 12:10:35 +0000 Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
On Tue, 18 Jan 2011 20:37:44 -0500, Chad Nelson wrote:
My utf8_t class lets you get the std::string with operator*, so it's easy to use with such encoding-agnostic functions as well.
I meant to mention this: please, no ;) Can we make it .raw() or .str() or something, anything but an operator overload?
operator* has a long history of providing the contents of a variable, even in C, and is a lot less typing to boot. But if you have any technical arguments against it, I'm listening. -- Chad Nelson Oak Circle Software, Inc. * * *

On Wed, 19 Jan 2011 08:55:38 -0500, Chad Nelson wrote:
On Wed, 19 Jan 2011 12:10:35 +0000 Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
On Tue, 18 Jan 2011 20:37:44 -0500, Chad Nelson wrote:
My utf8_t class lets you get the std::string with operator*, so it's easy to use with such encoding-agnostic functions as well.
I meant to mention this: please, no ;) Can we make it .raw() or .str() or something, anything but an operator overload?
operator* has a long history of providing the contents of a variable, even in C, and is a lot less typing to boot. But if you have any technical arguments against it, I'm listening.
operator* in not about providing the contents of a variable, it is about dereferencing something that points to an address, be that a real pointer, a C++ iterator, etc. utf8_t is just a class holding it's internal data as a std::string. In a similar way to std::stringstream which gives access to its internal string via .str() or std::string which serves up a raw char* string via .c_str(). While we're questioning cosmetic details, I'm wondering about the choice of utf8_t vs utf8 or utf8_string. What does the _t signify? I was under the impression that _t meant typedef; for example wchar_t started life as a typedef rather than a primitive data type. Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

On Wed, 19 Jan 2011 15:16:48 +0000 Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
I meant to mention this: please, no ;) Can we make it .raw() or .str() or something, anything but an operator overload?
operator* has a long history of providing the contents of a variable, even in C, and is a lot less typing to boot. But if you have any technical arguments against it, I'm listening.
operator* in not about providing the contents of a variable, it is about dereferencing something that points to an address, be that a real pointer, a C++ iterator, etc.
In C, that's the case. In C++, it isn't; very few iterators are implemented as raw pointers. Boost.Optional isn't either, and I believe I've seen other types that have adopted the same idea.
utf8_t is just a class holding it's internal data as a std::string. In a similar way to std::stringstream which gives access to its internal string via .str() or std::string which serves up a raw char* string via .c_str().
There's certainly room for something like .raw() or .str(), but I'm partial to keeping operator* too.
While we're questioning cosmetic details, I'm wondering about the choice of utf8_t vs utf8 or utf8_string. What does the _t signify? I was under the impression that _t meant typedef; for example wchar_t started life as a typedef rather than a primitive data type.
That may have been the original intended meaning, but wchar_t is a true type of its own nowadays, at least on the compilers I use. I read it as a shortened form of "_type" now. -- Chad Nelson Oak Circle Software, Inc. * * *

At Wed, 19 Jan 2011 14:19:57 -0500, Chad Nelson wrote:
I'm partial to keeping operator* too.
-1. I promoted this basic approach with boost.optional but I think I regret that now. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Wed, 19 Jan 2011 05:55:38 -0800, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
On Wed, 19 Jan 2011 12:10:35 +0000 Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
On Tue, 18 Jan 2011 20:37:44 -0500, Chad Nelson wrote:
My utf8_t class lets you get the std::string with operator*, so it's easy to use with such encoding-agnostic functions as well.
I meant to mention this: please, no ;) Can we make it .raw() or .str() or something, anything but an operator overload?
operator* has a long history of providing the contents of a variable, even in C, and is a lot less typing to boot. But if you have any technical arguments against it, I'm listening.
Can we stick to std::string conventions as closely as possible? It makes using whatever new string library that much easier, and clearer, and maintainable. From usage, it's not readily apparent what operator* is supposed to do in the context of strings, ie, utf8_t myStr(...); some_api_foo(*myStr); Even if I'm an experienced programmer, but a newbie to whatever library makes use of some_api_foo, I would be scratching my head at "*myStr"; and I would be forced to look up utf8_t::operator* or some_api_foo to figure it out. What about: utf8_t::cu_str where the last one stands for code-unit string. I'm a big fan of conveying your intent in code. For the same reason I strong disagree with utf8_t::str. utf8_t is already a string class, and a generic sounding "str" method off it doesn't convey what kind of string it returns. And whichever you choose, can we have one and only one way of doing it? Again, for the sake of code maintainability. My thoughts/suggestions. Mostafa

On Wed, 19 Jan 2011 14:32:31 -0800 Mostafa <mostafa_working_away@yahoo.com> wrote:
operator* has a long history of providing the contents of a variable, even in C, and is a lot less typing to boot. But if you have any technical arguments against it, I'm listening.
Can we stick to std::string conventions as closely as possible? It makes using whatever new string library that much easier, and clearer, and maintainable.
Is there a conventional way to get the data stored in an std::string? ;-)
From usage, it's not readily apparent what operator* is supposed to do in the context of strings, ie,
utf8_t myStr(...); some_api_foo(*myStr);
Even if I'm an experienced programmer, but a newbie to whatever library makes use of some_api_foo, I would be scratching my head at "*myStr"; and I would be forced to look up utf8_t::operator* or some_api_foo to figure it out.
I'd lean toward encoded(), or at least coded(), if operator* is out. If you know anything about UTF-8, it's sufficiently descriptive. If you don't, then nothing that's short enough to type on a regular basis is going to eliminate the need for documentation.
What about:
utf8_t::cu_str
where the last one stands for code-unit string.
If we need a code-point iterator, using anything based on the name code-unit might be confusingly similar to anyone not already very familiar with Unicode.
I'm a big fan of conveying your intent in code. For the same reason I strong disagree with utf8_t::str. utf8_t is already a string class, and a generic sounding "str" method off it doesn't convey what kind of string it returns.
While that's true (and I'm not a fan of str() in this context either), it does have the advantage of implying that it returns an std::string, based on the conventions of std::stringstream. -- Chad Nelson Oak Circle Software, Inc. * * *

On Wed, 19 Jan 2011 18:44:44 -0800, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
On Wed, 19 Jan 2011 14:32:31 -0800 Mostafa <mostafa_working_away@yahoo.com> wrote:
operator* has a long history of providing the contents of a variable, even in C, and is a lot less typing to boot. But if you have any technical arguments against it, I'm listening.
Can we stick to std::string conventions as closely as possible? It makes using whatever new string library that much easier, and clearer, and maintainable.
Is there a conventional way to get the data stored in an std::string? ;-)
std::string::c_str, or am I missing something? (BTW, that's why I suggested utf8_t::cu_str, sounds similar.)
I'm a big fan of conveying your intent in code. For the same reason I strong disagree with utf8_t::str. utf8_t is already a string class, and a generic sounding "str" method off it doesn't convey what kind of string it returns.
While that's true (and I'm not a fan of str() in this context either), it does have the advantage of implying that it returns an std::string, based on the conventions of std::stringstream.
I would argue utf8_t is analagous to std::string, not std::stringstream. Ones in the storage category, the others in the streaming category. Mostafa

On Wed, 19 Jan 2011 19:27:05 -0800 Mostafa <mostafa_working_away@yahoo.com> wrote:
Can we stick to std::string conventions as closely as possible? It makes using whatever new string library that much easier, and clearer, and maintainable.
Is there a conventional way to get the data stored in an std::string? ;-)
std::string::c_str, or am I missing something? (BTW, that's why I suggested utf8_t::cu_str, sounds similar.)
Ah, I hadn't considered that.
I'm a big fan of conveying your intent in code. For the same reason I strong disagree with utf8_t::str. utf8_t is already a string class, and a generic sounding "str" method off it doesn't convey what kind of string it returns.
While that's true (and I'm not a fan of str() in this context either), it does have the advantage of implying that it returns an std::string, based on the conventions of std::stringstream.
I would argue utf8_t is analagous to std::string, not std::stringstream. Ones in the storage category, the others in the streaming category.
But both, as prominent members of the STL, are standards that developers are likely to recognize. -- Chad Nelson Oak Circle Software, Inc. * * *

At Wed, 19 Jan 2011 21:44:44 -0500, Chad Nelson wrote:
On Wed, 19 Jan 2011 14:32:31 -0800 Mostafa <mostafa_working_away@yahoo.com> wrote:
operator* has a long history of providing the contents of a variable, even in C, and is a lot less typing to boot. But if you have any technical arguments against it, I'm listening.
Can we stick to std::string conventions as closely as possible? It makes using whatever new string library that much easier, and clearer, and maintainable.
Is there a conventional way to get the data stored in an std::string? ;-)
s.data() see std::vector (in C++0x at least, if not before).
From usage, it's not readily apparent what operator* is supposed to do in the context of strings, ie,
utf8_t myStr(...); some_api_foo(*myStr);
Even if I'm an experienced programmer, but a newbie to whatever library makes use of some_api_foo, I would be scratching my head at "*myStr"; and I would be forced to look up utf8_t::operator* or some_api_foo to figure it out.
+1 -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Wed, 19 Jan 2011 18:44:44 -0800, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
On Wed, 19 Jan 2011 14:32:31 -0800 Mostafa <mostafa_working_away@yahoo.com> wrote:
operator* has a long history of providing the contents of a variable, even in C, and is a lot less typing to boot. But if you have any technical arguments against it, I'm listening.
Can we stick to std::string conventions as closely as possible? It makes using whatever new string library that much easier, and clearer, and maintainable.
Is there a conventional way to get the data stored in an std::string? ;-)
On second thought, is there really a need to access the underlying data of utf8_t? I argue that having a view of the underlying data via iterators accomplishes just as much(*), and is more inline with the stl tradition of containers and iterators, not to mention the better encapsulation it affords the interface. Do clients really need to know, and potentially develop a dependency on, the fact that utf8_t (for now?) is really just a wrapper for std::string? (*) Granted there are legacy/existing C and C++ interfaces that require C style strings, and I guess that's what a potential utf8_t::c_str would be for. Mostafa

On Thu, 20 Jan 2011 12:52:25 -0800 Mostafa <mostafa_working_away@yahoo.com> wrote:
Can we stick to std::string conventions as closely as possible? It makes using whatever new string library that much easier, and clearer, and maintainable.
Is there a conventional way to get the data stored in an std::string? ;-)
On second thought, is there really a need to access the underlying data of utf8_t? I argue that having a view of the underlying data via iterators accomplishes just as much(*), and is more inline with the stl tradition of containers and iterators, not to mention the better encapsulation it affords the interface. Do clients really need to know, and potentially develop a dependency on, the fact that utf8_t (for now?) is really just a wrapper for std::string?
For interaction with functions that require std::string, and presumably don't care about the encoding, it's convenient. Without it, using the UTF classes with such a function requires calling the std::string constructor with the element iterators. That's going to be a major need for the foreseeable future. -- Chad Nelson Oak Circle Software, Inc. * * *

On Thu, 20 Jan 2011 15:46:01 -0800, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
On Thu, 20 Jan 2011 12:52:25 -0800 Mostafa <mostafa_working_away@yahoo.com> wrote:
Can we stick to std::string conventions as closely as possible? It makes using whatever new string library that much easier, and clearer, and maintainable.
Is there a conventional way to get the data stored in an std::string? ;-)
On second thought, is there really a need to access the underlying data of utf8_t? I argue that having a view of the underlying data via iterators accomplishes just as much(*), and is more inline with the stl tradition of containers and iterators, not to mention the better encapsulation it affords the interface. Do clients really need to know, and potentially develop a dependency on, the fact that utf8_t (for now?) is really just a wrapper for std::string?
For interaction with functions that require std::string, and presumably don't care about the encoding, it's convenient. Without it, using the UTF classes with such a function requires calling the std::string constructor with the element iterators. That's going to be a major need for the foreseeable future.
Would that be a bad thing? That is, forcing clients to call the std::string constructors with element iterators. I don't see it as a need, but more of a want. It's just a convenience for the clients with drawbacks for the interface, namely exposing implementation detail that doesn't need to be exposed. Mostafa

On Thu, 20 Jan 2011 17:19:24 -0800 Mostafa <mostafa_working_away@yahoo.com> wrote:
On second thought, is there really a need to access the underlying data of utf8_t? [...]
For interaction with functions that require std::string, and presumably don't care about the encoding, it's convenient. Without it, using the UTF classes with such a function requires calling the std::string constructor with the element iterators. That's going to be a major need for the foreseeable future.
Would that be a bad thing? That is, forcing clients to call the std::string constructors with element iterators. I don't see it as a need, but more of a want. It's just a convenience for the clients with drawbacks for the interface, namely exposing implementation detail that doesn't need to be exposed.
It's a need if you want people to use the class. I, for instance, would take one look at the requirements you outline, and how often I'd have to use them in my current codebase, and say "no thanks" to the whole idea. -- Chad Nelson Oak Circle Software, Inc. * * *

... elision by patrick For interaction with functions that require std::string, and presumably don't care about the encoding, it's convenient. Without it, using the UTF classes with such a function requires calling the std::string constructor with the element iterators. That's going to be a major need for the foreseeable future. But if utf8_string, like std::string derives from basic_string, wouldn't
On 01/20/2011 03:46 PM, Chad Nelson wrote: that just work? Patrick

On Fri, 21 Jan 2011 01:43:08 -0800, Patrick Horgan wrote:
... elision by patrick For interaction with functions that require std::string, and presumably don't care about the encoding, it's convenient. Without it, using the UTF classes with such a function requires calling the std::string constructor with the element iterators. That's going to be a major need for the foreseeable future. But if utf8_string, like std::string derives from basic_string, wouldn't
On 01/20/2011 03:46 PM, Chad Nelson wrote: that just work?
Rule 35 of C++ Coding Standard by Sutter and Alexandrescu: "Avoid inheriting from classes that were not designed to be classes" "Using a standalone class as a base is a serious design error" The reasons given include undefined behaviour when deleting a std::string* pointing to an instance of the subclass, slicing if extra data is added and pointlessness as the subclass doesn't get any more access to the superclass's implementation that it would have had if it just kept std::string as a member. Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

Alexander Lamaison wrote:
Rule 35 of C++ Coding Standard by Sutter and Alexandrescu: "Avoid inheriting from classes that were not designed to be classes"
^ base
"Using a standalone class as a base is a serious design error"
The reasons given include undefined behaviour when deleting a std::string* pointing to an instance of the subclass, slicing if extra data is added and pointlessness as the subclass doesn't get any more access to the superclass's implementation that it would have had if it just kept std::string as a member.
If there are no additional members, then slicing and failure to invoke the derivate's destructor cause no problems, while inheritance means the base class interface needn't be duplicated. In this case, one gets a new type name that behaves just like std::string at the cost of some forwarding constructors and assignment operators (for the right return type). That said, following Rule 35 is generally wise. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

At Fri, 21 Jan 2011 09:01:10 -0500, Stewart, Robert wrote:
Alexander Lamaison wrote:
Rule 35 of C++ Coding Standard by Sutter and Alexandrescu: "Avoid inheriting from classes that were not designed to be classes"
^ base
"Using a standalone class as a base is a serious design error"
The reasons given include undefined behaviour when deleting a std::string* pointing to an instance of the subclass, slicing if extra data is added and pointlessness as the subclass doesn't get any more access to the superclass's implementation that it would have had if it just kept std::string as a member.
If there are no additional members, then slicing and failure to invoke the derivate's destructor cause no problems
None other than undefined behavior. :-) The fact that most compilers and platforms will let you get away with it notwithstanding, I'd like to avoid that. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 01/20/2011 12:52 PM, Mostafa wrote:
... elision by patrick ...
On second thought, is there really a need to access the underlying data of utf8_t? I argue that having a view of the underlying data via iterators accomplishes just as much(*), and is more inline with the stl tradition of containers and iterators, not to mention the better encapsulation it affords the interface. Do clients really need to know, and potentially develop a dependency on, the fact that utf8_t (for now?) is really just a wrapper for std::string? What type would be returned by operator* on the iterator for a utf8_string? char32_t? What do you do about combining characters? Return them one at a time and let the application deal with it? That's what I think. I don't see what else you could do. There's a lot of other issues. Assuming it has the same interface as std::string how would you do max_size()? How about the comparison operators? There's:
template<typename charT, typename traits, typename Allocator> bool operator<=(const basic_string<charT, traits, Allocator>& lhs, const charT* rhs); What would the equivalent be for utf8_string? For the above, the rhs is in effect converted to basic_string for the comparison. For a utf8_string, what if the rhs doesn't convert to utf-8? Should there be some conversion facet able to be specified for the rhs? std::string's comparison operators are supposed to take linear time. These would capacity() is supposed to return the largest number of characters the string can hold without reallocation. Would you return that by considering that the smallest characters would only take one byte? The std::string's operator[] is supposed to work in constant time. This one couldn't. It would be fun to make it, but it would have to differ in some ways from the specification of std::string. How about push_back or insert? What do they take for the argument? A char32_t encoded as utf-32? Of course you'd have to insert combining characters one part at a time. If you have LC_COLLATE set to en_US.utf8 then std::sort should just work. (Replace en_ with whatever is used in your locale.) Patrick

On Fri, 21 Jan 2011 01:35:15 -0800, Patrick Horgan <phorgan1@gmail.com> wrote:
On 01/20/2011 12:52 PM, Mostafa wrote:
... elision by patrick ...
On second thought, is there really a need to access the underlying data of utf8_t? I argue that having a view of the underlying data via iterators accomplishes just as much(*), and is more inline with the stl tradition of containers and iterators, not to mention the better encapsulation it affords the interface. Do clients really need to know, and potentially develop a dependency on, the fact that utf8_t (for now?) is really just a wrapper for std::string? What type would be returned by operator* on the iterator for a utf8_string? char32_t? What do you do about combining characters? Return them one at a time and let the application deal with it? That's what I think. I don't see what else you could do. There's a lot of other issues. Assuming it has the same interface as std::string how would you do max_size()? How about the comparison operators? There's:
template<typename charT, typename traits, typename Allocator> bool operator<=(const basic_string<charT, traits, Allocator>& lhs, const charT* rhs);
What would the equivalent be for utf8_string? For the above, the rhs is in effect converted to basic_string for the comparison. For a utf8_string, what if the rhs doesn't convert to utf-8? Should there be some conversion facet able to be specified for the rhs? std::string's comparison operators are supposed to take linear time. These would
capacity() is supposed to return the largest number of characters the string can hold without reallocation. Would you return that by considering that the smallest characters would only take one byte?
The std::string's operator[] is supposed to work in constant time. This one couldn't. It would be fun to make it, but it would have to differ in some ways from the specification of std::string.
How about push_back or insert? What do they take for the argument? A char32_t encoded as utf-32? Of course you'd have to insert combining characters one part at a time.
If you have LC_COLLATE set to en_US.utf8 then std::sort should just work. (Replace en_ with whatever is used in your locale.)
Patrick
Interesting questions, but how do they relate to the sequence of posts you cited? Never the less, let me attempt to address some of them in the context of utf8_t and what I had posted. I was thinking that utf8_t should just be considered a container, whose interface only deals with iterators when it comes to "element" access; and that there should be 3 types of such iterators: code unit iterators, code point iterators, and character iterators. The utf8_t api should not accept or return individual code unit types (ie, an octet type), or individual code point types (ie, a 32 bit type), and, obviously, individual character types since there is no C++ type that can represent any unicode character. Thus, insert() and push_back() would take a range of iterators, etc... And does operator[] make sense for utf8_t, or should it be more aptly named: iterator_range character(size_t const ordinal_position) Though, I would argue one wouldn't need any of the latter two methods if the aforementioned iterators are random access (and I don't see a reason why they shouldn't be). Mostafa

On Fri, 21 Jan 2011 01:35:15 -0800 Patrick Horgan <phorgan1@gmail.com> wrote:
On 01/20/2011 12:52 PM, Mostafa wrote:
On second thought, is there really a need to access the underlying data of utf8_t? I argue that having a view of the underlying data via iterators accomplishes just as much(*), and is more inline with the stl tradition of containers and iterators, not to mention the better encapsulation it affords the interface. Do clients really need to know, and potentially develop a dependency on, the fact that utf8_t (for now?) is really just a wrapper for std::string?
What type would be returned by operator* on the iterator for a utf8_string? [...]
Which iterator? ;-) As I'd envisioned it, there would be three: an element iterator using char, a code-point iterator using char32_t, and a true character iterator using a custom class. The custom class might be ugly and hard to work with, but would be guaranteed to do the right thing.
There's a lot of other issues. Assuming it has the same interface as std::string how would you do max_size()? How about the comparison operators? [...]
max_size would have to operate on char elements, as there's no other accurate answer. Comparison operators would either operate on code-points or, through Boost.Locale, characters.
What would the equivalent be for utf8_string? For the above, the rhs is in effect converted to basic_string for the comparison. For a utf8_string, what if the rhs doesn't convert to utf-8? Should there be some conversion facet able to be specified for the rhs?
The more people discuss it, the more I think automatic conversions from std::string to the UTF types is the wrong way to go about it. It would be convenient, and would do the right thing in 90% of cases -- but it would do absolutely the *wrong* thing in the other 10%, where the std::string does *not* contain the encoding that the UTF constructor assumes. And most developers wouldn't think about that until they ran into it the hard way, after their programs were in widespread use.
std::string's comparison operators are supposed to take linear time. [...]
Obviously the hypothetical boost::string would have some slight differences from std::string. It would have to. -- Chad Nelson Oak Circle Software, Inc. * * *

Dave Abrahams wrote:
I think the reason to use separate types is to provide a type-safety barrier between your functions that operate on utf-8 and system or 3rd-party interfaces that don't or may not. In principle, that should force you to think about encoding and decoding at all the places where it may be needed, and should allow you to code naturally and with confidence where everybody is operating in utf8-land.
Yes, in principle. It isn't terribly necessary if everybody is operating in UTF-8 land though. It's a bit like defining a separate integer type for nonnegative ints for type safety reasons - useful in theory, but nobody does it. Are you saying that no one uses unsigned int for non-negative ints? I'm thinking I'm just misunderstanding you. I work with whole groups of
On 01/18/2011 03:27 AM, Peter Dimov wrote: people that are careful to declare things to match their use to take advantage of the compiler diagnostics. Show me any large body of code where people are sloppy about this I'll turn on the appropriate warnings and find bugs for you by inspection. My experience is that declaring everything int is something beginners do but once they've been bitten by the inevitable subtle and not so subtle bugs, intermediate level programmers learn to declare as unsigned things that will always be non-negative and for which it would be a mistake to ever be negative. In spite of being a good programmer with years of experience I make a constant series of sloppy coding errors and am thankful for every category the compiler will tell me about. Everyone that has ever worked at a place that builds with warnings turned up and wants the warnings gone has gone through this and learned these lessons. That's why I think I'm probably misunderstanding you.
If you're designing an interface that takes UTF-8 strings, it still may be worth it to have the parameters be of a utf8-specific type, if you want to force your users to think about the encoding of the argument each time they call one of your functions... this is a legitimate design decision. If you're in control of the whole program, though, it's usually not worth it - you just keep everything in UTF-8.
It's exactly why you would do it. It gets the compiler involved and it will give you diagnostics that make it harder for you to do the wrong thing. If the converting constructors for the utf-8 specific type are all explicit, so you can't accidentally get rid of the warning and _still_ have incorrect code, all the better. Better to be correct by design when you can. Patrick

On Mon, 17 Jan 2011 10:09:13 -0800 (PST) Artyom <artyomtnk@yahoo.com> wrote:
I've done some research, and it looks like it would require little effort to create an os::string_t type that uses the current locale, and assume all raw std::strings that contain eight-bit values are coded in that instead.
Design-wise, ascii_t would need to change slightly after this, to throw on anything that can't fit into a *seven*-bit value, rather than eight-bit. I'll add the default-character option to both types as well, and maybe make other improvements as I have time.
Unfortunately this is not the correct approach as well.
For example why do you think it is safe to pass ASCII subset of utf-8 to current non-utf-8 locale?
For example Shift-JIS that is in use on Windows/ANSI API has different subset in 0-127 range - it is not ASCII!
Ah, I wasn't aware that there were character sets that redefined 0..127. That does change things a bit.
Also if you want to use std::codecvt facet... Don't relay on them unless you know where they come from!
1. By default they are noop - in the default C locale
2. Under most compilers they are not implemented properly. [...]
I was planning to use MultiByteToWideChar and its opposite under Windows (which presumably would know how to translate its own code pages), and mbsrtowcs and its ilk under POSIX systems (which apparently have been well-implemented for at least seven versions under glibc [1], though I can't tell whether eglibc -- the fork that Ubuntu uses -- has the same level of capabilities). [1]: <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
Bottom lines don't relate on "current locale" :-)
I hadn't wanted to add a dependency on ICU or iconv either. Though I may end up having to for the program I'm currently developing, on at least some platforms.
[...] I would strongly recommend to read the answer of Pavel Radzivilovsky on Stackoverflow:
http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmf...
And he is hard-core-windows-programmer, designer, architext and developer and still he had chosen UTF-8!
Thanks, I'm familiar with it. In fact, reading that was one of the reasons that I started developing the utf*_t classes, so that I *could* keep strings in UTF-8 while still keeping track of the ones that aren't.
The problem that the issue is so completated that making it absolutly general and on the other hand right is only one - decide what you are working with and stick with it.
In CppCMS project I work with (and I developed Boost.Locale because of it) I stick by default with UTF-8 and use plain std::string - works like a charm.
To each his own. :-)
Invening "special unicode strings or storage" does not improve anybody's understanding of Unicode neither improve its handing.
We'll have to agree to disagree there. The whole point to these classes was to provide the compiler -- and the programmer using them -- with some way for the string to carry around information about its encoding, and allow for automatic conversions between different encodings. If you're working with strings in multiple encodings, as I have to in one of the programs we're developing, it frees up a lot of mental stack space to deal with other issues. -- Chad Nelson Oak Circle Software, Inc. * * *

From: Chad Nelson <chad.thecomfychair@gmail.com>
Artyom <artyomtnk@yahoo.com> wrote:
I've done some research, and it looks like it would require little effort to create an os::string_t type that uses the current locale, and assume all raw std::strings that contain eight-bit values are coded in that instead.
Design-wise, ascii_t would need to change slightly after this, to throw on anything that can't fit into a *seven*-bit value, rather than eight-bit. I'll add the default-character option to both types as well, and maybe make other improvements as I have time. Also if you want to use std::codecvt facet... Don't relay on them unless you know where they come from!
1. By default they are noop - in the default C locale
2. Under most compilers they are not implemented properly. [...]
I was planning to use MultiByteToWideChar and its opposite under Windows (which presumably would know how to translate its own code pages),
Ok... 1st of all I'd suggest to take a look on this code: http://cppcms.svn.sourceforge.net/viewvc/cppcms/boost_locale/trunk/libs/locale/src/encoding/wconv_codepage.hpp?revision=1462&view=markup What you would see is how painfully hard to use this functions right if you want to support things like skipping or replacing invalid characters. So if you use it, use it with SUPER care, and don't forget that there are changes between Windows XP and below and Windows Vista and above - to make your life even more interesting (a.k.a. miserable)
and mbsrtowcs and its ilk under POSIX systems (which apparently have been well-implemented for at least seven versions under glibc [1], though I can't tell whether eglibc -- the fork that Ubuntu uses -- has the same level of capabilities).
No... no.. This is not the way to go. For example what would be result of: #include <stdlib.h> int main() { wchar_t wide[32]; size_t size = mbsrtowcs(wide,"שלום",sizeof(wide)/sizeof(wide[0])); ??? } When current system locale lets say en_US.UTF-8? The result size would be (size_t)(-1) indicating error. You need first to use: setlocale(LC_ALL,""); To setup default locale and only then mbsrtowcs would work. And how do you think would the code below work after this (calling setlocale(...)? FILE *f=fopen("point.csv","w"); fprintf(f,"%f,%f\n",1.3,4.5); fclose(f); What would be the output? Would it succeed to create correct csv? Answer - depending on locale, for example in some locales like ru_RU.UTF-8 or Russian_Russia it would be "1,3,4,5" and not expected "1.3,4.5" Nice.. Isn't it?! And believe me 99.9% of developers would have hard to understand what is wrong with this code. You can't use these functions! Also there is other problem. What is "current locale" on current OS? - Is this defined by global OS definitions of environment variable LC_ALL, LC_CTYPE or LANG? - Is this, defined by the environment variable LC_ALL, LC_CTYPE or LANG in current user environment? - Is this, defined by the environment variable LC_ALL, LC_CTYPE or LANG in current process environment? - Is this the locale defined by setlocale(LC_ALL,"My_Locale_Name.My_Encoding"); - Is this the locale defined by std::locale::global(std::locale("My_Locale_Name.My_Encoding"))? All answers are correct and all users would probably expect each one of them to work. Don't bother to try to detect or convert to "current-locale" at POSIX system this is something that can be changed easily or even may be not defined at all!
I hadn't wanted to add a dependency on ICU or iconv either. Though I may end up having to for the program I'm currently developing, on at least some platforms.
Under Unix it is more then justified to use iconv - it is standard POSIX API, in fact in Linux it is part of libc on some other platforms it may be indepenent library (like FreeBSD) Acutally in Boost.Locale is use iconv by default under Linux as it is better API then ICU's one (and faster because do not require passing via UTF-16)
We'll have to agree to disagree there. The whole point to these classes was to provide the compiler -- and the programmer using them -- with some way for the string to carry around information about its encoding, and allow for automatic conversions between different encodings.
This is totally different problem. If so you need container like this: class specially_encoded_string { public: std::string encoding() const { return encoding_; } std::string to_utf8() const { return convert(content_,encoding_,"UTF-8"); } void from_utf8(std::string const &input) const { content_ = convert(input,"UTF-8",encoding_); } std::string const &raw() const { return content_; } private: std::string encoding_; /// <----- VERY IMPORTANT /// may have valies as: ASCII, Latin1, /// ISO-8859-8, Shift-JIS or Windows-1255 std::string content_; /// <----- The raw string } Creating "ascii_t" container or anything that that that does not carry REAL encoding name with it would lead to bad things.
If you're working with strings in multiple encodings, as I have to in one of the programs we're developing, it frees up a lot of mental stack space to deal with other issues.
The best way is to conver on input encoding to internal one and use it, and conver it back at output. I had written several programs that use different encodigns: 1. BiDiTeX: LaTeX + BiDi for Hebrew - converts input encoding to UTF-32 and then convers it back on output 2. CppCMS: it allows using non UTF-8 encodings, but the encoding information is carried with std::locale::codecvt facet and I created and the encoding/locale is bounded to the currect request/reponse context. Each user input (and BTW output as well) is validated - for example HTML form by default validates input encoding. These are my solutions of my real problems. What you suggest is misleading and not well defined. Best Regards, Artyom

On Mon, 17 Jan 2011 23:50:18 -0800 (PST), Artyom wrote:
We'll have to agree to disagree there. The whole point to these classes was to provide the compiler -- and the programmer using them -- with some way for the string to carry around information about its encoding, and allow for automatic conversions between different encodings.
This is totally different problem. If so you need container like this:
class specially_encoded_string { public: std::string encoding() const { return encoding_; } std::string to_utf8() const { return convert(content_,encoding_,"UTF-8"); } void from_utf8(std::string const &input) const { content_ = convert(input,"UTF-8",encoding_); } std::string const &raw() const { return content_; } private: std::string encoding_; /// <----- VERY IMPORTANT /// may have valies as: ASCII, Latin1, /// ISO-8859-8, Shift-JIS or Windows-1255 std::string content_; /// <----- The raw string }
Creating "ascii_t" container or anything that that that does not carry REAL encoding name with it would lead to bad things.
I thought the point of using different types was instead of tagging a string with an encoding name. In other words, a utf8_t would always hold a std::string content_ in UTF-8 format. Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

From: Alexander Lamaison <awl03@doc.ic.ac.uk>
On Mon, 17 Jan 2011 23:50:18 -0800 (PST), Artyom wrote:
We'll have to agree to disagree there. The whole point to these classes was to provide the compiler -- and the programmer using them -- with some way for the string to carry around information about its encoding, and allow for automatic conversions between different encodings.
This is totally different problem. If so you need container like this:
class specially_encoded_string { [snip] }
Creating "ascii_t" container or anything that that that does not carry REAL encoding name with it would lead to bad things.
I thought the point of using different types was instead of tagging a string with an encoding name. In other words, a utf8_t would always hold a std::string content_ in UTF-8 format.
I'm addressing this problem:
The whole point to these classes was to provide the compiler -- and the programmer using them with some way for the string to carry around information about its encoding
i.e. sometimes string should come with encoding. The point is that if the encoding you are using is not the **default** encoding in your program (i.e. UTF-8) then you may need to add encoding "tag" to the text, otherwise just use UTF-8 with std::string. Artyom

On Mon, 17 Jan 2011 23:50:18 -0800 (PST) Artyom <artyomtnk@yahoo.com> wrote:
Also if you want to use std::codecvt facet... Don't relay on them unless you know where they come from!
1. By default they are noop - in the default C locale
2. Under most compilers they are not implemented properly. [...]
I was planning to use MultiByteToWideChar and its opposite under Windows (which presumably would know how to translate its own code pages),
Ok...
1st of all I'd suggest to take a look on this code:
Pretty convoluted.
What you would see is how painfully hard to use this functions right if you want to support things like skipping or replacing invalid characters.
Sorry for the cheap shot, but: it's Microsoft. I *expect* it to be painful to use, from long experience. ;-)
So if you use it, use it with SUPER care, and don't forget that there are changes between Windows XP and below and Windows Vista and above - to make your life even more interesting (a.k.a. miserable)
As you might have seen in an earlier reply this morning, I didn't realize that it wasn't irretrievably tied to ICU; now that I know, I'd be completely happy letting Boost.Locale handle the code-page stuff. [...]
We'll have to agree to disagree there. The whole point to these classes was to provide the compiler -- and the programmer using them -- with some way for the string to carry around information about its encoding, and allow for automatic conversions between different encodings.
This is totally different problem. If so you need container like this:
class specially_encoded_string { [...] private: std::string encoding_; /// <----- VERY IMPORTANT /// may have valies as: ASCII, Latin1, /// ISO-8859-8, Shift-JIS or Windows-1255 std::string content_; /// <----- The raw string }
If you want arbitrary encodings, yes. If you only want a subset of the possible encodings -- such as ASCII and the three main UTF types -- then all you need is some way to convert to and from an OS-specific encodings.
Creating "ascii_t" container or anything that that that does not carry REAL encoding name with it would lead to bad things.
Certainly, if you tried to use it for stuff that isn't really in that encoding. It wasn't meant for that.
If you're working with strings in multiple encodings, as I have to in one of the programs we're developing, it frees up a lot of mental stack space to deal with other issues.
The best way is to conver on input encoding to internal one and use it, and conver it back at output.
I agree, for the most part. But if a large something comes in encoded with a particular coding, why waste a possibly-significant amount of processor time immediately recoding it to your internal format if you don't know that you're going to need to do anything with it? Or if it might well just be going out in that same external format again, without needing to be touched? Much better to hold onto it in whatever format it comes in, and only recode it only when you need to, in my opinion -- if you can easily keep track of what format it's in, anyway.
[...] 2. CppCMS: it allows using non UTF-8 encodings, but the encoding information is carried with std::locale::codecvt facet and I created and the encoding/locale is bounded to the currect request/reponse context. [...]
That sounds an awful lot like having a new string type that carries around its encoding. ;-)
These are my solutions of my real problems. What you suggest is misleading and not well defined.
I can see that parts of it are certainly not well defined yet, but I believe it's a fixable problem. -- Chad Nelson Oak Circle Software, Inc. * * *

From: Chad Nelson <chad.thecomfychair@gmail.com> Artyom <artyomtnk@yahoo.com> wrote:
I've done some research, and it looks like it would require little effort to create an os::string_t type that uses the current locale, and assume all raw std::strings that contain eight-bit values are coded in that instead.
Design-wise, ascii_t would need to change slightly after this, to throw on anything that can't fit into a *seven*-bit value, rather than eight-bit. I'll add the default-character option to both types as well, and maybe make other improvements as I have time. Also if you want to use std::codecvt facet... Don't relay on them unless you know where they come from!
1. By default they are noop - in the default C locale
2. Under most compilers they are not implemented properly. [...]
I was planning to use MultiByteToWideChar and its opposite under Windows (which presumably would know how to translate its own code pages),
Ok... 1st of all I'd suggest to take a look on this code: http://cppcms.svn.sourceforge.net/viewvc/cppcms/boost_locale/trunk/libs/locale/src/encoding/wconv_codepage.hpp?revision=1462&view=markup What you would see is how painfully hard to use this functions right if you want to support things like skipping or replacing invalid characters. So if you use it, use it with SUPER care, and don't forget that there are changes between Windows XP and below and Windows Vista and above - to make your life even more interesting (a.k.a. miserable)
and mbsrtowcs and its ilk under POSIX systems (which apparently have been well-implemented for at least seven versions under glibc [1], though I can't tell whether eglibc -- the fork that Ubuntu uses -- has the same level of capabilities).
This is the code that converts between encodings and usesds
Bottom lines don't relate on "current locale" :-)
I hadn't wanted to add a dependency on ICU or iconv either. Though I may end up having to for the program I'm currently developing, on at least some platforms.
[...] I would strongly recommend to read the answer of Pavel Radzivilovsky on Stackoverflow:
http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmf...
And he is hard-core-windows-programmer, designer, architext and developer and still he had chosen UTF-8!
Thanks, I'm familiar with it. In fact, reading that was one of the reasons that I started developing the utf*_t classes, so that I *could* keep strings in UTF-8 while still keeping track of the ones that aren't.
The problem that the issue is so completated that making it absolutly general and on the other hand right is only one - decide what you are working with and stick with it.
In CppCMS project I work with (and I developed Boost.Locale because of it) I stick by default with UTF-8 and use plain std::string - works like a charm.
To each his own. :-)
Invening "special unicode strings or storage" does not improve anybody's understanding of Unicode neither improve its handing.
We'll have to agree to disagree there. The whole point to these classes was to provide the compiler -- and the programmer using them -- with some way for the string to carry around information about its encoding, and allow for automatic conversions between different encodings. If you're working with strings in multiple encodings, as I have to in one of the programs we're developing, it frees up a lot of mental stack space to deal with other issues. -- Chad Nelson Oak Circle Software, Inc. * * *

-1
I'm opposed to this strategy simply because it differs from the way existing libraries treat narrow strings. Not least the STL. If you open an fstream with a narrow filename, for instance, this isn't treated as a UTF-8 string. It's treated as being in the local codepage.
First of all, neither in C++/03 nor in C++0x you can open a file stream with wide file name. MSVC provides non-standard extension but it does not exist in other compilers like GCC/MinGW. So using C++ you can't open a file called: "שלום-سلام-pease-Мир.txt" under Microsoft Windows. You can use OS level API like _wfopen to do this job using wide string. But you can't to do this in C++. Period. The idea is following: 1. Provide replacement for system libraries that actually use text and relate to it as text in some encoding. For STL and standard C library it would be filesystem API. So you need to provide something like boost::filesystem::fstream 2. Make all boost libraries use Wide API only and never call ANSI API. 3. Treat narrow strings as UTF-8 and convert then to wide prior system calls.
While this behaviour isn't great, it is standard.
If the standard it bad, leads to unportable and platform incompatible code it should not be used! You can always provide a fallback like boost::utf8_to_locale_encoding if you have to use ANSI API. But generally you should just use something like boost::utf8_to_utf16 and always call Wide API. You must not use ANSI API under Windows. Artyom

On Fri, 14 Jan 2011 07:27:49 -0800 (PST), Artyom wrote:
-1
I'm opposed to this strategy simply because it differs from the way existing libraries treat narrow strings. Not least the STL. If you open an fstream with a narrow filename, for instance, this isn't treated as a UTF-8 string. It's treated as being in the local codepage.
First of all, neither in C++/03 nor in C++0x you can open a file stream with wide file name. MSVC provides non-standard extension but it does not exist in other compilers like GCC/MinGW.
So using C++ you can't open a file called: "שלום-سلام-pease-Мир.txt" under Microsoft Windows.
You can use OS level API like _wfopen to do this job using wide string. But you can't to do this in C++. Period.
The situation is abysmal, i grant you that.
The idea is following:
1. Provide replacement for system libraries that actually use text and relate to it as text in some encoding.
For STL and standard C library it would be filesystem API.
So you need to provide something like boost::filesystem::fstream
+1. Done already, I believe :)
2. Make all boost libraries use Wide API only and never call ANSI API.
+1
3. Treat narrow strings as UTF-8 and convert then to wide prior system calls.
This is the part I have problems with: interpreting it as UTF-8 _by default_. Unless the programmer reads the docs really well, they would most likely expect to be able to use a narrow string as returned by other libraries and pass it straight to boost libraries without first having to convert it. Boost.Filesystem v3 allows you to specify that the incoming string it UTF-8 encoded but doesn't _default_ to that. Is this insufficient? Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

On Thu, 13 Jan 2011 12:17:05 -0500, Chad Nelson wrote:
On Thu, 13 Jan 2011 06:35:53 -0800 (PST) Artyom <artyomtnk@yahoo.com> wrote:
[...]
Notes:
1. You can also always assume that strings under windows are UTF-8 and always convert them to wide string before system calls.
This is I think better approach, but it is different from what most of boost does. [...]
An interesting thought... I developed a set of ASCII/UTF-8/16/32 classes for my company not too long ago, and I became fairly familiar with the UTF-8 encoding scheme. There was only one issue that stopped me from assuming that all std::string types as UTF-8-encoded: what if the string *isn't* meant as UTF-8 encoded, and contains characters with the high-bit set?
There's nothing technically stopping that from happening, and there's no way to determine with complete certainty whether even a string that seems to be valid UTF-8 was intended that way, or whether the UTF-8-like characters are really meant as their high-ASCII values.
Maybe you know something I don't, that would allow me to change it? I hope so, it would simplify some of the code greatly.
Most platforms have a notion of a 'default' encoding. On Linux, the is usually UTF-8 but isn't guaranteed to be. On Windows this is the active local codepage (i.e. *not* UTF-8) for char and UCS2 for wchar_t. The safest approach (and the one taken by the STL and boost) is to assume the strings are in this OS's default encoding unless explicitly known to be otherwise. This means you can pass these strings around freely without worrying about their encoding because, eventually, they get passed to an OS call which knows how to handle them. Alternatively, if you need to manipulate the string you can use the OS's character conversion functions to take your default-encoding string, convert it to something specific, manipulate the result and then convert it back. On Windows you would use MultibyteToWideChar/WideCharToMultibyte with the CP_ACP flag. HTH Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

Most platforms have a notion of a 'default' encoding. On Linux, the is usually UTF-8 but isn't guaranteed to be. On Windows this is the active local codepage (i.e. *not* UTF-8) for char and UCS2 for wchar_t.
The safest approach (and the one taken by the STL and boost) is to assume the strings are in this OS's default encoding unless explicitly known to be otherwise.
Two problems with this approach: - Even if the encoding under POSIX platforms is not UTF-8 you will be still able to open files, close them, stat on them and do any other operations regardless encoding as POSIX API is encoding agnostic, this is why it works well. - Under Windows, on the other hand you CAN NOT do everything with narrow strings. For example you can't create file "שלום-سلام-pease-Мир.txt" using char * API. And this has very bad consequences.
This means you can pass these strings around freely without worrying about their encoding because, eventually, they get passed to an OS call which knows how to handle them.
You can't under Windows... "ANSI" API is limited.
Alternatively, if you need to manipulate the string you can use the OS's character conversion functions to take your default-encoding string, convert it to something specific, manipulate the result and then convert it back. On Windows you would use MultibyteToWideChar/WideCharToMultibyte with the CP_ACP flag.
CP_ACP flag can never be 65001 - UTF-8 so basically you is stuck with same problem.
HTH
Alex
See my mail with wider description. Artyom

On Fri, 14 Jan 2011 00:48:43 -0800 (PST), Artyom wrote:
Most platforms have a notion of a 'default' encoding. On Linux, the is usually UTF-8 but isn't guaranteed to be. On Windows this is the active local codepage (i.e. *not* UTF-8) for char and UCS2 for wchar_t.
The safest approach (and the one taken by the STL and boost) is to assume the strings are in this OS's default encoding unless explicitly known to be otherwise.
Two problems with this approach:
- Even if the encoding under POSIX platforms is not UTF-8 you will be still able to open files, close them, stat on them and do any other operations regardless encoding as POSIX API is encoding agnostic, this is why it works well.
This isn't a problem, right? This is exactly why it _does_ work :D Assume the strings are in OS-default encoding, don't mess with them, hand them to the OS API which knows how to treat them.
- Under Windows, on the other hand you CAN NOT do everything with narrow strings. For example you can't create file "שלום-سلام-pease-Мир.txt" using char * API. And this has very bad consequences.
This is indeed true. I was just describing the situation where the string came from the result of one call and was being passed around. If you want to manipulate the strings, things become more tricky.
This means you can pass these strings around freely without worrying about their encoding because, eventually, they get passed to an OS call which knows how to handle them.
You can't under Windows... "ANSI" API is limited.
You've missed where I said "pass these strings around". I'm not suggesting you can change them. But you can take a narrow string returned by an OS call and pass it to another OS call without any problems.
Alternatively, if you need to manipulate the string you can use the OS's character conversion functions to take your default-encoding string, convert it to something specific, manipulate the result and then convert it back. On Windows you would use MultibyteToWideChar/WideCharToMultibyte with the CP_ACP flag.
I ommitted one important caveat here: if you manipulate the string once you've converted it to UTF-16, you may not be able to convert it back to the default encoding losslessly. For example, as in your string above, if you take the orginal string in Arabic, up-convert it and append a Russian word, you can't blindly convert this back as the default encoding may not be able to represent these two character sets simultaenously. Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

Alexander Lamaison wrote:
On Fri, 14 Jan 2011 00:48:43 -0800 (PST), Artyom wrote: ... Two problems with this approach:
- Even if the encoding under POSIX platforms is not UTF-8 you will be still able to open files, close them, stat on them and do any other operations regardless encoding as POSIX API is encoding agnostic, this is why it works well.
This isn't a problem, right? This is exactly why it _does_ work :D Assume the strings are in OS-default encoding, don't mess with them, hand them to the OS API which knows how to treat them.
It doesn't always work. On Mac OS X, the paths must be UTF-8; the OS isn't encoding-agnostic, because the HFS+ file system stores file names as UTF-16 (much like NTFS). You can achieve something similar on Linux by mounting a HFS+ or NTFS file system; the encoding is then specified at mount time and should also be observed. Of course, file systems that store file names as arbitrary null-terminated byte sequences are typically encoding-agnostic. For my own code, I've gradually reached the conclusion that I should always use UTF-8 encoded narrow paths. This may not be feasible for a library (yet) because people still insist on using other encodings on Unix-like OSes, usually koi8-r. :-) I'm anxiously awaiting the day everyone in the Linux/Unix world will finally switch to UTF-8 so we can be done with this question once and for all.

On Fri, 14 Jan 2011 16:09:02 +0200, Peter Dimov wrote:
Alexander Lamaison wrote:
On Fri, 14 Jan 2011 00:48:43 -0800 (PST), Artyom wrote: ... Two problems with this approach:
- Even if the encoding under POSIX platforms is not UTF-8 you will be still able to open files, close them, stat on them and do any other operations regardless encoding as POSIX API is encoding agnostic, this is why it works well.
This isn't a problem, right? This is exactly why it _does_ work :D Assume the strings are in OS-default encoding, don't mess with them, hand them to the OS API which knows how to treat them.
It doesn't always work. On Mac OS X, the paths must be UTF-8; the OS isn't encoding-agnostic, because the HFS+ file system stores file names as UTF-16 (much like NTFS).
Presumably, Mac OS X returns paths in the same encoding as it expects to receive them? So just passing them around and eventually back to the OS will always work regardless of encoding? Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

Alexander Lamaison wrote:
Presumably, Mac OS X returns paths in the same encoding as it expects to receive them? So just passing them around and eventually back to the OS will always work regardless of encoding?
Yes, it does (because it's UTF-8). It doesn't work on Windows in general - if the file name contains characters that can't be represented in the default code page, they are replaced by something else, typically '?', sometimes the character without the acute mark. Either way, the name can no longer be used to refer to the original file.

On Fri, 14 Jan 2011 17:04:19 +0200, Peter Dimov wrote:
Alexander Lamaison wrote:
Presumably, Mac OS X returns paths in the same encoding as it expects to receive them? So just passing them around and eventually back to the OS will always work regardless of encoding?
Yes, it does (because it's UTF-8). It doesn't work on Windows in general - if the file name contains characters that can't be represented in the default code page, they are replaced by something else, typically '?', sometimes the character without the acute mark. Either way, the name can no longer be used to refer to the original file.
Only if you modify the string! Windows can't give you a narrow string in the first place that it can't accept back. Even if you up-convert it to something like UTF-16 but don't modify it, you should always be able to down-convert back to the default codepage if you didn't modify the string.

Alexander Lamaison wrote:
Windows can't give you a narrow string in the first place that it can't accept back.
It can, and it does. You can have a file whose name can't be represented as a narrow string in the current code page. If you use an "ANSI" function to get its name, you can only receive an approximation of its real name. If you use the "wide" function, you get its real name.

On Fri, 14 Jan 2011 17:52:06 +0200, Peter Dimov wrote:
Alexander Lamaison wrote:
Windows can't give you a narrow string in the first place that it can't accept back.
It can, and it does. You can have a file whose name can't be represented as a narrow string in the current code page. If you use an "ANSI" function to get its name, you can only receive an approximation of its real name. If you use the "wide" function, you get its real name.
Ok, I see what you mean. The way I was looking at it, the string Windows gave you was wrong, but correctly encoded :P

I'm anxiously awaiting the day everyone in the Linux/Unix world will finally switch to UTF-8 so we can be done with this question once and for all.
Actually it had already happened: 1. All modern Linux Distributions come with UTF-8 locales by default 2. FreeBSD uses UTF-8 locales by default 3. OpenSolaris uses UTF-8 locales by default 4. Mac OS X uses UTF-8 locales by default Of course users can define other locales but this is other story. Artyom

On Thu, 13 Jan 2011 15:35:53 +0100, Artyom <artyomtnk@yahoo.com> wrote: Hi Artyom,
[...]I've noticed that you planned asynchronous notification and so on but I think it is quite important to add feature that provide an ability to wait for multiple processes to terminate and have timeout.
thanks for your micro review! I'll comment on your notes on the weekend. But which version of Boost.Process did you use? I wonder as meanwhile we have support for asynchronous operations in Boost.Process. You can download the latest version from http://www.highscore.de/boost/gsoc2010/process.zip and find the documentation at http://www.highscore.de/boost/gsoc2010/. Boris

Thanks for posting the links, Boris. By googling it, the latest I found was a 04/2009 version. Guess I can use this new one, then. Cheers, Greg On Thu, Jan 13, 2011 at 3:24 PM, Boris Schaeling <boris@highscore.de> wrote:
On Thu, 13 Jan 2011 15:35:53 +0100, Artyom <artyomtnk@yahoo.com> wrote:
Hi Artyom,
[...]I've noticed that you planned asynchronous notification and so on
but I think it is quite important to add feature that provide an ability to wait for multiple processes to terminate and have timeout.
thanks for your micro review! I'll comment on your notes on the weekend. But which version of Boost.Process did you use? I wonder as meanwhile we have support for asynchronous operations in Boost.Process. You can download the latest version from http://www.highscore.de/boost/gsoc2010/process.zipand find the documentation at http://www.highscore.de/boost/gsoc2010/.
Boris
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

From: Boris Schaeling <boris@highscore.de> On Thu, 13 Jan 2011 15:35:53 +0100, Artyom <artyomtnk@yahoo.com> wrote:
[...]I've noticed that you planned asynchronous notification and so on but I think it is quite important to add feature that provide an ability to wait for multiple processes to terminate and have timeout.
thanks for your micro review! I'll comment on your notes on the weekend. But which version of Boost.Process did you use? I wonder as meanwhile we have support for asynchronous operations in Boost.Process. You can download the latest version from http://www.highscore.de/boost/gsoc2010/process.zip and find the documentation at http://www.highscore.de/boost/gsoc2010/.
Boris
I used the version from there: http://www.boost.org/community/review_schedule.html Artyom

On Thu, 13 Jan 2011 15:35:53 +0100, Artyom <artyomtnk@yahoo.com> wrote:
[...]- stream buffer implementation:
Thanks, I'll update the code! If you have any ideas how a test case could look like to verify the fix let me know (given that there is no mocking framework in Boost yet).
- Windows and Unicode. [...] 1. You can also always assume that strings under windows are UTF-8 and always convert them to wide string before system calls. This is I think better approach, but it is different from what most of boost does.
2. I do not recommend adding wide API - makes the code much uglier, rather convert normal strigns to wide strings before system call.
I'd appreciate a Boost-wide solution or guideline. And I think this thread has already turned into a Unicode discussion? :) The interface of Boost.Process in that aspect has definitely evolved without clear direction - I think neither I nor other Boost.Process developers spent time trying to solve this problem in this library.
- It may be very good addition to implement full support of putback.
If you have a patch just drop me a mail! :)
[...]P.S.: Good luck with the review library looks overall very nice.
Thanks! I have some more patches waiting to be applied. There will be definitely another update (of the implementation only) before the review starts. Boris
participants (21)
-
Alexander Churanov
-
Alexander Lamaison
-
Artyom
-
Boris Schaeling
-
Chad Nelson
-
Christian Holmquist
-
Dave Abrahams
-
Emil Dotchevski
-
Gregory Dai
-
Jeff Flinn
-
Jens Finkhäuser
-
John B. Turpish
-
Matus Chochlik
-
Mostafa
-
Nelson, Erik - 2
-
Patrick Horgan
-
Peter Dimov
-
Robert Kawulak
-
Robert Ramey
-
Stephan T. Lavavej
-
Stewart, Robert