[general] What will string handling in C++ look like in the future [was Always treat ... ]

The string-encoding-related discussion boils down for me to the following: What fill the string handling in C++ look like in the (maybe not immediate) future. *Scenario A:* We will pick a widely-accepted char-based encoding that is able to handle all the writing scripts and alphabets that we can think of, has enough reserved space for future additions or is easily extensible and use that with std::strings which will become the one and only text string 'container' class. All the wstrings, wxString, Qstrings, utf8strings, etc. will be abandoned. All the APIs using ANSI or UCS-2 will be slowly phased out with the help of convenience classes like ansi_str_t and ucs2_t that will be made obsolete and finally dropped (after the transition). *Scenario B:* We will add yet another string class named utf8_t to the already crowded set named above. Then: library a: will stick to the ANSI encodings with std::strings It has worked in the past it will work in the future, right ? library b[oost]: will use utf8_t instead and provide the (seamles and straightforward) conversions between utf8_t and std::string and std::wstring. Some (many but not all) others will follow library c: will use std::strings with utf-8 ... library [.]n[et]: will use String class ... library q[t]: will use Qstrings .. library w[xWidgets]: will use wxStrings and wxChar* library wi[napi]: will use TCHAR* ... library z: will use const char* in an encoding agnostic way Now an application using libraries [a..z] will become the developers nightmare. What string should he use for the class members, constructor parameters, who to do when the conversions do not work so seamlesly ? Also half of the cpu time assigned to running that application will be wasted on useless string transcoding. And half of the memory will be occupied with useless transcoding-related code and data. *Scenario C:* This is basically the status quo; a mix of the above. A sad and unsatisfactory state of things. *Consequences of A:* - Interface breaking changes, which will require some fixing in the library client code and some work in the libraries themselves. These should be made as painless as possible with *temporary* utilities or convenience classes that would for example handle the transcoding from utf8 to UCS-2/UTF-16 in WINAPI and be no-ops on most POSIX systems. - Silent introduction of bugs for those who still use std::string for ANSI CP####. This is worse than above and will require some public-relations work on the part of Boost to make it clear that using std::strings with ANSI may be an error since Boost version x.y.z. - We should finally accept the notion that one byte, word, dword != one character and that there are code points and there are characters and both of them can have variable length encoding and devise tool to handle them as such conveniently. - Once we overcome the troubled period of transition everything will be just great. No headaches related to file encoding detection and transcoding. Think about what will happen after we accept IPV6 and drop IPV4. The process will be painful but after it is done, there will be no more NAT, and co. and the whole network infrastructure will be simplified. *Consequences of B:* - No fixing of existing interface which IMO means no or very slow moving on to a single encoding. - Creating another string class, which, let us face it, not everybody will accept even with the Boost influence unless it becomes standard. - We will abandon std::string and be stuck with utf8_t which I *personally* already dislike :) - People will probably start to use other programming languages (although this may by FUD) *Consequences of C:* Here pick all the negatives of the above :) *Note on the encoding to be used* The best candidate for the widely-accepted and extensible encoding vaguely mentioned above is IMO UTF-8. - It has been given a lot of thought - It is an already widely accepted standard - It is char-based so no need to switch to std::basic_string<whatever_char_t> - It is extensible, so once we have done the painful transition we will not have to do it again. Currently utf-8 uses 1-4 (or 1-6) byte sequences to encode code points, but the scheme is transparently extensible to 1-N bytes (unlike UCS-X and i'm not sure about UTF-16/32). So, [dark-sarcasm] even if we dig out the stargate or join the United Federation of Planets and captain Kirk, every time he returns home, brings a truckload of new writing scripts to support, UTF-8 will be able to handle it. just my 0.02 strips of gold pressed latinum :) [/dark-sarcasm] Best regards, Matus

On Wed, 19 Jan 2011 11:33:02 +0100, Matus Chochlik wrote:
The string-encoding-related discussion boils down for me to the following: What fill the string handling in C++ look like in the (maybe not immediate) future.
*Scenario A:*
[..]
All the wstrings, wxString, Qstrings, utf8strings, etc. will be abandoned. All the APIs using ANSI or UCS-2 will be slowly phased out with the help of convenience classes like ansi_str_t and ucs2_t that will be made obsolete and finally dropped (after the transition).
This is simply not going to happen. How could MS even go about doing this in Windows? It would make very single piece of Windows software incompatible with the next version! Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

On Wed, Jan 19, 2011 at 1:16 PM, Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
On Wed, 19 Jan 2011 11:33:02 +0100, Matus Chochlik wrote:
The string-encoding-related discussion boils down for me to the following: What fill the string handling in C++ look like in the (maybe not immediate) future.
*Scenario A:*
[..]
All the wstrings, wxString, Qstrings, utf8strings, etc. will be abandoned. All the APIs using ANSI or UCS-2 will be slowly phased out with the help of convenience classes like ansi_str_t and ucs2_t that will be made obsolete and finally dropped (after the transition).
This is simply not going to happen. How could MS even go about doing this in Windows? It would make very single piece of Windows software incompatible with the next version!
This is where the convenience classes would be used. For windows it may take a while to get rid of them and maybe a converter from std::string to WINAPI string will have to exist for a long time. IMO even Microsoft is finally realizing that the dual interface is crap and they will have to do something about it. Many new additions to WINAPI already use only WCHAR* and do not provide the ANSI version. What this means for Boost is that we would be using std::string with UTF-8 and when using WINAPI as backend (in Filesystem, Interproces, etc.) we should as Artyom already suggested use only the "wide-char interface". Matus

On Wed, 19 Jan 2011 12:16:59 +0000 Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
On Wed, 19 Jan 2011 11:33:02 +0100, Matus Chochlik wrote: [..]
All the wstrings, wxString, Qstrings, utf8strings, etc. will be abandoned. All the APIs using ANSI or UCS-2 will be slowly phased out with the help of convenience classes like ansi_str_t and ucs2_t that will be made obsolete and finally dropped (after the transition).
This is simply not going to happen. How could MS even go about doing this in Windows? It would make very single piece of Windows software incompatible with the next version!
That has never stopped them before -- see Windows 2.0 -> 3.0, Windows 3.x -> Windows 95 (only partial compatibility), various versions of WinCE/Windows Mobile/whatever-marketingspeak-name-they're-using-this-year... ;-) But you're right, they'll probably stick with UTF-16, despite its problems. -- Chad Nelson Oak Circle Software, Inc. * * *

On Wed, 19 Jan 2011 09:00:17 -0500, Chad Nelson wrote:
On Wed, 19 Jan 2011 12:16:59 +0000 Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
On Wed, 19 Jan 2011 11:33:02 +0100, Matus Chochlik wrote: [..]
All the wstrings, wxString, Qstrings, utf8strings, etc. will be abandoned. All the APIs using ANSI or UCS-2 will be slowly phased out with the help of convenience classes like ansi_str_t and ucs2_t that will be made obsolete and finally dropped (after the transition).
This is simply not going to happen. How could MS even go about doing this in Windows? It would make very single piece of Windows software incompatible with the next version!
That has never stopped them before -- see Windows 2.0 -> 3.0, Windows 3.x -> Windows 95 (only partial compatibility), various versions of WinCE/Windows Mobile/whatever-marketingspeak-name-they're-using-this-year... ;-)
I'm not convinced you're right about this. You only have to read The Old New Thing to see some of the remarkable (insane?) things MS do to retain backwards compatabiliy. I believe only the 64-bit versions of Windows Vista/7 ditch 16-bit program compatibilty - so you should be able to crack out those windows 3 programs on Windows 7 x86 and watch then run! :D Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

On Wed, 19 Jan 2011 15:22:34 +0000 Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
This is simply not going to happen. How could MS even go about doing this in Windows? It would make very single piece of Windows software incompatible with the next version!
That has never stopped them before -- see Windows 2.0 -> 3.0, Windows 3.x -> Windows 95 (only partial compatibility), various versions of WinCE/Windows Mobile/whatever-marketingspeak-name-they're-using-this-year... ;-)
I'm not convinced you're right about this. You only have to read The Old New Thing to see some of the remarkable (insane?) things MS do to retain backwards compatabiliy. I believe only the 64-bit versions of Windows Vista/7 ditch 16-bit program compatibilty - so you should be able to crack out those windows 3 programs on Windows 7 x86 and watch then run! :D
Yes, that answer was meant tongue-in-cheek. Microsoft got the backward-compatibility religion (for the desktop, at least) around the time they introduced Windows 95, because the only way to convince people to buy it was to show them that their old programs would continue to run. A few years ago they seemed to start drifting away from that again, but they seem to have rediscovered the need for it. -- Chad Nelson Oak Circle Software, Inc. * * *

On Jan 19, 2011, at 7:16 AM, Alexander Lamaison wrote:
On Wed, 19 Jan 2011 11:33:02 +0100, Matus Chochlik wrote:
The string-encoding-related discussion boils down for me to the following: What fill the string handling in C++ look like in the (maybe not immediate) future.
*Scenario A:*
[..]
All the wstrings, wxString, Qstrings, utf8strings, etc. will be abandoned. All the APIs using ANSI or UCS-2 will be slowly phased out with the help of convenience classes like ansi_str_t and ucs2_t that will be made obsolete and finally dropped (after the transition).
This is simply not going to happen. How could MS even go about doing this in Windows? It would make very single piece of Windows software incompatible with the next version!
There is a straightforward way for Microsoft to migrate Windows to this future: If they add UTF-8 support to their narrow character interface (I am avoiding calling it ANSI due to the negative connotations that has) and add narrow character APIs for all wide character APIs that lack a narrow counterpart, then I believe we could treat POSIX and Windows identically from an encoding point of view. Then Microsoft would be free deprecate their wide character interface over an extended period of time, if they so chose. Ian Emmons

On Wed, 19 Jan 2011 09:06:52 -0500, Ian Emmons wrote:
On Jan 19, 2011, at 7:16 AM, Alexander Lamaison wrote:
On Wed, 19 Jan 2011 11:33:02 +0100, Matus Chochlik wrote:
The string-encoding-related discussion boils down for me to the following: What fill the string handling in C++ look like in the (maybe not immediate) future.
*Scenario A:*
[..]
All the wstrings, wxString, Qstrings, utf8strings, etc. will be abandoned. All the APIs using ANSI or UCS-2 will be slowly phased out with the help of convenience classes like ansi_str_t and ucs2_t that will be made obsolete and finally dropped (after the transition).
This is simply not going to happen. How could MS even go about doing this in Windows? It would make very single piece of Windows software incompatible with the next version!
There is a straightforward way for Microsoft to migrate Windows to this future: If they add UTF-8 support to their narrow character interface (I am avoiding calling it ANSI due to the negative connotations that has) and add narrow character APIs for all wide character APIs that lack a narrow counterpart, then I believe we could treat POSIX and Windows identically from an encoding point of view.
It would break any programs using the narrow API currently that use any 'exotic' codepage (i.e. pretty much anything except 7-bit ascii). That said, perhaps it's worth it. Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

Alexander Lamaison wrote:
There is a straightforward way for Microsoft to migrate Windows to this future: If they add UTF-8 support to their narrow character interface (I am avoiding calling it ANSI due to the negative connotations that has) and add narrow character APIs for all wide character APIs that lack a narrow counterpart, then I believe we could treat POSIX and Windows identically from an encoding point of view.
It would break any programs using the narrow API currently that use any 'exotic' codepage (i.e. pretty much anything except 7-bit ascii).
It will only break programs that depend on a specific code page. Programs that use the narrow API but do not require a specific code page (or a single byte code page - the exact opposite of exotic) will work fine - they'll simply see an ANSI code page of 65001. It will still cause a fair amount of breakage, of course, but in principle, the transition path is obvious and straightforward.

On Jan 19, 2011, at 11:30 AM, Peter Dimov wrote:
Alexander Lamaison wrote:
There is a straightforward way for Microsoft to migrate Windows to this future: If they add UTF-8 support to their narrow character interface (I am avoiding calling it ANSI due to the negative connotations that has) and add narrow character APIs for all wide character APIs that lack a narrow counterpart, then I believe we could treat POSIX and Windows identically from an encoding point of view.
It would break any programs using the narrow API currently that use any 'exotic' codepage (i.e. pretty much anything except 7-bit ascii).
It will only break programs that depend on a specific code page. Programs that use the narrow API but do not require a specific code page (or a single byte code page - the exact opposite of exotic) will work fine - they'll simply see an ANSI code page of 65001. It will still cause a fair amount of breakage, of course, but in principle, the transition path is obvious and straightforward.
What I intended here (but forgot to say explicitly -- sorry) was that Microsoft could allow a process (or thread) to set its local character set to UTF-8. Then all existing code that pays attention to the narrow representation would find that it is UTF-8 and deal correctly with it. Naturally, this migration would take time -- but Microsoft has done that before. They successfully transitioned a large developer base off 16-bit Windows and onto 32-bit Windows (and, incidentally, introduced the wide character API at the same time).

On 20.01.2011 23:58, Ian Emmons wrote:
What I intended here (but forgot to say explicitly -- sorry) was that Microsoft could allow a process (or thread) to set its local character set to UTF-8. Then all existing code that pays attention to the narrow representation would find that it is UTF-8 and deal correctly with it.
Naturally, this migration would take time -- but Microsoft has done that before. They successfully transitioned a large developer base off 16-bit Windows and onto 32-bit Windows (and, incidentally, introduced the wide character API at the same time). Won't happen. Follow the links here.
http://blogs.msdn.com/b/michkap/archive/2006/10/11/816996.aspx

On Wed, 19 Jan 2011 09:06:52 -0500 Ian Emmons <iemmons@bbn.com> wrote:
This is simply not going to happen. How could MS even go about doing this in Windows? It would make very single piece of Windows software incompatible with the next version!
There is a straightforward way for Microsoft to migrate Windows to this future: If they add UTF-8 support to their narrow character interface [...] then I believe we could treat POSIX and Windows identically from an encoding point of view. Then Microsoft would be free deprecate their wide character interface over an extended period of time, if they so chose.
And if the developers at Microsoft controlled the company, this would probably already be underway, if not completed. :-) But Microsoft is controlled by management, which answers to investors, who want them to squeeze as much money out of their customers as they can. Interoperability that makes programmers able to take a Windows program and easily port it to some other OS is a very *bad* thing, from their point of view -- anything that locks people into using Windows is far preferable. Market forces might coerce them into allowing it someday, as they have coerced them to making Internet Explorer more standards-compliant, but they'll fight it tooth and nail. -- Chad Nelson Oak Circle Software, Inc. * * *

On Wed, 19 Jan 2011 11:33:02 +0100 Matus Chochlik <chochlik@gmail.com> wrote:
The string-encoding-related discussion boils down for me to the following: What fill the string handling in C++ look like in the (maybe not immediate) future.
*Scenario A:*
We will pick a widely-accepted char-based encoding [...] and use that with std::strings which will become the one and only text string 'container' class.
All the wstrings, wxString, Qstrings, utf8strings, etc. will be abandoned. All the APIs using ANSI or UCS-2 will be slowly phased out with the help of convenience classes like ansi_str_t and ucs2_t that will be made obsolete and finally dropped (after the transition).
Sounds like a little slice of heaven to me. Though you'll still have the pesky problem of having to verify that the UTF-8 code is valid all the time. More on that below.
*Scenario B:*
We will add yet another string class named utf8_t to the already crowded set named above. [...] Now an application using libraries [a..z] will become the developers nightmare. What string should he use for the class members, constructor parameters, who to do when the conversions do not work so seamlesly ?
How is that different from what we've got today, except that the utf*_t classes will make converting to and from different string types, and validating the UTF code, a little easier and more automatic?
Also half of the cpu time assigned to running that application will be wasted on useless string transcoding. And half of the memory will be occupied with useless transcoding-related code and data.
I think that's a bit of an exaggeration. :-) As more libraries move to the assumption that std::string == UTF-8, the need (and code) for transcoding will silently vanish. Eventually, utf8_t will just be a statement by the programmer that the data contained within is guaranteed to be valid UTF-8, enforced by the class -- something that would require at minimum an extra call if using std::string, one that could be forgotten and open up the program to exploits.
*Scenario C:*
This is basically the status quo; a mix of the above. A sad and unsatisfactory state of things.
Agreed.
*Consequences of A:*
[...] - Once we overcome the troubled period of transition everything will be just great. No headaches related to file encoding detection and transcoding.
It's the getting-there part that I'm concerned about.
Think about what will happen after we accept IPV6 and drop IPV4. The process will be painful but after it is done, there will be no more NAT, and co. and the whole network infrastructure will be simplified.
That's a problem I've been watching carefully for many years now, and I don't see that happening. ISPs will switch to IPv6 (because they have to), and make it possible for their customers to stay on IPv4, so their customers *will* stay on IPv4 because it's cheaper. And if they stay with IPv4, there won't be any impetus for consumer electronics companies to make their equipment IPv6-compatible because consumers won't care about it. Without consumer demand, it won't get done for years, maybe a decade or more. That's what I see happening with std::string and UTF-8 as well.
*Consequences of B:*
- No fixing of existing interface which IMO means no or very slow moving on to a single encoding.
Which, as stated above, I believe will happen anyway.
- Creating another string class, which, let us face it, not everybody will accept even with the Boost influence unless it becomes standard.
That's the beauty of it -- not everybody *needs* to accept it. Just the people who write code that isn't encoding-agnostic. Boost.FileSystem might provide a utf16_t overload for Windows, for instance, so that it can automatically convert strings in other UTF types. But I see no reason it would lose the existing interface.
- We will abandon std::string and be stuck with utf8_t which I *personally* already dislike :)
Any technical reason why, other than what you've already written?
- People will probably start to use other programming languages (although this may by FUD)
I hate to point this out, but people are *already* using other programming languages. :-) C++ isn't new or sexy, and has some pain-points (though many of the most egregious ones will be solved with C++0x). Unicode handling is one of them, and in my opinion, the utf*_t types will only ease that.
*Note on the encoding to be used*
The best candidate for the widely-accepted and extensible encoding vaguely mentioned above is IMO UTF-8. [...]
Apparently a growing number of people agree, as do I.
- It is extensible, so once we have done the painful transition we will not have to do it again. Currently utf-8 uses 1-4 (or 1-6) byte sequences to encode code points, but the scheme is transparently extensible to 1-N bytes (unlike UCS-X and i'm not sure about UTF-16/32). [...]
UTF-16 can't be extended any further than its current definition, not without a major reinterpretation. UTF-32 (and UTF-8) could go up to 0xFFFFFFFF codepoints, but the standards bodies involved have agreed that they'll never be extended past the current UTF-16 limitations. Though of course, that's subject to change if the circumstances change, though nobody foresees such a change right now. -- Chad Nelson Oak Circle Software, Inc. * * *

On Wed, Jan 19, 2011 at 2:39 PM, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
On Wed, 19 Jan 2011 11:33:02 +0100 Matus Chochlik <chochlik@gmail.com> wrote:
*Scenario A:*
Sounds like a little slice of heaven to me. Though you'll still have the pesky problem of having to verify that the UTF-8 code is valid all the time. More on that below.
I am a believer ;) and when people realize that UTF-8 is the way to go, the pesky problems will vanish. Believe me today with ANSI Today I have to check/detect the encoding of input files created by users on different windows machines and do the conversions. And checking if data is valid UTF-8 is IMO an easier task. Most people here use windows1252 that is not so different from ASCII so even if something gets garbled it can be rescued. I can't imagine what it is like in countries that have to deal with semitic languages, chinese/japanese/korean ideograms, etc.
*Scenario B:*
How is that different from what we've got today, except that the utf*_t classes will make converting to and from different string types, and validating the UTF code, a little easier and more automatic?
Exactly, and I think that we agree that the current status is far from ideal. The automatic conversions would (probably) be OK but introducing yet another string class is not.
Also half of the cpu time assigned to running that application will be wasted on useless string transcoding. And half of the memory will be occupied with useless transcoding-related code and data.
I think that's a bit of an exaggeration. :-) As more libraries move to
Yes, sorry I could not resist :)
the assumption that std::string == UTF-8, the need (and code) for transcoding will silently vanish. Eventually, utf8_t will just be a statement by the programmer that the data contained within is guaranteed to be valid UTF-8, enforced by the class -- something that would require at minimum an extra call if using std::string, one that could be forgotten and open up the program to exploits.
Yes but why do not enforce it "organizationally" with the power and influence Boost has. Again I know that it would break a lot of stuff but really are all those people that now use std::string ready to change all their code to use utf8_t instead ? Which will involve more work ? I'm convinced that it will be the latter, but I can be wrong. And many people already *do* use std::string for UTF-8 and are doing the "right" (sorry :)) thing, by introducing utf8_t we are "punishing" them because we want them, for the sake of people which still dwell on ANSI, to change their code. IMO we should do the opposite.
[...] - Once we overcome the troubled period of transition everything will be just great. No headaches related to file encoding detection and transcoding.
It's the getting-there part that I'm concerned about.
Me too, but again many other people already pointed out that a large portion of the code is completely encoding agnostic so there would be no impact if we stayed with std::string. There would be, if we add utf8_t.
Think about what will happen after we accept IPV6 and drop IPV4. The process will be painful but after it is done, there will be no more NAT, and co. and the whole network infrastructure will be simplified.
That's a problem I've been watching carefully for many years now, and I don't see that happening. ISPs will switch to IPv6 (because they have to), and make it possible for their customers to stay on IPv4, so their customers *will* stay on IPv4 because it's cheaper. And if they stay with IPv4, there won't be any impetus for consumer electronics companies to make their equipment IPv6-compatible because consumers won't care about it. Without consumer demand, it won't get done for years, maybe a decade or more.
That's what I see happening with std::string and UTF-8 as well.
Yes, people (me included) are resistant to big changes event for the better. But I've learned that I should always consider the long-term impact.
- Creating another string class, which, let us face it, not everybody will accept even with the Boost influence unless it becomes standard.
That's the beauty of it -- not everybody *needs* to accept it. Just the people who write code that isn't encoding-agnostic. Boost.FileSystem might provide a utf16_t overload for Windows, for instance, so that it can automatically convert strings in other UTF types. But I see no reason it would lose the existing interface.
So you suggest that for example in the STL there would be (for example) besides the existing fstream and wfstream also a third "ufstream". I think that we actually should be reducing the interface not expanding it (yes I hear it ... "breaking changes!" :)).
- We will abandon std::string and be stuck with utf8_t which I *personally* already dislike :)
Any technical reason why, other than what you've already written?
Besides the ugly name and that is a new class ? No :)
I hate to point this out, but people are *already* using other programming languages. :-) C++ isn't new or sexy, and has some pain-points (though many of the most egregious ones will be solved with C++0x). Unicode handling is one of them, and in my opinion, the utf*_t types will only ease that.
And the solution is long overdue. And creating utf8_t is just putting the problem away, not solving it really.

On 1/19/2011 9:08 AM, Matus Chochlik wrote:
On Wed, Jan 19, 2011 at 2:39 PM, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
On Wed, 19 Jan 2011 11:33:02 +0100 Matus Chochlik<chochlik@gmail.com> wrote:
*Scenario A:*
Sounds like a little slice of heaven to me. Though you'll still have the pesky problem of having to verify that the UTF-8 code is valid all the time. More on that below.
I am a believer ;) and when people realize that UTF-8 is the way to go, the pesky problems will vanish. Believe me today with ANSI
I do not believe that UTF-8 is the way to go. In fact I know it is not, except perhaps for the very near future for some programmers ( Linux advocates ). Inevitably a Unicode standard will be adapted where every character of every language will be represented by a single fixed length number of bits. Nobody will care any longer that this fixed length set of bits "wastes space", as so many people today hysterically are fixated on. Whether or not UTF-32 can do this now or not I do not know but this world where a character in some language on earth is represented by some arcane multi-byte encoding will end. If UTF-32 can not do it then UTF-nn inevitably will. I do not think that shoving UTF-8 down everybody's throats is the best solution even now, I think a good set of classes to convert between encoding standards is much better.

I do not believe that UTF-8 is the way to go. In fact I know it is not, except perhaps for the very near future for some programmers ( Linux advocates ).
:-) Just for the record, I'm not a Linux advocate any more then I'm a Windows advocate. I use both .. I'm writing this on a windows machine. What I would like is the whole encoding madness/dysfunction (including but not limited to the dual TCHAR/whateverchar-based interfaces) to stop. Everywhere.
Inevitably a Unicode standard will be adapted where every character of every language will be represented by a single fixed length number of bits. Nobody will care any longer that this fixed length set of bits "wastes space", as so many people today hysterically are fixated on. Whether or not UTF-32 can do this now or not I do not know but this world where a character in some language on earth is represented by some arcane multi-byte encoding will end. If UTF-32 can not do it then UTF-nn inevitably will.
And then the HUGE codebase written in C/C++ that currently uses char will be reimplemented using some utfNN_char_t. Sorry but I don't see that happening. Best, Matus

On Wed, 19 Jan 2011 16:13:04 +0100, Matus Chochlik wrote:
I do not believe that UTF-8 is the way to go. In fact I know it is not, except perhaps for the very near future for some programmers ( Linux advocates ).
:-) Just for the record, I'm not a Linux advocate any more then I'm a Windows advocate. I use both .. I'm writing this on a windows machine. What I would like is the whole encoding madness/dysfunction (including but not limited to the dual TCHAR/whateverchar-based interfaces) to stop. Everywhere.
Even if I bought the UTF-8ed-Boost idea, what would we do about the STL implementation on Windows which expects local-codepage narrow strings? Are we hoping MS etc. change these to match? Because otherwise we'll be converting between narrow encodings for the rest of eternity. Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

On Wed, Jan 19, 2011 at 4:34 PM, Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
On Wed, 19 Jan 2011 16:13:04 +0100, Matus Chochlik wrote:
Even if I bought the UTF-8ed-Boost idea, what would we do about the STL implementation on Windows which expects local-codepage narrow strings? Are we hoping MS etc. change these to match? Because otherwise we'll be converting between narrow encodings for the rest of eternity.
Actually this is the biggest problem I see with the whole transition and it also concerns other systems. But AFAIK POSIX OSes are moving to utf-8 so Windows the only one where this is a real issue. But is it possible that Windows does the same thing that POSIX did ? Some time ago on unix sk_SK locale came with the ISO-8859-2 encoding. Since the the default became sk_SK.UTF-8 with UTF-8 encoding. Is there any major obstracle that would prevent Microsoft from doing this ?
Alex

On Wed, 19 Jan 2011 16:44:29 +0100, Matus Chochlik wrote:
On Wed, Jan 19, 2011 at 4:34 PM, Alexander Lamaison <awl03@doc.ic.ac.uk> wrote: On Wed, 19 Jan 2011 16:13:04 +0100, Matus Chochlik wrote:
Even if I bought the UTF-8ed-Boost idea, what would we do about the STL implementation on Windows which expects local-codepage narrow strings? Are we hoping MS etc. change these to match? Because otherwise we'll be converting between narrow encodings for the rest of eternity.
Actually this is the biggest problem I see with the whole transition and it also concerns other systems. But AFAIK POSIX OSes are moving to utf-8 so Windows the only one where this is a real issue.
'Only'? :P While it may be one OS, it probably has more code written for it than all the rest combined.
But is it possible that Windows does the same thing that POSIX did ? Some time ago on unix sk_SK locale came with the ISO-8859-2 encoding. Since the the default became sk_SK.UTF-8 with UTF-8 encoding. Is there any major obstracle that would prevent Microsoft from doing this ?
I know some Microsoft guys hang out here and may be listening to this (Stephan L, are you about?). Do you guys have any input on this UTF-8 issue? Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

From: Alexander Lamaison <awl03@doc.ic.ac.uk>
On Wed, 19 Jan 2011 16:13:04 +0100, Matus Chochlik wrote:
I do not believe that UTF-8 is the way to go. In fact I know it is not, except perhaps for the very near future for some programmers ( Linux advocates ).
:-) Just for the record, I'm not a Linux advocate any more then I'm a Windows advocate. I use both .. I'm writing this on a windows machine. What I would like is the whole encoding madness/dysfunction (including but not limited to the dual TCHAR/whateverchar-based interfaces) to stop. Everywhere.
Even if I bought the UTF-8ed-Boost idea, what would we do about the STL implementation on Windows which expects local-codepage narrow strings? Are we hoping MS etc. change these to match? Because otherwise we'll be converting between narrow encodings for the rest of eternity.
Alex
First of all today there **is** problem and STL code can't open file, try to open "שלום-سلام-pease.txt" under Windows using GCC's std::fstream... You can't. I assume with some other compilers it happens as well. There **is** problem ignoring it would not help us. How can we address STL problem and UTF-8? Simply? Provide: boost::basic_fstream boost::fopen boost::freopen boost::remove boost::rename Which are using same std::* classes under Posix platform and UTF-8 aware implementations for Windows. Take a look on this code: http://art-blog.no-ip.info/files/nowide.zip This is the code I use for my projects that implements what I'm talking about - simple easy to use straightforward. Also don't forget two things: 1. Microsoft Deprecated ANSI API and does not recommend to use it. If the only OS that gives us most of the encodings headache deprecated ANSI API I don't think that Boost should continue supporting it. 2. All the world had already moved to Unicode, Microsoft did this as well. They did it in their incompatible-with-rest-of-the-world way... But still they did it too - so we can continue ignoring the fact that UTF-8 is ultimate encoding or go forward. Artyom

Artyom wrote:
How can we address STL problem and UTF-8? Simply?
Provide:
boost::basic_fstream boost::fopen boost::freopen boost::remove boost::rename
Which are using same std::* classes under Posix platform and UTF-8 aware implementations for Windows.
This is basically what I do as well, wrappers that on Windows translate UTF-8 into UTF-16 and call the corresponding _w* function.

On 01/19/2011 09:50 AM, Artyom wrote:
... elision by patrick ... Take a look on this code:
http://art-blog.no-ip.info/files/nowide.zip building test failed on my linux box with gcc 4.6. Is it supposed to work here?
2/2 Testing: test_streambuf 2/2 Test: test_streambuf Command: "/usr/local/downloads/tmp/nowide/test_streambuf" Directory: /usr/local/downloads/tmp/nowide "test_streambuf" start time: Jan 19 20:58 PST Output: ---------------------------------------------------------- Testing input device Testing input device, small buffer Testing output device Testing output device, small buffer size Testing output device, reset Testing seek fault Testing tell fault Testing random access device Error /usr/local/downloads/tmp/nowide/test/test_streambuf.cpp:231 int(io.tellg())==4 <end of output> Test time = 0.01 sec ---------------------------------------------------------- Test Failed. "test_streambuf" end time: Jan 19 20:58 PST "test_streambuf" time elapsed: 00:00:00 ---------------------------------------------------------- End testing: Jan 19 20:58 PST
This is the code I use for my projects that implements what I'm talking about - simple easy to use straightforward.
Also don't forget two things:
1. Microsoft Deprecated ANSI API and does not recommend to use it.
If the only OS that gives us most of the encodings headache deprecated ANSI API I don't think that Boost should continue supporting it.
2. All the world had already moved to Unicode, Microsoft did this as well.
They did it in their incompatible-with-rest-of-the-world way... But still they did it too - so we can continue ignoring the fact that UTF-8 is ultimate encoding or go forward.
Artyom
_______________________________________________ Unsubscribe& other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

From: Patrick Horgan <phorgan1@gmail.com>
On 01/19/2011 09:50 AM, Artyom wrote:
... elision by patrick ... Take a look on this code:
http://art-blog.no-ip.info/files/nowide.zip building test failed on my linux box with gcc 4.6. Is it supposed to work here?
Actually if it fails then there is either bug in g++-4.6 or even more likely bug in my test :-) Because under Linux it is just something like namespace nowide { using std::fstream; }
[snip]
---------------------------------------------------------- Test Failed. "test_streambuf" end time: Jan 19 20:58 PST "test_streambuf" time elapsed: 00:00:00 ----------------------------------------------------------
End testing: Jan 19 20:58 PST
This is quite old code, the new one can be found as part of Booster - Boost-like part of CppCMS. http://cppcms.svn.sourceforge.net/viewvc/cppcms/framework/trunk/booster/boos... http://cppcms.svn.sourceforge.net/viewvc/cppcms/framework/trunk/booster/lib/... I fixed since then few bugs. If there is an interest I can extract the code once again, in any case new code is tested on many platforms and compilers: http://art-blog.no-ip.info/files/nightly-build-report.html Artyom

On Wed, 19 Jan 2011 16:13:04 +0100, Matus Chochlik wrote:
I do not believe that UTF-8 is the way to go. In fact I know it is not, except perhaps for the very near future for some programmers ( Linux advocates ). :-) Just for the record, I'm not a Linux advocate any more then I'm a Windows advocate. I use both .. I'm writing this on a windows machine. What I would like is the whole encoding madness/dysfunction (including but not limited to the dual TCHAR/whateverchar-based interfaces) to stop. Everywhere. Even if I bought the UTF-8ed-Boost idea, what would we do about the STL implementation on Windows which expects local-codepage narrow strings? Are we hoping MS etc. change these to match? Because otherwise we'll be converting between narrow encodings for the rest of eternity. That's the reality already. As long as people use local narrow encodings we will be converting between them. If your code runs on Windows in Korea or in Spain, you'll get local-codepage narrow strings
On 01/19/2011 07:34 AM, Alexander Lamaison wrote: that are incompatible. At least if there was a utf-8_string type, or utf16_string type, or utf-32_string type, with documentation about how to implement templated conversions to them, (code conversion facets), someone could write a library to use them, and everyone using all of these different local encodings would know what to do to use the library. The way it is today it's much more difficult to figure out how to write a generic library that accepts text from a user. What's a char* or a std::string<char> imply about encoding? Who knows what you'll get. A local 8 bit code page? Shift-JIS? utf-8? euc? This is just saying that, hey, here's one way to deal with this issue. This sort of scheme lets the Windows STL implementation exist, but says, here's what you need to do so that I know how to treat the text you pass to me as an argument. If it's in a local code page you need to convert it to what I want. With validating string types that support the three UCS encodings you can trust that the data is validly encoded, although all the normal issues about whether the content is meaningful to you still exist. If you use normal code conversion facets as specified for C++ locales, for conversion from local code pages to your strings, then you can leverage existing work. Why reinvent the wheel? Patrick

19.01.2011 18:34, Alexander Lamaison wrote:
Even if I bought the UTF-8ed-Boost idea, what would we do about the STL implementation on Windows which expects local-codepage narrow strings? Are we hoping MS etc. change these to match? Because otherwise we'll be converting between narrow encodings for the rest of eternity. The problems with MSVC and multilingual filenames are not boost-related. Even the following code don't work correctly:
#include <stdio.h> int main( int argc, char *argv[]) { printf("%s", argv[1]); return 0; }
1.exe asdfфыва asdfЇ√тр
As you can see, the cyrillic characters are broken (this is an ANSI vs OEM issue and is not related to the unicode at all). Please note that the cygwin compiler/libc has no such problems because it uses utf-8 (by default, at least). The fopen() uses the utf-8 for filenames, too. So, we may choose one of the following: 1. Wait until MS fixes the problem on their side. For now, the windows users may use the short filenames (i.e. GetShortPathName() ) for the multilingual filenames. 2. Provide a char * interface that will allow the windows developers to work with multilingual filenames. 3. Provide WCHAR * interface specially for the windows developers and allow them to write the non-portable code. Leave the char * interface unusable for windows/msvc and wait until MS fixes it on their side. 4. Create the almost-portable wchar_t * interface. 5. Create our own type (boost::native_t or boost::utf8_t) and conversion routines for it. Please note that independent libraries will NEVER use foreign non-standard types. I think only 2nd and 3rd options are realistic. -- Best regards, Sergey Cheban

On Thu, Jan 20, 2011 at 1:22 PM, Sergey Cheban <s.cheban@drweb.com> wrote:
The problems with MSVC and multilingual filenames are not boost-related. Even the following code don't work correctly:
#include <stdio.h> int main( int argc, char *argv[]) { printf("%s", argv[1]); return 0; }
1.exe asdfфыва asdfЇ√тр
You lost me. That example has nothing to do with filenames.
As you can see, the cyrillic characters are broken (this is an ANSI vs OEM issue and is not related to the unicode at all).
Please note that the cygwin compiler/libc has no such problems because it uses utf-8 (by default, at least). The fopen() uses the utf-8 for filenames, too.
So, we may choose one of the following:
1. Wait until MS fixes the problem on their side. For now, the windows users may use the short filenames (i.e. GetShortPathName() ) for the multilingual filenames.
2. Provide a char * interface that will allow the windows developers to work with multilingual filenames.
3. Provide WCHAR * interface specially for the windows developers and allow them to write the non-portable code. Leave the char * interface unusable for windows/msvc and wait until MS fixes it on their side.
4. Create the almost-portable wchar_t * interface.
5. Create our own type (boost::native_t or boost::utf8_t) and conversion routines for it. Please note that independent libraries will NEVER use foreign non-standard types.
I think only 2nd and 3rd options are realistic.
Why not just use Boost.Filesystem V3 for dealing with files and filenames? You can work with char strings in the encoding of your choice, including utf-8 encoding. You can use wchar_t strings in utf-16 encoding. If your compiler supports C++0x char16_t and char_32t, you will be able to also use strings based on those as C++0x support matures. Class boost::filesystem::path provides a single non-template class that works fine with all of those types and encodings. Your code can be written to be reasonably portable too, particularly if all you are concerned with is either Windows systems or POSIX-like systems that use utf-8 for filenames. If you want wider portability, you would have to avoid narrow strings so that on POSIX-like systems the wide strings could be converted to whatever narrow encoding the system uses. --Beman --Beman

Beman Dawes wrote:
Why not just use Boost.Filesystem V3 for dealing with files and filenames? ... Your code can be written to be reasonably portable too, particularly if all you are concerned with is either Windows systems or POSIX-like systems that use utf-8 for filenames. If you want wider portability, you would have to avoid narrow strings so that on POSIX-like systems the wide strings could be converted to whatever narrow encoding the system uses.
No, if you want wider portability you want to avoid wide strings and use narrow strings (UTF-8 on Windows, no conversion on POSIX). You don't want to convert on POSIX because the narrow -> wide -> narrow conversion may not be exact. (*) On Windows, wide (UTF-16) -> narrow (UTF-8) -> wide (UTF-16) is exact. std::wstring is typically UTF-32 on POSIX, by the way, because wchar_t is 32 bits there - this is one additional source of non-portability if you go wide. (*) It will be exact if you use a single-byte encoding that maps every narrow character to a distinct Unicode code point, but then you can't use UTF-8 at the same time - you have to choose one or the other.

Beman Dawes wrote:
Why not just use Boost.Filesystem V3 for dealing with files and filenames?
The V3 path looks very reasonably designed and I can certainly understand why it's the way it is. However... Let's take Windows. In the vast majority of the use cases that call for construction from a narrow string, this string is either (a) ANSI code page encoded, (b) UTF-8 encoded. Of these, (b) are people doing the Right Thing, (a) are people doing the Wrong Thing or people who have to work with people doing the Wrong Thing (not that there's anything wrong with that). v3::path has the following constructors: path( Source ); path( Source, codecvt_type const & cvt ); The first one uses std::codecvt<wchar_t, char, mbstate_t> to do the conversion, which "converts between the native character sets for narrow and wide characters" according to the standard. In other words, nobody knows for sure what it does without consulting the source of the STL implementation du jour, but one might expect it to use the C locale via mbtowc. This is a reasonable approximation of what we need (to convert between ANSI and wide) but pedants wouldn't consider it portable or reliable. It's also implicit - so it makes it easy for people to do the wrong thing. The second one allows me to use an arbitrary encoding, which is good in that I could pass it an utf8_codecvt or ansi_codecvt, if I find some buggy versions on the Web or write them myself. But, since it considers all encodings equally valid, it makes it hard for people to do the right thing.

On Fri, Jan 21, 2011 at 5:21 AM, Peter Dimov <pdimov@pdimov.com> wrote:
Beman Dawes wrote:
Why not just use Boost.Filesystem V3 for dealing with files and filenames?
The V3 path looks very reasonably designed and I can certainly understand why it's the way it is. However...
Let's take Windows. In the vast majority of the use cases that call for construction from a narrow string, this string is either (a) ANSI code page encoded, (b) UTF-8 encoded. Of these, (b) are people doing the Right Thing, (a) are people doing the Wrong Thing or people who have to work with people doing the Wrong Thing (not that there's anything wrong with that).
Sure, but anything other than that would be untenable. Programmers will assume that the default is for a narrow string to be treated exactly the way it would be treated in a call to the C library's fopen(), and doing something different would cause endless real-world bugs.
v3::path has the following constructors:
path( Source ); path( Source, codecvt_type const & cvt );
The first one uses std::codecvt<wchar_t, char, mbstate_t> to do the conversion, which "converts between the native character sets for narrow and wide characters" according to the standard. In other words, nobody knows for sure what it does without consulting the source of the STL implementation du jour, but one might expect it to use the C locale via mbtowc. This is a reasonable approximation of what we need (to convert between ANSI and wide) but pedants wouldn't consider it portable or reliable. It's also implicit - so it makes it easy for people to do the wrong thing.
std::codecvt<wchar_t, char, mbstate_t> is the type, but for windows the actual object used is a custom codecvt that uses Windows MultiByteToWideChar() for the ANSI or OEM codepage, as determined by AreFileApisANSI(). But your point is correct, but only if you believe defaulting to the platform's usual open/fopen() behavior is the wrong thing.
The second one allows me to use an arbitrary encoding, which is good in that I could pass it an utf8_codecvt or ansi_codecvt, if I find some buggy versions on the Web or write them myself. But, since it considers all encodings equally valid, it makes it hard for people to do the right thing.
What I'm suggesting is that people who want to use Unicode use wchar_t strings now, and char16_t or char32_t strings in C++0x. For general string use, rather than just paths, I'd like Boost to supply non-templated Unicode string classes: * u8_string, u16_string, and u32_string, with guaranteed internal representations. * utf_string with an internal representation that is one of the above, but chosen at run-time. All would, like boost::path, supply member function templates that take any of the above, as well as std and UDT types. --Beman

Beman Dawes wrote:
std::codecvt<wchar_t, char, mbstate_t> is the type, but for windows the actual object used is a custom codecvt that uses Windows MultiByteToWideChar() for the ANSI or OEM codepage, as determined by AreFileApisANSI().
This is not what http://www.boost.org/doc/libs/1_45_0/libs/filesystem/v3/doc/reference.html#p... says. I can't find this custom codecvt in the v3 source either, but I haven't looked very hard.
But your point is correct, but only if you believe defaulting to the platform's usual open/fopen() behavior is the wrong thing.
It's not really a matter of belief. Using an "ANSI" path on Windows is objectively a wrong thing, unless you are forced to do so by a library. "ANSI" paths can't represent Windows paths properly. Now, it is not objectively a wrong thing to make a library default to ANSI when given a narrow string, because this is what 90% of programmers would expect. In fact, if a function's documentation doesn't state how it interprets a narrow path string, I would assume ANSI as well - this is how it is. Fine. But this design decision makes v3::path unsuitable for people who don't want their strings to be treated silently as ANSI paths (via the implicit conversion) because this hides logic errors. My current preference is for a Windows path class to provide path::from_ansi and path::from_utf8; the implicit constructor, if present, would default to UTF-8, although omitting it would be much less controversial, and I don't really insist on having it.
What I'm suggesting is that people who want to use Unicode use wchar_t strings now, and char16_t or char32_t strings in C++0x.
This is worse than using UTF-8 on Windows from a portability standpoint, as I've explained in my previous post.

On 01/21/2011 02:21 AM, Peter Dimov wrote:
Beman Dawes wrote:
Why not just use Boost.Filesystem V3 for dealing with files and filenames?
The V3 path looks very reasonably designed and I can certainly understand why it's the way it is. However...
Let's take Windows. In the vast majority of the use cases that call for construction from a narrow string, this string is either (a) ANSI code page encoded, (b) UTF-8 encoded. Of these, (b) are people doing the Right Thing, (a) are people doing the Wrong Thing or people who have to work with people doing the Wrong Thing (not that there's anything wrong with that).
There's nothing wrong about either one, although blindly dealing with it without knowing which would be problematic. Of course this is an artificial distinction and it might be Shift-JIS or EUC or others as well. These are all valid and are only a subset of widely used 8-bit encodings. In general, if someone works only within a particular language and whatever they're using works for them they aren't much motivated to change. You won't easily convince them to switch to utf-8. Although nicely, utf-8 isn't state dependent like many others, people have long since solved the problems of dealing with whatever encoding they're used to for their region, and even though utf-8 would be less problematic they have already solved their problems. You're just asking them to take on a new set of problems.
v3::path has the following constructors:
path( Source ); path( Source, codecvt_type const & cvt );
The first one uses std::codecvt<wchar_t, char, mbstate_t> to do the conversion, which "converts between the native character sets for narrow and wide characters" according to the standard. In other words, nobody knows for sure what it does without consulting the source of the STL implementation du jour, but one might expect it to use the C locale via mbtowc. This is a reasonable approximation of what we need (to convert between ANSI and wide) but pedants wouldn't consider it portable or reliable. It's also implicit - so it makes it easy for people to do the wrong thing.
That is a frustration. The program should check the locale when run and use the current one. That locale will have the codecvt in it. Now, many operating systems don't provide a standard locale in the user's environment, so the default would be "C", a 7-bit US-ascii. You can run this little program in your environment to see what you get. #include <iostream> #include <ctime> #include <locale> int main() { std::locale native(""); std::locale c("C"); std::locale global; std::cout << "native : " << native.name() << '\n'; std::cout << "classic: " << std::locale::classic().name() << '\n'; std::cout << "global : " << global.name() << '\n'; std::cout << "c : " << c.name() << '\n'; return 0; } Also, although it's specified in the standard what will happen, vc++ doesn't always follow the standard exactly, sometimes just because the standards have left room for interpretation and different operating systems interpret the same part of the spec differently.
The second one allows me to use an arbitrary encoding, which is good in that I could pass it an utf8_codecvt or ansi_codecvt, if I find some buggy versions on the Web or write them myself. But, since it considers all encodings equally valid, it makes it hard for people to do the right thing.
Writing a code conversion facet yourself isn't hard, but it's tricky to make sure all the corner cases work. It will be nicer with c++0x, because some standard code conversion facets come with it that are specified clearly enough so that you can rely on them doing the same thing across operating systems. The truth is that there is a dearth of high quality code conversion facets available as open source. Lets all fix that:) One of the problems has been different interpretations of what a wchar_t is. That's another thing c++0x gets right with the char32_t and char16_t. No more converting from utf-8 to wchar_t and on one operating system you get utf-16 and on another utf-32. Patrick

Patrick Horgan wrote: ...
There's nothing wrong about either one, although blindly dealing with it without knowing which would be problematic. Of course this is an artificial distinction and it might be Shift-JIS or EUC or others as well. These are all valid and are only a subset of widely used 8-bit encodings.
The "ANSI code page" on Windows may well use Shift-JIS. "ANSI" is just a(n unfortunate choice of) name, the actual encoding is not fixed and has little to do with ANSI - it depends on the Windows locale.

On 01/21/2011 07:09 PM, Peter Dimov wrote:
The "ANSI code page" on Windows may well use Shift-JIS. "ANSI" is just a(n unfortunate choice of) name, the actual encoding is not fixed and has little to do with ANSI - it depends on the Windows locale.
My simple guideline is that any time someone uses the term "ANSI", "ASCII", or "Unicode" to refer to an encoding scheme they don't know what they're talking about. Seriously. - Marsh

On Tue, 08 Feb 2011 10:57:53 -0600 Marsh Ray <marsh@extendedsubset.com> wrote:
On 01/21/2011 07:09 PM, Peter Dimov wrote:
The "ANSI code page" on Windows may well use Shift-JIS. "ANSI" is just a(n unfortunate choice of) name, the actual encoding is not fixed and has little to do with ANSI - it depends on the Windows locale.
My simple guideline is that any time someone uses the term "ANSI", "ASCII", or "Unicode" to refer to an encoding scheme they don't know what they're talking about.
Seriously.
A good rule of thumb, but keep in mind that ASCII (or more formally "US-ASCII") is the colloquial name for the seven-bit ISO 646 encoding, and "ANSI" was used for Windows code-page 1252 because Microsoft based it on an early ISO-8859-1 draft.[1] (The name is still in use in the Windows API, but they say it's a "historical reference, but is nowadays a misnomer that continues to persist in the Windows community.") The blame for "Unicode encoding" can probably be laid on Microsoft too.[2] (Sorry to get pedantic on you, just taking a break before the hopefully-final coding session on my UTF string library, which includes converter classes for many common code-pages, including ascii (typedef of us_ascii) and windows_ansi (typedef of windows1252)... I've been swimming in this stuff for the last several weeks. ;-) ) [1]:<http://intranet.ipub.sil.org/cms/scripts/page.php?item_id=IWS-Chapter03> [2]:<http://msdn.microsoft.com/en-us/library/system.text.unicodeencoding.aspx> -- Chad Nelson Oak Circle Software, Inc. * * *

On 02/08/2011 04:17 PM, Chad Nelson wrote:
A good rule of thumb, but keep in mind that ASCII (or more formally "US-ASCII") is the colloquial name for the seven-bit ISO 646 encoding,
Everybody knows that. But which one? Everyone doesn't agree and there is significant variation. You can't point to a single standard that even a majority of people agree on as being the official ASCII. It was revised many times over the years. So it's a bad way to refer to a spec. It's probably why GCC prints compiler warning messages using the backtick/grave and apostrophe as if they were paired single quotes. It's broken.
and "ANSI" was used for Windows code-page 1252 because Microsoft based it on an early ISO-8859-1 draft.[1] (The name is still in use in the Windows API, but they say it's a "historical reference, but is nowadays a misnomer that continues to persist in the Windows community.")
MS also used it to contrast with the "OEM" code page, which was their way of saying "ANSI" was for system stuff that didn't change (e.g. DLL names) and "OEM" was for UI and interoperable stuff that was deeply customized for foreign markets.
The blame for "Unicode encoding" can probably be laid on Microsoft too.[2]
Unicode was originally sold as a 16-bit fixed-width encoding, with perhaps just the minor variation for endianness. 64K characters ought to be enough for anybody they said. But they just couldn't stop themselves from inflicting yet another endless variety of multibyte encodings on the world.
(Sorry to get pedantic on you, just taking a break before the
No, surely it was I who was trolling for pedantry. Sorry!
hopefully-final coding session on my UTF string library, which includes converter classes for many common code-pages, including ascii (typedef of us_ascii) and windows_ansi (typedef of windows1252)... I've been swimming in this stuff for the last several weeks. ;-) )
Oh I know. I worked on that stuff for many years while working on document printing and display software. The variations are endless. I used to keep a book on my desk that was over an inch thick of just code pages. Half of them were "ASCII" code pages. The other half were "EBCDIC". Perhaps you've seen this: http://en.wikipedia.org/wiki/ISO/IEC_646#National_variants I still can't figure out of "ISO 646 US" and "ANSI X3.4-1968" are the same as Unicode U+0000 - U+007F (for those 128 points). I think there are some differences. You can maybe get away with "US ASCII" in the US (other than Spanish speakers), Canada (other than Quebec), Austrailia and New Zealand. Maybe a few other places. But make sure you reference a modern relevant standard for it. It'd probably be better if you just referenced the specific standards directly and avoid the imprecise term "ASCII". - Marsh

On Tue, Feb 8, 2011 at 14:17, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
The blame for "Unicode encoding" can probably be laid on Microsoft too.[2]
It certainly doesn't help that Notepad (on this XP system, at lease, and I know Unicode is still there in 2008R2) allows you to select "ANSI, Unicode, Unicode big endian, or UTF-8" under "Encoding" in the Save As screen.

Edward Diener wrote:
Inevitably a Unicode standard will be adapted where every character of every language will be represented by a single fixed length number of bits.
This was the prevailing thinking once. First this number of bits was 16, which incorrect assumption claimed Microsoft and Java as victims, then it became 21 (or 22?). Eventually, people realized that this will never happen even if we allocate 32 bits per character, so here we are.

On 1/19/2011 11:33 AM, Peter Dimov wrote:
Edward Diener wrote:
Inevitably a Unicode standard will be adapted where every character of every language will be represented by a single fixed length number of bits.
This was the prevailing thinking once. First this number of bits was 16, which incorrect assumption claimed Microsoft and Java as victims, then it became 21 (or 22?). Eventually, people realized that this will never happen even if we allocate 32 bits per character, so here we are.
"Eventually, people realized..." . This is just rhetoric, where "people" is just whatever your own opinion is. I do not understand the technical reason for it never happening. Are human "alphabets" proliferating so fast that we can not fit the notion of a character in any alphabet into a fixed size character ? In that case neither are we ever going to have multi-byte characters representing all of the possible characters in any language. But it is absurd to believe that. "Eventually people realized that making a fixed size character representing every character in every language was doable and they just did it." That sounds fairly logical to me, aside from the practicality of getting diverse people from different nationalities/character-sets to agree on things. Of course you can argue that having a variable number of bytes representing each possible character in any language is better than having a single fixed size character and I am willing to listen to that technical argument. But from a programming point of view, aside from the "waste of space" issue, it does seem to me that having a fixed size character has the obvious advantage of being able to access a character via some offset in the character array, and that all the algorithms for finding/inserting/deleting/changing characters become much easier and quicker with a fixed size character, as well as displaying and inputting.

Edward Diener wrote:
On 1/19/2011 11:33 AM, Peter Dimov wrote:
Edward Diener wrote:
Inevitably a Unicode standard will be adapted where every character of every language will be represented by a single fixed length number of bits.
This was the prevailing thinking once. First this number of bits was 16, which incorrect assumption claimed Microsoft and Java as victims, then it became 21 (or 22?). Eventually, people realized that this will never happen even if we allocate 32 bits per character, so here we are.
"Eventually, people realized..." . This is just rhetoric, where "people" is just whatever your own opinion is.
I do not understand the technical reason for it never happening.
I'm not sure that I do, either. Nevertheless, people at the Unicode consortium have been working on that for... 20 years now? What technical obstacle that currently blocks their progress do you foresee disappearing in the future? Occam says that variable width characters are simply a better match for the problem domain, even when character width in bits is not a problem.

... elision by patrick ...
I'm not sure that I do, either. Nevertheless, people at the Unicode consortium have been working on that for... 20 years now? What technical obstacle that currently blocks their progress do you foresee disappearing in the future? Occam says that variable width characters are simply a better match for the problem domain, even when character width in bits is not a problem. You lost me with the last line. I'm thinking you're talking about Occam's razor which says that we should prefer simpler explanations for
On 01/19/2011 09:52 AM, Peter Dimov wrote: things except when a more complicate explanation does a better job of explaining the facts. I'm completely lost about how that would apply to choosing a variable width encoding over a fixed width encoding. Patrick

Peter Dimov wrote:
Edward Diener wrote:
Inevitably a Unicode standard will be adapted where every character of every language will be represented by a single fixed length number of bits.
This was the prevailing thinking once. First this number of bits was 16, which incorrect assumption claimed Microsoft and Java as victims, then it became 21 (or 22?). Eventually, people realized that this will never happen even if we allocate 32 bits per character, so here we are.
well put !!! This is the problem of trying to impose a view upon the future. No one really sees far enough ahead to do that. The best we can do is to allow proposals to be implemented so they can be then sorted out by "software evolution". It's the age old argument of central planning - intelligent design - etc. vs. market capitalism, evolution, etc. Admitadly, it goes against the grain of control freaks like us, but we have to live with it. And things do get better. Imagine if somehow 32 bit Unicode had been imposed upon us! This is the essence of my argument that the only way forward is to propose and implement a better way forward and then try to sell it. Robert Ramey

Edward Diener wrote:
Inevitably a Unicode standard will be adapted where every character of every language will be represented by a single fixed length number of bits.
This was the prevailing thinking once. First this number of bits was 16, which incorrect assumption claimed Microsoft and Java as victims, then it became 21 (or 22?). Eventually, people realized that this will never happen even if we allocate 32 bits per character, so here we are. At 32 bits we can encode all current languages, all extinct languages, Klingon, and still have most the space empty. You might want to read
On 01/19/2011 08:33 AM, Peter Dimov wrote: the Unicode spec which talks clearly about this. If you just read through the end of Chapter 6 you'll have a great overall understanding of Unicode. It's available as a compressed pdf file at: http://www.unicode.org/versions/Unicode5.2.0/UnicodeStandard-5.2.zip Patrick

At Wed, 19 Jan 2011 09:58:13 -0500, Edward Diener wrote:
I do not think that shoving UTF-8 down everybody's throats is the best solution even now, I think a good set of classes to convert between encoding standards is much better.
Can we please tone down the rhetoric here? I could say, "I do not think that shoving a set of classes to convert between encoding standards down everybody's throats is the best solution..." but I don't think it would help anyone understand the issues better. Is there any harm in exploring the alternatives here in a calm and rational way? If we do that, and the approaches you oppose are truly inferior, that fact will become clear to everyone, I think. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Dave Abrahams wrote:
At Wed, 19 Jan 2011 09:58:13 -0500, Edward Diener wrote:
I do not think that shoving UTF-8 down everybody's throats is the best solution even now, I think a good set of classes to convert between encoding standards is much better.
Can we please tone down the rhetoric here?
It's OK. :-) Either way, not shoving _something_ down everybody's throats is not an option if you need to create a library that talks to the OS. You have to pick some type, and if you pick string or char*, you have to make a decision how to interpret it.

On Wed, 19 Jan 2011 09:58:13 -0500 Edward Diener <eldiener@tropicsoft.com> wrote:
I am a believer ;) and when people realize that UTF-8 is the way to go, the pesky problems will vanish. Believe me today with ANSI
I do not believe that UTF-8 is the way to go. In fact I know it is not, except perhaps for the very near future for some programmers ( Linux advocates ).
Inevitably a Unicode standard will be adapted where every character of every language will be represented by a single fixed length number of bits. [...]
I'm no Unicode expert, but the reason this hasn't happened might be combinatorial explosion. In which case it might never happen. But I could well be wrong. And I hope I am, the design you outline is something I'd love to see. -- Chad Nelson Oak Circle Software, Inc. * * *

On 01/19/2011 11:54 AM, Chad Nelson wrote:
On Wed, 19 Jan 2011 09:58:13 -0500 Edward Diener<eldiener@tropicsoft.com> wrote:
I am a believer ;) and when people realize that UTF-8 is the way to go, the pesky problems will vanish. Believe me today with ANSI I do not believe that UTF-8 is the way to go. In fact I know it is not, except perhaps for the very near future for some programmers ( Linux advocates ).
Inevitably a Unicode standard will be adapted where every character of every language will be represented by a single fixed length number of bits. [...] I'm no Unicode expert, but the reason this hasn't happened might be combinatorial explosion. In which case it might never happen. But I could well be wrong. And I hope I am, the design you outline is something I'd love to see. It's already here and has been for a long time. That's just UCS encoded as UTF-32. UCS isn't a new thing. They started on the standard in the late 80s and the standard was first copyright in 1991. They've come a long way. All the common languages and many of the uncommon languages are supported. Already many dead languages are supported. Language with supported added in 5.1 and 5.2 were Cham, Kayah Li, Lepcha, Ol Chiki, Rejang, Saurashtra, Sundanese, Vai, Bamum, Javanese, Lisu, Meetei Mayek, Samaritan, Tai Tham, and Tai Viet.
Patrick

On Thu, 20 Jan 2011 00:05:47 -0800 Patrick Horgan <phorgan1@gmail.com> wrote:
Inevitably a Unicode standard will be adapted where every character of every language will be represented by a single fixed length number of bits. [...]
I'm no Unicode expert, but the reason this hasn't happened might be combinatorial explosion. In which case it might never happen. But I could well be wrong. And I hope I am, the design you outline is something I'd love to see.
It's already here and has been for a long time. That's just UCS encoded as UTF-32. [...]
The problem, in my uninformed view of it, is the idea of combining characters. Any time you can have a single character that requires more than one code-point, you can't assume that a fixed number of bits will be able to represent every character. I may be wrong, and I hope I am. If a character is guaranteed never to consist of more than X code-points, it would be simple to offer a fixed-width character type, even if the width is huge by comparison to the eight-bit char type. But from what I've seen, I don't think that's the case. -- Chad Nelson Oak Circle Software, Inc. * * *

I may be wrong, and I hope I am. If a character is guaranteed never to consist of more than X code-points, it would be simple to offer a fixed-width character type, even if the width is huge by comparison to the eight-bit char type. But from what I've seen, I don't think that's the case.
I assume there is some limit but who know which? Even in Hebrew (the language I speak) you can easily create a letter with 4 code points: - shin-basic, shin/sin mark, vovel, dagesh - Now I can also add some biblical marks (I think there may be two or three of them) And Hebrew is relatively simple one. Now I have no idea about what happens in other languages and what happens with Unicode points that are going to be added in future Unicode releases. So I would suggest not assume that there is a certain limit. Artyom

On Thu, 20 Jan 2011 06:30:48 -0800 (PST) Artyom <artyomtnk@yahoo.com> wrote:
I may be wrong, and I hope I am. If a character is guaranteed never to consist of more than X code-points, it would be simple to offer a fixed-width character type, even if the width is huge by comparison to the eight-bit char type. But from what I've seen, I don't think that's the case.
I assume there is some limit but who know which?
Even in Hebrew (the language I speak) you can easily create a letter with 4 code points: [...] And Hebrew is relatively simple one. [...] So I would suggest not assume that there is a certain limit.
<sigh> Yes, that's what I pretty much expected. -- Chad Nelson Oak Circle Software, Inc. * * *

20.01.2011 17:30, Artyom пишет:
Even in Hebrew (the language I speak) you can easily create a letter with 4 code points:
- shin-basic, shin/sin mark, vovel, dagesh - Now I can also add some biblical marks (I think there may be two or three of them)
And Hebrew is relatively simple one. Even in english, you should combine the letters to write text. There are some kerning-related issues, too. But I see no problem here.
Best regards, Sergey Cheban.

On 01/19/2011 06:58 AM, Edward Diener wrote:
... elision by patrick...
I do not believe that UTF-8 is the way to go. In fact I know it is not, except perhaps for the very near future for some programmers ( Linux advocates ).
Inevitably a Unicode standard will be adapted where every character of every language will be represented by a single fixed length number of bits. Nobody will care any longer that this fixed length set of bits "wastes space", as so many people today hysterically are fixated on. Whether or not UTF-32 can do this now or not I do not know but this world where a character in some language on earth is represented by some arcane multi-byte encoding will end. If UTF-32 can not do it then UTF-nn inevitably will. UTF-32 is the only UCS fixed width encoding.
UTF-16 can encode most the basic multilingual plane in fixed width. That's most the characters in the world. If you know your problem domain, and know that you are in the first code plane then you can use UTF-16 as a fixed width encoding. If you know that you have to be able to handle any UCS character, then you can't. Currently 107,296 of the characters in UCS are defined out of a total code space of 1,114,112, (0 to 10FFFF16).
I do not think that shoving UTF-8 down everybody's throats is the best solution even now, I think a good set of classes to convert between encoding standards is much better.
I agree with you. Nobody should shove any one solution down anyone's throat. Instead, I wish that more people would understand the trade-offs of different encodings and when each might be more desirable instead of saying, "Oh, we can never do that." or "Oh, we must always do that." The best thing is to understand your problem domain, and what the implications of that domain are in each of the possible encodings. The truth is that the web and xml apps all use Unicode, as do more and more applications. Nobody considers doing new international applications with anything other than Unicode. That means that you need to know about the three encodings, UTF-8 UTF-16 and UTF-32, and their trade-offs. If you're on a fast lightly loaded machine with lots of memory, there could be real advantages to UTF-32. If you're running on a hand-held device with limited memory, UTF-8 could be a real winner. That's a simplistic view of a complex decision, but if you're doing the design for something you should educate yourself and make the complex decision with fore thought. You can get your own copy of the Unicode 5.2 standard as a zipped pdf file at http://www.unicode.org/versions/Unicode5.2.0/UnicodeStandard-5.2.zip The 6.0 standard is being worked on as we speak. Patrick

On Wed, 19 Jan 2011 15:08:06 +0100 Matus Chochlik <chochlik@gmail.com> wrote:
How is that different from what we've got today, except that the utf*_t classes will make converting to and from different string types, and validating the UTF code, a little easier and more automatic?
Exactly, and I think that we agree that the current status is far from ideal. The automatic conversions would (probably) be OK but introducing yet another string class is not.
Do you see another way to provide those conversions, and automatic verification of proper UTF coding? (Automatic verification is a very good thing, without it someone won't use it or will forget to, and open up their programs to exploitation.)
the assumption that std::string == UTF-8, the need (and code) for transcoding will silently vanish. Eventually, utf8_t will just be a statement by the programmer that the data contained within is guaranteed to be valid UTF-8, enforced by the class -- something that would require at minimum an extra call if using std::string, one that could be forgotten and open up the program to exploits.
Yes but why do not enforce it "organizationally" with the power and influence Boost has. Again I know that it would break a lot of stuff but really are all those people that now use std::string ready to change all their code to use utf8_t instead ? Which will involve more work ? I'm convinced that it will be the latter, but I can be wrong.
If Boost comes out with a version that breaks existing programs, companies just won't upgrade to it. I can keep one of the companies that mine works with upgrading, because the group that I work with is the only one there using C++ and they listen to me, but most companies have a lot more invested in the existing system. Believe me, any breaking changes have to be eased in over many versions -- the "boiling a frog" approach. :-)
And many people already *do* use std::string for UTF-8 and are doing the "right" (sorry :)) thing, by introducing utf8_t we are "punishing" them because we want them, for the sake of people which still dwell on ANSI, to change their code. IMO we should do the opposite.
If they're already using UTF-8 strings, then we provide something like BOOST_ALL_STD_STRINGS_ARE_UTF8 that they can define. The utf*_t classes configure themselves to accept std::strings as UTF-8-encoded, and any changes are completely transparent to those people. No punishment involved. For everyone else, we introduce the utf*_t API alongside the std::string one, for those classes and functions that are not encoding-agnostic. The std::string one can be deprecated in future versions if the library author desires. Again, no punishment involved.
[...] - Once we overcome the troubled period of transition everything will be just great. No headaches related to file encoding detection and transcoding.
It's the getting-there part that I'm concerned about.
Me too, but again many other people already pointed out that a large portion of the code is completely encoding agnostic so there would be no impact if we stayed with std::string. There would be, if we add utf8_t.
Those portions of the code that are encoding-agnostic can continue using std::string, and nothing changes. It's only the functions that need to know the encoding that would change, and that change can be gradual.
Think about what will happen after we accept IPV6 and drop IPV4. The process will be painful but after it is done, there will be no more NAT, and co. and the whole network infrastructure will be simplified.
That's a problem I've been watching carefully for many years now, and I don't see that happening. [...]
Yes, people (me included) are resistant to big changes event for the better. But I've learned that I should always consider the long-term impact.
As have I. :-) I think the design I'm proposing is low-impact enough that people will adopt it. Slowly, but they will.
That's the beauty of it -- not everybody *needs* to accept it. Just the people who write code that isn't encoding-agnostic. Boost.FileSystem might provide a utf16_t overload for Windows, for instance, so that it can automatically convert strings in other UTF types. But I see no reason it would lose the existing interface.
So you suggest that for example in the STL there would be (for example) besides the existing fstream and wfstream also a third "ufstream". I think that we actually should be reducing the interface not expanding it (yes I hear it ... "breaking changes!" :)).
I don't expect that the utf*_t classes will make it into the standard. They definitely won't make it into the now-misnamed C++0x standard, and it'll likely be another ten years before another one is hashed out -- by then, the UTF-8 conversion should be complete, so there will be no need for it, except possibly to confirm that a string isn't malformed.
- We will abandon std::string and be stuck with utf8_t which I *personally* already dislike :)
Any technical reason why, other than what you've already written?
Besides the ugly name and that is a new class ? No :)
If you can think of a more-acceptable-but-still-descriptive name for it, I'm all ears. :-)
I hate to point this out, but people are *already* using other programming languages. :-) C++ isn't new or sexy, and has some pain-points (though many of the most egregious ones will be solved with C++0x). Unicode handling is one of them, and in my opinion, the utf*_t types will only ease that.
And the solution is long overdue. And creating utf8_t is just putting the problem away, not solving it really.
I see it as merely easing the transition. -- Chad Nelson Oak Circle Software, Inc. * * *

On Wed, Jan 19, 2011 at 8:50 PM, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
Do you see another way to provide those conversions, and automatic verification of proper UTF coding? (Automatic verification is a very good thing, without it someone won't use it or will forget to, and open up their programs to exploitation.)
Yes, implementing it into std::string in some future standard.
If Boost comes out with a version that breaks existing programs, companies just won't upgrade to it. I can keep one of the companies that mine works with upgrading, because the group that I work with is the only one there using C++ and they listen to me, but most companies have a lot more invested in the existing system. Believe me, any breaking changes have to be eased in over many versions -- the "boiling a frog" approach. :-)
Of course this is a valid point and what we should do is to do some potential damage evaluation. There have been breaking changes in Boost and the end-users finally accepted them (even if complaining loudly) Boost is a cutting edge library and such changes should be avoided if possible, but they should not be avoided completelly. This would require a lot of PR and announcing the changes well in advance.
If they're already using UTF-8 strings, then we provide something like BOOST_ALL_STD_STRINGS_ARE_UTF8 that they can define. The utf*_t classes configure themselves to accept std::strings as UTF-8-encoded, and any changes are completely transparent to those people. No punishment involved.
OK this could work.
For everyone else, we introduce the utf*_t API alongside the std::string one, for those classes and functions that are not encoding-agnostic. The std::string one can be deprecated in future versions if the library author desires. Again, no punishment involved.
I don't expect that the utf*_t classes will make it into the standard. They definitely won't make it into the now-misnamed C++0x standard, and it'll likely be another ten years before another one is hashed out -- by then, the UTF-8 conversion should be complete, so there will be no need for it, except possibly to confirm that a string isn't malformed.
Besides the ugly name and that is a new class ? No :)
If you can think of a more-acceptable-but-still-descriptive name for it, I'm all ears. :-)
I have an idea: what about boost::string, which could possibly become the next std::string in the future.
And the solution is long overdue. And creating utf8_t is just putting the problem away, not solving it really.
I see it as merely easing the transition.
OK, if the long term plan is: 1) design and implement boost::string using UTF-8 doing all the things like code-point iteration, character iteration, convenience stuff like starts-with, ends-with, replace, trim, etc., etc. with as much backward compatibility with std::string as possible without hindering progress 2) try really hard to push it to the standard then I'm on board with that. BR, Matus

On Thu, 20 Jan 2011 09:59:51 +0100 Matus Chochlik <chochlik@gmail.com> wrote:
Do you see another way to provide those conversions, and automatic verification of proper UTF coding? (Automatic verification is a very good thing, without it someone won't use it or will forget to, and open up their programs to exploitation.)
Yes, implementing it into std::string in some future standard.
'Fraid that's a little beyond my current level of programming skill. ;-)
Besides the ugly name and that is a new class ? No :)
If you can think of a more-acceptable-but-still-descriptive name for it, I'm all ears. :-)
I have an idea: what about boost::string, which could possibly become the next std::string in the future.
And string16 and string32? We'll have to support UTF-32, as the single-codepoint-per-element type, and UTF-16 (distasteful though it may be) is needed for Windows. Or are you suggesting the utf* types in addition to the boost::string type? If so, I believe the idea has merit.
And the solution is long overdue. And creating utf8_t is just putting the problem away, not solving it really.
I see it as merely easing the transition.
OK, if the long term plan is:
1) design and implement boost::string using UTF-8 doing all the things like code-point iteration, character iteration, convenience stuff like starts-with, ends-with, replace, trim, etc., etc. with as much backward compatibility with std::string as possible without hindering progress
2) try really hard to push it to the standard
then I'm on board with that.
Some of those could be problematic (I've run across references implying that 0x20 isn't the universal word-separation character, so trim would at least need some extra parameters), but for the most part, I'd agree with it. -- Chad Nelson Oak Circle Software, Inc. * * *

OK, if the long term plan is:
1) design and implement boost::string using UTF-8 doing all the things like code-point iteration, character iteration, convenience stuff like starts-with, ends-with, replace, trim, etc., etc. with as much backward compatibility with std::string as possible without hindering progress
2) try really hard to push it to the standard
then I'm on board with that.
Some of those could be problematic (I've run across references implying that 0x20 isn't the universal word-separation character, so trim would at least need some extra parameters), but for the most part, I'd agree with it.
And also it is locale dependent. Unicode defines 4 text segments: Grapheme, Word and Sentence. http://www.unicode.org/reports/tr14/ There is also line break boundaries defined: http://unicode.org/reports/tr29 Most of them are also locale dependent as require use of dictionaries. So unless you want to carry locale information in the string, I don't think it is good to put these into the string itself. Artyom

On Thu, 20 Jan 2011 09:33:02 -0500, Chad Nelson wrote:
On Thu, 20 Jan 2011 09:59:51 +0100 Matus Chochlik <chochlik@gmail.com> wrote:
Do you see another way to provide those conversions, and automatic verification of proper UTF coding? (Automatic verification is a very good thing, without it someone won't use it or will forget to, and open up their programs to exploitation.)
Yes, implementing it into std::string in some future standard.
'Fraid that's a little beyond my current level of programming skill. ;-)
Besides the ugly name and that is a new class ? No :)
If you can think of a more-acceptable-but-still-descriptive name for it, I'm all ears. :-)
I have an idea: what about boost::string, which could possibly become the next std::string in the future.
And string16 and string32? We'll have to support UTF-32, as the single-codepoint-per-element type, and UTF-16 (distasteful though it may be) is needed for Windows.
I imagine you wouldn't have UTF-16 and UTF-32 string being passed about as a matter for course. For instance, a UTF-16 string should only be used just before calling a Windows API call. If this is the case, it makes sense to make the common case (UTF-8 string) have a nice name like boost::string and the others which are used for special situations can have something less snappy like boost::u16string and boost::u32string. Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

On 01/20/2011 07:43 AM, Alexander Lamaison wrote:
I imagine you wouldn't have UTF-16 and UTF-32 string being passed about as a matter for course. For instance, a UTF-16 string should only be used just before calling a Windows API call.
If this is the case, it makes sense to make the common case (UTF-8 string) have a nice name like boost::string and the others which are used for special situations can have something less snappy like boost::u16string and boost::u32string.
What would you use for a regular string where you just had, essentially a vector of char, wchar_t, char8_t, char16_t, char32_t, or unsigned char, but didn't care about encoding? I want to differentiate between this case and the case where I know that there's a particular encoding. A lot of times you just know you got a string from one system call and you're passing it to another and you don't care about encoding. If you used a validating utf8 std::string, if it turned out not to be utf-8 you'd throw or return an error for nothing. It's not a good fit for existing things, it would just be something going forward for things that knew that they were using utf-8. Patrick

On Thu, 20 Jan 2011 23:26:35 -0800, Patrick Horgan wrote:
On 01/20/2011 07:43 AM, Alexander Lamaison wrote:
I imagine you wouldn't have UTF-16 and UTF-32 string being passed about as a matter for course. For instance, a UTF-16 string should only be used just before calling a Windows API call.
If this is the case, it makes sense to make the common case (UTF-8 string) have a nice name like boost::string and the others which are used for special situations can have something less snappy like boost::u16string and boost::u32string.
What would you use for a regular string where you just had, essentially a vector of char, wchar_t, char8_t, char16_t, char32_t, or unsigned char, but didn't care about encoding? I want to differentiate between this case and the case where I know that there's a particular encoding. A lot of times you just know you got a string from one system call and you're passing it to another and you don't care about encoding. [..]
Good point! boost::u8string then? Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

On Fri, Jan 21, 2011 at 10:37 AM, Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
On Thu, 20 Jan 2011 23:26:35 -0800, Patrick Horgan wrote:
On 01/20/2011 07:43 AM, Alexander Lamaison wrote:
I imagine you wouldn't have UTF-16 and UTF-32 string being passed about as a matter for course. For instance, a UTF-16 string should only be used just before calling a Windows API call.
If this is the case, it makes sense to make the common case (UTF-8 string) have a nice name like boost::string and the others which are used for special situations can have something less snappy like boost::u16string and boost::u32string.
What would you use for a regular string where you just had, essentially a vector of char, wchar_t, char8_t, char16_t, char32_t, or unsigned char, but didn't care about encoding? I want to differentiate between this case and the case where I know that there's a particular encoding. A lot of times you just know you got a string from one system call and you're passing it to another and you don't care about encoding. [..]
Good point! boost::u8string then?
Why not boost::string (explicitly stating in the docs that it is UTF-8-based) ? the name u8string suggests to me that it is meant for some special case of character encoding and the (encoding agnostic/native) std::string is still the way to go. IMO we should send the message that UTF-8 is "normal"/"(semi-)standard"/"de-facto-standard" and the other encodings like the native_t (or even ansi_t, ibm_cp_xyz_t, string16_t, string32_t, ...) are the special cases and they should be treated as such. Matus

On Fri, 21 Jan 2011 10:54:15 +0100, Matus Chochlik wrote:
On Fri, Jan 21, 2011 at 10:37 AM, Alexander Lamaison <awl03@doc.ic.ac.uk> wrote: On Thu, 20 Jan 2011 23:26:35 -0800, Patrick Horgan wrote:
On 01/20/2011 07:43 AM, Alexander Lamaison wrote:
I imagine you wouldn't have UTF-16 and UTF-32 string being passed about as a matter for course. For instance, a UTF-16 string should only be used just before calling a Windows API call.
If this is the case, it makes sense to make the common case (UTF-8 string) have a nice name like boost::string and the others which are used for special situations can have something less snappy like boost::u16string and boost::u32string.
What would you use for a regular string where you just had, essentially a vector of char, wchar_t, char8_t, char16_t, char32_t, or unsigned char, but didn't care about encoding? I want to differentiate between this case and the case where I know that there's a particular encoding. A lot of times you just know you got a string from one system call and you're passing it to another and you don't care about encoding. [..]
Good point! boost::u8string then?
Why not boost::string (explicitly stating in the docs that it is UTF-8-based) ? the name u8string suggests to me that it is meant for some special case of character encoding and the (encoding agnostic/native) std::string is still the way to go.
That was the idea, was it not? We should be encoding agnostic wherever possible.
IMO we should send the message that UTF-8 is "normal"/"(semi-)standard"/"de-facto-standard" and the other encodings like the native_t (or even ansi_t, ibm_cp_xyz_t, string16_t, string32_t, ...) are the special cases and they should be treated as such.
Why? When a string doesn't need to be converted, why force it to be? Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

On Fri, Jan 21, 2011 at 2:35 PM, Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
Why not boost::string (explicitly stating in the docs that it is UTF-8-based) ? the name u8string suggests to me that it is meant for some special case of character encoding and the (encoding agnostic/native) std::string is still the way to go.
That was the idea, was it not? We should be encoding agnostic wherever possible.
If we can (globally) agree upon an encoding, that will be able to handle all imaginable writing systems, will be robust, etc., etc. we *will* end up being encoding agnostic. Today, what is called 'encoding agnostic' causes many problems. For example you save a file with name containing non-ASCII characters, even if it is latin with some accents, on one version of Windows and you ship it to a machine with another version of Windows using a different encoding the name becomes garbled. Same thing with applications that use text files to exchange information. Either you pick a single encoding and stick to that, or you use what is the current platforms native encoding is and do the encoding detection and transcoding on demand, and usually you loose some information in the process. In both cases you have to transcode the text explicitly. I don't see (besides support for legacy SW/HW) why so many people are saying that this is OK.
IMO we should send the message that UTF-8 is "normal"/"(semi-)standard"/"de-facto-standard" and the other encodings like the native_t (or even ansi_t, ibm_cp_xyz_t, string16_t, string32_t, ...) are the special cases and they should be treated as such.
Why? When a string doesn't need to be converted, why force it to be?
Already on many platforms you won't have to do any transcoding precisely because those platforms have already adopted a single encoding: UTF-8. I can't imagine why any new SW would choose anything else besides Unicode for text representations and to support legacy apps and/or hardware that accepts commands or prints output in a specific encoding there are tools like iconv. Matus

On 01/21/2011 01:54 AM, Matus Chochlik wrote:
... elision by patrick... Why not boost::string (explicitly stating in the docs that it is UTF-8-based) ? the name u8string suggests to me that it is meant for some special case of character encoding and the (encoding agnostic/native) std::string is still the way to go.
I think that's the truth. std::string has some performance guarantees that a utf-8 based string wouldn't be able to keep. std::string can do things, and people do things with std::string that a utf-8 based string can't do. If you set LC_COLLATE to en_US.utf8 or the equivalent (I hate the way locale names are not as standardized as you might like), then most of the standard algorithms will be locale aware and operations on your string will be muchly aware of the string encoding. By switching locales, you can then operate on strings with other encodings. utf-8_string isn't intended to operate like that. It's specialized.
IMO we should send the message that UTF-8 is "normal"/"(semi-)standard"/"de-facto-standard" and the other encodings like the native_t (or even ansi_t, ibm_cp_xyz_t, string16_t, string32_t, ...) are the special cases and they should be treated as such. Why would people want to lose so much of the functionality of std::string? The only advantage of a utf8_string would be automatic and continual verification that it's a valid utf-8 encoded string that otherwise acts as much as possible like a std::string. For that you would give up a lot of other functionality.
Patrick

On Sat, Jan 22, 2011 at 12:36 AM, Patrick Horgan <phorgan1@gmail.com> wrote:
On 01/21/2011 01:54 AM, Matus Chochlik wrote:
... elision by patrick... Why not boost::string (explicitly stating in the docs that it is UTF-8-based) ? the name u8string suggests to me that it is meant for some special case of character encoding and the (encoding agnostic/native) std::string is still the way to go.
I think that's the truth. std::string has some performance guarantees that a utf-8 based string wouldn't be able to keep. std::string can do things, and people do things with std::string that a utf-8 based string can't do.
If this was really the case then what you describe would be already happening on all the platforms that use the UTF-8 encoding by default for any locale.
If you set LC_COLLATE to en_US.utf8 or the equivalent (I hate the way locale names are not as standardized as you might like), then most of the standard algorithms will be locale aware and operations on your string will be muchly aware of the string encoding. By switching locales, you can then operate on strings with other encodings. utf-8_string isn't intended to operate like that. It's specialized.
IMO we should send the message that UTF-8 is "normal"/"(semi-)standard"/"de-facto-standard" and the other encodings like the native_t (or even ansi_t, ibm_cp_xyz_t, string16_t, string32_t, ...) are the special cases and they should be treated as such.
Why would people want to lose so much of the functionality of std::string?
What functionality would they loose exactly ? Again, on many platforms the default encoding for all (or nearly all) locales already is UTF-8 so if you get a string from the OS API and store it into a std::string then it is UTF-8 encoded. I do a equal share of programming on Windows and Linux platforms and I have yet to run into these problems you describe on Linux where for some time now the default encoding is UTF-8. Actually today I encounter more problems on Windows, where I can't set the locale to use UTF-8 and consequently I have to transcode data from socket connections of files manually. If you are talking about being able to have indexed random-access to "logical characters" for example on Windows with some special encodings, then this is only a platform-specific and unportable functionality. What I propose, is to extend the interface so that it would allow you handle the "raw-byte-sequences" that are now used to represent strings of logical characters in a platform independent way by using the Unicode standard.
The only advantage of a utf8_string would be automatic and continual verification that it's a valid utf-8 encoded string that otherwise acts as much as possible like a std::string. For that you would give up a lot of other functionality.
Again what exactly would you give up? The gain is not only what you describe, but also that, for example when writing text into a file in a portable application, sending the file to a different machine with a different platform you can read the string on that other machine without explicit transcoding (which means picking a library/tool that can do the transcoding and use it explicitly everywhere you potentially handle data that may come from different platforms). BR, Matus

On Sat, Jan 22, 2011 at 12:36 AM, Patrick Horgan<phorgan1@gmail.com> wrote:
On 01/21/2011 01:54 AM, Matus Chochlik wrote:
... elision by patrick... Why not boost::string (explicitly stating in the docs that it is UTF-8-based) ? the name u8string suggests to me that it is meant for some special case of character encoding and the (encoding agnostic/native) std::string is still the way to go. I think that's the truth. std::string has some performance guarantees that a utf-8 based string wouldn't be able to keep. std::string can do things, and people do things with std::string that a utf-8 based string can't do. If this was really the case then what you describe would be already happening on all the platforms that use the UTF-8 encoding by default for any locale. No. They're using std::string. It works just fine for this as it does for other things. It's performance guarantees are in respect to their templated data type, not in terms of the encoding of the contents. std::string lets you walk through a JIS string and decode it. A utf-8 string would hurl chunks since, of course, it wouldn't be encoded as utf-8. I could go on and on, but perhaps if you'd refresh yourself on
On 01/22/2011 11:53 AM, Matus Chochlik wrote: the interface of std::string and think about the implications on that if you had a validating utf-8 string you'd see. I'm really in favor of an utf-8 string, I just wouldn't call it string because that would be a lie. It wouldn't be a general string, but a special case of string.
... elision by me. What functionality would they loose exactly ? Again, on many platforms the default encoding for all (or nearly all) locales already is UTF-8 so if you get a string from the OS API and store it into a std::string then it is UTF-8 encoded. I do a equal share of programming on Windows and Linux platforms and I have yet to run into these problems you describe on Linux where for some time now the default encoding is UTF-8. Actually today I encounter more problems on Windows, where I can't set the locale to use UTF-8 and consequently I have to transcode data from socket connections of files manually.
I didn't say ever, that having utf-8 encoded characters in a std::string would cause you some kind of problems. I don't think I even said anything that would let you infer that. You're completely off base here. I was talking about a string specialized for utf-8 encoded characters. You're chasing a red-herring. You're stalking a strawman target. I agree with you entirely. std::string does a great job of holding utf-8 encoded characters, as well as many other things.
If you are talking about being able to have indexed random-access to "logical characters" for example on Windows with some special encodings, then this is only a platform-specific and unportable functionality. What I propose, is to extend the interface so that it would allow you handle the "raw-byte-sequences" that are now used to represent strings of logical characters in a platform independent way by using the Unicode standard.
That's nice. I vote for that idea. Just don't call it std::string, because it won't be, and you won't be able to do everything with it that std::string does today.
The only advantage of a utf8_string would be automatic and continual verification that it's a valid utf-8 encoded string that otherwise acts as much as possible like a std::string. For that you would give up a lot of other functionality. Again what exactly would you give up? The gain is not only what you describe, but also that, for example when writing text into a file in a portable application, sending the file to a different machine with a different platform you can read the string on that other machine without explicit transcoding (which means picking a library/tool that can do the transcoding and use it explicitly everywhere you potentially handle data that may come from different platforms).
That's an advantage of a utf-8 encoded file. You don't need a special string type to write to that. Before writing to the file you can hold the data in memory in a std::string or a C string, or a chunk of mmap'd memory today, and if they contain data encoded in utf-8 you have the same advantage. There's a great advantage to a utf-8_string, in that as long as it always validates, you never have to check the data again for correctness. None of the routines with interfaces written in terms of it would have to do any validating for utf-8 encoding correctness and could worry about their own missions solely. I _would_ like to be able to choose two types of behavior because sometimes I would want it to throw on an invalid sequence, but other's I'd like it to substitute an indicator of an invalid character so I don't throw out the baby with the bath water. Both of these types of reactions are talked about in the utf-8 spec. A string specialized to only hold utf-8 encoded data wouldn't be any good to someone not using utf-8 encoding. Even if they were using 7-bit ascii for a network application, like ftp, for example, they'd have to pay the penalty for validating that the 7-bit ascii was valid utf-8. If they're using it as a container to carry around encrypted strings, well that wouldn't be possible at all. If a system call returned a name valid in that operating system that I would later pass to another system call, if it wasn't utf-8 what could I do? Break? Or corrupt the string? utf-8 encoding is useful, but it's not the majority of the text used in the world. Many applications in the world are not internationalized at all and will never be. They are written for one specific place on earth and that's all they want to do. In twenty years you'll see just as many applications using EUC as today. Shoot, COBOL hasn't gone away. I like to use it as my interface to the world, for, just as you said, a portable file, or a web page. I don't use it internally in my code unless I'm just carrying something from one external interface to another. I would LOVE to have a validating utf-8_string. It would be really useful in a web app. Vayo con Diós, Patrick

On Sun, Jan 23, 2011 at 6:42 PM, Patrick Horgan <phorgan1@gmail.com> wrote: [snip/]
No. They're using std::string. It works just fine for this as it does for other things. It's performance guarantees are in respect to their templated data type, not in terms of the encoding of the contents. std::string lets you walk through a JIS string and decode it. A utf-8 string would hurl chunks since, of course, it wouldn't be encoded as utf-8. I could go on and on, but perhaps if you'd refresh yourself on the interface of std::string and think about the implications on that if you had a validating utf-8 string you'd see. I'm really in favor of an utf-8 string, I just wouldn't call it string because that would be a lie. It wouldn't be a general string, but a special case of string.
This whole debate is at least for me about what the std::string is and what we want it to be: a) a little more that a glorified std::vector<char> with few extra operations for a more convenient handling of the byte-sequences stored inside, that can be currently interpreted in dozens if not hundreds of ways, depending on the current platform, the default or explicitly selected locale+encoding, etc., etc. b) a container of byte sequences that represent human-readable text and every single sequence (provided it is valid) can be translated into exactly a single sequence of "logical" characters of the said text, by a standardized mapping, but also provides operations for handling the text character-by-character not only byte-by-byte (portably). If the application wishes so, it still can treat the string only as the byte sequence because this is of course a valid usage.
... elision by me. What functionality would they loose exactly ? Again, on many platforms the default encoding for all (or nearly all) locales already is UTF-8 so if you get a string from the OS API and store it into a std::string then it is UTF-8 encoded. I do a equal share of programming on Windows and Linux platforms and I have yet to run into these problems you describe on Linux where for some time now the default encoding is UTF-8. Actually today I encounter more problems on Windows, where I can't set the locale to use UTF-8 and consequently I have to transcode data from socket connections of files manually.
I didn't say ever, that having utf-8 encoded characters in a std::string would cause you some kind of problems. I don't think I even said anything that would let you infer that. You're completely off base here. I was talking about a string specialized for utf-8 encoded characters. You're chasing a red-herring. You're stalking a strawman target. I agree with you entirely. std::string does a great job of holding utf-8 encoded characters, as well as many other things.
Why would people want to lose so much of the functionality of std::string? [/quote]
OK, [OT] I was referring to the [quote] part. I meant no offense I merely said that I yet have to run into any problems (loosing much of the functionality) on platforms where std::string is used to hold byte-sequences encoded by the UTF-8 encoding. I certainly don't need to be right or everyone agreeing with me, this is a discussion and I gladly let myself to be educated by people knowing more about the issue at hand than I do. [/OT]
If you are talking about being able to have indexed random-access to "logical characters" for example on Windows with some special encodings, then this is only a platform-specific and unportable functionality. What I propose, is to extend the interface so that it would allow you handle the "raw-byte-sequences" that are now used to represent strings of logical characters in a platform independent way by using the Unicode standard.
That's nice. I vote for that idea. Just don't call it std::string, because it won't be, and you won't be able to do everything with it that std::string does today.
Would you care to elaborate what functionality would you loose ? Even random access to individual characters could be implemented. Of course this would break the existing performance guarantees, that are however granted only on platforms which use std::string for single-byte encodings. It could also employ some caching mechanism to speed things up, but this is just an implementation detail and of course has its trade-offs. But, if this help us to (slowly) get rid of the necessity to handle various encodings that are relics from an age where every single byte of memory and every processor tick had been a precious resource, then I am all for it. I imagine that the folks at Unicode consortium have worked for the past 20+ years on the standard not only to create a yet another encoding that would complement and live happily ever after with all the others, but to replace them eventually. Having said that I *do not* want to "ban" or prevent anyone from using specific encodings where it is necessary or advantageous, but such usage should be considered a special-case and not general-usage as it is now. Many database systems, web-browsers, web-content-creation tools, xml editors, etc., etc. are considering UTF-8 to be the default and yes, they let you work with other encoding but as a special case. [snip/]
That's an advantage of a utf-8 encoded file. You don't need a special string type to write to that. Before writing to the file you can hold the data in memory in a std::string or a C string, or a chunk of mmap'd memory today, and if they contain data encoded in utf-8 you have the same advantage.
That is not only what I can, do but also what I already do and I'm not very happy with the results, because if std::string is by the OS/libraries now expected to use platform-specific encoding, then these two do not play together very well. Unless you (of course) transcode them explicitly. I rarely use a mem'mapped file as a whole without trying to parse it and use the data for example in a GUI. [snip/]
A string specialized to only hold utf-8 encoded data wouldn't be any good to someone not using utf-8 encoding. Even if they were using 7-bit ascii for a network application, like ftp, for example, they'd have to pay the penalty for validating that the 7-bit ascii was valid utf-8. If they're using it as a container to carry around encrypted strings, well that wouldn't be possible at all.
Let us have a "special_encoding_string" where we need to handle the legacy encodings ...
If a system call returned a name valid in that operating system that I would later pass to another system call, if it wasn't utf-8 what could I do? Break? Or corrupt the string?
... and a native_encoding_string or even let's use vector<char> for these two (they are valid, but IMO *special*) use-cases. [snip/]
Vayo con Diós, Hasta la vista :)
Matus

On Thu, Jan 20, 2011 at 3:33 PM, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
Besides the ugly name and that is a new class ? No :)
If you can think of a more-acceptable-but-still-descriptive name for it, I'm all ears. :-)
I have an idea: what about boost::string, which could possibly become the next std::string in the future.
And string16 and string32? We'll have to support UTF-32, as the single-codepoint-per-element type, and UTF-16 (distasteful though it may be) is needed for Windows.
Or are you suggesting the utf* types in addition to the boost::string type? If so, I believe the idea has merit.
If boost::string uses utf-8 by default and I will be able to do sed 's/boost::string/std::string/g' with all my sources at some point in the distant future without breaking them (completely) we can have string16, string32, string_ucs2, string_ucs4, etc. for all I care :-). I am not against alternative string representations and encodings, but I would like to finally see a string class, which I can for example write to a file on a Windows machine with cp1250 and open it on Linux with utf-8 without doing explicit transcoding, which allows to do true code-point and character iteration, supporting the essential algorithms (it is open for debate which ones), which I can use as a type for parameters of my functions and member variables, etc.
OK, if the long term plan is:
1) design and implement boost::string using UTF-8 doing all the things like code-point iteration, character iteration, convenience stuff like starts-with, ends-with, replace, trim, etc., etc. with as much backward compatibility with std::string as possible without hindering progress
2) try really hard to push it to the standard
then I'm on board with that.
Some of those could be problematic (I've run across references implying that 0x20 isn't the universal word-separation character, so trim would at least need some extra parameters), but for the most part, I'd agree with it.
This is *exactly* why I would like to see them in a standard string (or string manipulation library) , designed and implemented by true experts and not reinvented by an "expert" like me :) Matus

On 01/20/2011 06:33 AM, Chad Nelson wrote:
... elision by patrick ... And string16 and string32? We'll have to support UTF-32, as the single-codepoint-per-element type, and UTF-16 (distasteful though it may be) is needed for Windows. string is already templated on the character type, character traits, and allocator, no? So if you used string with char16_t or char32_t (real types in current C++ specs) you would get them already. You just wouldn't know anything about encoding. Or are you suggesting the utf* types in addition to the boost::string type? If so, I believe the idea has merit. Yay! +1.72
Patrick

At Wed, 19 Jan 2011 11:33:02 +0100, Matus Chochlik wrote:
The string-encoding-related discussion boils down for me to the following: What fill the string handling in C++ look like in the (maybe not immediate) future.
*Scenario A:*
We will pick a widely-accepted char-based encoding that is able to handle all the writing scripts and alphabets that we can think of, has enough reserved space for future additions or is easily extensible and use that with std::strings which will become the one and only text string 'container' class.
All the wstrings, wxString, Qstrings, utf8strings, etc. will be abandoned. All the APIs using ANSI or UCS-2 will be slowly phased out with the help of convenience classes like ansi_str_t and ucs2_t that will be made obsolete and finally dropped (after the transition).
*Scenario B:*
We will add yet another string class named utf8_t to the already crowded set named above. Then:
library a: will stick to the ANSI encodings with std::strings It has worked in the past it will work in the future, right ?
library b[oost]: will use utf8_t instead and provide the (seamles and straightforward) conversions between utf8_t and std::string and std::wstring. Some (many but not all) others will follow
library c: will use std::strings with utf-8 ... library [.]n[et]: will use String class ... library q[t]: will use Qstrings .. library w[xWidgets]: will use wxStrings and wxChar* library wi[napi]: will use TCHAR* ... library z: will use const char* in an encoding agnostic way
Now an application using libraries [a..z] will become the developers nightmare. What string should he use for the class members, constructor parameters, who to do when the conversions do not work so seamlesly ?
Also half of the cpu time assigned to running that application will be wasted on useless string transcoding. And half of the memory will be occupied with useless transcoding-related code and data.
*Scenario C:*
This is basically the status quo; a mix of the above. A sad and unsatisfactory state of things.
*Scenario D:* We try for scenario A. and people still use Qstrings, wxStrings, etc. *Scenario E:* We add another string class and everyone adopts it -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Wed, Jan 19, 2011 at 5:26 PM, Dave Abrahams <dave@boostpro.com> wrote:
At Wed, 19 Jan 2011 11:33:02 +0100, Matus Chochlik wrote:
*Scenario D:* We try for scenario A. and people still use Qstrings, wxStrings, etc.
'I think maybe you underestimate our influence.' :)
*Scenario E:* We add another string class and everyone adopts it
Ok I admit that this is possible. But let me ask: How did the C world made the transition without abandoning char ? BR, Matus

On Wed, 19 Jan 2011 17:34:27 +0100, Matus Chochlik wrote:
On Wed, Jan 19, 2011 at 5:26 PM, Dave Abrahams <dave@boostpro.com> wrote:
At Wed, 19 Jan 2011 11:33:02 +0100, Matus Chochlik wrote:
*Scenario D:* We try for scenario A. and people still use Qstrings, wxStrings, etc.
'I think maybe you underestimate our influence.' :)
*Scenario E:* We add another string class and everyone adopts it
Ok I admit that this is possible. But let me ask: How did the C world made the transition without abandoning char ?
They made the transition? I must have missed this. The Windows API _is_ C and has all the problems we've been talking about. Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

On Wed, 19 Jan 2011 18:10:06 +0100, Matus Chochlik wrote:
They made the transition? I must have missed this.
The Windows API _is_ C and has all the problems we've been talking about.
OK, besides Microsoft C world :)
By changing the OS-default encoding to assume char* string was UTF-8. Same as for C++. This whole issue is about how to accommodate OSses that don't make that assumption. Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

On Wed, Jan 19, 2011 at 6:21 PM, Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
On Wed, 19 Jan 2011 18:10:06 +0100, Matus Chochlik wrote:
They made the transition? I must have missed this.
The Windows API _is_ C and has all the problems we've been talking about.
OK, besides Microsoft C world :)
By changing the OS-default encoding to assume char* string was UTF-8. Same as for C++. This whole issue is about how to accommodate OSses that don't make that assumption.
Agreed, again if Microsoft could move by default to UTF-8 for the various locales instead of using the current encodings then this whole discussion would be moot. For the time being we would need to do something like this even if a complete transcoding is not possible: std::string filepath(get_path_in_utf8()) std::fstream file(utf8_to_locale_encoding(filepath)); everywhere the implementation (STL, etc.) expects native encoding. This is the ugliest part of the whole transition. Boost could hide this completely by using the wide-char interfaces and doing CreateFileW(utf8_to_winapi_wide(filepath), ...). It also could be an opportunity for alternate implementations of STL which would handle it transparently. Matus

On Wed, 19 Jan 2011 18:41:22 +0100, Matus Chochlik wrote:
On Wed, Jan 19, 2011 at 6:21 PM, Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
On Wed, 19 Jan 2011 18:10:06 +0100, Matus Chochlik wrote:
They made the transition? I must have missed this.
The Windows API _is_ C and has all the problems we've been talking about.
OK, besides Microsoft C world :)
By changing the OS-default encoding to assume char* string was UTF-8. Same as for C++. This whole issue is about how to accommodate OSses that don't make that assumption.
Agreed, again if Microsoft could move by default to UTF-8 for the various locales instead of using the current encodings then this whole discussion would be moot.
For the time being we would need to do something like this even if a complete transcoding is not possible:
std::string filepath(get_path_in_utf8()) std::fstream file(utf8_to_locale_encoding(filepath));
everywhere the implementation (STL, etc.) expects native encoding. This is the ugliest part of the whole transition. Boost could hide this completely by using the wide-char interfaces and doing CreateFileW(utf8_to_winapi_wide(filepath), ...).
It also could be an opportunity for alternate implementations of STL which would handle it transparently.
Hmmmm ... I'm starting to come round to your std::string == UTF-8 point-of view. The one thing that would still annoy me is that std::string's interface was clearly designed for single-byte == single-character/codepoint/whatever operation. I don't suppose anyone will be adding .begin_character()/.end_character() methods to std::string any time soon. Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

On Wed, Jan 19, 2011 at 7:54 PM, Alexander Lamaison <awl03@doc.ic.ac.uk> wrote:
Agreed, again if Microsoft could move by default to UTF-8 for the various locales instead of using the current encodings then this whole discussion would be moot.
For the time being we would need to do something like this even if a complete transcoding is not possible:
std::string filepath(get_path_in_utf8()) std::fstream file(utf8_to_locale_encoding(filepath));
everywhere the implementation (STL, etc.) expects native encoding. This is the ugliest part of the whole transition. Boost could hide this completely by using the wide-char interfaces and doing CreateFileW(utf8_to_winapi_wide(filepath), ...).
It also could be an opportunity for alternate implementations of STL which would handle it transparently.
Hmmmm ... I'm starting to come round to your std::string == UTF-8 point-of view.
The one thing that would still annoy me is that std::string's interface was clearly designed for single-byte == single-character/codepoint/whatever operation. I don't suppose anyone will be adding .begin_character()/.end_character() methods to std::string any time soon.
This is where the (Boost.)Locale and (Boost.)Unicode libraries could provide insight into how to extend the std::string interface or be the testbed for new additions to the standard library related to string manipulation. (Provided, the standard adopts UTF-8 as a native encoding. Or does it already ?) Matus

On Wed, Jan 19, 2011 at 18:33, Peter Dimov <pdimov@pdimov.com> wrote:
This was the prevailing thinking once. First this number of bits was 16, which incorrect assumption claimed Microsoft and Java as victims, then it became 21 (or 22?). Eventually, people realized that this will never happen even if we allocate 32 bits per character, so here we are.
This is one more advantage of UTF-8 over UTF-16 and UTF-32. UTF-8 bit patterns can be extended indefinitely, even for 256 bit code-points. :-)

...elision by patrick... This is where the (Boost.)Locale and (Boost.)Unicode libraries could provide insight into how to extend the std::string interface or be the testbed for new additions to the standard library related to string manipulation. (Provided, the standard adopts UTF-8 as a native encoding. Or does it already ?) In recent C++ specs you can specify u8"a string to be considered encoded in utf-8". If a wide char string literal is next to a u8 string literal
On 01/19/2011 11:15 AM, Matus Chochlik wrote: they don't concatenate, it's an error. There's of course all the locale stuff including codecvt_utf8. A byte must be large enough to hold an 8 bit utf-8 code unit. That's all that's in the C++ spec so far. Patrick

Alexander Lamaison wrote:
By changing the OS-default encoding to assume char* string was UTF-8.
You keep talking about "OS-default encoding", but there's no such thing. POSIX operating systems do not assume anything about the encoding of char* (*). You have a global locale (**) in C/C++ programs, and the user can control it via environment variables (unless the program changes it), but the OS itself does not. (*) Except Mac OS X, which requires UTF-8 for paths. (**) Actually, you have two global locales - C and C++, not necessarily in sync with each other.

On Wed, 19 Jan 2011 19:42:25 +0200, Peter Dimov wrote:
Alexander Lamaison wrote:
By changing the OS-default encoding to assume char* string was UTF-8.
You keep talking about "OS-default encoding", but there's no such thing. POSIX operating systems do not assume anything about the encoding of char*
I was under the impression that Linux changed from interpreting char* as being in a multitude of different encodings to being in UTF-8 by default. Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

Alexander Lamaison wrote:
I was under the impression that Linux changed from interpreting char* as being in a multitude of different encodings to being in UTF-8 by default.
Well, it probably depends on what part of Linux we're talking to, but most of the functions do not interpret char* as being in any encoding, neither do they have a default. They just treat it as a byte sequence.

Peter Dimov wrote:
Alexander Lamaison wrote:
I was under the impression that Linux changed from interpreting char* as being in a multitude of different encodings to being in UTF-8 by default.
Well, it probably depends on what part of Linux we're talking to, but most of the functions do not interpret char* as being in any encoding, neither do they have a default. They just treat it as a byte sequence.
hmmm - that's what I always considered std::string to be. There's no notion of locale in there. I'm still not seeing why we can't continue to consider std::string just a sequence of bytes with some extra sauce .. ... and make a new class utf8_string .. derived from which which includes a code point iterator, a function to return a utf8 "character or codepoint or whatever it is". I just can't see anything wrong with this. It doesn't redefine the sematics (formal, intuitive, common usage) of std::string, utf8_string would let one use the special unicode sauces when needed. And it could be implicitly converted to std::string when passed as a function argument. Finally, given the history of this, I don't believe utf8 is the "end of the road". It still leaves open the possibility of the next greatest thing - whatever that turns out to be. To summarize: std::string - a sequence of bytes utf8_string - a sequence of "code points" implemented in terms of std::string. (or at least convertible to std::string) Robert Ramey

On 01/19/2011 12:56 PM, Robert Ramey wrote:
... elision by patrick ... std::string - a sequence of bytes utf8_string - a sequence of "code points" implemented in terms of std::string. With the ability to specify a conversion facet to convert from your local encoding to utf-8. The string would still validate the utf-8 received from the conversion facet.
What do you do about things that can validly be represented by one character, or by a basic character with one or more combining characters. For example Ü can be represented by U+00DC, a capital U with diaeresis or by the two combining characters U+0055 U+0308, a U and a combining diaeresis. Ü<=- That one is done with two combining characters and the previous one is just one character. The spec says that these must be considered absolutely equivalent. Will our utf8_string class always choose one representation over another? Certainly to make choices like this you'd need the characterization database from Unicode. So, if you're iterating the utf8_string with an iterator iter, what type does *iter return? It could _consume_ a lot of bytes. Is it a char32_t with the character in it, is it another utf8-string with only one character in it? I'd say char32_t because that can hold anything in ucs. So then what about *iter=thechar. What type or types can thechar be? char32_t char16_t, wchar_t, char, unsigned char, int, int32_t, a utf8_string with only one "character" to be copied in, a utf8_string and we'll just take the first char? I'd probably use char32_t in both those cases. Food for thought. I agree I'd like to see it be derived from std::string so you can pass it to things that expect a std::string and don't care so much about encoding. Patrick

On Wed, Jan 19, 2011 at 2:42 PM, Alexander Lamaison <awl03@doc.ic.ac.uk>wrote:
On Wed, 19 Jan 2011 19:42:25 +0200, Peter Dimov wrote:
Alexander Lamaison wrote:
By changing the OS-default encoding to assume char* string was UTF-8.
You keep talking about "OS-default encoding", but there's no such thing. POSIX operating systems do not assume anything about the encoding of char*
I was under the impression that Linux changed from interpreting char* as being in a multitude of different encodings to being in UTF-8 by default.
Peter is correct, with a slight editorial clarification: "POSIX operating systems [API's] do not assume anything about the encoding of char*". When I was designing Boost Filesystem V3, several of the POSIX liaison folks were kind enough to confirm this with me in person. Some of the shell utilities do have notions of encoding, but not the API's. Linux is a "POSIX-like" operating system, but there are places where it deviates from the POSIX spec. So because Linux does something a certain way doesn't necessarily mean that POSIX specifies it that way. --Beman

I was under the impression that Linux changed from interpreting char* as being in a multitude of different encodings to being in UTF-8 by default. What you might be thinking of is that most modern Linux distributions set
On Wed, Jan 19, 2011 at 07:42:44PM +0000, Alexander Lamaison wrote: the default locale to include UTF-8 encoding (usually en_US.UTF-8). -- 1.21 Jiggabytes of memory should be enough for anybody.

At Wed, 19 Jan 2011 17:34:27 +0100, Matus Chochlik wrote:
On Wed, Jan 19, 2011 at 5:26 PM, Dave Abrahams <dave@boostpro.com> wrote:
At Wed, 19 Jan 2011 11:33:02 +0100, Matus Chochlik wrote:
*Scenario D:* We try for scenario A. and people still use Qstrings, wxStrings, etc.
'I think maybe you underestimate our influence.' :)
Our influence, if we introduce new library components, is very great, because they're on a de-facto fast track to standardization, and an improved string library is exactly the sort of thing that would be adopted upstream. If we simply agree to a programming convention, that will have some impact, but much less.
*Scenario E:* We add another string class and everyone adopts it
Ok I admit that this is possible. But let me ask: How did the C world made the transition without abandoning char ?
The transition from what to what? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Wed, Jan 19, 2011 at 7:39 PM, Dave Abrahams <dave@boostpro.com> wrote:
Our influence, if we introduce new library components, is very great, because they're on a de-facto fast track to standardization, and an improved string library is exactly the sort of thing that would be adopted upstream. If we simply agree to a programming convention, that will have some impact, but much less.
OK, I see. But, is there any chance that the standard itself would be updated so that it first would recommend to use UTF-8 with C++ strings. After some period of time all other encodings would be deprecated and using them would cause undefined behavior. Could Boost be the driving force here? I really see all the obstacles that prevent us from just switching to UTF-8, but adding a new string class will not help for the same reasons adding wstring did not help. As I already said elsewhere I think that this is a problem that has to be solved "organizationally".
*Scenario E:* We add another string class and everyone adopts it
Ok I admit that this is possible. But let me ask: How did the C world made the transition without abandoning char ?
The transition from what to what?
I meant that for example on POSIX OSes the POSIX C-API did not have to be changed or extended by a new set of functions doing the same things, but using a new character type, when they switched from the old encodings to UTF-8. To compare two strings you still can use stdcmp and not utf8strcmp, to collate strings you use strcoll and not utf8strcol, etc. I must admit that the previous statement is an oversimplification and that the things also rely on the C/C++ locale, etc.
-- Dave Abrahams BoostPro Computing http://www.boostpro.com
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- ________________ ::matus_chochlik

At Wed, 19 Jan 2011 20:03:59 +0100, Matus Chochlik wrote:
On Wed, Jan 19, 2011 at 7:39 PM, Dave Abrahams <dave@boostpro.com> wrote:
Our influence, if we introduce new library components, is very great, because they're on a de-facto fast track to standardization, and an improved string library is exactly the sort of thing that would be adopted upstream. If we simply agree to a programming convention, that will have some impact, but much less.
OK, I see. But, is there any chance that the standard itself would be updated so that it first would recommend to use UTF-8 with C++ strings.
Well, never say "never," but... never. Such recommendations are not part of the standard's mission. It doesn't do things like that.
After some period of time all other encodings would be deprecated
By whom?
and using them would cause undefined behavior. Could Boost be the driving force here?
This doesn't seem like a very plausible scenario to me, based on my experience. Of course, others may disagree.
I really see all the obstacles that prevent us from just switching to UTF-8, but adding a new string class will not help for the same reasons adding wstring did not help.
I don't see the parallel at all. wstring is just another container of bytes, for all practical purposes. It doesn't imply any particular encoding, and does nothing to segregate the encoded from the raw.
As I already said elsewhere I think that this is a problem that has to be solved "organizationally".
Perhaps. The type system is one of our organizational tools, and Boost has an impact insofar as it produces components that people use, so if we aren't able to produce some flagship library components that help with the solution, we have little traction.
*Scenario E:* We add another string class and everyone adopts it
Ok I admit that this is possible. But let me ask: How did the C world made the transition without abandoning char ?
The transition from what to what?
I meant that for example on POSIX OSes the POSIX C-API did not have to be changed or extended by a new set of functions doing the same things, but using a new character type, when they switched from the old encodings to UTF-8.
...and people still have the problem that they lose track of what's "raw" and what's encoded as utf-8.
To compare two strings you still can use stdcmp and not utf8strcmp, to collate strings you use strcoll and not utf8strcol, etc.
Yeah... but surely POSIX's strcmp only tells you whether the two strings have the same sequence of code points, not whether they have the same characters, right? And if you inadvertently compare a "raw" string with an equivalent utf-8-encoded string, what happens? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Wed, Jan 19, 2011 at 8:28 PM, Dave Abrahams <dave@boostpro.com> wrote:
OK, I see. But, is there any chance that the standard itself would be updated so that it first would recommend to use UTF-8 with C++ strings.
Well, never say "never," but... never. Such recommendations are not part of the standard's mission. It doesn't do things like that.
My view of what is the standardizing comitee willing to do may by naive, but generally I don't see why this could not be done. Other major languages (Java, C#, etc.) picked a single "standard" encoding and in those languages you treat text with other encodings as special case. If C++ recommended the use of UTF-8 this would probably kickstart the OS and compiler vendors to follow or at least to fix their implementations of the standard libary and the OS API's to accept UTF-8 by default (if we agree that this is a good idea).
After some period of time all other encodings would be deprecated
By whom?
By the same comitee that made the recommendation in the first place.
I really see all the obstacles that prevent us from just switching to UTF-8, but adding a new string class will not help for the same reasons adding wstring did not help.
I don't see the parallel at all. wstring is just another container of bytes, for all practical purposes. It doesn't imply any particular encoding, and does nothing to segregate the encoded from the raw.
Maybe wstring is not officially UTF-16 or UTF-32 or UCS, but on most platforms it is at least treated as "the unicode string" regardless of this being a vague term. What I am afraid of is that just like the use of wchar_t and wstring spawned the dual interface used by Winapi and followed by many others (including myself in the past), introducing a third (semi-)standard string class will spawn a "ternary" interface (but I may be wrong or mixing the order of the events mentienoed above).
As I already said elsewhere I think that this is a problem that has to be solved "organizationally".
Perhaps. The type system is one of our organizational tools, and Boost has an impact insofar as it produces components that people use, so if we aren't able to produce some flagship library components that help with the solution, we have little traction.
I believe in strong typing, but .. OK, for the sake of argument, where do we imagine utf8_t (or whatever its name will be) will be used and what is out long-term plan for std::string? If I design a library or an application should I use utf8_t everywhere ? As the type of the class' member variables, parameters of functions and constructors or should I stick to std::string (or perhaps wstring) for maximum compatibility with the rest of the world ?
*Scenario E:* We add another string class and everyone adopts it
I meant that for example on POSIX OSes the POSIX C-API did not have to be changed or extended by a new set of functions doing the same things, but using a new character type, when they switched from the old encodings to UTF-8.
...and people still have the problem that they lose track of what's "raw" and what's encoded as utf-8.
Yes, but in the end, they will get used to it. There are many dangerous things in C++ (like for example dereferencing a nil or dangling pointer, doing C-pointer arithmetic in the presence of inheritance, etc.) you should not do and mixing UTF-8 and other encoding would be one of them. It is a breaking change but it would not be the first one in C++'s history.
To compare two strings you still can use stdcmp and not utf8strcmp, to collate strings you use strcoll and not utf8strcol, etc.
Yeah... but surely POSIX's strcmp only tells you whether the two strings have the same sequence of code points, not whether they have the same characters, right? And if you inadvertently compare a "raw" string with an equivalent utf-8-encoded string, what happens?
Undefined behavior, your application segfaults, aborts, silently fails... (what happens if you dereference a dangling pointer ?) BR, Matus

Matus Chochlik wrote:
On Wed, Jan 19, 2011 at 8:28 PM, Dave Abrahams <dave@boostpro.com> wrote:
OK, I see. But, is there any chance that the standard itself would be updated so that it first would recommend to use UTF-8 with C++ strings.
Well, never say "never," but... never. Such recommendations are not part of the standard's mission. It doesn't do things like that.
My view of what is the standardizing comitee willing to do may by naive, but generally I don't see why this could not be done. Other major languages (Java, C#, etc.) picked a single "standard" encoding and in those languages you treat text with other encodings as special case.
These are not good examples, as they are single company, single platform languages. C++ is supposed to be much more than that.
If C++ recommended the use of UTF-8 this would probably kickstart the OS and compiler vendors to follow or at least to fix their implementations of the standard libary and the OS API's to accept UTF-8 by default (if we agree that this is a good idea).
To name one set of obstacles: IBM, z/OS, EBCDIC. Will work when all the legacy Cobol code has been converted to C++, but not before that. :-) Latest estimate was 200+ billion lines left to process. Bo Persson

From: Dave Abrahams <dave@boostpro.com>
Matus Chochlik wrote:
On Wed, Jan 19, 2011 at 5:26 PM, Dave Abrahams <dave@boostpro.com> wrote:
Matus Chochlik wrote:
*Scenario D:* We try for scenario A. and people still use Qstrings,
wxStrings, etc.
'I think maybe you underestimate our influence.' :)
Our influence, if we introduce new library components, is very great, because they're on a de-facto fast track to standardization, and an improved string library is exactly the sort of thing that would be adopted upstream. If we simply agree to a programming convention, that will have some impact, but much less.
Dave, Most of existing projects and frameworks had decided about 1 single and useful encoding: - C++: + Qt UTF-16 using QString + GtkMM UTF-8 using ustring + MFC UTF-16 using CString /when compiled in "Unicode mode" + ICU UTF-16 using UnicodeString - C: + Gtk UTF-8 string - Java: UTF-16 String - C#: UTF-16 string - Vala: UTF-8 String/using "char *" And so on... If you take a look on All C++ frameworks they all have a way to convert their string to std::string and backwards. C++ hadn't picked yet, but C++ has string and very good one. And every existing project has an interface to it. The problem we hadn't decided about its encoding. Yes, we can't say to standard std::string is UTF-8 but we can say other things. As standard deprecated auto_ptr (which I think is crime but this is other story) it should deprecate all non-unicode aware uses of std::string and say default is UTF-8. It already has u8"שלום" that creates UTF-8 string using "char *" the only remaining thing is to adopt it. All frameworks decided how they use Unicode and what string they use. Boost can and **should** decide - we use Unicode - and we use UTF-8 as all frameworks did. Decide and cut it. As Boost had decided not to use tabs in source code or use BSL for all its code base. This would do only good. Sometimes it is bad to support every bad decision that was made. As many Boost Developers and Users enjoy the fact that Boost is in constant evolution so we can evolve and decide: On windows char */std::string etc is UTF-8 if you don't agree, don't use Boost. Artyom

At Wed, 19 Jan 2011 12:50:17 -0800 (PST), Artyom wrote:
Most of existing projects and frameworks had decided about 1 single and useful encoding:
- C++:
+ Qt UTF-16 using QString + GtkMM UTF-8 using ustring + MFC UTF-16 using CString /when compiled in "Unicode mode" + ICU UTF-16 using UnicodeString
- C: + Gtk UTF-8 string
- Java: UTF-16 String - C#: UTF-16 string - Vala: UTF-8 String/using "char *"
And so on...
If you take a look on All C++ frameworks they all have a way to convert their string to std::string and backwards.
C++ hadn't picked yet, but C++ has string and very good one.
I guess whether std::string is a good design could be considered a matter of opinion.
And every existing project has an interface to it.
The problem we hadn't decided about its encoding.
Yes, we can't say to standard std::string is UTF-8 but we can say other things.
Like what?
As standard deprecated auto_ptr (which I think is crime but this is other story) it should deprecate all non-unicode aware uses of std::string and say default is UTF-8.
The standard can't deprecate usage patterns, just language (which includes the standard library) features.
Boost can and **should** decide - we use Unicode - and we use UTF-8 as all frameworks did.
Except for all the UTF-16 frameworks you cited above?
Decide and cut it.
Cut what?
As Boost had decided not to use tabs in source code or use BSL for all its code base.
Those were easy to do without breaking code.
This would do only good.
Sometimes it is bad to support every bad decision that was made.
No argument there.
As many Boost Developers and Users enjoy the fact that Boost is in constant evolution so we can evolve and decide:
On windows char */std::string etc is UTF-8 if you don't agree, don't use Boost.
Yes we can. It would break code, I'm pretty sure. I am not opposed to breaking code when the benefits are worth it, but in this case I am not yet convinced that there isn't an equally-good alternative that doesn't break code. We're still exploring those alternatives. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Boost can and **should** decide - we use Unicode - and we use UTF-8 as all frameworks did.
Except for all the UTF-16 frameworks you cited above?
Short reminder: http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmf... - Qt support UTF-16 only from version 4, before that Qt3 supported only UCS-2! (At it wasn't long time ago) - Java Supports UTF-16 from 1.5 before UCS-2 - Windows somehow supports UTF-16 starting from XP - MS SQL Server does not support UTF-16 yet (only UCS-2) I can continue... UTF-16 is a "historical mistake" because some (long) time ago Unicode supposed to be 16 bit, and in those days 16 bit character was very reasonable but it didn't worked - so UTF-16 was invented. No modern project should pick it as it give more problems the headache. Not to mention that before char16_t would be supported in all compiler it would be hard time to support it in C++. (and not wchar_t is not good for UTF-16) Just a small point before we may think of picking UTF-16. My $0.02 Artyom

On 01/19/2011 09:39 PM, Artyom wrote:
... elision by patrick ... Short reminder:
http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmf... Dueling reminder: http://www.joelonsoftware.com/articles/Unicode.html
Patrick

From: Patrick Horgan <phorgan1@gmail.com> On 01/19/2011 09:39 PM, Artyom wrote:
... elision by patrick ... Short reminder:
http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmf... Dueling reminder: http://www.joelonsoftware.com/articles/Unicode.html
Patrick
I know this article well. The problem is it is outdated, has many errors and written from the Microsoft's view on Unicode which apparently lead to all the TCHAR crap. In any case I woudn't give it to anybody as good reference for Unicode. Artyom

Artyom wrote :
If you take a look on All C++ frameworks they all have a way to convert their string to std::string and backwards.
C++ hadn't picked yet, but C++ has string and very good one.
Please allow me to question this last statement. I'm struggling to follow this thread, but there is one thing that emerge from this effort : encoded strings are dangerous beasts you don't want to touch, you should pass them as-is or use an expert library to analyze or modify them. Do you think that std::string and its fairly open interface reflects this good practice ? As a user, I would like to see encoded strings as something a little bit more opaque. Ivan.

Dave Abrahams wrote:
*Scenario D:* We try for scenario A. and people still use Qstrings, wxStrings, etc.
*Scenario E:* We add another string class and everyone adopts it
The problem with using an Unicode string, be it QString or utf8_string, to represent paths is that it forces you to pick an encoding under POSIX. When the OS gives you a file name as char*, to store it in your Unicode string, you have to interpret it. Then, to give it back to the OS, you have to de-interpret it. This forces you to choose between two evils: you can opt to use a single byte encoding such as ISO-8859-1, which gives you perfect round-trip, but leads to the problem that people can enter a Cyrillic file name in your Unicode-enabled GUI and see something odd happen on disk, even when their shell is configured as UTF-8 and can show Cyrillic names. Or, you can choose to use UTF-8, in which case the OS can give you a name which you can't decode properly, because it's invalid UTF-8. There is no single good answer to this, of course; even if you go with my recommended approach as treating paths as byte sequences unless and until you need to display them (in which case you treat them as UTF-8), there'll still be paths that won't show up properly on the screen. But the program will be able to work with them, even if they are undisplayable. To give a simple example: int my_main( int ac, char const* av[] ) { my_fopen( av[1] ); } Since files can have arbitrary byte sequences as names under POSIX (Mac OS X excluded), if my_fopen insists on taking valid UTF-8, it will refuse to open the file.

At Wed, 19 Jan 2011 19:09:48 +0200, Peter Dimov wrote:
Dave Abrahams wrote:
*Scenario D:* We try for scenario A. and people still use Qstrings, wxStrings, etc.
*Scenario E:* We add another string class and everyone adopts it
The problem with using an Unicode string, be it QString or utf8_string, to represent paths is that it forces you to pick an encoding under POSIX. When the OS gives you a file name as char*, to store it in your Unicode string, you have to interpret it. Then, to give it back to the OS, you have to de-interpret it.
Nonono; if you don't want to choose an encoding, you store it as a raw_string, (a.k.a. std::string, for example)! The whole point is to separate by type the things we know how to interpret from the things we don't. Please tell me if I'm missing something that's still important below after my explanation above. I only skimmed because it mostly seemed to be based on a misinterpretation of my proposal.
This forces you to choose between two evils: you can opt to use a single byte encoding such as ISO-8859-1, which gives you perfect round-trip, but leads to the problem that people can enter a Cyrillic file name in your Unicode-enabled GUI and see something odd happen on disk, even when their shell is configured as UTF-8 and can show Cyrillic names. Or, you can choose to use UTF-8, in which case the OS can give you a name which you can't decode properly, because it's invalid UTF-8.
There is no single good answer to this, of course; even if you go with my recommended approach as treating paths as byte sequences unless and until you need to display them (in which case you treat them as UTF-8), there'll still be paths that won't show up properly on the screen. But the program will be able to work with them, even if they are undisplayable.
To give a simple example:
int my_main( int ac, char const* av[] ) { my_fopen( av[1] ); }
Since files can have arbitrary byte sequences as names under POSIX (Mac OS X excluded), if my_fopen insists on taking valid UTF-8, it will refuse to open the file.
-- Dave Abrahams BoostPro Computing http://www.boostpro.com

Dave Abrahams wrote:
At Wed, 19 Jan 2011 19:09:48 +0200, Peter Dimov wrote:
The problem with using an Unicode string, be it QString or utf8_string, to represent paths is that it forces you to pick an encoding under POSIX. When the OS gives you a file name as char*, to store it in your Unicode string, you have to interpret it. Then, to give it back to the OS, you have to de-interpret it.
Nonono; if you don't want to choose an encoding, you store it as a raw_string, (a.k.a. std::string, for example)!
OK. You're designing a portable library that talks to the OS. It has the following functions: T get_path( ... ); void process_path( T ); What do you use for T? string or utf8_string?

At Wed, 19 Jan 2011 22:15:10 +0200, Peter Dimov wrote:
Dave Abrahams wrote:
At Wed, 19 Jan 2011 19:09:48 +0200, Peter Dimov wrote:
The problem with using an Unicode string, be it QString or utf8_string, to represent paths is that it forces you to pick an encoding under POSIX. When the OS gives you a file name as char*, to store it in your Unicode string, you have to interpret it. Then, to give it back to the OS, you have to de-interpret it.
Nonono; if you don't want to choose an encoding, you store it as a raw_string, (a.k.a. std::string, for example)!
OK. You're designing a portable library that talks to the OS. It has the following functions:
T get_path( ... ); void process_path( T );
What do you use for T? string or utf8_string?
I'm even less of an expert on encodings at the OS boundary than I am on an expert on encodings in general, but I'll take a shot at this one. OK, according to all the experts (like you), we should be trafficking in UTF-8 everywhere, so I guess I'd say T is utf8_string (well, T is boost::filesystem::path, but that begs the same questions, ultimately). -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Dave Abrahams wrote:
Peter Dimov wrote:
Dave Abrahams wrote:
Nonono; if you don't want to choose an encoding, you store it as a raw_string, (a.k.a. std::string, for example)!
OK. You're designing a portable library that talks to the OS. It has the following functions:
T get_path( ... ); void process_path( T );
What do you use for T? string or utf8_string?
OK, according to all the experts (like you), we should be trafficking in UTF-8 everywhere, so I guess I'd say T is utf8_string (well, T is boost::filesystem::path, but that begs the same questions, ultimately).
I think it depends upon get_path() and process_path(). If get_path() returns an OS byte sequence and process_path() uses its argument for OS calls, then T is std::string as both functions are encoding agnostic. If process_path() is to do character-based processing, then it probably needs a UTF8 string and so it might expect a utf8_string which, presumably, would have a converting constructor from std::string assumed to have unknown encoding. (I have no idea whether it is possible to determine the encoding from the byte sequence, but I suppose it is.) In a system which assumes UTF8 encoding in std::strings, then utf8_string might be a typedef for std::string and the only concern is that all sources of such strings be UTF8. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

Dave Abrahams wrote: ...
OK. You're designing a portable library that talks to the OS. It has the following functions:
T get_path( ... ); void process_path( T );
What do you use for T? string or utf8_string?
I'm even less of an expert on encodings at the OS boundary than I am on an expert on encodings in general, but I'll take a shot at this one.
OK, according to all the experts (like you), we should be trafficking in UTF-8 everywhere, so I guess I'd say T is utf8_string (well, T is boost::filesystem::path, but that begs the same questions, ultimately).
My answer is different. T is std::string, and: - on POSIX OSes, this string is taken directly from the OS and given directly to the OS, without any conversion; - on Windows, this string is UTF-8 and is converted to UTF-16 before being given to the OS, and converted from UTF-16 after being received from it. This conversion should tolerate broken UTF-16 because the OS does so as well.

At Wed, 19 Jan 2011 23:02:02 +0200, Peter Dimov wrote:
Dave Abrahams wrote: ...
OK. You're designing a portable library that talks to the OS. It has the following functions:
T get_path( ... ); void process_path( T );
What do you use for T? string or utf8_string?
I'm even less of an expert on encodings at the OS boundary than I am on an expert on encodings in general, but I'll take a shot at this one.
OK, according to all the experts (like you), we should be trafficking in UTF-8 everywhere, so I guess I'd say T is utf8_string (well, T is boost::filesystem::path, but that begs the same questions, ultimately).
My answer is different. T is std::string, and:
- on POSIX OSes, this string is taken directly from the OS and given directly to the OS, without any conversion;
- on Windows, this string is UTF-8 and is converted to UTF-16 before being given to the OS, and converted from UTF-16 after being received from it. This conversion should tolerate broken UTF-16 because the OS does so as well.
A fine answer if: a. you think the interface to std::string is a good one for posterity, and b. every other std::string that might be used along with your portable library is guaranteed to be utf-8 encoded. But I don't agree with a), and the interface to std::string makes a future where b) holds look highly unlikely to me. I prefer to have semantic constraints/invariants like "this is UTF-8 encoded" represented in the type system and enforced by public library interfaces. I'm arguing for a future like that. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Dave Abrahams wrote:
At Wed, 19 Jan 2011 23:02:02 +0200, Peter Dimov wrote:
My answer is different. T is std::string, and:
- on POSIX OSes, this string is taken directly from the OS and given directly to the OS, without any conversion;
- on Windows, this string is UTF-8 and is converted to UTF-16 before being given to the OS, and converted from UTF-16 after being received from it. This conversion should tolerate broken UTF-16 because the OS does so as well.
...
I prefer to have semantic constraints/invariants like "this is UTF-8 encoded" represented in the type system and enforced by public library interfaces. I'm arguing for a future like that.
But the semantics I outlined above only have this constraint under Windows.

At Thu, 20 Jan 2011 00:07:18 +0200, Peter Dimov wrote:
Dave Abrahams wrote:
At Wed, 19 Jan 2011 23:02:02 +0200, Peter Dimov wrote:
My answer is different. T is std::string, and:
- on POSIX OSes, this string is taken directly from the OS and given directly to the OS, without any conversion; - on Windows, this string is UTF-8 and is converted to UTF-16 before being given to the OS, and converted from UTF-16 after being received from it. This conversion should tolerate broken UTF-16 because the OS does so as well.
...
I prefer to have semantic constraints/invariants like "this is UTF-8 encoded" represented in the type system and enforced by public library interfaces. I'm arguing for a future like that.
But the semantics I outlined above only have this constraint under Windows.
Sorry, I don't understand what you're saying here. But let me say a little more about my point; maybe that will help. If I get a std::string from "somewhere", I don't know what encoding it's in, if any. The abstraction presented by std::string is essentially "sequence of individually addressable and mutable chars that by convention represents text in some unspecified way." It has lots of interface that is aimed at manipulating the raw sequence of chars, and none that helps with an interpretation of those chars. IIUC, you're talking about changing the abstraction presented by std::string to "sequence of individually addressable and mutable chars that by convention represents text encoded as utf-8." I would prefer to be handling something that presents the abstraction "character string." I'm not sure exactly what that looks like, but I'm pretty sure the "individually addressable and mutable chars" part should go. I'd like to see an interface that prevents corrupting the underlying data such that it no longer represents a valid sequence of characters (or at least makes it highly unlikely that such corruption could happen accidentally). Furthermore, there are lots of string-y things I'd want to do that aren't provided—or aren't provided well—by std::string, e.g. if (s1.starts_with(s2)) {...} Does this make more sense? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Thu, Jan 20, 2011 at 12:13 PM, Dave Abrahams <dave@boostpro.com> wrote:
I would prefer to be handling something that presents the abstraction "character string." I'm not sure exactly what that looks like, but I'm pretty sure the "individually addressable and mutable chars" part should go. I'd like to see an interface that prevents corrupting the underlying data such that it no longer represents a valid sequence of characters (or at least makes it highly unlikely that such corruption could happen accidentally). Furthermore, there are lots of string-y things I'd want to do that aren't provided—or aren't provided well—by std::string, e.g. if (s1.starts_with(s2)) {...}
Does this make more sense?
This discussion is interesting for a lot of reasons. However, I think it's time to address the root cause of the problem with strings in C++: that the way we think of strings right now is broken. Everything follows from this basic problem. It's time to call a spade a spade: std::string is not as well thought out as everybody might seem to think. I think since we've had something like 20 years to think about this problem, it's time to consider revolution instead of evolution. Immutable strings with lazy operations seem to be the most effective way to deal with strings from a design/implementation perspective. Encoding is just a matter of rendering, or in fusion/mpl parlance is a view of the data. String mutation is a concurrency hindrance, encourages bad programming practice, and is generally an overrated feature that makes designing efficient strings still revert to pointers and value twiddling. In this day and age with all the idioms in C++ we already know, we should really be thinking about changing the way people think about strings. Of course it shouldn't be as drastic as wiping out std::string from the face of all programs -- but something that allows for taking data from an std::string and becoming immutable, allowing lazy operations on it, and overall making a crazy efficient string implementation should be the goal first before we think about dealing with encodings and what not. Maybe it's time someone formalizes a string calculus and implements a string type that's worthy of being called a modern string. Going-back-to-just-watching'ly yours, -- Dean Michael Berris about.me/deanberris

Dave Abrahams wrote:
IIUC, you're talking about changing the abstraction presented by std::string to "sequence of individually addressable and mutable chars that by convention represents text encoded as utf-8."
Something like that. string is just char[] with value semantics. It doesn't necessarily hold a valid UTF-8 sequence.
I would prefer to be handling something that presents the abstraction "character string." I'm not sure exactly what that looks like, but I'm pretty sure the "individually addressable and mutable chars" part should go. I'd like to see an interface that prevents corrupting the underlying data such that it no longer represents a valid sequence of characters (or at least makes it highly unlikely that such corruption could happen accidentally). Furthermore, there are lots of string-y things I'd want to do that aren't provided—or aren't provided well—by std::string, e.g. if (s1.starts_with(s2)) {...}
Does this make more sense?
It makes sense in the abstract. But there is no way to protect against corruption without also setting an invariant that the sequence is not corrupted (represents valid UTF-8), and I don't usually need such a string in the interfaces we're discussing, although it can certainly be useful on its own. The interfaces that talk to the OS need to be able to carry arbitrary char sequences (in the POSIX case). Even an interface that displays the string, one that by necessity must interpret it as UTF-8, should preferably handle invalid UTF-8 and display some placeholders instead of the invalid subsequence - it's better for the user to see parts of the string than nothing at all. It's even worse to abort the whole operation with an invalid_utf8 exception. I don't particularly like string's mutable chars, but they don't mutate themselves without my telling them to, so things tend to work out fine. :-)

At Thu, 20 Jan 2011 06:43:48 +0200, Peter Dimov wrote:
Dave Abrahams wrote:
IIUC, you're talking about changing the abstraction presented by std::string to "sequence of individually addressable and mutable chars that by convention represents text encoded as utf-8."
Something like that. string is just char[] with value semantics. It doesn't necessarily hold a valid UTF-8 sequence.
Right.
I would prefer to be handling something that presents the abstraction "character string." I'm not sure exactly what that looks like, but I'm pretty sure the "individually addressable and mutable chars" part should go. I'd like to see an interface that prevents corrupting the underlying data such that it no longer represents a valid sequence of characters (or at least makes it highly unlikely that such corruption could happen accidentally). Furthermore, there are lots of string-y things I'd want to do that aren't provided—or aren't provided well—by std::string, e.g. if (s1.starts_with(s2)) {...}
Does this make more sense?
It makes sense in the abstract. But there is no way to protect against corruption without also setting an invariant that the sequence is not corrupted (represents valid UTF-8), and I don't usually need such a string in the interfaces we're discussing, although it can certainly be useful on its own. The interfaces that talk to the OS need to be able to carry arbitrary char sequences (in the POSIX case).
Yup. Then they should be handling raw_string, right?
Even an interface that displays the string, one that by necessity must interpret it as UTF-8, should preferably handle invalid UTF-8 and display some placeholders instead of the invalid subsequence - it's better for the user to see parts of the string than nothing at all.
Yep. Then I guess that should be handling raw_string, too.
It's even worse to abort the whole operation with an invalid_utf8 exception.
Yowp. So you want a "resilient utf-8 string:" something that can represent any sequence of chars and, when interpretation is necessary, will interpret them as utf-8, using some kind of best-effort error recovery to avoid hard errors. Then you can have an is_valid_utf_8() routine that is used to check for validity when/if you need it. I can understand the argument that there's not much to be gained from the type system here. I still like the idea of using something with a real string interface: namespace boost { struct text { explicit text(std::string); operator std::string const&() const { return storage; } ... bool startswith(text const& s) const; bool endswith(text const& s) const; text trim() const; ... private: std::string storage; }; } but I do wonder whether it's worth writing (or paying for the copy in) x.startswith(text(some_std_string)) and in general whether the cost of copying std::strings into text::storage is too high. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 19/01/2011 22:02, Peter Dimov wrote:
- on Windows, this string is UTF-8 and is converted to UTF-16 before being given to the OS, and converted from UTF-16 after being received from it. This conversion should tolerate broken UTF-16 because the OS does so as well.
I see no need to tolerate bad practices if they cause obvious problems.

Mathias Gaunard wrote:
On 19/01/2011 22:02, Peter Dimov wrote:
- on Windows, this string is UTF-8 and is converted to UTF-16 before being given to the OS, and converted from UTF-16 after being received from it. This conversion should tolerate broken UTF-16 because the OS does so as well.
I see no need to tolerate bad practices if they cause obvious problems.
It is possible to create a file whose name is not a valid UTF-16 sequence on Windows, so the library ought to be able to work with it. You could go either way in this specific case though, since such names are extremely rare in practice.

20.01.2011 2:50, Peter Dimov пишет:
It is possible to create a file whose name is not a valid UTF-16 sequence on Windows, so the library ought to be able to work with it. You could go either way in this specific case though, since such names are extremely rare in practice. On Windows, it is possible to create the filenames with \0 in the middle. But I don't think that Boost should support them.

On 20 January 2011 13:54, Sergey Cheban <s.cheban@drweb.com> wrote:
On Windows, it is possible to create the filenames with \0 in the middle. But I don't think that Boost should support them.
Er, why not? As a developer, it would mean that I can't use Boost to write reliable code to handle files. Boost cannot dictate what a filesystem can and cannot handle. -- Nevin ":-)" Liber <mailto:nevin@eviloverlord.com> (847) 691-1404

Nevin Liber wrote:
On 20 January 2011 13:54, Sergey Cheban <s.cheban@drweb.com> wrote:
On Windows, it is possible to create the filenames with \0 in the middle. But I don't think that Boost should support them.
Er, why not?
We can afford not to support those because the official Win32 API doesn't (undocumented kernel interfaces do). Not supporting file names that the user can easily create is another matter. :-)

On 21 January 2011 14:47, Peter Dimov <pdimov@pdimov.com> wrote:
Nevin Liber wrote:
On 20 January 2011 13:54, Sergey Cheban <s.cheban@drweb.com> wrote:
On Windows, it is possible to create the filenames with \0 in the > middle. But I don't think that Boost should support them.
Er, why not?
We can afford not to support those because the official Win32 API doesn't (undocumented kernel interfaces do). Not supporting file names that the user can easily create is another matter. :-)
Agreed (to both). Nevin :-) -- Nevin ":-)" Liber <mailto:nevin@eviloverlord.com> (847) 691-1404

On 19/01/2011 11:33, Matus Chochlik wrote:
The string-encoding-related discussion boils down for me to the following: What fill the string handling in C++ look like in the (maybe not immediate) future.
*Scenario A:*
We will pick a widely-accepted char-based encoding that is able to handle all the writing scripts and alphabets that we can think of, has enough reserved space for future additions or is easily extensible and use that with std::strings which will become the one and only text string 'container' class.
All the wstrings, wxString, Qstrings, utf8strings, etc. will be abandoned. All the APIs using ANSI or UCS-2 will be slowly phased out with the help of convenience classes like ansi_str_t and ucs2_t that will be made obsolete and finally dropped (after the transition).
*Scenario B:*
We will add yet another string class named utf8_t to the already crowded set named above. Then:
library a: will stick to the ANSI encodings with std::strings It has worked in the past it will work in the future, right ?
library b[oost]: will use utf8_t instead and provide the (seamles and straightforward) conversions between utf8_t and std::string and std::wstring. Some (many but not all) others will follow
library c: will use std::strings with utf-8 ... library [.]n[et]: will use String class ... library q[t]: will use Qstrings .. library w[xWidgets]: will use wxStrings and wxChar* library wi[napi]: will use TCHAR* ... library z: will use const char* in an encoding agnostic way
Now an application using libraries [a..z] will become the developers nightmare. What string should he use for the class members, constructor parameters, who to do when the conversions do not work so seamlesly ?
Also half of the cpu time assigned to running that application will be wasted on useless string transcoding. And half of the memory will be occupied with useless transcoding-related code and data.
*Scenario C:*
This is basically the status quo; a mix of the above. A sad and unsatisfactory state of things.
*Scenario D:* Use Ranges, don't care whether it's std::string, whatever_string, etc. This also allows maximum efficiency, with lazy concatenation, transformations, conversion, filtering etc. My Unicode library works with arbitrary ranges, and you can adapt a range in an encoding into a range in another encoding. This can be used to lazily perform encoding conversion as the range is iterated; such conversions may even be pipelined.

On 01/19/2011 08:51 AM, Mathias Gaunard wrote:
... elision by patrick ...
*Scenario D:*
Use Ranges, don't care whether it's std::string, whatever_string, etc. This also allows maximum efficiency, with lazy concatenation, transformations, conversion, filtering etc.
My Unicode library works with arbitrary ranges, and you can adapt a range in an encoding into a range in another encoding. This can be used to lazily perform encoding conversion as the range is iterated; such conversions may even be pipelined. Sounds interesting. Of courses ranges could be used with strings of whatever sort. Is the intelligence about the encoding in the ranges? As you iterate a range does it move byte by byte character by character, does it deal with compositions? Is it available to read?
Patrick

On 20/01/2011 05:38, Patrick Horgan wrote:
On 01/19/2011 08:51 AM, Mathias Gaunard wrote:
My Unicode library works with arbitrary ranges, and you can adapt a range in an encoding into a range in another encoding. This can be used to lazily perform encoding conversion as the range is iterated; such conversions may even be pipelined. Sounds interesting. Of courses ranges could be used with strings of whatever sort. Is the intelligence about the encoding in the ranges?
I've chosen not to attach encoding information to ranges, as this could make my Unicode library quite intrusive. It's a design by contract; your input ranges must satisfy certain criteria, such as encoding, depending on the function you choose to call. If the criteria are not satisfied, you either get undefined behaviour or an exception, depending on the version of the function you choose to call.
As you iterate a range does it move byte by byte character by character,
You can adapt a range of code units into a range of code points or into a range of ranges of code points (combining character sequences, graphemes, words, sentences, etc.)
does it deal with compositions?
It can. My library doesn't really have string algorithms, it's up to you to make sure you call those algorithms using the correct adapters. For example, to search for a substring in a string, both of which being in UTF-8, and taking into account combining characters, there are different strategies: - Decode both to UTF-32, normalize them, segment them as combining character sequences, and perform a substring search on that. - Decode both to UTF-32, normalize them, re-encode them both in UTF-8, perform a substring search at the byte level, and ignore matches that do not lie on the utf8_combining_boundary (checks whether we're at a UTF-8 code point boundary, decodes to UTF-32, checks whether we're at a combining character boundary). You could want to avoid the normalization step because you know your data is already normalized. The second one has chances to be quite faster than the former, because you spend most of the time working on chars in actual memory, which can be optimized quite aggressively. Both approaches are doable directly in a couple of lines by combining Boost.StringAlgo and my Unicode library in various ways ; and all conversions can happen lazily or not as one wishes. Boost.StringAlgo isn't however that good (it only provides naive O(n*m) algorithms, doesn't support right-to-left search well, and certainly is unable to vectorize the cases where the range is made of built-in types contiguous in memory), so eventually it might have to be replaced.
Is it available to read?
Somewhat dated docs are at <http://mathias.gaunard.com/unicode/doc/html/> A presentation is planned for Boostcon 2011, and a submission for review before that.

On 01/19/2011 02:33 AM, Matus Chochlik wrote:
... elision by patrick ... - It is extensible, so once we have done the painful transition we will not have to do it again. Currently utf-8 uses 1-4 (or 1-6) byte sequences to encode code The 5 and 6 byte sequences are from early versions of the utf-8 and have known negative security implications. You should never use them in your encoding, nor should you ever accept them as valid utf-8. The entire unicode code space (all 2^31 codes) is encodable in 4 byte standard compliant utf-8. Please see RFC3629 UTF-8, a transformation format of ISO 10646. F. Yergeau. November 2003. This is also STD0063. Also see Table 3-7. Well-Formed UTF-8 Byte Sequences from version 5.2 of the Unicode Standard. I can't emphasize this enough. There have been real, serious problems, that cost people money from following the older naive spec.
to 1-N bytes (unlike UCS-X and i'm not sure about UTF-16/32). If you extended it, then it would not be utf-8 which is an encoding of UCS. So, [dark-sarcasm] even if we dig out the stargate or join the United Federation of Planets and captain Kirk, every time he returns home, brings a truckload of new writing scripts to support, UTF-8 will be able to handle it. Well, most of the code space of UCS is still unused. There's plenty of room. 2^31 codes is a lot. just my 0.02 strips of gold pressed latinum :) [/dark-sarcasm]
Best regards,
Matus _______________________________________________ Unsubscribe& other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
participants (23)
-
Alexander Lamaison
-
Artyom
-
Beman Dawes
-
Bo Persson
-
Chad Nelson
-
Dave Abrahams
-
Dean Michael Berris
-
Edward Diener
-
Ian Emmons
-
Ivan Le Lann
-
Jens Finkhäuser
-
Marsh Ray
-
Mathias Gaunard
-
Matus Chochlik
-
Nevin Liber
-
Patrick Horgan
-
Peter Dimov
-
Robert Ramey
-
Scott McMurray
-
Sebastian Redl
-
Sergey Cheban
-
Stewart, Robert
-
Yakov Galka