Re: [boost] UTF-8 conversion setc. [was: [String algorithm] is_any_of has inefficient implementation]

Felipe Magno de Almeida wrote:
On Fri, Feb 15, 2008 at 3:54 PM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
This week I have been writing some UTF-8 encoding and decoding and Unicode<->iso8859 conversion algorithms. They seem to be faster than the libc implementations which is satisfying especially as I haven't even started on the serious optimisations yet. This will be part of the strings-tagged-with-character-sets stuff that I have described before. Anyone interested?
Sure. Though I'm most interested in all charset conversions. But the most usual is enough to speed up my application *a lot*.
Thanks to everyone who expressed an interest. I will attempt to have some sort of documentation and code available in the next few days. Pester me if I don't produce anything. If you can describe your typical data, i.e. which character set pairs you're converting between, or even better send me some test data to benchmark with, that would give a target to work towards. Cheers, Phil.

Phil Endecott wrote:
Felipe Magno de Almeida wrote:
On Fri, Feb 15, 2008 at 3:54 PM, Phil Endecott wrote:
This week I have been writing some UTF-8 encoding and decoding and Unicode<->iso8859 conversion algorithms. They seem to be faster than the libc implementations which is satisfying especially as I haven't even started on the serious optimisations yet. This will be part of the strings-tagged-with-character-sets stuff that I have described before. Anyone interested?
Sure. Though I'm most interested in all charset conversions. But the most usual is enough to speed up my application *a lot*.
Thanks to everyone who expressed an interest.
I will attempt to have some sort of documentation and code available in the next few days. Pester me if I don't produce anything.
OK, the code is here: http://svn.chezphil.org/libpbe/trunk/include/charset/ and there are some very basic docs here: http://svn.chezphil.org/libpbe/trunk/doc/charsets/ (Have a look at intro.txt for the feature list.) This code is not yet Boostified (namespaces, directory layout etc.) Most of it compiles but it has hardly been exercised at all. The functionality includes conversion between UTF-8, UCS-2, UCS-4, ASCII and ISO-8859-*. Things I'd appreciate feedback on: - What should the cs_string look like? Basically everywhere that std::string uses an integer position I have the choice of a character position, a unit position, or an iterator - or not providing that function. - What character sets are people interested in using (a) at the "edges" of their programs, and (b) in the "core"? Regards, Phil.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tuesday 19 February 2008 07:40 am, Phil Endecott wrote:
This code is not yet Boostified (namespaces, directory layout etc.) Most of it compiles but it has hardly been exercised at all. The functionality includes conversion between UTF-8, UCS-2, UCS-4, ASCII and ISO-8859-*.
Things I'd appreciate feedback on: - What should the cs_string look like? Basically everywhere that std::string uses an integer position I have the choice of a character position, a unit position, or an iterator - or not providing that function. - What character sets are people interested in using (a) at the "edges" of their programs, and (b) in the "core"?
I don't have a lot of experience using non-ascii strings in my internal code, aside from occasional forays into utf-8 for special characters, but wouldn't using ucs-4 for the "core" encoding be the sane thing to do? With a ucs-4 encoding, you could use a basic_string<wchar_t> and continue using the familiar api without worrying about the complications and confusion caused by variable length encodings. - -- Frank -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHvHRc5vihyNWuA4URAsY5AKDjvg0giN2IHhdBKzT7+IgNH2h/igCeLeAv axRO7RQJVv1U7OMDLP71bZc= =lImG -----END PGP SIGNATURE-----

At 1:41 PM -0500 2/20/08, Frank Mori Hess wrote:
I don't have a lot of experience using non-ascii strings in my internal code, aside from occasional forays into utf-8 for special characters, but wouldn't using ucs-4 for the "core" encoding be the sane thing to do? With a ucs-4 encoding, you could use a
basic_string<wchar_t>
and continue using the familiar api without worrying about the complications and confusion caused by variable length encodings.
You are making an unwarranted assumption - that wchar_t is big enough to hold a ucs-4 code point (or, in fact, that wchar_t has a particular size). This is incorrect. On some compilers, sizeof(wchar_t) == 2, while on others, sizeof(wchar_t) == 4. (Other compilers may use other values as well - but I've never seen them). -- -- Marshall Marshall Clow Idio Software <mailto:marshall@idio.com> It is by caffeine alone I set my mind in motion. It is by the beans of Java that thoughts acquire speed, the hands acquire shaking, the shaking becomes a warning. It is by caffeine alone I set my mind in motion.

I don't have a lot of experience using non-ascii strings in my internal code, aside from occasional forays into utf-8 for special characters, but wouldn't using ucs-4 for the "core" encoding be the sane thing to do? With a ucs-4 encoding, you could use a
basic_string<wchar_t>
and continue using the familiar api without worrying about the complications and confusion caused by variable length encodings. The sane thing, perhaps. But take a look at Mozilla, for example, who're dealing with character data a lot. Currently they're evaluating the memory and speed effects of switching from UTF-16 to UTF-8 for everything. The reasoning is that even on web pages that consist mostly of exotic characters, there's still a lot of ASCII around (not counting tag names): URIs, IDs, classes, names, etc. Thus, the space savings could be considerable. (Current benchmarks record an average of a few
Frank Mori Hess wrote: percent on an unfortunately not representative set of pages, if I remember correctly.) Can you imagine what these developers would think of switching to UTF-32, where 11 bits are guaranteed to be wasted simply because all Unicode5 planes can be represented with 21 bits? Sebastian

Phil Endecott wrote:
Things I'd appreciate feedback on: - What should the cs_string look like? Basically everywhere that std::string uses an integer position I have the choice of a character position, a unit position, or an iterator - or not providing that function.
I think emulating std::string doesn't work. It has a naive design based on the assumption of fixed-width encodings. I think that a tagged string is the best place to really start over with a string design and produce a string that is lean, rather than bloated. I think the string type should offer minimal manipulation facilities - either completely read-only or append as the only manipulation function. A string buffer type could be written as a mutable alternative, as is the design in Java and C#. However, I'm not sure how much of that interface is needed, either. I'd love to have some empirical data on string usage.
- What character sets are people interested in using (a) at the "edges" of their programs, As many as possible. Theoretically, a program might have to deal with any and all encodings out there. Realistically, there's probably a dozen or two that are relevant. You'd need empirical data. and (b) in the "core"?
ASCII, UTF-8 and UTF-16. Sebastian

On Mon, Feb 25, 2008 at 8:09 AM, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:
Phil Endecott wrote:
Things I'd appreciate feedback on: - What should the cs_string look like? Basically everywhere that std::string uses an integer position I have the choice of a character position, a unit position, or an iterator - or not providing that function.
I think emulating std::string doesn't work. It has a naive design based on the assumption of fixed-width encodings. I think that a tagged string is the best place to really start over with a string design and produce a string that is lean, rather than bloated.
I agree.
I think the string type should offer minimal manipulation facilities - either completely read-only or append as the only manipulation function.
I would like to have at least a modifiable string. But only through iterators (insert and erase). That should suffice all my algorithm needs.
A string buffer type could be written as a mutable alternative, as is the design in Java and C#. However, I'm not sure how much of that interface is needed, either.
A modifiable iterator interface (with insert and erase) is, IMO, as concise and extensible as possible.
I'd love to have some empirical data on string usage.
I do some string manipulations on email. And it is usually better to do all manipulations in the codepage received, instead of converting back and forth.
- What character sets are people interested in using (a) at the "edges" of their programs, As many as possible. Theoretically, a program might have to deal with any and all encodings out there. Realistically, there's probably a dozen or two that are relevant. You'd need empirical data.
Unfortunately I need all supported by MIME.
and (b) in the "core"?
ASCII, UTF-8 and UTF-16.
ISO-8859-1 ?
Sebastian
-- Felipe Magno de Almeida

Felipe Magno de Almeida wrote:
On Mon, Feb 25, 2008 at 8:09 AM, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:
Phil Endecott wrote:
Things I'd appreciate feedback on: - What should the cs_string look like? Basically everywhere that std::string uses an integer position I have the choice of a character position, a unit position, or an iterator - or not providing that function.
I think emulating std::string doesn't work. It has a naive design based on the assumption of fixed-width encodings. I think that a tagged string is the best place to really start over with a string design and produce a string that is lean, rather than bloated.
I agree.
Hmmm. I hear what you're saying, but things that are too revolutionary don't get used because they're too different from what people are used to. I'd like to offer something that's close to a drop-in replacement for std::string that will let people painlessly upgrade their code to proper character set support. However, most of the work that I have done has been at a lower level and can be easily built upon to enable a new class with a different interface as well. So you can have your cake and eat it! Comments about both are welcome.
I think the string type should offer minimal manipulation facilities - either completely read-only or append as the only manipulation function.
I would like to have at least a modifiable string. But only through iterators (insert and erase). That should suffice all my algorithm needs.
Try this: temporarily replace all your strings with list<character> and see what's missing.
A string buffer type could be written as a mutable alternative, as is the design in Java and C#. However, I'm not sure how much of that interface is needed, either.
I'm unfamiliar with what Java and C# do, but my lower-level code (e.g. character_output_iterator) make it simple to write e.g. UTF-8 into arbitrary memory.
A modifiable iterator interface (with insert and erase) is, IMO, as concise and extensible as possible.
I'd love to have some empirical data on string usage.
I do some string manipulations on email. And it is usually better to do all manipulations in the codepage received, instead of converting back and forth.
One issue that I'm currently thinking about with this sort of usage is compile-time character set tagging vs. run-time character set tagging. In fact, I've been wondering whether there is some general pattern for providing both e.g. template <charset_t cset> void foo(int x); and void foo(charset_t cset, int x); You can obviously forward from the first to the second but that may lose some compile-time-constant optimisations; forwarding from the second to the first needs a horrible case statement. I was wondering about a macro that would define both.... any ideas anyone?
- What character sets are people interested in using (a) at the "edges" of their programs, As many as possible. Theoretically, a program might have to deal with any and all encodings out there. Realistically, there's probably a dozen or two that are relevant. You'd need empirical data.
I have looked at the charsets in all my email, but the results are thrown by the spam.
Unfortunately I need all supported by MIME.
Falling back using e.g. iconv() for the otherwise-unsupported ones is my plan. I'm unlikely to have the energy to write code for more than a couple of the exotic sets myself. If anyone would like to help, please get in touch.
and (b) in the "core"?
ASCII, UTF-8 and UTF-16.
ISO-8859-1 ?
Cheers, Phil.

On Mon, Feb 25, 2008 at 6:06 PM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
Felipe Magno de Almeida wrote:
On Mon, Feb 25, 2008 at 8:09 AM, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:
[snip]
I think emulating std::string doesn't work. It has a naive design based on the assumption of fixed-width encodings. I think that a tagged string is the best place to really start over with a string design and produce a string that is lean, rather than bloated.
I agree.
Hmmm. I hear what you're saying, but things that are too revolutionary don't get used because they're too different from what people are used to. I'd like to offer something that's close to a drop-in replacement for std::string that will let people painlessly upgrade their code to proper character set support.
I would really much use it. And am not very concerned if some algorithms would have to change. I'm now using icu directly, and it is quite a PITA.
However, most of the work that I have done has been at a lower level and can be easily built upon to enable a new class with a different interface as well. So you can have your cake and eat it! Comments about both are welcome.
You could create a bloated_utf8 as a drop-in replacement for std::string, and at the same time discouraging its use. :P
I think the string type should offer minimal manipulation facilities - either completely read-only or append as the only manipulation function.
I would like to have at least a modifiable string. But only through iterators (insert and erase). That should suffice all my algorithm needs.
Try this: temporarily replace all your strings with list<character> and see what's missing.
I did (not *all*, but in very significant places). The first problem I got was unnecessary requiring RandomAccessIterators, like using operator+ instead of std::advance. Other places uses std::string::size_type and operator[]. But I can say these are easily correctable.
A string buffer type could be written as a mutable alternative, as is the design in Java and C#. However, I'm not sure how much of that interface is needed, either.
I'm unfamiliar with what Java and C# do, but my lower-level code (e.g. character_output_iterator) make it simple to write e.g. UTF-8 into arbitrary memory.
Good.
A modifiable iterator interface (with insert and erase) is, IMO, as concise and extensible as possible.
I'd love to have some empirical data on string usage.
I do some string manipulations on email. And it is usually better to do all manipulations in the codepage received, instead of converting back and forth.
One issue that I'm currently thinking about with this sort of usage is compile-time character set tagging vs. run-time character set tagging. In fact, I've been wondering whether there is some general pattern for providing both e.g.
template <charset_t cset> void foo(int x); and void foo(charset_t cset, int x);
I can say I won't be using much compile-time tagged strings. But, I guess you could do: template <typename Char, typename Charset> struct compiletime_string; template <typename Char> struct string { template <typename Charset> string(compiletime_string<Char, Charset> const& s); } And then you can have compile-time tagged strings and runtime tagged strings work together seamlessly.
You can obviously forward from the first to the second but that may lose some compile-time-constant optimisations; forwarding from the second to the first needs a horrible case statement. I was wondering about a macro that would define both.... any ideas anyone?
I guess a macro wouldn't be a very good idea. You can just do some if's in the runtime_tagged and forward to the compile-time function for cases where you have a optimized compile-time version for those charsets. For all others, just execute a common function (based on iconv maybe) just passing the character set name. You could have a map for compile-time character set to c-string character set name.
- What character sets are people interested in using (a) at the "edges" of their programs, As many as possible. Theoretically, a program might have to deal with any and all encodings out there. Realistically, there's probably a dozen or two that are relevant. You'd need empirical data.
I have looked at the charsets in all my email, but the results are thrown by the spam.
Unfortunately I need all supported by MIME.
Falling back using e.g. iconv() for the otherwise-unsupported ones is my plan.
That's good enough to me. [snip]
Cheers,
Phil.
Regards, -- Felipe Magno de Almeida

Phil Endecott wrote:
I'd like to offer something that's close to a drop-in replacement for std::string that will let people painlessly upgrade their code to proper character set support.
Something that might be worth a look is Glib::ustring, in glibmm-2. http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustrin... Matt

Matt Gruenke wrote:
Phil Endecott wrote:
I'd like to offer something that's close to a drop-in replacement for std::string that will let people painlessly upgrade their code to proper character set support.
Something that might be worth a look is Glib::ustring, in glibmm-2.
http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustrin...
Thanks for the link; this is indeed the sort of thing that I have in mind. Some observations: - It doesn't seem to offer any complexity guarantees. - It uses size_type to pass and return positions in the string, and doesn't specify whether these are byte or character positions; I get the impression that they're character positions but I could be wrong. - It offers implicit conversion to and from std::string. Is this desirable? Regards, Phil.

On Tue, Mar 4, 2008 at 6:11 AM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
Matt Gruenke wrote:
[snip]
Something that might be worth a look is Glib::ustring, in glibmm-2.
http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustrin...
Thanks for the link; this is indeed the sort of thing that I have in mind. Some observations:
- It doesn't seem to offer any complexity guarantees.
- It uses size_type to pass and return positions in the string, and doesn't specify whether these are byte or character positions; I get the impression that they're character positions but I could be wrong.
- It offers implicit conversion to and from std::string. Is this desirable?
Not at all desirable. But a str() function member is very welcome.
Regards, Phil.
Regards, -- Felipe Magno de Almeida

Phil Endecott wrote:
Matt Gruenke wrote:
Something that might be worth a look is Glib::ustring, in glibmm-2.
http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustrin...
- It uses size_type to pass and return positions in the string, and doesn't specify whether these are byte or character positions; I get the impression that they're character positions but I could be wrong.
Given that they highlight the pitfalls of byte-based addressing in std::string (when instantiated with multibyte characters) |std::string::operator[] <http://gcc.gnu.org/onlinedocs/libstdc++/latest-doxygen/classstd_1_1basic__string.html#std_1_1basic__stringa79>| might return a byte in the middle of a character, and |std::string::length() <http://gcc.gnu.org/onlinedocs/libstdc++/latest-doxygen/classstd_1_1basic__string.html#std_1_1basic__stringa71>| returns the number of bytes rather than characters and that virtually everywhere size_type is used (where there's documentation), they explicitly state that it describes a number of characters, I think it's character-based. Also, they provide a bytes() member function, in addition to size() and length().
- It offers implicit conversion to and from std::string. Is this desirable?
I feel the danger and potential for confusion easily outweigh the benefits. I agree that something like a str() member function is the way to go. Matt

Phil Endecott a écrit : I've been using it for quite some time now, so here are my remarks :
- It uses size_type to pass and return positions in the string, and doesn't specify whether these are byte or character positions; I get the impression that they're character positions but I could be wrong.
It is character positions.
- It offers implicit conversion to and from std::string. Is this desirable?
In my experience, it is very bad. It is not a conversion, but a reinterpretation. Even a str() function would seem bad, unless it has a parameter that states in which encoding the resulting string is desired. A conversion from/to std::wstring would seem more useful in most cases. -- Loïc

On Tue, Mar 04, 2008 at 02:05:03PM +0100, Loïc Joly wrote:
Even a str() function would seem bad, unless it has a parameter that states in which encoding the resulting string is desired. A conversion from/to std::wstring would seem more useful in most cases.
Why do you require a specific encoding for std::string but none for std::wstring? Jens

Jens Seidel a écrit :
On Tue, Mar 04, 2008 at 02:05:03PM +0100, Loïc Joly wrote:
Even a str() function would seem bad, unless it has a parameter that states in which encoding the resulting string is desired. A conversion from/to std::wstring would seem more useful in most cases.
Why do you require a specific encoding for std::string but none for std::wstring?
I require one too. I just believe that a default value will be enough in most cases with std::wstring. -- Loïc

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tuesday 04 March 2008 04:11 am, Phil Endecott wrote:
Matt Gruenke wrote:
Phil Endecott wrote:
I'd like to offer something that's close to a drop-in replacement for std::string that will let people painlessly upgrade their code to proper character set support.
Something that might be worth a look is Glib::ustring, in glibmm-2.
http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ust ring.html
Thanks for the link; this is indeed the sort of thing that I have in mind. Some observations:
There is also QString from qt, if you're not already aware of it. I'm not saying it's necessarily the way you want to go, but it does handle conversions between different encodings (see toUtf8() and fromUtf8() for example). http://doc.trolltech.com/4.3/qstring.html - -- Frank -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHzV7g5vihyNWuA4URAlIgAJ4/wGHtBkJQZEgogh9vXutoLYHiFACgpOAo 9xYBIpi/upTgHFmy7Hy99IE= =+kcK -----END PGP SIGNATURE-----

Frank Mori Hess wrote:
There is also QString from qt, if you're not already aware of it. I'm not saying it's necessarily the way you want to go, but it does handle conversions between different encodings (see toUtf8() and fromUtf8() for example).
Have you seen trotter lib? http://www.assembla.com/wiki/show/trotter-libs Regards, -- Shunsuke Sogame

Phil Endecott wrote:
OK, the code is here: http://svn.chezphil.org/libpbe/trunk/include/charset/
Playing around with implementing a very simple immutable string with shared representation based on the concepts, and I found that your current charset traits cannot deal with shift encodings. Basically, you miss the shift state in all relevant functions: skip_forward/backward_char, encode, decode and char_length. Also, const_unit_ptr_t is not a very good template parameter name. It's misleading, since the user might assume the type has to be a pointer, where an iterator suffices. Sebastian

Sebastian Redl wrote:
Phil Endecott wrote:
OK, the code is here: http://svn.chezphil.org/libpbe/trunk/include/charset/
Playing around with implementing a very simple immutable string with shared representation based on the concepts, and I found that your current charset traits cannot deal with shift encodings. Basically, you miss the shift state in all relevant functions: skip_forward/backward_char, encode, decode and char_length.
My original hope was that only the actual conversion function would need to track the shift state, while the decoding of the units would be independent of it. Unfortunately this isn't true of, for example, iso-2022; you need to track the shift state to know whether you're looking at a 1-byte or a 2-byte character. So, yes, this interface will need to change.
Also, const_unit_ptr_t is not a very good template parameter name. It's misleading, since the user might assume the type has to be a pointer, where an iterator suffices.
True. I'm sure you'll find many more such issues. Thanks for the feedback. Phil.

Phil Endecott wrote:
My original hope was that only the actual conversion function would need to track the shift state, while the decoding of the units would be independent of it. Unfortunately this isn't true of, for example, iso-2022; you need to track the shift state to know whether you're looking at a 1-byte or a 2-byte character. So, yes, this interface will need to change.
It gets worse. I've tried to implement a very simple "kinda-shift" encoding, UTF-16VE. That is, a UTF-16 form that expects a BOM to determine endianness. This encoding uses the shift state to remember what endian it is in. (No dynamic switching.) Trying to implement this, I've found that it is apparently logically impossible to provide bidirectional iterators for shift encodings, like ISO 2022-based encodings. These encodings rely on state that can only be known by sequentially scanning the string from front to back. Any attempt to iterate backwards would first have to mark the switch positions and what modes they switch from. This can be worked around for my UTF-16VE, but not for true shift encodings. Thus, the charset traits probably need a flag that designates the set as a shift encoding and makes the iterator adapter be forward-only. On a side note, Shift-JIS, EUC-JP and ISO-2022-JP are all absurdly complex. UTF-8 is so much easier! Sebastian Redl

Sebastian Redl wrote:
It gets worse. I've tried to implement a very simple "kinda-shift" encoding, UTF-16VE. That is, a UTF-16 form that expects a BOM to determine endianness. This encoding uses the shift state to remember what endian it is in. (No dynamic switching.)
The common case is that you have a BOM at the start, and if there are any other BOMs they'll be the same. But what I don't know is what the Unicode specs allow in this respect, and whether it's sensible to provide explicit support for that limited case as well as the more general case. (Do the IANA character sets names that I'm using as the basis for the charset_t enum have any way of distinguishing these cases, for example? I think the answer is no.)
Trying to implement this, I've found that it is apparently logically impossible to provide bidirectional iterators for shift encodings, like ISO 2022-based encodings. These encodings rely on state that can only be known by sequentially scanning the string from front to back.
Yes. You may be able to argue in some cases that you can predict the state during backward traversal IF there are no redundant shifts and if there are only two states. Again, I don't know whether that's useful in practice (and I suspect not).
Any attempt to iterate backwards would first have to mark the switch positions and what modes they switch from.
This can be worked around for my UTF-16VE, but not for true shift encodings. Thus, the charset traits probably need a flag that designates the set as a shift encoding and makes the iterator adapter be forward-only.
We could detect the case when skip_forward_char is not implemented. There are various factors that influence the adapted iterator traversal tag. For example, I wanted to say that the character iterator has the same traversal tag as the unit iterator, except that it's not random access; i.e. min(unit_iter_t,bidirectional). Is there any existing code anywhere for doing operations like this on iterator traversal category tags?
On a side note, Shift-JIS, EUC-JP and ISO-2022-JP are all absurdly complex. UTF-8 is so much easier!
Agreed :-( Phil.

Phil Endecott wrote:
The common case is that you have a BOM at the start, and if there are any other BOMs they'll be the same. But what I don't know is what the Unicode specs allow in this respect, and whether it's sensible to provide explicit support for that limited case as well as the more general case. (Do the IANA character sets names that I'm using as the basis for the charset_t enum have any way of distinguishing these cases, for example? I think the answer is no.)
IANA registers UTF-16BE, UTF-16LE and UTF-16. BE and LE are the fixed-endian variants. UTF-16 depends on context: if the base unit is a 16-bit entity, UTF-16 is simply endian-agnostic. If it's an 8-bit entity, I believe UTF-16 requires a BOM. I don't think flipping endians in the middle of a string is useful. I can't imagine what twisted tool would generate such code. Come to think of it, if I'm not careful, *my* code will generate it, namely when you concatenate a BE and a LE string. Concatenating shift encodings is *not* fun. Neither is substringing them.
Trying to implement this, I've found that it is apparently logically impossible to provide bidirectional iterators for shift encodings, like ISO 2022-based encodings. These encodings rely on state that can only be known by sequentially scanning the string from front to back.
Yes. You may be able to argue in some cases that you can predict the state during backward traversal IF there are no redundant shifts and if there are only two states. Again, I don't know whether that's useful in practice (and I suspect not).
Not really. The only shift encodings that ever found use are those of the ISO 2022 family, which have a two different shift state sets, one with four and one with three states, for a total of 12 shift states, not to mention the character set selection capabilities. Have I mentioned that the complexity of this stuff is absurd?
We could detect the case when skip_forward_char is not implemented.
What I'm currently doing is detecting if state_t is an empty class. Much, much easier than detecting if a function is implemented or not, especially if you have a base class that provides a default for the function.
There are various factors that influence the adapted iterator traversal tag. For example, I wanted to say that the character iterator has the same traversal tag as the unit iterator, except that it's not random access; i.e. min(unit_iter_t,bidirectional). Is there any existing code anywhere for doing operations like this on iterator traversal category tags?
Not that I know of. I had something like this around for old style categories, but when I tried to adapt it to the new ones, I realized that it didn't actually work. (I ended up never using it.) Sebastian Redl

Phil Endecott wrote:
Sebastian Redl wrote:
It gets worse. I've tried to implement a very simple "kinda-shift" encoding, UTF-16VE. That is, a UTF-16 form that expects a BOM to determine endianness. This encoding uses the shift state to remember what endian it is in. (No dynamic switching.)
The common case is that you have a BOM at the start, and if there are any other BOMs they'll be the same. But what I don't know is what the Unicode specs allow in this respect, and whether it's sensible to provide explicit support for that limited case as well as the more general case.
From memory when I was implementing Unicode strings for my web framework it goes something along these lines. If an enclosing specification already tells us that it is Unicode and which encoding (i.e. HTTP and SMTP/MIME have this mechanism) then there shouldn't be a BOM. There also should never be a BOM anywhere other than the start of a string/stream/file (if you concatenate you should remove inner ones). I think some old applications may incorrectly use a BOM as a zero width break too. You probably want to just filter out all BOMs and output them in streams etc. only when told to do so. When decoding UTF-8 it is also useful to check that the character you just decoded is actually meant to use that number of UTF-8 bytes. For example, by zero padding you can encode an apostrophe as 2 bytes rather than 1. There are a number of security exploits centred around this and getting one means you're dealing with a buggy Unicode encoder at best, but more likely your software is under attack. I throw an exception to stop all processing in its tracks if I see this. K -- http://www.kirit.com/

Kirit Sælensminde wrote:
If an enclosing specification already tells us that it is Unicode and which encoding (i.e. HTTP and SMTP/MIME have this mechanism) then there shouldn't be a BOM. Yes, if the mechanism tells us the endianness. Otherwise, the BOM is still needed. There also should never be a BOM anywhere other than the start of a string/stream/file (if you concatenate you should remove inner ones). I think some old applications may incorrectly use a BOM as a zero width break too. Not really incorrectly. 0xFFFE really was the zero-width non-breaking space originally, but the special zero-width property led people to use it as a BOM. Thus, a different character was designated as the new ZWNBSP, and 0xFFFE was officially made the BOM. So the usage is only incorrect in new applications. When decoding UTF-8 it is also useful to check that the character you just decoded is actually meant to use that number of UTF-8 bytes. For example, by zero padding you can encode an apostrophe as 2 bytes rather than 1. There are a number of security exploits centred around this and getting one means you're dealing with a buggy Unicode encoder at best, but more likely your software is under attack. I throw an exception to stop all processing in its tracks if I see this.
Phil's code does that, too. Sebastian Redl

Phil Endecott wrote:
OK, the code is here: http://svn.chezphil.org/libpbe/trunk/include/charset/
and there are some very basic docs here: http://svn.chezphil.org/libpbe/trunk/doc/charsets/ (Have a look at intro.txt for the feature list.)
Another conceptual problem in your traits. Take a look at UTF-8's skip_forward_char: template <typename char8_ptr_t> static void skip_forward_char(char8_ptr_t& i) { do { ++i; } while (!char_start_byte(*i)); // Maybe hint this? } And this loop: for(iterator it = cnt.begin(); it != cnt.end(); skip_forward_char(it)) { } This will always invoke undefined behaviour. Consider the case where it is just before end(), i.e. ++it == cnt.end(). Then skip_forward_char() will indeed do ++it, and then do *it, thus dereferencing the past-the-end iterator. Boom. Compare with filter_iterator. skip_forward_char *must* take the end iterator, too, and stop when reaching it. This, in turn, makes the charset adapter iterator that much more complicated. Sebastian Redl

Sebastian Redl wrote:
Phil Endecott wrote:
OK, the code is here: http://svn.chezphil.org/libpbe/trunk/include/charset/
and there are some very basic docs here: http://svn.chezphil.org/libpbe/trunk/doc/charsets/ (Have a look at intro.txt for the feature list.)
Another conceptual problem in your traits. Take a look at UTF-8's skip_forward_char:
template <typename char8_ptr_t> static void skip_forward_char(char8_ptr_t& i) { do { ++i; } while (!char_start_byte(*i)); // Maybe hint this? }
And this loop:
for(iterator it = cnt.begin(); it != cnt.end(); skip_forward_char(it)) { }
This will always invoke undefined behaviour. Consider the case where it is just before end(), i.e. ++it == cnt.end(). Then skip_forward_char() will indeed do ++it, and then do *it, thus dereferencing the past-the-end iterator. Boom.
Yes, absolutely. I'm aware of this and similar problems. But please keep reporting them :-) In this case the problem is slightly less serious if you write something more like skip_forward_char(char8_ptr_t& i) { advance(i,char_length(*i)); } In this case you don't dereference an invalid iterator if the input is valid and complete UTF8. That might be useful in some circumstances. But computing char_length is actually harder than the loop with char_start_byte(). On the other hand my code does work for zero-terminated data, which is useful in the case of std::string::c_str(). I presume that the standard doesn't guarantee that dereferencing the byte after the end of a string returns 0, even though an implementation that provides c_str() in the obvious way would have to do so, right? I'm not sure what the best solution to that problem is, but I have thought more about the converse case where you're storing UTF8 using the output iterator, and you're writing into a fixed-size buffer, e.g. (pseudo-code) char* iso88591_data; size_t iso8859_data_length; // The UTF8 data will take more space than the iso8859_1 data; // maybe we know that in our case most bytes will be ASCII, so we allow // a 10% overhead: char* utf8_data = new char[iso_8859_data_length * 1.1]; // In the rare case where that's insufficient we'll abort and retry with a // larger buffer, or do the rest in another chunk or something. // Iterator to store utf8: character_output_iterator<utf8> utf8_it(utf8_data); // First thought is to use a function with the same signature as std::copy: seq_conv(iso88591_data, iso88591_data+iso8859_data_length, utf8_it); // But that doesn't allow us to specify the end of the output buffer. So // we make that an additional parameter: character_output_iterator<utf8> utf8_end_it(utf8_data+utf8_length); seq_conv(iso88591_data, iso88591_data+iso8859_data_length, utf8_it, utf8_end_it); // But this may terminate either because it reached the end of the // input or because it reached the end of the output. So perhaps it // needs to return a pair<> of iterators reporting how far it got through // each. But I'm also concerned that the inner loops in these conversion algorithms shouldn't be doing more comparisons than is absolutely necessary. So I'm currently considering having both versions, with and without the destination-end iterator. I've added functions (or maybe constants) to the charset_traits indicating the maximum number of units per character. The bounded version can then be implemented something like this: (pseudo-code!) seq_conv(in_start, in_end, out_start, out_end) { size_t out_length = out_end-out_start; max_chars = out_length / charset_traits<cset>::max_units_per_char(); // We can safely copy max_chars from in to out without worrying about out_end: (in_next,out_next) = seq_conv(in_start, min(in_end, in_start+max_chars), out_start); // We do need to worry about out_end while copying the others: seq_conv(in_next, in_end, out_next, out_end); }
Compare with filter_iterator. skip_forward_char *must* take the end iterator, too, and stop when reaching it. This, in turn, makes the charset adapter iterator that much more complicated.
Yes. filter_iterator is a good example; I would like to be consistent with existing practice when it's appropriate to do so. As you can see I'm progressing quite slowly with this work. This has the advantage that I have plenty of time to think about what I should do next before I implement it.... BTW I have just written a base64 decoding iterator adaptor. It also needs you to pass an iterator referring to the end of the data so that it can do the right thing at the end. Anyone interested? Phil.

Phil Endecott wrote:
Sebastian Redl wrote:
Another conceptual problem in your traits. Take a look at UTF-8's skip_forward_char:
template <typename char8_ptr_t> static void skip_forward_char(char8_ptr_t& i) { do { ++i; } while (!char_start_byte(*i)); // Maybe hint this? }
And this loop:
for(iterator it = cnt.begin(); it != cnt.end(); skip_forward_char(it)) { }
This will always invoke undefined behaviour. Consider the case where it is just before end(), i.e. ++it == cnt.end(). Then skip_forward_char() will indeed do ++it, and then do *it, thus dereferencing the past-the-end iterator. Boom.
Yes, absolutely. I'm aware of this and similar problems. But please keep reporting them :-)
In this case the problem is slightly less serious if you write something more like
skip_forward_char(char8_ptr_t& i) { advance(i,char_length(*i)); }
In this case you don't dereference an invalid iterator if the input is valid and complete UTF8. That might be useful in some circumstances. But computing char_length is actually harder than the loop with char_start_byte().
The loop can also be done like this (I prefer unsigned char as the base for UTF-8 because of the bit fiddling): unsigned char lb = *i; ++i; if((lb & 0x80) == 0) return; lb <<= 1; while((lb & 0x80) != 0) { ++i; lb <<= 1; } This might be faster than first determining the character count. Depends on how you do it, and on the architecture. Profiling needed, also for testing against an end iterator.
On the other hand my code does work for zero-terminated data, which is useful in the case of std::string::c_str(). But not in the general case. I presume that the standard doesn't guarantee that dereferencing the byte after the end of a string returns 0, even though an implementation that provides c_str() in the obvious way would have to do so, right?
No such guarantee, but yes, for the "obvious" implementation this would be correct. An implementation doesn't have to be obvious, though. There are still some left that aren't, I think. We can implement UTF-8's and UTF-16's skip_forward by looking at the current byte. But does that work with all encodings? I think it doesn't work for shift encodings, unless you're willing to come to a stop on a shift character. I'm not: there's a rule for some shift encodings that they *must* end in the initial shift state, which means that there's a good chance that a shift character is the last thing in the string. This would mean, however, that if you increment an iterator that points to the last real character, it must scan past the shift character or it won't compare equal to the end iterator. Unless you're willing to scan past the shift in the equality test, another thing I wouldn't do. Seems to me that shift encodings are a lot more pain than they're worth. I really have to wonder why anyone would ever have come up with them.
// But this may terminate either because it reached the end of the // input or because it reached the end of the output. So perhaps it // needs to return a pair<> of iterators reporting how far it got through // each.
I think that was the point where I gave up the last time I started developing a character conversion library. :-) Here's another thing to think about: it is not possible to construct an end iterator for any of the insert iterators. The without-end version is therefore absolutely necessary.
I've added functions (or maybe constants) to the charset_traits indicating the maximum number of units per character.
Another thing you need: a trait to calculate exactly the number of units needed to store a codepoint. std::size_t units_for_char(char_t cp, shift_state s); Returns e.g. between 1 and 4 for UTF-8. Actually, I think the shift_state should be passed by reference and be modified, so that you can run through a sequence of codepoints and calculate precisely the needed length. typedef charset_traits<Enc> traits; std::size_t length = 0; typename traits::state_type state = state_type(); for(auto cpit = codepoints.begin(); cpit != codepoints.end(); ++cpit) { length += traits::units_for_char(*cpit, state); } length += traits::units_for_finish(state); The units_for_finish trait again refers to the requirement of some shift encodings to finish in the default shift state. units_for_finish calculates the number of units needed to shift from the given state back to the initial state.
BTW I have just written a base64 decoding iterator adaptor. It also needs you to pass an iterator referring to the end of the data so that it can do the right thing at the end. Anyone interested?
I'm sure someone could find a use for such a utility. Sebastian
participants (11)
-
Felipe Magno de Almeida
-
Frank Mori Hess
-
Jens Seidel
-
Kirit Sælensminde
-
Lassi Tuura
-
Loïc Joly
-
Marshall Clow
-
Matt Gruenke
-
Phil Endecott
-
Sebastian Redl
-
shunsuke