
Is there interest in having a Unicode codec library submitted to Boost? Here is what I have (only tested with GCC 4.3 on Debian, and VC++ 2008): http://svn.int64.org/viewvc/int64/snips/unicode.hpp Right now it is pretty simple to use: transcode<utf8, utf16le>(forwarditerator, forwarditeratorend, outputiterator, traits [, maximum]); transcode<wchar_encoding, utf32be>(inputrange, outputrange, traits [, maximum]); There is also a codecvt facet which supports any-to-any. Supports UTF-8, UTF-16, and UTF-32, in little or big endian. Has a special wchar_encoding that maps to UTF-16 or UTF-32 depending on your platform. A traits class controls error handling. -- Cory Nelson

Cory Nelson wrote:
Is there interest in having a Unicode codec library submitted to Boost?
Here is what I have (only tested with GCC 4.3 on Debian, and VC++ 2008): http://svn.int64.org/viewvc/int64/snips/unicode.hpp
Right now it is pretty simple to use:
transcode<utf8, utf16le>(forwarditerator, forwarditeratorend, outputiterator, traits [, maximum]); transcode<wchar_encoding, utf32be>(inputrange, outputrange, traits [, maximum]);
There is also a codecvt facet which supports any-to-any.
Supports UTF-8, UTF-16, and UTF-32, in little or big endian. Has a special wchar_encoding that maps to UTF-16 or UTF-32 depending on your platform. A traits class controls error handling.
Hi Cory, Yes, Boost definitely ought to have Unicode conversion. Yours is not the first proposal, and there is actually already one implementation hidden inside another Boost library. I wrote some UTF conversion code a while ago and was trying to build a more comprehensive character set library around it but realised after a while that that was too mammoth a job. So I think that getting _just_ the UTF conversion into Boost, ASAP, is the right thing to do. I've had a look at your code. I like that you have implemented what I called an "error policy". It is wasteful to continuously check the validity of input that comes from trusted code, but important to check it when it comes from an untrusted source. However I'm not sure that your actual conversion is as efficient as it could be. I spent quite a while profiling my UTF8 conversion and came up with this: http://svn.chezphil.org/libpbe/trunk/include/charset/utf8.hh I think that you could largely copy&paste bits of that into the right places in your algorithm and get a significant speedup. Having said all that, I must say that I actually use the code that I wrote quite rarely. I now tend to use UTF8 everywhere and treat it as a sequence of bytes. Because of the properties of UTF8 I find it's rare to need to identify individual code points. For example, if I'm scanning for a matching " or ) I can just look for the next matching byte, without worrying about where the character boundaries are. If I were to use a special UTF8-decoding iterator to do that scan I would waste a lot of time do unnecessary conversions. I'm not sure what conclusion to draw from that: perhaps just that any "UTF8 string", or whatever, should come with a health warning that users should first learn how UTF8 works and review whether or not they actually need it. Cheers, Phil.

On Thu, Feb 12, 2009 at 6:32 PM, Phil Endecott<spam_from_boost_dev@chezphil.org> wrote:
Cory Nelson wrote:
Is there interest in having a Unicode codec library submitted to Boost?
Here is what I have (only tested with GCC 4.3 on Debian, and VC++ 2008): http://svn.int64.org/viewvc/int64/snips/unicode.hpp
Right now it is pretty simple to use:
transcode<utf8, utf16le>(forwarditerator, forwarditeratorend, outputiterator, traits [, maximum]); transcode<wchar_encoding, utf32be>(inputrange, outputrange, traits [, maximum]);
There is also a codecvt facet which supports any-to-any.
Supports UTF-8, UTF-16, and UTF-32, in little or big endian. Has a special wchar_encoding that maps to UTF-16 or UTF-32 depending on your platform. A traits class controls error handling.
Hi Cory,
Yes, Boost definitely ought to have Unicode conversion. Yours is not the first proposal, and there is actually already one implementation hidden inside another Boost library. I wrote some UTF conversion code a while ago and was trying to build a more comprehensive character set library around it but realised after a while that that was too mammoth a job. So I think that getting _just_ the UTF conversion into Boost, ASAP, is the right thing to do.
I've had a look at your code. I like that you have implemented what I called an "error policy". It is wasteful to continuously check the validity of input that comes from trusted code, but important to check it when it comes from an untrusted source. However I'm not sure that your actual conversion is as efficient as it could be. I spent quite a while profiling my UTF8 conversion and came up with this:
http://svn.chezphil.org/libpbe/trunk/include/charset/utf8.hh
I think that you could largely copy&paste bits of that into the right places in your algorithm and get a significant speedup.
I finally found some time to do some optimizations of my own and have had some good progress using a small lookup table, a switch, and slightly deducing branches. See line 318: http://svn.int64.org/viewvc/int64/snips/unicode.hpp?view=markup Despite these efforts, Windows 7 still decodes UTF-8 three times faster (~750MiB/s vs ~240MiB/s on my Core 2. I assume they are either using some gigantic look up tables or SSE. -- Cory Nelson http://int64.org

On Fri, Jul 17, 2009 at 20:02, Cory Nelson <phrosty@gmail.com> wrote:
On Thu, Feb 12, 2009 at 6:32 PM, Phil Endecott<spam_from_boost_dev@chezphil.org> wrote:
Cory Nelson wrote:
Is there interest in having a Unicode codec library submitted to Boost?
I've had a look at your code. I like that you have implemented what I called an "error policy". It is wasteful to continuously check the validity of input that comes from trusted code, but important to check it when it comes from an untrusted source. However I'm not sure that your actual conversion is as efficient as it could be. I spent quite a while profiling my UTF8 conversion and came up with this:
http://svn.chezphil.org/libpbe/trunk/include/charset/utf8.hh
I think that you could largely copy&paste bits of that into the right places in your algorithm and get a significant speedup.
I finally found some time to do some optimizations of my own and have had some good progress using a small lookup table, a switch, and slightly deducing branches. See line 318:
http://svn.int64.org/viewvc/int64/snips/unicode.hpp?view=markup
Despite these efforts, Windows 7 still decodes UTF-8 three times faster (~750MiB/s vs ~240MiB/s on my Core 2. I assume they are either using some gigantic look up tables or SSE.
From the first UTF-8 byte, the values and the valid ranges of the 0 to 3 following bytes is known (Table 3-7 in the Unicode standard version 5.0). Therefore, you can put these in lookup tables (7 tables with 256 entries each, with a 32-bit code point value). A master table contains 256 entries mapping the first byte to 0 to 3 following tables. You find the code point values for consecutive bytes in the appropriate
Dear Cory, Though I'm not sure decoding this much UTF8-encoded data is often a bottleneck in practice, let me add my thoughts. I wrote a utf8_codecvt::do_in that uses lookup tables for a Unicode library that Graham and I started long ago (in the vault). I haven't actually compared performance of my implementation against yours, so mine may well be slower. However, the technique may be of interest. tables and OR them. The trick that I suspect gave me speed-up compared to the function the Unicode Consortium publishes, is that code point values for invalid bytes are 0xffffffff. After OR-ing all values, you need only one check to see whether there was an error. This reduces branches. As I said, I'm not sure it's actually faster than your code, but it may be an interesting technique. The debug compilation of utf8_data.cpp, which contains the tables and nothing else, is 30kB on Linux gcc; hardly "gigantic". However, with a threefold speed increase it seems likely that Windows uses bigger tables or something smarter than I came up with. To answer your original question, "is there any interest in Unicode codecs?", yes. It now seems to me that a full Unicode library would be hard to get accepted into Boost; it would be more feasible to get a UTF library submitted, which is more along the lines of your library. (A Unicode library could later be based on the same principles.) Freestanding transcoding functions and codecvt facets are not the only thing I believe a UTF library would need, though. I'd add to the list: - iterator adaptors (input and output); - range adaptors; - a code point string; - compile-time encoding (meta-programming); - documentation. (The compile-time encoding thing may sound esoteric. I believe would be useful for fast parsers. It's not that hard to do at any rate.) I suspect the string is a pain to design to please everyone. Iterator adaptors, I found, are a pain to attach error policies to and write them correctly. For example, with a policy equivalent to your "ReplaceCheckFailures", you need to produce the same code point sequence whether you traverse an invalid encoded string forward or backward. I've got code for UTF-8 that passes my unit tests, but the error checking and the one-by-one decoding makes it much harder to optimise. I believe that Mathias Gaunard is working on a library at <http://blogloufoque.free.fr/unicode/doc/html/>. I don't know how complete it is, but from the documentation it looks well thought-out so far. I'm looking forward to seeing where that's going! Cheers, Rogier

Rogier van Dalen wrote:
Freestanding transcoding functions and codecvt facets are not the only thing I believe a UTF library would need, though.
I've personally purposely chose not to use codecvt facets in my unicode library at all, but maybe I should provide them anyway for compatibility with the iostreams subsystem. I don't really find those practical to use. I'd add to the list:
- compile-time encoding (meta-programming);
Didn't think of that.
Iterator adaptors, I found, are a pain to attach error policies to and write them correctly. For example, with a policy equivalent to your "ReplaceCheckFailures", you need to produce the same code point sequence whether you traverse an invalid encoded string forward or backward. I've got code for UTF-8 that passes my unit tests, but the error checking and the one-by-one decoding makes it much harder to optimise.
For now my iterator adaptors (and the codecs they're based on for that matter) perform full checks, including checking that we don't go past the end of the input range (one way or the other). While I wanted both versions with checks and without initially, only having one does make it easier to use. An error policy isn't really enough though, because to do full checks you need each iterator to know about the begin and the end of the range it's working on which could be avoided altogether when trusting the input. They're fairly simple implementations and were never benchmarked (benchmarking my library isn't even scheduled at the moment), but they're quite correct (proper unit tests are in the works).
I believe that Mathias Gaunard is working on a library at <http://blogloufoque.free.fr/unicode/doc/html/>. I don't know how complete it is, but from the documentation it looks well thought-out so far. I'm looking forward to seeing where that's going!
Thanks! I'm in the writing of several tutorials to make it easier to understand how it's designed. (plus I still need to actually implement some stuff that is in that version of the docs)

Mathias Gaunard wrote:
An error policy isn't really enough though, because to do full checks you need each iterator to know about the begin and the end of the range it's working on which could be avoided altogether when trusting the input.
The idea of an "iterator that knows where its end is" is something that comes up fairly often; do the Range experts have any comments about it? I think that in this case an iterator that can be incremented and dereferenced in some limited way beyond its end would be sufficient. For example, a std::string normally has a 0 byte beyond its end so that c_str() can work, so it is safe (for some value of safe!) to keep advancing a std::string::iterator until a 0 is seen, without looking for end(). A UTF-8 decoding algorithm that processes multi-byte characters by continuing until the top bit is not set would safely terminate in this case. For iterators that don't offer this sort of behaviour you can provide a wrapper that knows where end is and returns a sentinel 0 in that case. Just a thought.... Phil.

Phil Endecott wrote:
I think that in this case an iterator that can be incremented and dereferenced in some limited way beyond its end would be sufficient. For example, a std::string normally has a 0 byte beyond its end so that c_str() can work, so it is safe (for some value of safe!) to keep advancing a std::string::iterator until a 0 is seen, without looking for end().
A terminator is only specified for the result of c_str(), not for a string's internal representation. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On Sat, Jul 18, 2009 at 15:20, Phil Endecott<spam_from_boost_dev@chezphil.org> wrote:
The idea of an "iterator that knows where its end is" is something that comes up fairly often; do the Range experts have any comments about it?
I think that in this case an iterator that can be incremented and dereferenced in some limited way beyond its end would be sufficient. For example, a std::string normally has a 0 byte beyond its end so that c_str() can work, so it is safe (for some value of safe!) to keep advancing a std::string::iterator until a 0 is seen, without looking for end(). A UTF-8 decoding algorithm that processes multi-byte characters by continuing until the top bit is not set would safely terminate in this case.
(I don't think I'm a Range expert.) I think there are problems with this example. Adding '\0' at the end is not mandated by the standard, right? Also, '\0' could also occur in the middle of the sequence. Also, I don't think iterators strictly speaking allow dereferencing the past-the-end element.
For iterators that don't offer this sort of behaviour you can provide a wrapper that knows where end is and returns a sentinel 0 in that case.
Wouldn't this end up requiring two if-statements? In general, a sentinel which is in the valid range of the value type would be an awkward sentinel. However, I can see where you're coming from. Being able to tell from an iterator whether it's at the end of its range is often useful. Is operator*() is the right place to implement this functionality? Wouldn't a free function is_at_end(Iterator) make more sense? Cheers, Rogier

Rogier van Dalen wrote:
On Sat, Jul 18, 2009 at 15:20, Phil Endecott wrote:
The idea of an "iterator that knows where its end is" is something that comes up fairly often; do the Range experts have any comments about it?
I think that in this case an iterator that can be incremented and dereferenced in some limited way beyond its end would be sufficient. ?For example, a std::string normally has a 0 byte beyond its end so that c_str() can work, so it is safe (for some value of safe!) to keep advancing a std::string::iterator until a 0 is seen, without looking for end(). ?A UTF-8 decoding algorithm that processes multi-byte characters by continuing until the top bit is not set would safely terminate in this case.
(I don't think I'm a Range expert.) I think there are problems with this example. Adding '\0' at the end is not mandated by the standard, right?
My recollection is that the standard makes it hard to implement a std::string that does not have a 0 after the last element, or that has non-contiguous storage. These are assumptions that I would probably be happy to accept in my own internal-use code, but you are right to say that they are probably not appropriate for library code. For library code you would need to wrap the container with something that guarantees this behaviour in a more solid way. Out of interest, does anyone know if a std::vector that has been reserve()d guarantees anything about dereferencing beyond-the-end iterators? It would be great if they were allowed to be undefined yet certain not to segfault.
Also, '\0' could also occur in the middle of the sequence.
That doesn't cause a problem for this application.
For iterators that don't offer this sort of behaviour you can provide a wrapper that knows where end is and returns a sentinel 0 in that case.
Wouldn't this end up requiring two if-statements? In general, a sentinel which is in the valid range of the value type would be an awkward sentinel.
No, the point is that it doesn't need any extra if statements at all. The code is already looking for a top-bit-clear byte to indicate the end of the multibyte character, and the 0 byte does that.
However, I can see where you're coming from. Being able to tell from an iterator whether it's at the end of its range is often useful. Is operator*() is the right place to implement this functionality? Wouldn't a free function is_at_end(Iterator) make more sense?
Then you do have the extra if statement. Phil.

Phil Endecott wrote:
Rogier van Dalen wrote:
On Sat, Jul 18, 2009 at 15:20, Phil Endecott wrote:
The idea of an "iterator that knows where its end is" is something that comes up fairly often; do the Range experts have any comments about it?
I think that in this case an iterator that can be incremented and dereferenced in some limited way beyond its end would be sufficient. ?For example, a std::string normally has a 0 byte beyond its end so that c_str() can work, so it is safe (for some value of safe!) to keep advancing a std::string::iterator until a 0 is seen, without looking for end(). ?A UTF-8 decoding algorithm that processes multi-byte characters by continuing until the top bit is not set would safely terminate in this case.
(I don't think I'm a Range expert.) I think there are problems with this example. Adding '\0' at the end is not mandated by the standard, right?
My recollection is that the standard makes it hard to implement a std::string that does not have a 0 after the last element, or that has non-contiguous storage.
I believe that as of C++03, std::string is required to have contiguous storage. The null-termination is another thing. The only guarantee is that the char* returned by c_str() is required to be null terminated. No such guarantee is made for the sequence traversed by std::string's iterators. Dereferencing the end iterator is verboten.
These are assumptions that I would probably be happy to accept in my own internal-use code, but you are right to say that they are probably not appropriate for library code. For library code you would need to wrap the container with something that guarantees this behaviour in a more solid way. Out of interest, does anyone know if a std::vector that has been reserve()d guarantees anything about dereferencing beyond-the-end iterators? It would be great if they were allowed to be undefined yet certain not to segfault.
Undefined behavior. Don't do it.
Also, '\0' could also occur in the middle of the sequence.
That doesn't cause a problem for this application.
But not in general.
For iterators that don't offer this sort of behaviour you can provide a wrapper that knows where end is and returns a sentinel 0 in that case.
Wouldn't this end up requiring two if-statements? In general, a sentinel which is in the valid range of the value type would be an awkward sentinel.
No, the point is that it doesn't need any extra if statements at all. The code is already looking for a top-bit-clear byte to indicate the end of the multibyte character, and the 0 byte does that.
However, I can see where you're coming from. Being able to tell from an iterator whether it's at the end of its range is often useful. Is operator*() is the right place to implement this functionality? Wouldn't a free function is_at_end(Iterator) make more sense?
Then you do have the extra if statement.
If you write a generic algorithm that takes a ForwardRange or two ForwardIterators, you're not allowed to assume that the sequence is terminated with a sentry. Why? Because that's not part of the ForwardRange/ForwardIterator concept. You would be allowed to define a refinement, say ContiguousNullTerminatedIterator concept such that &*it is required to point to a contiguous block of memory that is terminated with a sentry at the position specified by the end iterator. You could also define a trait to detect conformance to the concept. Then you could use this trait to dispatch to an algorithm optimized for that trait. This seems like a lot of work for little gain to me, though. I'd like to see performance numbers to justify a design like that. -- Eric Niebler BoostPro Computing http://www.boostpro.com

These are assumptions that I would probably be happy to accept in my own internal-use code, but you are right to say that they are probably not appropriate for library code. For library code you would need to wrap the container with something that guarantees this behaviour in a more solid way. Out of interest, does anyone know if a std::vector that has been reserve()d guarantees anything about dereferencing beyond-the-end iterators? It would be great if they were allowed to be undefined yet certain not to segfault.
Undefined behavior. Don't do it.
I'm having precisely this problem with a home brewed strided_iterator Given a std::vector<int> 'a' containing [3,8,2,1,0,9,4,6,5,7] I would like to sort the sub range defined by every second element starting at index 1. Conceptually something like: ejg::striding_iterator<std::vector<int>::iterator> begin(a.begin()+1,2); std::sort(begin,begin+5); so that I obtain [3,1,2,6,0,7,4,8,5,9] While it works fine under g++, on MSVC it throws a segfault when applying the prefix ++ operator since it attempts to dereference the location one past one past the end. Irritating beyond belief! As far as I can see there's no way around this with bare iterators. Any suggestions? -ed ------------------------------------------------ "No more boom and bust." -- Dr. J. G. Brown, 1997

Eric Niebler wrote:
Phil Endecott wrote:
Rogier van Dalen wrote:
On Sat, Jul 18, 2009 at 15:20, Phil Endecott wrote:
The idea of an "iterator that knows where its end is" is something that comes up fairly often; do the Range experts have any comments about it?
I think that in this case an iterator that can be incremented and dereferenced in some limited way beyond its end would be sufficient. ?For example, a std::string normally has a 0 byte beyond its end so that c_str() can work, so it is safe (for some value of safe!) to keep advancing a std::string::iterator until a 0 is seen, without looking for end(). ?A UTF-8 decoding algorithm that processes multi-byte characters by continuing until the top bit is not set would safely terminate in this case.
(I don't think I'm a Range expert.) I think there are problems with this example. Adding '\0' at the end is not mandated by the standard, right?
My recollection is that the standard makes it hard to implement a std::string that does not have a 0 after the last element, or that has non-contiguous storage.
I believe that as of C++03, std::string is required to have contiguous storage. The null-termination is another thing. The only guarantee is that the char* returned by c_str() is required to be null terminated. No such guarantee is made for the sequence traversed by std::string's iterators. Dereferencing the end iterator is verboten.
Here is an outline of a zero-overhead wrapper for std::string that I hope guarantees that *end() == 0: struct string_with_zero_beyond_end: std::string { typedef const char* const_iterator; const_iterator begin() const { return c_str(); } const_iterator end() const { return c_str() + length(); } };
These are assumptions that I would probably be happy to accept in my own internal-use code, but you are right to say that they are probably not appropriate for library code. For library code you would need to wrap the container with something that guarantees this behaviour in a more solid way. Out of interest, does anyone know if a std::vector that has been reserve()d guarantees anything about dereferencing beyond-the-end iterators? It would be great if they were allowed to be undefined yet certain not to segfault.
Undefined behavior. Don't do it.
Something similar is possible.
Also, '\0' could also occur in the middle of the sequence.
That doesn't cause a problem for this application.
But not in general.
We're not concerned with "in general", we're concerned with this particular problem of traversing bytes of UTF-8 efficiently with the possibility of malformed (e.g. malicious) input, without having to constantly check for end-of-input.
For iterators that don't offer this sort of behaviour you can provide a wrapper that knows where end is and returns a sentinel 0 in that case.
Wouldn't this end up requiring two if-statements? In general, a sentinel which is in the valid range of the value type would be an awkward sentinel.
No, the point is that it doesn't need any extra if statements at all. The code is already looking for a top-bit-clear byte to indicate the end of the multibyte character, and the 0 byte does that.
However, I can see where you're coming from. Being able to tell from an iterator whether it's at the end of its range is often useful. Is operator*() is the right place to implement this functionality? Wouldn't a free function is_at_end(Iterator) make more sense?
Then you do have the extra if statement.
If you write a generic algorithm that takes a ForwardRange or two ForwardIterators, you're not allowed to assume that the sequence is terminated with a sentry. Why? Because that's not part of the ForwardRange/ForwardIterator concept.
You would be allowed to define a refinement, say ContiguousNullTerminatedIterator concept such that &*it is required to point to a contiguous block of memory that is terminated with a sentry at the position specified by the end iterator. You could also define a trait to detect conformance to the concept. Then you could use this trait to dispatch to an algorithm optimized for that trait.
I had something to detect contiguous storage to do the trick of casting to an int and checking 4 or 8 top bits simultaneously.
This seems like a lot of work for little gain to me, though. I'd like to see performance numbers to justify a design like that.
Cory Nelson reported that Windows 7 can decode UTF-8 3 times faster than his code. I don't want Boost to end up with UTF-8 features that are all lovely and generic and full of concepts but that run 3 times slower on common types (strings, vector<char>, points) than they could if they were less generic. Phil.

Phil Endecott wrote:
Eric Niebler wrote:
I believe that as of C++03, std::string is required to have contiguous storage. The null-termination is another thing. The only guarantee is that the char* returned by c_str() is required to be null terminated. No such guarantee is made for the sequence traversed by std::string's iterators. Dereferencing the end iterator is verboten.
Here is an outline of a zero-overhead wrapper for std::string that I hope guarantees that *end() == 0:
struct string_with_zero_beyond_end: std::string { typedef const char* const_iterator; const_iterator begin() const { return c_str(); } const_iterator end() const { return c_str() + length(); } }; Not zero-overhead if the string implementation is one (of the non-existent ones) that doesn't store the NUL internally.
Sebastian

Sebastian Redl wrote:
Phil Endecott wrote:
Eric Niebler wrote:
I believe that as of C++03, std::string is required to have contiguous storage. The null-termination is another thing. The only guarantee is that the char* returned by c_str() is required to be null terminated. No such guarantee is made for the sequence traversed by std::string's iterators. Dereferencing the end iterator is verboten.
Here is an outline of a zero-overhead wrapper for std::string that I hope guarantees that *end() == 0:
struct string_with_zero_beyond_end: std::string { typedef const char* const_iterator; const_iterator begin() const { return c_str(); } const_iterator end() const { return c_str() + length(); } };
Not zero-overhead if the string implementation is one (of the non-existent ones) that doesn't store the NUL internally.
Worse: you'd be returning iterators to what may well be a temporary array of characters. The result of calling c_str() need not refer to the internal storage of the string. There's also the issue of deriving from a concrete class so that slicing will revert to std::string::begin()/end(). Then there's also the little problem of std::string implementations that don't use raw pointers for iterators, particularly those in a checked STL. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

Stewart, Robert wrote:
Sebastian Redl wrote:
Phil Endecott wrote:
Eric Niebler wrote:
I believe that as of C++03, std::string is required to have contiguous storage. The null-termination is another thing. The only guarantee is that the char* returned by c_str() is required to be null terminated. No such guarantee is made for the sequence traversed by std::string's iterators. Dereferencing the end iterator is verboten.
Here is an outline of a zero-overhead wrapper for std::string that I hope guarantees that *end() == 0:
struct string_with_zero_beyond_end: std::string { typedef const char* const_iterator; const_iterator begin() const { return c_str(); } const_iterator end() const { return c_str() + length(); } };
Not zero-overhead if the string implementation is one (of the non-existent ones) that doesn't store the NUL internally.
Is c_str() allowed to be > O(1) ?
Worse: you'd be returning iterators to what may well be a temporary array of characters. The result of calling c_str() need not refer to the internal storage of the string.
People are taking my "code" too literally. I am just trying to point out that a function for UTF-8 decoding can be significantly more efficient if it exploits various characteristics of common types, such as contiguous storage and null termination. I feel that this is something worth doing. Phil.

Phil Endecott wrote:
Stewart, Robert wrote:
Sebastian Redl wrote:
Phil Endecott wrote:
Here is an outline of a zero-overhead wrapper for std::string that I hope guarantees that *end() == 0:
struct string_with_zero_beyond_end: std::string { typedef const char* const_iterator; const_iterator begin() const { return c_str(); } const_iterator end() const { return c_str() + length(); } };
Not zero-overhead if the string implementation is one (of the non-existent ones) that doesn't store the NUL internally.
Is c_str() allowed to be > O(1) ?
Yes, since the entirety of C++98/03's requirements levied against c_str() are shown here:
From 21.3/5: It can invalidate outstanding iterators, references, and pointers.
From 21.3.6:
"1 Returns: A pointer to the initial element of an array of length size() + 1 whose first size() elements equal the corresponding elements of the string controlled by *this and whose last element is a null character specified by charT(). 2 Requires: The program shall not alter any of the values stored in the array. Nor shall the program treat the returned value as a valid pointer value after any subsequent call to a nonconst member function of the class basic_string that designates the same object as this." _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

Phil Endecott wrote:
Is c_str() allowed to be > O(1) ? Yes. In particular, this implementation is valid:
template <...> class basic_string { Ch *real_data; mutable Ch *cstr_data; // ... public: const Ch *c_str() const { if (!cstr_data) { cstr_data = allocate(size() + 1); copy(cstr_data, size()); cstr_data[size()] = 0; } return cstr_data; } void any_modifying_function() { if(cstr_data) { deallocate(cstr_data); cstr_data = 0; } } }; Sebastian

Sebastian Redl skrev:
Phil Endecott wrote:
Is c_str() allowed to be > O(1) ? Yes. In particular, this implementation is valid:
template <...> class basic_string { Ch *real_data; mutable Ch *cstr_data; // ...
public: const Ch *c_str() const { if (!cstr_data) { cstr_data = allocate(size() + 1); copy(cstr_data, size()); cstr_data[size()] = 0; } return cstr_data; }
void any_modifying_function() { if(cstr_data) { deallocate(cstr_data); cstr_data = 0; } } };
But does *any* implementation actually do that? The problem is AFAIK that str[size] is valid when str is a const object. -Thorsten

On Jul 21, 2009, at 9:54 AM, Thorsten Ottosen wrote:
Sebastian Redl skrev:
Is c_str() allowed to be > O(1) ? Yes. In particular, this implementation is valid: template <...> class basic_string { Ch *real_data; mutable Ch *cstr_data; // ...
Phil Endecott wrote: public: const Ch *c_str() const { if (!cstr_data) { cstr_data = allocate(size() + 1); copy(cstr_data, size()); cstr_data[size()] = 0; } return cstr_data; } void any_modifying_function() { if(cstr_data) { deallocate(cstr_data); cstr_data = 0; } } };
But does *any* implementation actually do that? The problem is AFAIK that str[size] is valid when str is a const object.
Fwiw, the C++0X string will (probably) require the trailing null for non-const strings too: http://home.roadrunner.com/~hinnant/issue_review/lwg-active.html#876 (see A 2) And the committee knows of no implementations this will break. -Howard

On Mon, Jul 20, 2009 at 15:07, Phil Endecott<spam_from_boost_dev@chezphil.org> wrote:
Rogier van Dalen wrote:
On Sat, Jul 18, 2009 at 15:20, Phil Endecott wrote:
The idea of an "iterator that knows where its end is" is something that comes up fairly often; do the Range experts have any comments about it?
I mistakenly thought here you meant an iterator knowing where its end in general, rather than just a std::string iterator when decoding UTF-8 from the string.
Also, '\0' could also occur in the middle of the sequence.
That doesn't cause a problem for this application.
Fair enough. Cheers, Rogier

On Sat, Jul 18, 2009 at 07:14, Mathias Gaunard<mathias.gaunard@ens-lyon.org> wrote:
Rogier van Dalen wrote:
Freestanding transcoding functions and codecvt facets are not the only thing I believe a UTF library would need, though.
I've personally purposely chose not to use codecvt facets in my unicode library at all, but maybe I should provide them anyway for compatibility with the iostreams subsystem. I don't really find those practical to use.
I don't necessarily disagree, but I'm curious what alternative you have in mind.
Iterator adaptors, I found, are a pain to attach error policies to and write them correctly. For example, with a policy equivalent to your "ReplaceCheckFailures", you need to produce the same code point sequence whether you traverse an invalid encoded string forward or backward. I've got code for UTF-8 that passes my unit tests, but the error checking and the one-by-one decoding makes it much harder to optimise.
For now my iterator adaptors (and the codecs they're based on for that matter) perform full checks, including checking that we don't go past the end of the input range (one way or the other). While I wanted both versions with checks and without initially, only having one does make it easier to use.
Non-checking iterator adaptors can be faster. That would be useful when you know that a string is safe, for example, in a UTF string type that has a validity invariant.
An error policy isn't really enough though, because to do full checks you need each iterator to know about the begin and the end of the range it's working on which could be avoided altogether when trusting the input.
I think this means that all iterator adaptors can be constructed from 3 iterators (begin, position, end) and the ones that don't check the input can also be constructed from 1 iterator. For a checking forward iterator, only two iterators are necessary (position, end). This is how I implemented this, at any rate.
They're fairly simple implementations and were never benchmarked (benchmarking my library isn't even scheduled at the moment), but they're quite correct (proper unit tests are in the works).
It makes sense to design for correctness. It's probably worth keeping in minds, though, whether conceivable extensions and optimisations are possible in your design. I like the idea of the Pipe and related concepts. I am wondering, however, whether the UTF-8 decoding iterator can be fast enough given the current specification. I think Pipe (or another concept) might have to support decoding of exactly one output element. Correct me if I'm wrong. The actual implementation of extensions and optimisations can be delayed until the need appears. I'd be happy to contribute checking policies. Cheers, Rogier

Rogier van Dalen wrote:
Non-checking iterator adaptors can be faster. That would be useful when you know that a string is safe, for example, in a UTF string type that has a validity invariant.
I suppose that type of string should probably use optimized iterators that make use of the fact it is stored on contiguous and properly aligned memory anyway, so it will need special code.
I think this means that all iterator adaptors can be constructed from 3 iterators (begin, position, end) and the ones that don't check the input can also be constructed from 1 iterator. For a checking forward iterator, only two iterators are necessary (position, end). This is how I implemented this, at any rate.
Indeed, that makes 3 cases per encoding and I'm only handling the broadest case for now.
It makes sense to design for correctness. It's probably worth keeping in minds, though, whether conceivable extensions and optimisations are possible in your design.
I suppose you could attach traits to select more optimal iteration methods.
I like the idea of the Pipe and related concepts. I am wondering, however, whether the UTF-8 decoding iterator can be fast enough given the current specification. I think Pipe (or another concept) might have to support decoding of exactly one output element. Correct me if I'm wrong.
I don't really understand what you mean. Calling Pipe::ltr or Pipe::rtl only decodes one "element" (utf8 decoding means a multibyte sequence is read and a code point is written, utf8 encoding means a code point is read and a multibyte sequence is written).
The actual implementation of extensions and optimisations can be delayed until the need appears. I'd be happy to contribute checking policies.
The mechanism to do so has yet to be defined unfortunately ;).

Mathias Gaunard wrote:
Rogier van Dalen wrote:
Non-checking iterator adaptors can be faster. That would be useful when you know that a string is safe, for example, in a UTF string type that has a validity invariant.
I suppose that type of string should probably use optimized iterators that make use of the fact it is stored on contiguous and properly aligned memory anyway, so it will need special code.
There are 2 orthogonal issues here: 1) whether a sequence is stored in contiguous memory 2) whether it is already guaranteed to be well-formed UTF-XX Conflating the two will lead to bad design. I agree with Rogier. The routines should make checking a policy. Iterators should be non-checked. Checked iterators can be adaptors. This is important to get right. You can build safe interfaces on top of fast ones, but not the other way around. -- Eric Niebler BoostPro Computing http://www.boostpro.com

On Mon, Jul 20, 2009 at 18:42, Eric Niebler<eric@boostpro.com> wrote:
Mathias Gaunard wrote:
Rogier van Dalen wrote:
Non-checking iterator adaptors can be faster. That would be useful when you know that a string is safe, for example, in a UTF string type that has a validity invariant.
I suppose that type of string should probably use optimized iterators that make use of the fact it is stored on contiguous and properly aligned memory anyway, so it will need special code.
There are 2 orthogonal issues here: 1) whether a sequence is stored in contiguous memory 2) whether it is already guaranteed to be well-formed UTF-XX
I think where confusion could arise is this: even thought these issues are orthogonal, if it's just about optimising, it might be acceptable to write code for a specific special case. However, one policy that is sensible is of "repairing" an invalid string: interpreting overlong sequences; and replacing uninterpretable code units by U+FFFD "Replacement character". This is similar to Cory's "ReplaceCheckFailures". Such a policy is necessary if a program needs to read a corrupted UTF file and make the most out of it. On the other hand, the current behaviour of throwing an error at overlong or invalid sequences is also sensible. The one-to-one relation between encoded and decoded form makes it the safest choice. It can guarantee there are no NULLs in the decoded form that were not in the encoded form. I think both of these policies (and possibly others that I haven't thought of) will need to be supported. Checking policies are therefore not just an optimisation.
Conflating the two will lead to bad design. I agree with Rogier.
That is great...
The routines should make checking a policy. Iterators should be non-checked. Checked iterators can be adaptors.
... but I'm not sure I understand what you mean. I read this as "you can build a checking iterator adaptor on top of an non-checking iterator adaptor". I don't think this is true for decoding UTF. I suspect, therefore, that I misunderstand something. Cheers, Rogier

On Mon, Jul 20, 2009 at 17:50, Mathias Gaunard<mathias.gaunard@ens-lyon.org> wrote:
Rogier van Dalen wrote:
I like the idea of the Pipe and related concepts. I am wondering, however, whether the UTF-8 decoding iterator can be fast enough given the current specification. I think Pipe (or another concept) might have to support decoding of exactly one output element. Correct me if I'm wrong.
I don't really understand what you mean. Calling Pipe::ltr or Pipe::rtl only decodes one "element" (utf8 decoding means a multibyte sequence is read and a code point is written, utf8 encoding means a code point is read and a multibyte sequence is written).
Thanks, that explains it—I read the documentation "Reads part of the [begin, end[ range .., writes some elements .. to out" to mean that as much as possible would be read from the range. I guess I should just hold off commenting on these matters until it's finished! Cheers, Rogier

On Fri, Jul 17, 2009 at 4:29 PM, Rogier van Dalen<rogiervd@gmail.com> wrote:
On Fri, Jul 17, 2009 at 20:02, Cory Nelson <phrosty@gmail.com> wrote:
On Thu, Feb 12, 2009 at 6:32 PM, Phil Endecott<spam_from_boost_dev@chezphil.org> wrote:
Cory Nelson wrote:
Is there interest in having a Unicode codec library submitted to Boost?
I've had a look at your code. I like that you have implemented what I called an "error policy". It is wasteful to continuously check the validity of input that comes from trusted code, but important to check it when it comes from an untrusted source. However I'm not sure that your actual conversion is as efficient as it could be. I spent quite a while profiling my UTF8 conversion and came up with this:
http://svn.chezphil.org/libpbe/trunk/include/charset/utf8.hh
I think that you could largely copy&paste bits of that into the right places in your algorithm and get a significant speedup.
I finally found some time to do some optimizations of my own and have had some good progress using a small lookup table, a switch, and slightly deducing branches. See line 318:
http://svn.int64.org/viewvc/int64/snips/unicode.hpp?view=markup
Despite these efforts, Windows 7 still decodes UTF-8 three times faster (~750MiB/s vs ~240MiB/s on my Core 2. I assume they are either using some gigantic look up tables or SSE.
Dear Cory,
Though I'm not sure decoding this much UTF8-encoded data is often a bottleneck in practice, let me add my thoughts. I wrote a utf8_codecvt::do_in that uses lookup tables for a Unicode library that Graham and I started long ago (in the vault). I haven't actually compared performance of my implementation against yours, so mine may well be slower. However, the technique may be of interest.
UTF-8 is the primary bottleneck in XML decoding. That's been my motivation thus far.
From the first UTF-8 byte, the values and the valid ranges of the 0 to 3 following bytes is known (Table 3-7 in the Unicode standard version 5.0). Therefore, you can put these in lookup tables (7 tables with 256 entries each, with a 32-bit code point value). A master table contains 256 entries mapping the first byte to 0 to 3 following tables. You find the code point values for consecutive bytes in the appropriate tables and OR them. The trick that I suspect gave me speed-up compared to the function the Unicode Consortium publishes, is that code point values for invalid bytes are 0xffffffff. After OR-ing all values, you need only one check to see whether there was an error. This reduces branches.
Sounds interesting, I will try this out and see what happens.
As I said, I'm not sure it's actually faster than your code, but it may be an interesting technique. The debug compilation of utf8_data.cpp, which contains the tables and nothing else, is 30kB on Linux gcc; hardly "gigantic". However, with a threefold speed increase it seems likely that Windows uses bigger tables or something smarter than I came up with.
To answer your original question, "is there any interest in Unicode codecs?", yes.
It now seems to me that a full Unicode library would be hard to get accepted into Boost; it would be more feasible to get a UTF library submitted, which is more along the lines of your library. (A Unicode library could later be based on the same principles.)
Freestanding transcoding functions and codecvt facets are not the only thing I believe a UTF library would need, though. I'd add to the list: - iterator adaptors (input and output); - range adaptors; - a code point string; - compile-time encoding (meta-programming); - documentation.
I agree, mostly. I'm not sure if a special string (as opposed to basic_string<utf16_t>) would be worthwhile -- what would you do with it that didn't require a full Unicode library supporting it?
(The compile-time encoding thing may sound esoteric. I believe would be useful for fast parsers. It's not that hard to do at any rate.) I suspect the string is a pain to design to please everyone. Iterator adaptors, I found, are a pain to attach error policies to and write them correctly. For example, with a policy equivalent to your "ReplaceCheckFailures", you need to produce the same code point sequence whether you traverse an invalid encoded string forward or backward. I've got code for UTF-8 that passes my unit tests, but the error checking and the one-by-one decoding makes it much harder to optimise.
I believe that Mathias Gaunard is working on a library at <http://blogloufoque.free.fr/unicode/doc/html/>. I don't know how complete it is, but from the documentation it looks well thought-out so far. I'm looking forward to seeing where that's going!
Cheers, Rogier _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- Cory Nelson http://int64.org

On Sat, Jul 18, 2009 at 15:34, Cory Nelson<phrosty@gmail.com> wrote:
On Fri, Jul 17, 2009 at 4:29 PM, Rogier van Dalen<rogiervd@gmail.com> wrote:
Though I'm not sure decoding this much UTF8-encoded data is often a bottleneck in practice,
UTF-8 is the primary bottleneck in XML decoding. That's been my motivation thus far.
And is it necessary to decode large stretches of UTF-8 rather than only the textual content? I imagined performance characteristics are quite different when you decode only short amounts of text at the same time. But I've never actually done this comparison, so I'm happy to take your word for it.
It now seems to me that a full Unicode library would be hard to get accepted into Boost; it would be more feasible to get a UTF library submitted, which is more along the lines of your library. (A Unicode library could later be based on the same principles.)
Freestanding transcoding functions and codecvt facets are not the only thing I believe a UTF library would need, though. I'd add to the list: - iterator adaptors (input and output); - range adaptors; - a code point string; - compile-time encoding (meta-programming); - documentation.
I agree, mostly. I'm not sure if a special string (as opposed to basic_string<utf16_t>) would be worthwhile -- what would you do with it that didn't require a full Unicode library supporting it?
Good point. I am not able to come up with a use case, other than "use it as the base of a grapheme string". From the tactical perspective of getting something through a Boost review, though, it would help to flesh out the design of a code point string before writing a grapheme string in the same vein. I think. But I'm becoming less sure of it as I write it. Cheers, Rogier

Cory Nelson wrote:
I finally found some time to do some optimizations of my own and have had some good progress using a small lookup table, a switch, and slightly deducing branches. See line 318:
http://svn.int64.org/viewvc/int64/snips/unicode.hpp?view=markup
Despite these efforts, Windows 7 still decodes UTF-8 three times faster (~750MiB/s vs ~240MiB/s on my Core 2. I assume they are either using some gigantic look up tables or SSE.
Hi Cory, What is your test input? When the input is largely ASCII, a worthwhile optimisation is to cast groups of 4 (or 8) characters to ints and & with 0x80808080; if the answer is zero, no further conversion is needed. In general I'm unsure of the performance issues of lookup tables compared to explicit bit-manipulation. Cache effects may be significant, and a benchmark will tend to warm up the cache better than a real application might. I can't see how SSE could be applied to this problem, but it's not something I know much about. I don't have much time to work on this right now, but if the algorithm plus test harness and test data were bundled up into something that I can just "make", I will try to compare it with my version. Regards, Phil.

I finally found some time to do some optimizations of my own and have had some good progress using a small lookup table, a switch, and slightly deducing branches. See line 318:
http://svn.int64.org/viewvc/int64/snips/unicode.hpp?view=markup
Despite these efforts, Windows 7 still decodes UTF-8 three times faster (~750MiB/s vs ~240MiB/s on my Core 2. I assume they are either using some gigantic look up tables or SSE. How much cost are you incurring in the tests for whether the traits indicate that
Cory Nelson wrote: the error returns are valid? I'm wondering if theer is a case for requiring that these be compile time constants in the Traits class rather than flags in a Traits value. And why is 'last' passed in to decode_unsafe? Is there any indication that duff's device will prevent aggressive inlining? I'm assuming you need this method to be fully inlined into the outer loop, and maybe its not happening - ideally you;d want some loop unrolling too. I suspect that as noted the lack of special case for largely 7-bit ascii input will tend to make it slow on mosts Western texts, though speedups for the multi-character case will need care on alignment-sensitive hardware: you'll need to fix that in the outermost loop.

On Sat, Jul 18, 2009 at 6:21 AM, James Mansion<james@mansionfamily.plus.com> wrote:
Cory Nelson wrote:
I finally found some time to do some optimizations of my own and have had some good progress using a small lookup table, a switch, and slightly deducing branches. See line 318:
http://svn.int64.org/viewvc/int64/snips/unicode.hpp?view=markup
Despite these efforts, Windows 7 still decodes UTF-8 three times faster (~750MiB/s vs ~240MiB/s on my Core 2. I assume they are either using some gigantic look up tables or SSE.
How much cost are you incurring in the tests for whether the traits indicate that the error returns are valid?
I've played around with it and have not noticed any significant difference for this.
I'm wondering if theer is a case for requiring that these be compile time constants in the Traits class rather than flags in a Traits value.
And why is 'last' passed in to decode_unsafe?
Leftover from copy-paste, good catch.
Is there any indication that duff's device will prevent aggressive inlining?
I have looked at the output of both GCC 4.4 and VC++ 2008 with optimization flags cranked up. Each is generating inlined code exactly how I want them to.
I'm assuming you need this method to be fully inlined into the outer loop, and maybe its not happening - ideally you;d want some loop unrolling too.
I suspect that as noted the lack of special case for largely 7-bit ascii input will tend to make it slow on mosts Western texts, though speedups for the multi-character case will need care on alignment-sensitive hardware: you'll need to fix that in the outermost loop.
Indeed. I haven't done this because the code uses iterators, but I think some small specializations could be made to enable this in transcode() when the input is a raw pointer. One thing I have been trying is in that decode_unsafe. It has less branches overall and compiles down to the optimal assembly I'd expect. For some reason, it runs slower. No clue why! -- Cory Nelson http://int64.org

On Sat, Jul 18, 2009 at 5:02 AM, Phil Endecott<spam_from_boost_dev@chezphil.org> wrote:
Cory Nelson wrote:
I finally found some time to do some optimizations of my own and have had some good progress using a small lookup table, a switch, and slightly deducing branches. See line 318:
http://svn.int64.org/viewvc/int64/snips/unicode.hpp?view=markup
Despite these efforts, Windows 7 still decodes UTF-8 three times faster (~750MiB/s vs ~240MiB/s on my Core 2. I assume they are either using some gigantic look up tables or SSE.
Hi Cory,
What is your test input?
i'm using a large file with a mix of many languages in it. JMDict, available here: http://www.csse.monash.edu.au/~jwb/jmdict.html
When the input is largely ASCII, a worthwhile optimisation is to cast groups of 4 (or 8) characters to ints and & with 0x80808080; if the answer is zero, no further conversion is needed.
It is something I've considered, but it is a bit harder to translate to working with generic iterators.
In general I'm unsure of the performance issues of lookup tables compared to explicit bit-manipulation. Cache effects may be significant, and a benchmark will tend to warm up the cache better than a real application might.
I can't see how SSE could be applied to this problem, but it's not something I know much about.
I believe there is a SSE algo out there, but it of course won't work with iterators.
I don't have much time to work on this right now, but if the algorithm plus test harness and test data were bundled up into something that I can just "make", I will try to compare it with my version.
I will try to upload my benchmarking code somewhere today. -- Cory Nelson http://int64.org
participants (11)
-
Cory Nelson
-
Edward Grace
-
Eric Niebler
-
Howard Hinnant
-
James Mansion
-
Mathias Gaunard
-
Phil Endecott
-
Rogier van Dalen
-
Sebastian Redl
-
Stewart, Robert
-
Thorsten Ottosen