Re: [boost] UTF-8 conversion etc.

8 Mar 2008

      Sebastian Redl wrote:
...
Phil Endecott wrote:
...
OK, the code is here:
   http://svn.chezphil.org/libpbe/trunk/include/charset/
and there are some very basic docs here:
   http://svn.chezphil.org/libpbe/trunk/doc/charsets/
(Have a look at intro.txt for the feature list.)
Another conceptual problem in your traits. Take a look at UTF-8's 
skip_forward_char:
template <typename char8_ptr_t>
  static void skip_forward_char(char8_ptr_t& i) {
    do {
      ++i;
    } while (!char_start_byte(*i));  // Maybe hint this?
  }
And this loop:
for(iterator it = cnt.begin(); it != cnt.end(); skip_forward_char(it)) {
}
This will always invoke undefined behaviour. Consider the case where it 
is just before end(), i.e. ++it == cnt.end(). Then skip_forward_char() 
will indeed do ++it, and then do *it, thus dereferencing the 
past-the-end iterator. Boom.
Yes, absolutely. I'm aware of this and similar problems.  But please 
keep reporting them :-)

In this case the problem is slightly less serious if you write 
something more like

skip_forward_char(char8_ptr_t& i) {
   advance(i,char_length(*i));
}

In this case you don't dereference an invalid iterator if the input is 
valid and complete UTF8.  That might be useful in some circumstances.  
But computing char_length is actually harder than the loop with char_start_byte().

On the other hand my code does work for zero-terminated data, which is 
useful in the case of std::string::c_str().  I presume that the 
standard doesn't guarantee that dereferencing the byte after the end of 
a string returns 0, even though an implementation that provides c_str() 
in the obvious way would have to do so, right?

I'm not sure what the best solution to that problem is, but I have 
thought more about the converse case where you're storing UTF8 using 
the output iterator, and you're writing into a fixed-size buffer, e.g. (pseudo-code)

char* iso88591_data;
size_t iso8859_data_length;

// The UTF8 data will take more space than the iso8859_1 data;
// maybe we know that in our case most bytes will be ASCII, so we allow
// a 10% overhead:
char* utf8_data = new char[iso_8859_data_length * 1.1];

// In the rare case where that's insufficient we'll abort and retry 
with a
// larger buffer, or do the rest in another chunk or something.

// Iterator to store utf8:
character_output_iterator<utf8> utf8_it(utf8_data);

// First thought is to use a function with the same signature as std::copy:
seq_conv(iso88591_data, iso88591_data+iso8859_data_length, utf8_it);

// But that doesn't allow us to specify the end of the output buffer.  So
// we make that an additional parameter:
character_output_iterator<utf8> utf8_end_it(utf8_data+utf8_length);
seq_conv(iso88591_data, iso88591_data+iso8859_data_length, utf8_it, utf8_end_it);

// But this may terminate either because it reached the end of the
// input or because it reached the end of the output.  So perhaps it
// needs to return a pair<> of iterators reporting how far it got through
// each.

But I'm also concerned that the inner loops in these conversion 
algorithms shouldn't be doing more comparisons than is absolutely 
necessary.  So I'm currently considering having both versions, with and 
without the destination-end iterator.  I've added functions (or maybe 
constants) to the charset_traits indicating the maximum number of units 
per character.  The bounded version can then be implemented something 
like this: (pseudo-code!)

seq_conv(in_start, in_end, out_start, out_end) {
   size_t out_length = out_end-out_start;
   max_chars = out_length / charset_traits<cset>::max_units_per_char();
   // We can safely copy max_chars from in to out without worrying about out_end:
   (in_next,out_next) = seq_conv(in_start, min(in_end, 
in_start+max_chars), out_start);
   // We do need to worry about out_end while copying the others:
   seq_conv(in_next, in_end, out_next, out_end);
}
...
Compare with filter_iterator. skip_forward_char *must* take the end 
iterator, too, and stop when reaching it. This, in turn, makes the 
charset adapter iterator that much more complicated.
Yes.  filter_iterator is a good example; I would like to be consistent 
with existing practice when it's appropriate to do so.  As you can see 
I'm progressing quite slowly with this work.  This has the advantage 
that I have plenty of time to think about what I should do next before 
I implement it....

BTW I have just written a base64 decoding iterator adaptor.  It also 
needs you to pass an iterator referring to the end of the data so that 
it can do the right thing at the end.  Anyone interested?

Phil.