
Sebastian Redl wrote:
Phil Endecott wrote:
OK, the code is here: http://svn.chezphil.org/libpbe/trunk/include/charset/
and there are some very basic docs here: http://svn.chezphil.org/libpbe/trunk/doc/charsets/ (Have a look at intro.txt for the feature list.)
Another conceptual problem in your traits. Take a look at UTF-8's skip_forward_char:
template <typename char8_ptr_t> static void skip_forward_char(char8_ptr_t& i) { do { ++i; } while (!char_start_byte(*i)); // Maybe hint this? }
And this loop:
for(iterator it = cnt.begin(); it != cnt.end(); skip_forward_char(it)) { }
This will always invoke undefined behaviour. Consider the case where it is just before end(), i.e. ++it == cnt.end(). Then skip_forward_char() will indeed do ++it, and then do *it, thus dereferencing the past-the-end iterator. Boom.
Yes, absolutely. I'm aware of this and similar problems. But please keep reporting them :-) In this case the problem is slightly less serious if you write something more like skip_forward_char(char8_ptr_t& i) { advance(i,char_length(*i)); } In this case you don't dereference an invalid iterator if the input is valid and complete UTF8. That might be useful in some circumstances. But computing char_length is actually harder than the loop with char_start_byte(). On the other hand my code does work for zero-terminated data, which is useful in the case of std::string::c_str(). I presume that the standard doesn't guarantee that dereferencing the byte after the end of a string returns 0, even though an implementation that provides c_str() in the obvious way would have to do so, right? I'm not sure what the best solution to that problem is, but I have thought more about the converse case where you're storing UTF8 using the output iterator, and you're writing into a fixed-size buffer, e.g. (pseudo-code) char* iso88591_data; size_t iso8859_data_length; // The UTF8 data will take more space than the iso8859_1 data; // maybe we know that in our case most bytes will be ASCII, so we allow // a 10% overhead: char* utf8_data = new char[iso_8859_data_length * 1.1]; // In the rare case where that's insufficient we'll abort and retry with a // larger buffer, or do the rest in another chunk or something. // Iterator to store utf8: character_output_iterator<utf8> utf8_it(utf8_data); // First thought is to use a function with the same signature as std::copy: seq_conv(iso88591_data, iso88591_data+iso8859_data_length, utf8_it); // But that doesn't allow us to specify the end of the output buffer. So // we make that an additional parameter: character_output_iterator<utf8> utf8_end_it(utf8_data+utf8_length); seq_conv(iso88591_data, iso88591_data+iso8859_data_length, utf8_it, utf8_end_it); // But this may terminate either because it reached the end of the // input or because it reached the end of the output. So perhaps it // needs to return a pair<> of iterators reporting how far it got through // each. But I'm also concerned that the inner loops in these conversion algorithms shouldn't be doing more comparisons than is absolutely necessary. So I'm currently considering having both versions, with and without the destination-end iterator. I've added functions (or maybe constants) to the charset_traits indicating the maximum number of units per character. The bounded version can then be implemented something like this: (pseudo-code!) seq_conv(in_start, in_end, out_start, out_end) { size_t out_length = out_end-out_start; max_chars = out_length / charset_traits<cset>::max_units_per_char(); // We can safely copy max_chars from in to out without worrying about out_end: (in_next,out_next) = seq_conv(in_start, min(in_end, in_start+max_chars), out_start); // We do need to worry about out_end while copying the others: seq_conv(in_next, in_end, out_next, out_end); }
Compare with filter_iterator. skip_forward_char *must* take the end iterator, too, and stop when reaching it. This, in turn, makes the charset adapter iterator that much more complicated.
Yes. filter_iterator is a good example; I would like to be consistent with existing practice when it's appropriate to do so. As you can see I'm progressing quite slowly with this work. This has the advantage that I have plenty of time to think about what I should do next before I implement it.... BTW I have just written a base64 decoding iterator adaptor. It also needs you to pass an iterator referring to the end of the data so that it can do the right thing at the end. Anyone interested? Phil.