Re: [boost] UTF-8 conversion etc.

25 Feb 2008


      On Mon, Feb 25, 2008 at 6:06 PM, Phil Endecott
<spam_from_boost_dev@chezphil.org> wrote:
...
Felipe Magno de Almeida wrote:
...
On Mon, Feb 25, 2008 at 8:09 AM, Sebastian Redl
<sebastian.redl@getdesigned.at> wrote:
[snip]
...
...
...
I think emulating std::string doesn't work. It has a naive design based
 on the assumption of fixed-width encodings. I think that a tagged string
 is the best place to really start over with a string design and produce
 a string that is lean, rather than bloated.
I agree.
Hmmm.  I hear what you're saying, but things that are too revolutionary
 don't get used because they're too different from what people are used
 to.  I'd like to offer something that's close to a drop-in replacement
 for std::string that will let people painlessly upgrade their code to
 proper character set support.
I would really much use it. And am not very concerned if some
algorithms would have to change. I'm now using icu directly, and it is
quite a PITA.
...
However, most of the work that I have done has been at a lower level
 and can be easily built upon to enable a new class with a different
 interface as well.   So you can have your cake and eat it!  Comments
 about both are welcome.
You could create a bloated_utf8 as a drop-in replacement for
std::string, and at the same time discouraging its use. :P
...
...
...
I think the string type should offer minimal manipulation facilities -
 either completely read-only or append as the only manipulation function.
I would like to have at least a modifiable string. But only through
iterators (insert and erase).
That should suffice all my algorithm needs.
Try this: temporarily replace all your strings with list<character> and
 see what's missing.
I did (not *all*, but in very significant places).
The first problem I got was unnecessary requiring
RandomAccessIterators, like using operator+ instead of std::advance.
Other places uses std::string::size_type and operator[].
But I can say these are easily correctable.
...
...
...
A string buffer type could be written as a mutable alternative, as is
 the design in Java and C#. However, I'm not sure how much of that
 interface is needed, either.
I'm unfamiliar with what Java and C# do, but my lower-level code (e.g.
 character_output_iterator) make it simple to write e.g. UTF-8 into
 arbitrary memory.
Good.
...
...
A modifiable iterator interface (with insert and erase) is, IMO, as
concise and extensible as possible.
...
I'd love to have some empirical data on string usage.
I do some string manipulations on email. And it is usually better to
do all manipulations in the codepage received, instead of converting
back and forth.
One issue that I'm currently thinking about with this sort of usage is
 compile-time character set tagging vs. run-time character set tagging.
 In fact, I've been wondering whether there is some general pattern for
 providing both e.g.
template <charset_t cset> void foo(int x);
 and
 void foo(charset_t cset, int x);
I can say I won't be using much compile-time tagged strings.
But, I guess you could do:

template <typename Char, typename Charset> struct compiletime_string;
template <typename Char> struct string
{
  template <typename Charset>
  string(compiletime_string<Char, Charset> const& s);
}

And then you can have compile-time tagged strings and runtime tagged
strings work together seamlessly.
...
You can obviously forward from the first to the second but that may
 lose some compile-time-constant optimisations; forwarding from the
 second to the first needs a horrible case statement.  I was wondering
 about a macro that would define both.... any ideas anyone?
I guess a macro wouldn't be a very good idea.
You can just do some if's in the runtime_tagged and forward to the
compile-time function for cases where you have a optimized
compile-time version for those charsets. For all others, just
execute a common function (based on iconv maybe) just passing the
character set name.
You could have a map for compile-time character set to c-string
character set name.
...
...
...
...
- What character sets are people interested in using (a) at the "edges"
of their programs,
 As many as possible. Theoretically, a program might have to deal with
 any and all encodings out there. Realistically, there's probably a dozen
 or two that are relevant. You'd need empirical data.
I have looked at the charsets in all my email, but the results are
 thrown by the spam.
...
Unfortunately I need all supported by MIME.
Falling back using e.g. iconv() for the otherwise-unsupported ones is
 my plan.
That's good enough to me.

[snip]
...
Cheers,
Phil.
Regards,
-- 
Felipe Magno de Almeida