Re: [boost] UTF-8 conversion etc.

25 Feb 2008

      Felipe Magno de Almeida wrote:
...
On Mon, Feb 25, 2008 at 8:09 AM, Sebastian Redl
<sebastian.redl@getdesigned.at> wrote:
...
Phil Endecott wrote:
...
Things I'd appreciate feedback on:
- What should the cs_string look like?  Basically everywhere that
std::string uses an integer position I have the choice of a character
position, a unit position, or an iterator - or not providing that function.
I think emulating std::string doesn't work. It has a naive design based
 on the assumption of fixed-width encodings. I think that a tagged string
 is the best place to really start over with a string design and produce
 a string that is lean, rather than bloated.
I agree.
Hmmm.  I hear what you're saying, but things that are too revolutionary 
don't get used because they're too different from what people are used 
to.  I'd like to offer something that's close to a drop-in replacement 
for std::string that will let people painlessly upgrade their code to 
proper character set support.

However, most of the work that I have done has been at a lower level 
and can be easily built upon to enable a new class with a different 
interface as well.   So you can have your cake and eat it!  Comments 
about both are welcome.
...
...
I think the string type should offer minimal manipulation facilities -
 either completely read-only or append as the only manipulation function.
I would like to have at least a modifiable string. But only through
iterators (insert and erase).
That should suffice all my algorithm needs.
Try this: temporarily replace all your strings with list<character> and 
see what's missing.
...
...
A string buffer type could be written as a mutable alternative, as is
 the design in Java and C#. However, I'm not sure how much of that
 interface is needed, either.
I'm unfamiliar with what Java and C# do, but my lower-level code (e.g. 
character_output_iterator) make it simple to write e.g. UTF-8 into 
arbitrary memory.
...
A modifiable iterator interface (with insert and erase) is, IMO, as
concise and extensible as possible.
...
I'd love to have some empirical data on string usage.
I do some string manipulations on email. And it is usually better to
do all manipulations in the codepage received, instead of converting
back and forth.
One issue that I'm currently thinking about with this sort of usage is 
compile-time character set tagging vs. run-time character set tagging.  
In fact, I've been wondering whether there is some general pattern for 
providing both e.g.

template <charset_t cset> void foo(int x);
and
void foo(charset_t cset, int x);

You can obviously forward from the first to the second but that may 
lose some compile-time-constant optimisations; forwarding from the 
second to the first needs a horrible case statement.  I was wondering 
about a macro that would define both.... any ideas anyone?
...
...
...
- What character sets are people interested in using (a) at the "edges"
of their programs,
 As many as possible. Theoretically, a program might have to deal with
 any and all encodings out there. Realistically, there's probably a dozen
 or two that are relevant. You'd need empirical data.
I have looked at the charsets in all my email, but the results are 
thrown by the spam.
...
Unfortunately I need all supported by MIME.
Falling back using e.g. iconv() for the otherwise-unsupported ones is 
my plan.

I'm unlikely to have the energy to write code for more than a couple of 
the exotic sets myself.  If anyone would like to help, please get in touch.
...
...
...
and (b) in the "core"?
ASCII, UTF-8 and UTF-16.
ISO-8859-1 ?
Cheers,

Phil.