Re: [boost] Strings tagged with their character set

26 Sep 2007


      On 9/26/07, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
...
Joseph Gauterin wrote:
[snip]
...
...
I've noticed that there are frequent requests/proposals for some sort
of boost unicode/string encoding library. I've thought about the
problem and it seems to big for one person to handle in their spare
time
Let me say "part time" rather than "spare time"...
Sorry to jump into the discussion, but I've been watching it since the
start. And I'm interested in this project too. Though I'm a little
swamped with work right now, I do work with exactly one use case
exposed in the thread: e-mail parsing (and etcs about it).
And tagging is exactly the best approach.
The worst part being: how do we compare external strings and email text?
Until now the safest approach I found is to convert everything to
Unicode and then compare both (if they don't have the same encoding).
...
...
- perhaps a group of us should get together to discuss working on
one? I'd be happy to participate.
I would definitely encourage breaking the work up into smaller chunks.
IMHO "smaller is better" for Boost libraries; there have been a number
of occasions when I've discovered that a feature I want is hidden as an
internal component of a Boost library, and I've felt that it should
have been a stand-alone public entity.  So let's think about how this
work can be split up:
This seem like a very good approach.
...
- A charset_trait class.  I have started on this.  The missing piece is
a way to look up traits of character sets that are known at run-time;
input would be appreciated.
- Compile-time and run-time tagged strings.  The basics of this are
straightforward and done.
Not as easy if a "universal string" class is to be achieved. But we
can probably left it out for now.
...
- Conversions.  My approach at present is to use iconv via a functor
that I wrote a while ago.  I believe iconv is widely available;
however, some implementations may support only a small set of character
sets.  Alternatives would be interesting.
I use icu extensively, never used iconv.
...
- Variable width iterators, including the issue that you raised above.
boost.iterator makes this job quite easy.
...
- Interaction with locales, internationalisation, and system APIs.
I'm not an IOStream expert, but I'm very use to working with Windows API.
...
and no doubt more.  Thinking about the interfaces between these areas
and the user would be a good place to start.
There were also some interesting discussions about Unicode in the
past, though they didn't seem to go anywhere towards any conclusion.
But were raised very important concerns w.r.t internationalization.
...
Regards,
Phil.
Thanks Phil,
-- 
Felipe Magno de Almeida