
On Mon, Feb 25, 2008 at 6:06 PM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
Felipe Magno de Almeida wrote:
On Mon, Feb 25, 2008 at 8:09 AM, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:
[snip]
I think emulating std::string doesn't work. It has a naive design based on the assumption of fixed-width encodings. I think that a tagged string is the best place to really start over with a string design and produce a string that is lean, rather than bloated.
I agree.
Hmmm. I hear what you're saying, but things that are too revolutionary don't get used because they're too different from what people are used to. I'd like to offer something that's close to a drop-in replacement for std::string that will let people painlessly upgrade their code to proper character set support.
I would really much use it. And am not very concerned if some algorithms would have to change. I'm now using icu directly, and it is quite a PITA.
However, most of the work that I have done has been at a lower level and can be easily built upon to enable a new class with a different interface as well. So you can have your cake and eat it! Comments about both are welcome.
You could create a bloated_utf8 as a drop-in replacement for std::string, and at the same time discouraging its use. :P
I think the string type should offer minimal manipulation facilities - either completely read-only or append as the only manipulation function.
I would like to have at least a modifiable string. But only through iterators (insert and erase). That should suffice all my algorithm needs.
Try this: temporarily replace all your strings with list<character> and see what's missing.
I did (not *all*, but in very significant places). The first problem I got was unnecessary requiring RandomAccessIterators, like using operator+ instead of std::advance. Other places uses std::string::size_type and operator[]. But I can say these are easily correctable.
A string buffer type could be written as a mutable alternative, as is the design in Java and C#. However, I'm not sure how much of that interface is needed, either.
I'm unfamiliar with what Java and C# do, but my lower-level code (e.g. character_output_iterator) make it simple to write e.g. UTF-8 into arbitrary memory.
Good.
A modifiable iterator interface (with insert and erase) is, IMO, as concise and extensible as possible.
I'd love to have some empirical data on string usage.
I do some string manipulations on email. And it is usually better to do all manipulations in the codepage received, instead of converting back and forth.
One issue that I'm currently thinking about with this sort of usage is compile-time character set tagging vs. run-time character set tagging. In fact, I've been wondering whether there is some general pattern for providing both e.g.
template <charset_t cset> void foo(int x); and void foo(charset_t cset, int x);
I can say I won't be using much compile-time tagged strings. But, I guess you could do: template <typename Char, typename Charset> struct compiletime_string; template <typename Char> struct string { template <typename Charset> string(compiletime_string<Char, Charset> const& s); } And then you can have compile-time tagged strings and runtime tagged strings work together seamlessly.
You can obviously forward from the first to the second but that may lose some compile-time-constant optimisations; forwarding from the second to the first needs a horrible case statement. I was wondering about a macro that would define both.... any ideas anyone?
I guess a macro wouldn't be a very good idea. You can just do some if's in the runtime_tagged and forward to the compile-time function for cases where you have a optimized compile-time version for those charsets. For all others, just execute a common function (based on iconv maybe) just passing the character set name. You could have a map for compile-time character set to c-string character set name.
- What character sets are people interested in using (a) at the "edges" of their programs, As many as possible. Theoretically, a program might have to deal with any and all encodings out there. Realistically, there's probably a dozen or two that are relevant. You'd need empirical data.
I have looked at the charsets in all my email, but the results are thrown by the spam.
Unfortunately I need all supported by MIME.
Falling back using e.g. iconv() for the otherwise-unsupported ones is my plan.
That's good enough to me. [snip]
Cheers,
Phil.
Regards, -- Felipe Magno de Almeida