
Felipe Magno de Almeida wrote:
On Mon, Feb 25, 2008 at 8:09 AM, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:
Phil Endecott wrote:
Things I'd appreciate feedback on: - What should the cs_string look like? Basically everywhere that std::string uses an integer position I have the choice of a character position, a unit position, or an iterator - or not providing that function.
I think emulating std::string doesn't work. It has a naive design based on the assumption of fixed-width encodings. I think that a tagged string is the best place to really start over with a string design and produce a string that is lean, rather than bloated.
I agree.
Hmmm. I hear what you're saying, but things that are too revolutionary don't get used because they're too different from what people are used to. I'd like to offer something that's close to a drop-in replacement for std::string that will let people painlessly upgrade their code to proper character set support. However, most of the work that I have done has been at a lower level and can be easily built upon to enable a new class with a different interface as well. So you can have your cake and eat it! Comments about both are welcome.
I think the string type should offer minimal manipulation facilities - either completely read-only or append as the only manipulation function.
I would like to have at least a modifiable string. But only through iterators (insert and erase). That should suffice all my algorithm needs.
Try this: temporarily replace all your strings with list<character> and see what's missing.
A string buffer type could be written as a mutable alternative, as is the design in Java and C#. However, I'm not sure how much of that interface is needed, either.
I'm unfamiliar with what Java and C# do, but my lower-level code (e.g. character_output_iterator) make it simple to write e.g. UTF-8 into arbitrary memory.
A modifiable iterator interface (with insert and erase) is, IMO, as concise and extensible as possible.
I'd love to have some empirical data on string usage.
I do some string manipulations on email. And it is usually better to do all manipulations in the codepage received, instead of converting back and forth.
One issue that I'm currently thinking about with this sort of usage is compile-time character set tagging vs. run-time character set tagging. In fact, I've been wondering whether there is some general pattern for providing both e.g. template <charset_t cset> void foo(int x); and void foo(charset_t cset, int x); You can obviously forward from the first to the second but that may lose some compile-time-constant optimisations; forwarding from the second to the first needs a horrible case statement. I was wondering about a macro that would define both.... any ideas anyone?
- What character sets are people interested in using (a) at the "edges" of their programs, As many as possible. Theoretically, a program might have to deal with any and all encodings out there. Realistically, there's probably a dozen or two that are relevant. You'd need empirical data.
I have looked at the charsets in all my email, but the results are thrown by the spam.
Unfortunately I need all supported by MIME.
Falling back using e.g. iconv() for the otherwise-unsupported ones is my plan. I'm unlikely to have the energy to write code for more than a couple of the exotic sets myself. If anyone would like to help, please get in touch.
and (b) in the "core"?
ASCII, UTF-8 and UTF-16.
ISO-8859-1 ?
Cheers, Phil.