Re: [boost] [string] proposal

25 Jan 2011

      On Tue, Jan 25, 2011 at 5:34 AM, Chad Nelson
<chad.thecomfychair@gmail.com> wrote:
...
On Mon, 24 Jan 2011 19:28:50 +0800
Dean Michael Berris <mikhailberis@gmail.com> wrote:
...
On Mon, Jan 24, 2011 at 3:04 PM, Patrick Horgan <phorgan1@gmail.com>
wrote:
...
[...] I'm with you here, but to be fair to Chad, you could add to
that list a string of utf-8 encoded characters.  If a string contains
things with a particular encoding  there's value in being able to
keep track of whether it's validly encoded.  It may very well be
that a std::string is part of another type, or that there's some
encoding wrapper that lets you see it as utf-8 in the same way an
external iterator lets you look at chars.
Sure, however I personally don't see the value of making the encoding
an intrinsic property of a string object. [...]
Then I think we have different purposes, and I'll absent myself from
this part of the discussion after this reply.
I don't think we have different purposes, I just think we're
discussing two different levels.

I for one want a string that is efficient and lightweight to use.
Whether it encodes the data underneath as UTF-32 for convenience is
largely of little consequence to me at that level. However, as I have
already described elsewhere on a different message, "viewing" a string
in a given encoding is much more scalable as far as design is
concerned as it allows others to extend the view mechanism to be
unique to the encoding being supported.

This allows you to write algorithms and views that adapt existing
strings (std::string, QString, CString, std::wstring, <insert string
implementation here>) and operate on them in a generic manner. The
hypothetical `boost::string` can have implicit conversion constructors
(?) that deal with the supported strings, and that means you are able
to view that `boost::string` instead in the view.
...
Before I go, I'll note in passing that I've started on the
modifications to the UTF types, and I found that it made sense to omit
many of the mutating functions from utf8_t and utf16_t, at least the
ones that operate on anything other than the end of the string.
Actually, I think if you have the immutable string as something you
use internally in your UTF-* "views", then you may even be able to
omit even the mutation parts even those dealing with the end of the
string. ;)
...
...
...
Are you saying that you try it as utf-8, it doesn't decode and then
you try utf-32 to see if it works?  Cause the same string couldn't
be both.   Or are you saying that the string has some underlying
encoding but something lets it be viewed in other encodings, for
example it might actually be EUC, but external iterators let you
view it as utf-8 or utf-16 or utf-32 interpreting on the fly?
I'm saying the string could contain whatever it contains (which is
largely of little consequence) but that you can give a "view" of the
string as UTF-8 if it's valid UTF-8, or UTF-32 if it's valid UTF-32.
[...]
For what it's worth, that's the basic concept that I've adopted for the
utf*_t modifications. The utf*_t gives only a code-point iterator (you
can also get a char/char16_t/char32_t iterator from the type returned
by the encoded() function). I plan to write a separate character
iterator that will accept code-points and return actual Unicode
characters.
I do suggest however that you implement/design algorithms first and
build your iterators around the algorithms.

I know that might sound counter-intuitive but having concrete
algorithms in mind will allow you to delineate the proper (or more
effective) abstractions better than thinking of the iterators in
isolation.

-- 
Dean Michael Berris
about.me/deanberris