[string] --> [text] ?

Gregory Crosswhite

27 Jan 2011 27 Jan '11

10:47 p.m.

Hey everyone, Since there has been a lot of talk about what the name of a new immutable string class should be, may I toss the name "boost::text" into the ring? The advantage of this name is that it explicitly conveys what it is meant for: working with human-readable text encoded in some implementation-specific form. The name "string" would then continue to have its current interpretation as a string of contiguous 8-bit chars. It has also been suggested that different classes be created for different UTF encodings. I propose that boost::text have the internal encoding be an implementation (and potentially platform-specific) detail. Since at the end of a serious of manipulations with the rope-like data structure one will have to do a final transformation to convert the text into a string of bytes anyway, that provides a natural point at which the desired encoding of the string of bytes can be specified. That is, given a boost::text object "t", one could convert it into a UTF-8 string by calling "t.utf8_c_str()", a UTF-16 string by calling "t.utf16_c_str()", and so on, depending on what the underlying API is expecting. Some of these calls might require recoding the text to a different encoding, so the internal encoding of boost::text could be optimized to whatever is most likely to be needed on that platform so that it is least likely to need recoding. Alternatively, the encoding could be specified as a parameter to the constructor and be carried around as a runtime parameter since nobody needs to know what it is until the final encoding of the string. Thoughts? Cheers, Greg

Show replies by date

Dean Michael Berris

28 Jan 28 Jan

7:56 a.m.

New subject: [string] --> [text] ?

On Fri, Jan 28, 2011 at 6:47 AM, Gregory Crosswhite <gcross@phys.washington.edu> wrote:

...

Since there has been a lot of talk about what the name of a new immutable string class should be, may I toss the name "boost::text" into the ring?

Hmm... Unfortunately it denotes the wrong thing for my case.

...

The advantage of this name is that it explicitly conveys what it is meant for: working with human-readable text encoded in some implementation-specific form. The name "string" would then continue to have its current interpretation as a string of contiguous 8-bit chars.

Right, so then I can keep saying 'string' and meaning it in the computer science context. :)

...

It has also been suggested that different classes be created for different UTF encodings. I propose that boost::text have the internal encoding be an implementation (and potentially platform-specific) detail. Since at the end of a serious of manipulations with the rope-like data structure one will have to do a final transformation to convert the text into a string of bytes anyway, that provides a natural point at which the desired encoding of the string of bytes can be specified.

This was the point for my 'view' template idea. That the view would give some semblance of encoding appropriately.

...

That is, given a boost::text object "t", one could convert it into a UTF-8 string by calling "t.utf8_c_str()", a UTF-16 string by calling "t.utf16_c_str()", and so on, depending on what the underlying API is expecting.

And then you run into the problem of having a ton of member functions that do encapsulate the logic instead of having multiple types to do the conversion instead. The member functions idea will not scale appropriately and would be a hell to manage.

...

Some of these calls might require recoding the text to a different encoding, so the internal encoding of boost::text could be optimized to whatever is most likely to be needed on that platform so that it is least likely to need recoding. Alternatively, the encoding could be specified as a parameter to the constructor and be carried around as a runtime parameter since nobody needs to know what it is until the final encoding of the string.

Hmmm... So why isn't boost::text just a typedef to `view<some_encoding>`? And more to the point, why do you need to make the final encoding a runtime choice when it can easily be made a compile-time choice? Even if you needed to switch appropriately you can always linearize it into a character buffer at some point in time. -- Dean Michael Berris about.me/deanberris

Sebastian Redl

10:02 a.m.

New subject: [string] --> [text] ?

On 28.01.2011 08:56, Dean Michael Berris wrote:

...

On Fri, Jan 28, 2011 at 6:47 AM, Gregory Crosswhite <gcross@phys.washington.edu> wrote:

...
Since there has been a lot of talk about what the name of a new immutable string class should be, may I toss the name "boost::text" into the ring? Hmm... Unfortunately it denotes the wrong thing for my case. That's why "text" is the proposed name for the other case. +1 from me. This was the point for my 'view' template idea. That the view would give some semblance of encoding appropriately. I really don't like the name "view". It has strong connotations of non-ownership. It's not meaningful for the actual purpose of a text type: storing text. A text type should store text, not provide a view on a raw sequence of bytes. A view<some_encoding> would be something I would look for if I wanted to get the bytes that make up a text in some_encoding. Not something I would look for if I wanted to store the text.

Calling a text type "view<utf_8>" feels very much to me like calling int "view<little_endian_32_bit>". As I said before, encoding is a property of interfacing with things external to my code. 3rd party libraries, files, network protocols.

...

...
That is, given a boost::text object "t", one could convert it into a UTF-8 string by calling "t.utf8_c_str()", a UTF-16 string by calling "t.utf16_c_str()", and so on, depending on what the underlying API is expecting. And then you run into the problem of having a ton of member functions that do encapsulate the logic instead of having multiple types to do the conversion instead. The member functions idea will not scale appropriately and would be a hell to manage.

True. How about t.c_str<desired_encoding>()? Put the actual logic for the conversion into the encoding type.

...

...
Some of these calls might require recoding the text to a different encoding, so the internal encoding of boost::text could be optimized to whatever is most likely to be needed on that platform so that it is least likely to need recoding. Alternatively, the encoding could be specified as a parameter to the constructor and be carried around as a runtime parameter since nobody needs to know what it is until the final encoding of the string.

Hmmm... So why isn't boost::text just a typedef to `view<some_encoding>`?

...

And more to the point, why do you need to make the final encoding a runtime choice when it can easily be made a compile-time choice? Various situations may be most efficient with different encodings. If

boost::text should store text. The encoding of the underlying bytes in memory shouldn't matter so much. the text type hides the actual encoding from the user and can switch at runtime, it can adapt to the situation. Sebastian

Matus Chochlik

10:21 a.m.

New subject: [string] --> [text] ?

On Fri, Jan 28, 2011 at 11:02 AM, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:

...

On 28.01.2011 08:56, Dean Michael Berris wrote:

...
On Fri, Jan 28, 2011 at 6:47 AM, Gregory Crosswhite <gcross@phys.washington.edu> wrote:

...
Since there has been a lot of talk about what the name of a new immutable string class should be, may I toss the name "boost::text" into the ring?

Hmm... Unfortunately it denotes the wrong thing for my case.

That's why "text" is the proposed name for the other case. +1 from me.

...
This was the point for my 'view' template idea. That the view would give some semblance of encoding appropriately.

I really don't like the name "view". It has strong connotations of non-ownership. It's not meaningful for the actual purpose of a text type: storing text. A text type should store text, not provide a view on a raw sequence of bytes. A view<some_encoding> would be something I would look for if I wanted to get the bytes that make up a text in some_encoding. Not something I would look for if I wanted to store the text.

Calling a text type "view<utf_8>" feels very much to me like calling int "view<little_endian_32_bit>".

*Exactly*

...

As I said before, encoding is a property of interfacing with things external to my code. 3rd party libraries, files, network protocols.

...
...
That is, given a boost::text object "t", one could convert it into a UTF-8 string by calling "t.utf8_c_str()", a UTF-16 string by calling "t.utf16_c_str()", and so on, depending on what the underlying API is expecting.

And then you run into the problem of having a ton of member functions that do encapsulate the logic instead of having multiple types to do the conversion instead. The member functions idea will not scale appropriately and would be a hell to manage.

True. How about t.c_str<desired_encoding>()? Put the actual logic for the conversion into the encoding type.

+1 although I would not be against c_str<encoding_tag>(my_text) if someone shows that this is better than the member function. NOTE.1: But I would like to see a special encoding tag for the native encoding i.e. something like native_char_encoding, native_wchar_encoding or platform_encoding_tag<char>/platform_encoding_tag<wchar_t> NOTE.2: UTF-8 is assumed by default.

...

boost::text should store text. The encoding of the underlying bytes in memory shouldn't matter so much.

Yes, I basically don't care what the internal encoding of the string is if the interface 'plays' with Unicode/UTF-8. [snip/] Matus

5271

Age (days ago)

5272

Last active (days ago)

List overview

Download

3 comments

4 participants

participants (4)

Dean Michael Berris
Gregory Crosswhite
Matus Chochlik
Sebastian Redl

[string] --> [text] ?

Gregory Crosswhite

Dean Michael Berris

Sebastian Redl

Matus Chochlik

tags

participants (4)