
Dear list, following the whole string encoding discussion I would like to make some suggestions.
From the whole debate it is becoming clear, that instant switch from encoding-agnostic/platform-native std::string to UTF-8-encoded std::string is not likely to happen.
Then it was proposed that we create a utf8_t string type that would be used *together* (for all eternity) with the standard basic_string<>. While I see the advantages here, I (as I already said elsewhere) have the following problem with this approach: Using a name like utf8_t or u8string, string_utf8, etc. at least to me (and I've consulted this off the list, with several people) suggests, that UTF-8 is still something special and IMO also sends the message that it is OK to remain forever with the various encodings and std::string as it is today. We should *IMO* endorse the opposite. My suggestion is the following: Let us create a class called boost::string that will have all the properties that a string handling class in 2011+ A.D. should have, basically what std::string should have been. Then there are two alternatives: a) When all the zillions lines of legacy code in FORTRAN, COBOL, BASIC, LOGO, etc. :) are fixed / ported / abandoned, and UTF-8 becomes a true standard for text encoding widely accepted by the whole IT industry and markets, and all the issues that prevent us from doing the transition now are resolved, this string becomes the standard, like many other things from Boost in the past, and replaces the current std::string. b) As some (having much more insight into how the standardizing comitee works than I do) have pointed out, it will never become a true standard. But with the Boost's influence it at least becomes a de-facto standard for strings and it is (hopefully) adopted by the libraries that currently feel the need to invent string-classes themselves (with a good reason). Also I've uploaded into the vault file string_proposal.zip containing my (naive and un-expert-ly) idea what the interface for boost::string and the related-classes could look like (it still needs some work and it is completelly un-optimized, un-beautified, etc.). /me ducks and covers :) The idea is that, let std::string/wstring be platform-specifically- -encoded as it is now, but also let the boost::string handle the conversions as transparently as possible so if in case the standard adopts it, std::string would become a synonym for boost::string. It is only partially implemented and there are two examples showing how things could work, but the real UTF-8 validation, transcoding, error handling, is of course missing. Remember it is aimed at the design of the interfaces at this point. If you have the time, have a look and if my suggestions and/or the code looks completely wrong, please, feel free to slash it to pieces :), and if you feel up to it, propose something better. If this or something completely different and much better that comes out of it, will be agreed upon, we could set up a dedicated git repository for Boost.String and maybe try if the new suggested collaborative development in per-boost-component repositories really works. :) If some of the people that are skilled with unicode would join or lead the effort it would be awesome. Best, Matus

On Fri, Jan 21, 2011 at 7:25 PM, Matus Chochlik <chochlik@gmail.com> wrote:
Dear list,
following the whole string encoding discussion I would like to make some suggestions.
[snip]
Let us create a class called boost::string that will have all the properties that a string handling class in 2011+ A.D. should have, basically what std::string should have been.
+1 [snip]
Also I've uploaded into the vault file string_proposal.zip containing my (naive and un-expert-ly) idea what the interface for boost::string and the related-classes could look like (it still needs some work and it is completelly un-optimized, un-beautified, etc.).
/me ducks and covers :)
Maybe you have a publicly available Git repository -- maybe on Github -- we'd have a better discussion going? Mostly I'm interested in seeing a string class that is: 1. Immutable. No if's or but's about it. I don't want a string to be modifiable. Period. You can create it, and once it's created, that's it. 2. Has real value semantics. This means, once you've copied it, that's really copied. No funky copy-on-write reference-counting mumbo-jumbo. 3. Has all the algorithms that apply to it defined externally. 4. Looks like a real STL container except the iterator type is smarter than your average iterator. Encoding is a matter of external interpretation and I think should not be part of a string's interface. You can have wrappers that interpret a string as a UTF-* string.
The idea is that, let std::string/wstring be platform-specifically- -encoded as it is now, but also let the boost::string handle the conversions as transparently as possible so if in case the standard adopts it, std::string would become a synonym for boost::string.
Hmmm... I rather much have a string class that works first and can later be decorated to be assumed/verified to be in a given encoding. [snip]
If this or something completely different and much better that comes out of it, will be agreed upon, we could set up a dedicated git repository for Boost.String and maybe try if the new suggested collaborative development in per-boost-component repositories really works. :) If some of the people that are skilled with unicode would join or lead the effort it would be awesome.
+1 -- collaborative development FTW. :) I think strings are different from the encoding they're interpreted as. Let's fix the problem of a string data structure first then tack on encoding/decoding as something that depends on the string abstraction first. HTH -- Dean Michael Berris about.me/deanberris

On 1/21/2011 7:07 PM, Dean Michael Berris wrote:
Mostly I'm interested in seeing a string class that is:
1. Immutable. No if's or but's about it. I don't want a string to be modifiable. Period. You can create it, and once it's created, that's it.
2. Has real value semantics. This means, once you've copied it, that's really copied. No funky copy-on-write reference-counting mumbo-jumbo.
Uh, if the string is immutable, then two strings can transparently share the same data. There is no "write" in "copy-on-write". That's the definition of immutable. :-) -- Eric Niebler BoostPro Computing http://www.boostpro.com

On Fri, Jan 21, 2011 at 8:37 PM, Eric Niebler <eric@boostpro.com> wrote:
On 1/21/2011 7:07 PM, Dean Michael Berris wrote:
Mostly I'm interested in seeing a string class that is:
1. Immutable. No if's or but's about it. I don't want a string to be modifiable. Period. You can create it, and once it's created, that's it.
2. Has real value semantics. This means, once you've copied it, that's really copied. No funky copy-on-write reference-counting mumbo-jumbo.
Uh, if the string is immutable, then two strings can transparently share the same data. There is no "write" in "copy-on-write". That's the definition of immutable. :-)
Ha! You're right. :D Now if you can make this work without reference counting, that'd be perfect. :D -- Dean Michael Berris about.me/deanberris

On 1/21/2011 7:45 PM, Dean Michael Berris wrote:
On Fri, Jan 21, 2011 at 8:37 PM, Eric Niebler <eric@boostpro.com> wrote:
On 1/21/2011 7:07 PM, Dean Michael Berris wrote:
Mostly I'm interested in seeing a string class that is:
1. Immutable. No if's or but's about it. I don't want a string to be modifiable. Period. You can create it, and once it's created, that's it.
2. Has real value semantics. This means, once you've copied it, that's really copied. No funky copy-on-write reference-counting mumbo-jumbo.
Uh, if the string is immutable, then two strings can transparently share the same data. There is no "write" in "copy-on-write". That's the definition of immutable. :-)
Ha! You're right. :D Now if you can make this work without reference counting, that'd be perfect. :D
Why? Immutable data is exactly the situation (and the only situation IMO) when you'd want to share data. The potential savings, both in time and space, are huge. What's wrong with a ref-counted string impl? -- Eric Niebler BoostPro Computing http://www.boostpro.com

On Fri, Jan 21, 2011 at 9:23 PM, Eric Niebler <eric@boostpro.com> wrote:
On 1/21/2011 7:45 PM, Dean Michael Berris wrote:
On Fri, Jan 21, 2011 at 8:37 PM, Eric Niebler <eric@boostpro.com> wrote:
Uh, if the string is immutable, then two strings can transparently share the same data. There is no "write" in "copy-on-write". That's the definition of immutable. :-)
Ha! You're right. :D Now if you can make this work without reference counting, that'd be perfect. :D
Why? Immutable data is exactly the situation (and the only situation IMO) when you'd want to share data. The potential savings, both in time and space, are huge. What's wrong with a ref-counted string impl?
Good question. Here are some reasons off the top of my head for why a ref-counted string implementation might not be desirable: 1. The overhead of maintaining that table of reference counts may be significant if you have a lot of strings created in your application. The only way I know how this would be implemented is with maybe static intrusive container of these string chunks. Notice I say "created" not copied around because the cost of copying whole strings around in memory would dwarf the cost of having that static intrusive container for string chunks. 2. In the case that you do have strings copied around, consider the case in multi-threaded applications where you have different threads running on different cores. Although there won't be any mutations, updating a reference count that's shared across cores will cause some "unnecessary" cache ping-pong across cores. You won't have this if each thread was just operating on a copy of the data and not needing to update reference counts. 3. The reference count almost always will be implemented with atomic counters to be efficient. In cases where the hardware doesn't support these (in embedded environments) the reference count would be implemented either with a spin-lock if the platform supports threading, OS-level semaphores, etc. -- none of which would be necessary if you just copied the string to begin with. For the majority of cases an interning strategy for strings to be able to share the same storage would be a good thing. Maybe turning off the reference counting "default" might be an option as well. Of course I'm just listing the reasons why a ref-counted implementation wouldn't be desirable -- in the majority of the cases, I'd say reference counting may be a good thing. ;) -- Dean Michael Berris about.me/deanberris

Dean Michael Berris wrote:
1. The overhead of maintaining that table of reference counts may be significant if you have a lot of strings created in your application.
No table is needed; the reference count is just placed before the string data in the allocated memory.
3. The reference count almost always will be implemented with atomic counters to be efficient. In cases where the hardware doesn't support these (in embedded environments) ...
Copying the string requires a memory allocation, which also grabs a lock. You're not likely to have a lock-free allocator if you don't have atomic operations.
2. In the case that you do have strings copied around, consider the case in multi-threaded applications where you have different threads running on different cores.
Copying the string data outweighs the reference count update because of the allocation.

On Sat, Jan 22, 2011 at 12:19 AM, Peter Dimov <pdimov@pdimov.com> wrote:
Dean Michael Berris wrote:
1. The overhead of maintaining that table of reference counts may be significant if you have a lot of strings created in your application.
No table is needed; the reference count is just placed before the string data in the allocated memory.
Interesting. That can work, although you'll run into all sorts of load/store issues and alignment requirements. Of course much like anything that needs to be shared across threads too, so it's not that bad.
3. The reference count almost always will be implemented with atomic counters to be efficient. In cases where the hardware doesn't support these (in embedded environments)
...
Copying the string requires a memory allocation, which also grabs a lock. You're not likely to have a lock-free allocator if you don't have atomic operations.
Yeah, I agree. So never mind that limitation, it's going to be a limitation anyway with anything that has to be synchronized. :)
2. In the case that you do have strings copied around, consider the case in multi-threaded applications where you have different threads running on different cores.
Copying the string data outweighs the reference count update because of the allocation.
Right. So I guess I can be convinced that ref counting in strings would be just fine. I can't wait to get that so that I can make cpp-netlib just use these smart strings throughout. :D -- Dean Michael Berris about.me/deanberris

2. In the case that you do have strings copied around, consider the case in multi-threaded applications where you have different threads running on different cores.
Copying the string data outweighs the reference count update because of the allocation.
Right.
So I guess I can be convinced that ref counting in strings would be just fine.
I can't wait to get that so that I can make cpp-netlib just use these smart strings throughout. :D
If it's just for a std::string compatible class being implemented the way you like it (COW, value semantics, reference counting, or plain old data copying, etc.), you might want to look at flex_string, a policy based std::string equivalent (here: http://loki-lib.svn.sourceforge.net/viewvc/loki-lib/trunk/include/loki/flex/ , also part of wave). Regards Hartmut --------------- http://boost-spirit.com

On Sat, Jan 22, 2011 at 12:52 AM, Hartmut Kaiser <hartmut.kaiser@gmail.com> wrote:
I can't wait to get that so that I can make cpp-netlib just use these smart strings throughout. :D
If it's just for a std::string compatible class being implemented the way you like it (COW, value semantics, reference counting, or plain old data copying, etc.), you might want to look at flex_string, a policy based std::string equivalent (here: http://loki-lib.svn.sourceforge.net/viewvc/loki-lib/trunk/include/loki/flex/ , also part of wave).
I'm not sure though if that string is immutable. What I want really is something like what D has for strings, which are immutable with true value semantics -- and are thought of as ranges (which last time I checked are also default UTF-8 encoded or something like that). It's not just about being std::string compatible, defining lazy operations on it and doing things *right* as a string data type is what I'm interested in. Thanks for the pointer though I just might use flex_string in cpp-netlib. :D -- Dean Michael Berris about.me/deanberris

At Fri, 21 Jan 2011 20:07:51 +0800, Dean Michael Berris wrote:
On Fri, Jan 21, 2011 at 7:25 PM, Matus Chochlik <chochlik@gmail.com> wrote:
Dear list,
following the whole string encoding discussion I would like to make some suggestions.
[snip]
Let us create a class called boost::string that will have all the properties that a string handling class in 2011+ A.D. should have, basically what std::string should have been.
+1
[snip]
Also I've uploaded into the vault file string_proposal.zip containing my (naive and un-expert-ly) idea what the interface for boost::string and the related-classes could look like (it still needs some work and it is completelly un-optimized, un-beautified, etc.).
/me ducks and covers :)
Maybe you have a publicly available Git repository -- maybe on Github -- we'd have a better discussion going?
Mostly I'm interested in seeing a string class that is:
1. Immutable. No if's or but's about it. I don't want a string to be modifiable. Period. You can create it, and once it's created, that's it.
Do you want to prevent 1. wholesale mutation such as x = y x += y or just 2. per-char mutation such as x[10] = 'a' ? eliminating #2 does a lot for implementation flexibility (e.g. allowing refcounts or GC to be used cleanly), and can be useful for thread safety if there's no "small string optimization," because the buffers holding the chars are truly immutable. However, preventing #1 is a more serious matter...
2. Has real value semantics. This means, once you've copied it, that's really copied. No funky copy-on-write reference-counting mumbo-jumbo.
I guess you're talking about just per-char mutation, then, because value semantics implies assignability.
3. Has all the algorithms that apply to it defined externally.
4. Looks like a real STL container except the iterator type is smarter than your average iterator.
Encoding is a matter of external interpretation and I think should not be part of a string's interface. You can have wrappers that interpret a string as a UTF-* string.
What does it iterate over? chars? code points? characters? Something else? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Sat, Jan 22, 2011 at 12:55 AM, Dave Abrahams <dave@boostpro.com> wrote:
At Fri, 21 Jan 2011 20:07:51 +0800, Dean Michael Berris wrote:
1. Immutable. No if's or but's about it. I don't want a string to be modifiable. Period. You can create it, and once it's created, that's it.
Do you want to prevent 1. wholesale mutation such as
x = y x += y
or just
2. per-char mutation such as
x[10] = 'a'
?
eliminating #2 does a lot for implementation flexibility (e.g. allowing refcounts or GC to be used cleanly), and can be useful for thread safety if there's no "small string optimization," because the buffers holding the chars are truly immutable.
However, preventing #1 is a more serious matter...
I want to prevent #2 but not #1. :) And actually, I would have phrased the concept of #1 to be: x = "Some string"; x = x ^ "... and another string"; Because adding two strings isn't the same as joining two strings in concatenation. ;)
2. Has real value semantics. This means, once you've copied it, that's really copied. No funky copy-on-write reference-counting mumbo-jumbo.
I guess you're talking about just per-char mutation, then, because value semantics implies assignability.
Yep.
3. Has all the algorithms that apply to it defined externally.
4. Looks like a real STL container except the iterator type is smarter than your average iterator.
Encoding is a matter of external interpretation and I think should not be part of a string's interface. You can have wrappers that interpret a string as a UTF-* string.
What does it iterate over? chars? code points? characters? Something else?
I can see basically a way of saying what you want when you want to get an iterator from it -- by default though a call to '.begin()' will return an iterator characters (just so you don't break compatibility with std::string). The iterator can store a reference to the original string and when advanced, can do the appropriate interpretation of the string in context. If you wanted a code point iterator, you'd get the code point iterator. If you wanted a character based on a certain encoding then you can have a special iterator for that. An iterator would also know whether it was out of bounds. This allows people to write code that dealt with code points, characters (based on the encoding), and raw data if absolutely necessary. -- Dean Michael Berris about.me/deanberris

At Sat, 22 Jan 2011 01:14:38 +0800, Dean Michael Berris wrote:
4. Looks like a real STL container except the iterator type is smarter than your average iterator.
Encoding is a matter of external interpretation and I think should not be part of a string's interface. You can have wrappers that interpret a string as a UTF-* string.
What does it iterate over? chars? code points? characters? Something else?
I can see basically a way of saying what you want when you want to get an iterator from it -- by default though a call to '.begin()' will return an iterator characters (just so you don't break compatibility with std::string).
Then you mean an iterator over chars, not characters.
The iterator can store a reference to the original string and when advanced, can do the appropriate interpretation of the string in context. If you wanted a code point iterator, you'd get the code point iterator. If you wanted a character based on a certain encoding then you can have a special iterator for that. An iterator would also know whether it was out of bounds.
This allows people to write code that dealt with code points, characters (based on the encoding), and raw data if absolutely necessary.
Hmm, I'm just not sure whether these are useful. The iterators to be supplied (if any) should IMO be dictated by the needs of real algorithms. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Sat, Jan 22, 2011 at 1:51 AM, Dave Abrahams <dave@boostpro.com> wrote:
At Sat, 22 Jan 2011 01:14:38 +0800, Dean Michael Berris wrote:
4. Looks like a real STL container except the iterator type is smarter than your average iterator.
Encoding is a matter of external interpretation and I think should not be part of a string's interface. You can have wrappers that interpret a string as a UTF-* string.
What does it iterate over? chars? code points? characters? Something else?
I can see basically a way of saying what you want when you want to get an iterator from it -- by default though a call to '.begin()' will return an iterator characters (just so you don't break compatibility with std::string).
Then you mean an iterator over chars, not characters.
Yeah, over chars. :)
The iterator can store a reference to the original string and when advanced, can do the appropriate interpretation of the string in context. If you wanted a code point iterator, you'd get the code point iterator. If you wanted a character based on a certain encoding then you can have a special iterator for that. An iterator would also know whether it was out of bounds.
This allows people to write code that dealt with code points, characters (based on the encoding), and raw data if absolutely necessary.
Hmm, I'm just not sure whether these are useful. The iterators to be supplied (if any) should IMO be dictated by the needs of real algorithms.
I thought about it a little more too, and there should be a way of just crafting the appropriate iterator from the outside -- much like how the current Iterators library allows you to create different kinds of iterators. Algorithms that deal with text, like rendering characters for example in a GUI, would basically need to iterate over code points or glyphs. Typesetting algorithms would pretty much need the same kind of traversal. Also things like instance counting (building a histogram based on character counts) for example for compression and all the cool things like that would need to have access to individual "elements" of a given text -- in the pre-Unicode days this was just a simple table of 255 characters, unfortunately it's gotten a lot more complex than that ;). -- Dean Michael Berris about.me/deanberris

On 21 January 2011 10:55, Dave Abrahams <dave@boostpro.com> wrote:
Do you want to prevent 1. wholesale mutation such as
x = y x += y
or just
2. per-char mutation such as
x[10] = 'a'
?
Eliminating #2 but not #1 would force c_str() to make a (possibly tracked) copy, to avoid #2 on its internal buffer (due to the trailing '\0' byte). -- Nevin ":-)" Liber <mailto:nevin@eviloverlord.com> (847) 691-1404

On Fri, 21 Jan 2011 15:31:16 -0600, Nevin Liber wrote:
On 21 January 2011 10:55, Dave Abrahams <dave@boostpro.com> wrote:
Do you want to prevent 1. wholesale mutation such as
x = y x += y
or just
2. per-char mutation such as
x[10] = 'a'
?
Eliminating #2 but not #1 would force c_str() to make a (possibly tracked) copy, to avoid #2 on its internal buffer (due to the trailing '\0' byte).
Why does the trailing '\0' affect anything. It could store the immutable string data internally in a buffer that has an extra '\0'. The time it considers this '\0' to be part of the data is on a call to c_str(). Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

On Fri, Jan 21, 2011 at 4:31 PM, Nevin Liber <nevin@eviloverlord.com> wrote:
On 21 January 2011 10:55, Dave Abrahams <dave@boostpro.com> wrote:
Do you want to prevent 1. wholesale mutation such as
x = y x += y
or just
2. per-char mutation such as
x[10] = 'a'
?
Eliminating #2 but not #1 would force c_str() to make a (possibly tracked) copy, to avoid #2 on its internal buffer (due to the trailing '\0' byte).
Who says we need a c_str()? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

From: Dave Abrahams <dave@boostpro.com> On Fri, Jan 21, 2011 at 4:31 PM, Nevin Liber <nevin@eviloverlord.com> wrote:
On 21 January 2011 10:55, Dave Abrahams <dave@boostpro.com> wrote:
Do you want to prevent 1. wholesale mutation such as
x = y x += y
or just
2. per-char mutation such as
x[10] = 'a'
?
Eliminating #2 but not #1 would force c_str() to make a (possibly tracked) copy, to avoid #2 on its internal buffer (due to the trailing '\0' byte).
Who says we need a c_str()?
Almost everybody who uses any kind of API that does not has direct uses of this string and this is as almost every API around... Artyom

On Mon, Jan 24, 2011 at 7:50 PM, Artyom <artyomtnk@yahoo.com> wrote:
From: Dave Abrahams <dave@boostpro.com> On Fri, Jan 21, 2011 at 4:31 PM, Nevin Liber <nevin@eviloverlord.com> wrote:
Eliminating #2 but not #1 would force c_str() to make a (possibly tracked) copy, to avoid #2 on its internal buffer (due to the trailing '\0' byte).
Who says we need a c_str()?
Almost everybody who uses any kind of API that does not has direct uses of this string and this is as almost every API around...
Right, but what's keeping that person from creating an std::string from this hypothetical `boost::string` if they really need a c_str()? -- Dean Michael Berris about.me/deanberris

Dean Michael Berris wrote:
On Mon, Jan 24, 2011 at 7:50 PM, Artyom <artyomtnk@yahoo.com> wrote:
From: Dave Abrahams <dave@boostpro.com>
Who says we need a c_str()?
Almost everybody who uses any kind of API that does not has direct uses of this string and this is as almost every API around...
Right, but what's keeping that person from creating an std::string from this hypothetical `boost::string` if they really need a c_str()?
Do you mean besides poorer performance and greater complexity of use? _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On Mon, Jan 24, 2011 at 9:51 PM, Stewart, Robert <Robert.Stewart@sig.com> wrote:
Dean Michael Berris wrote:
On Mon, Jan 24, 2011 at 7:50 PM, Artyom <artyomtnk@yahoo.com> wrote:
From: Dave Abrahams <dave@boostpro.com>
Who says we need a c_str()?
Almost everybody who uses any kind of API that does not has direct uses of this string and this is as almost every API around...
Right, but what's keeping that person from creating an std::string from this hypothetical `boost::string` if they really need a c_str()?
Do you mean besides poorer performance and greater complexity of use?
Sure, but a conversion operator implemented by the hypothetical boost::string should suffice no? For interfaces that take an std::string, that'd be all you need really. For those that need a c_str(), then I guess it's the price you pay for a crazy efficient string that you can use in the innards of your application and just pay for the cost of when you just need a c_str(). :D Of course I'm half-joking there, but I think that's a reasonable price to pay I think much like how everyone is just dealing with the fact that you absolutely have to at some point just linearize your data structures into a void* for it to be sent through network APIs that take a void* and a size. Maybe people want to have zero-copy APIs but even Linux doesn't provide that API just yet at least to the user-space applications so I don't see why having just an extra string copy to an std::string would be a big deal. -- Dean Michael Berris about.me/deanberris

On 24 January 2011 08:20, Dean Michael Berris <mikhailberis@gmail.com>wrote:
On Mon, Jan 24, 2011 at 9:51 PM, Stewart, Robert <Robert.Stewart@sig.com> wrote:
Dean Michael Berris wrote:
On Mon, Jan 24, 2011 at 7:50 PM, Artyom <artyomtnk@yahoo.com> wrote:
From: Dave Abrahams <dave@boostpro.com>
Who says we need a c_str()?
Almost everybody who uses any kind of API that does not has direct uses of this string and this is as almost every API around...
Right, but what's keeping that person from creating an std::string from this hypothetical `boost::string` if they really need a c_str()?
Do you mean besides poorer performance and greater complexity of use?
Isn't it too early to discuss optimization tricks, before the interface has started to take form? I looked at the string_proposal, and if Boost is going for a distinct utf8 type (in whatever form), I think it should be built on something not directly tied to utf8. namespace boost { template<class Encoding> class string { public: typedef some_iterator_base<Encoding> char/code_point/character_iterator // Disclaimer: I'm not familiar enough with unicode terminology to know what kind of iterators needed, and how they relate // to properties of the encoding. template<class Iterator, class OtherEncoding> string(Iterator first, Iterator last, OtherEncoding); // Add other convenience constructors from ranges here, taking the encoding as separate parameter.. template<class OtherEncoding> string(string<OtherEncoding> const&); string& operator=(string& const) // Only allow assignment of same encoding allowed. template<class Iterator, class OtherEncoding> void assign(Iterator first, Iterator last, OtherEncoding); template<class OtherEncoding> std::string to_string(OtherEncoding const&) const; } } usage: namespace encoding = boost::string_encoding; // Or something cleaner namespace boost::string<encoding::utf8> s_utf("Hello world", encoding::latin1())); boost::string<encoding::latin1> s_latin1(s_utf8); FILE* f = boost::fopen(boost::string<boost::string_encoding::utf8> const&); Boost could have as policy that interfaces should take boost::utf8 and not bother with implementing any other encoding (but still support updating the Encoding concept, so that other encodings can be expressed). In future, a theoretical layer built upon proposed Boost.Locale could expose more encodings to use with boost::string. /me also ducks and covers (: - Christian

On Mon, Jan 24, 2011 at 12:59 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Mon, Jan 24, 2011 at 7:50 PM, Artyom <artyomtnk@yahoo.com> wrote:
From: Dave Abrahams <dave@boostpro.com> On Fri, Jan 21, 2011 at 4:31 PM, Nevin Liber <nevin@eviloverlord.com> wrote:
Eliminating #2 but not #1 would force c_str() to make a (possibly tracked) copy, to avoid #2 on its internal buffer (due to the trailing '\0' byte).
Who says we need a c_str()?
Almost everybody who uses any kind of API that does not has direct uses of this string and this is as almost every API around...
Right, but what's keeping that person from creating an std::string from this hypothetical `boost::string` if they really need a c_str()?
In the last couple of days I browsed through some of the code that I wrote or that I have access to and this is not some marginal use-case. I don't say that the sources are statistically representative or anything, but using string to interact with the OS's APIs is actually one of the most prevalent use-cases. I don' t think it is a good idea to focus on getting the basic string manipulation to be uber-efficient at the expense of performance of the string in a "real-world" context. Your program does not always look like: int main(void) { boost::string your_string; do(something(really(cool(and(efficient(with(your_string))))))); return 0; } many time you do things like: a) read file, parse its contents, create instances from the data b) get string from a socket, manipulate it, display it in the GUI c) get a string from a GUI, save it into a config file d) take a string literal, localize/translate by gettext, show it etc. In order for this thing to be widely adopted (which is one of my goals) it has to be nice to the existing APIs and let's face it, most of them expect std::string or just plain ol' char*. BR, Matus

On Wed, Jan 26, 2011 at 4:09 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Mon, Jan 24, 2011 at 12:59 PM, Dean Michael Berris
Right, but what's keeping that person from creating an std::string from this hypothetical `boost::string` if they really need a c_str()?
In the last couple of days I browsed through some of the code that I wrote or that I have access to and this is not some marginal use-case.
I can say the same thing, although I'm largely concerned about "changing the way people think about strings" with an implementation that actually works and can be adapted in real-world situations rather than trying to support the IMO borked status quo. ;)
I don't say that the sources are statistically representative or anything, but using string to interact with the OS's APIs is actually one of the most prevalent use-cases. I don' t think it is a good idea to focus on getting the basic string manipulation to be uber-efficient at the expense of performance of the string in a "real-world" context.
Well, see here's the key phrase there that's important: "string manipulation" I actually want to wipe that phrase off the face of the C++ world and have everyone think in terms of "building strings" rather than manipulating them because in reality, there really is no manipulating something that's unique and immutable -- unless you create a new one from it.
Your program does not always look like:
int main(void) { boost::string your_string; do(something(really(cool(and(efficient(with(your_string))))))); return 0; }
FWIW, I think programs written that way are just really ugly and broken. ;)
many time you do things like: a) read file, parse its contents, create instances from the data
So, how does immutability change this? Why can't you create an immutable string from a buffer and parse that immutable string and create instances of other types from the data?
b) get string from a socket, manipulate it, display it in the GUI
What's stopping you from building another string from an immutable string to be displayed in the GUI? If you need something that would be mutable, you don't use a string to represent it -- use a different data structure that implies mutability like a vector<char> or similar data structure, build a string from that and display that string to the GUI.
c) get a string from a GUI, save it into a config file
So what's the problem with an immutable string in this context?
d) take a string literal, localize/translate by gettext, show it etc.
Yes (except the gettext part which I largely don't understand), so what's wrong with building an immutable string from a string literal? Or for that matter building an immutable string from gettext output?
In order for this thing to be widely adopted (which is one of my goals) it has to be nice to the existing APIs and let's face it, most of them expect std::string or just plain ol' char*.
Sure, but still I don't see why you need to add c_str() to an immutable string when you're fine to create an std::string from that immutable string and call c_str() from the std::string instance instead? -- Dean Michael Berris about.me/deanberris

On Wed, Jan 26, 2011 at 9:34 AM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Wed, Jan 26, 2011 at 4:09 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Mon, Jan 24, 2011 at 12:59 PM, Dean Michael Berris
[snip/]
I can say the same thing, although I'm largely concerned about "changing the way people think about strings" with an implementation that actually works and can be adapted in real-world situations rather than trying to support the IMO borked status quo. ;)
I don't say that the sources are statistically representative or anything, but using string to interact with the OS's APIs is actually one of the most prevalent use-cases. I don' t think it is a good idea to focus on getting the basic string manipulation to be uber-efficient at the expense of performance of the string in a "real-world" context.
Well, see here's the key phrase there that's important:
"string manipulation"
I actually want to wipe that phrase off the face of the C++ world and have everyone think in terms of "building strings" rather than manipulating them because in reality, there really is no manipulating something that's unique and immutable -- unless you create a new one from it.
OK, I think we think the same thing :) Looking at the aforementioned sources I found the doing things like: mystr[i] = whatever() was very rare, and while I used the term manipulation I didn't have this specific case in mind. I'm completely OK, if we all agree that what you propose is the way to go, with your idea how strings should be manip.. - er scratch that - built.
Your program does not always look like:
int main(void) { boost::string your_string; do(something(really(cool(and(efficient(with(your_string))))))); return 0; }
FWIW, I think programs written that way are just really ugly and broken. ;)
many time you do things like: a) read file, parse its contents, create instances from the data
So, how does immutability change this? Why can't you create an immutable string from a buffer and parse that immutable string and create instances of other types from the data?
b) get string from a socket, manipulate it, display it in the GUI
What's stopping you from building another string from an immutable string to be displayed in the GUI? If you need something that would be mutable, you don't use a string to represent it -- use a different data structure that implies mutability like a vector<char> or similar data structure, build a string from that and display that string to the GUI.
c) get a string from a GUI, save it into a config file
So what's the problem with an immutable string in this context?
d) take a string literal, localize/translate by gettext, show it etc.
The immutability *does not* have a thing with the problems in the use-cases described above. Encoding *does*.
Yes (except the gettext part which I largely don't understand), so what's wrong with building an immutable string from a string literal? Or for that matter building an immutable string from gettext output?
In order for this thing to be widely adopted (which is one of my goals) it has to be nice to the existing APIs and let's face it, most of them expect std::string or just plain ol' char*.
Sure, but still I don't see why you need to add c_str() to an immutable string when you're fine to create an std::string from that immutable string and call c_str() from the std::string instance instead?
One word. Performance :) BR, Matus

Excuse my ignorance, but can someone explain to me why people are so keen on immutable strings? Aren't they basically the same as 'shared_ptr<const std::string>'? I follow these discussions, and I must admit that I already use std::string in my projects with utf8 encoding assumed by default. What matters for me is the lack of a "standard" way to manipulate those strings. I.e.: 1) Convert them to and from other APIs' encoding: SetWindowTextW(to_utf16(my_string)); 2) Iterate through the codepoints, characters, words etc.. like this: for(char32_t cp : codepoints(my_string)) ...; The original proposal (in the other thread) was to use the type of the string to ensure at compile time that the above code is valid. I understand that it is needed in the current world where not everybody uses utf8. It's fine for me. But why On Fri, Jan 21, 2011 at 13:25, Matus Chochlik <chochlik@gmail.com> wrote:
create a class called boost::string that will have all the properties that a string handling class in 2011+ A.D. should have, basically what std::string should have been.
? What are those properties? Isn't std::string *is* what it should have been? Do you mean that you want to put there in any possible algorithm you can imagine? IMO std::string is just a container of bytes with two useful convenience methods (c_str() and substr()) and a utf8 encoding that had to be assumed by default but unfortunately isn't. Everything else should be generic algorithms that work with sequences of characters in some encoding. So, maybe it's better to focus on designing something like boost::iterator_range with an encoding associated with it and algorithms that work with these ranges? -- Yakov

On Wed, Jan 26, 2011 at 10:37 AM, Yakov Galka <ybungalobill@gmail.com> wrote:
Excuse my ignorance, but can someone explain to me why people are so keen on immutable strings? Aren't they basically the same as 'shared_ptr<const std::string>'?
I'm fairly neutral on the immutability issue, I do not oppose it if someone shows why it is a superior design, provided it does not break everything horribly (from the backward compatibility perspective).
I follow these discussions, and I must admit that I already use std::string in my projects with utf8 encoding assumed by default. What matters for me is the lack of a "standard" way to manipulate those strings. I.e.: 1) Convert them to and from other APIs' encoding: SetWindowTextW(to_utf16(my_string)); 2) Iterate through the codepoints, characters, words etc.. like this: for(char32_t cp : codepoints(my_string)) ...;
+1
The original proposal (in the other thread) was to use the type of the string to ensure at compile time that the above code is valid. I understand that it is needed in the current world where not everybody uses utf8. It's fine for me. But why
On Fri, Jan 21, 2011 at 13:25, Matus Chochlik <chochlik@gmail.com> wrote:
create a class called boost::string that will have all the properties that a string handling class in 2011+ A.D. should have, basically what std::string should have been.
The original proposal was to keep the existing string but to switch to UTF-8 as the default encoding. This is what still is my long term goal. The whole discussion changed my opinion on how to get there. I personally would not have any problem with doing the instant switch .. but many other people would, and with good reasons.
? What are those properties? Isn't std::string *is* what it should have been? Do you mean that you want to put there in any possible algorithm you can imagine?
What I was talking about is basically adding some more convenience member functions, many of which are currently implemented by the string_algo Boost library, to the strings interface and more importantly to extend the strings interface with 'Unicode-functionality' i.e. the ability to traverse the string not just as a sequence of bytes but as a sequence of Unicode code-points and if possible even "logical characters".
IMO std::string is just a container of bytes with two useful convenience methods (c_str() and substr()) and a utf8 encoding that had to be assumed by default but unfortunately isn't. Everything else should be generic algorithms that work with sequences of characters in some encoding. So, maybe it's better to focus on designing something like boost::iterator_range with an encoding associated with it and algorithms that work with these ranges?
I that is to succeed it has to be (backward)compatible with the existing APIs, however borked they seem to us (me included). There are lots of strings implementations that are *cool* but unusable by anything except algorithms specifically designed for them. Matus

On Wed, Jan 26, 2011 at 11:54, Matus Chochlik <chochlik@gmail.com> wrote:
On Wed, Jan 26, 2011 at 10:37 AM, Yakov Galka <ybungalobill@gmail.com> wrote:
Excuse my ignorance, but can someone explain to me why people are so keen on immutable strings? Aren't they basically the same as 'shared_ptr<const std::string>'?
I'm fairly neutral on the immutability issue, I do not oppose it if someone shows why it is a superior design, provided it does not break everything horribly (from the backward compatibility perspective).
Me too, but it definitely will break existing code: string.resize(91); [...]
? What are those properties? Isn't std::string *is* what it should have
been?
Do you mean that you want to put there in any possible algorithm you can imagine?
What I was talking about is basically adding some more convenience member functions, many of which are currently implemented by the string_algo Boost library, to the strings interface and more importantly to extend the strings interface with 'Unicode-functionality' i.e. the ability to traverse the string not just as a sequence of bytes but as a sequence of Unicode code-points and if possible even "logical characters".
IMO std::string is just a container of bytes with two useful convenience methods (c_str() and substr()) and a utf8 encoding that had to be assumed
by
default but unfortunately isn't. Everything else should be generic algorithms that work with sequences of characters in some encoding. So, maybe it's better to focus on designing something like boost::iterator_range with an encoding associated with it and algorithms that work with these ranges? I that is to succeed it has to be (backward)compatible with the existing APIs, however borked they seem to us (me included). There are lots of strings implementations that are *cool* but unusable by anything except algorithms specifically designed for them.
I can't exactly understand what has to be backward compatible with what... Can you please provide a few code snippets that mustn't break so I could
My point is that 'Unicode-functionality' should be separate from the string implementation. This code for(char32_t cp : codepoints(my_string)); should work with any type of my_string whose encoding is known. I'm not against adding convenience functions into the string. It makes the code more readable when you concatenate operations. However, it violates this: http://www.drdobbs.com/184401197 think about that? -- Yakov

On Wed, Jan 26, 2011 at 12:42 PM, Yakov Galka <ybungalobill@gmail.com> wrote:
On Wed, Jan 26, 2011 at 11:54, Matus Chochlik <chochlik@gmail.com> wrote: [snip/]
I'm fairly neutral on the immutability issue, I do not oppose it if someone shows why it is a superior design, provided it does not break everything horribly (from the backward compatibility perspective).
Me too, but it definitely will break existing code: string.resize(91);
This is just one of the examples. The append/prepend/etc. are others. The question is: do we allow them for the sake of the backward compatibility and implement them by using the immutable-semantic. Even resize could be implemented this way. Another matter is whether it makes sense. [snip/]
My point is that 'Unicode-functionality' should be separate from the string implementation. This code for(char32_t cp : codepoints(my_string)); should work with any type of my_string whose encoding is known.
If you need just this, then why not use std::string as it is now for my_string and use any of the Unicode libraries around. What I would like is a string with which I *can* forget that there ever was anything like other encodings than Unicode except for those cases where it is completely impossible. And even in those cases, like when calling a OS API function I don't want to specify exactly what encoding I want but just to say: Give me a representation (or "view" if you like) of the string that is in the "native" encoding of the currently selected locale for the desired character type. Something like this: whatever_the_string_class_name_will_be cmd = init(); system(cmd.native<char>().c_str()); ShellExecute(..., cmd.native<TCHAR>().c_str(), ...); ShellExecuteW(..., cmd.native<wchar_t>().c_str(), ...); wxExecute(cmd.native<wxChar>()); or whatever_the_string_class_name_will_be caption = get_non_ascii_string(); new wxFrame(parent, wxID_ANY, caption.native<wxChar>(), ...); In many cases the above could be a no-op, depending on the *internal* encoding used by this string class. It could be UTF-8 by default and maybe UTF-16 on Windows. Specifying *exactly* (like with iso_8859_2_cp_tag, or utf32_cp_tag, ...) which encoding I want, should be done only when absolutely necessary and *not* every time when I want to do something with the string. Also, there should be iterators allowing you to do this, again without specifying what encoding you want exactly: // cp_begin returning a "code-point-iterator" auto i = str.cp_begin(), e = str.cp_end(); if(i != e && *i == code_point(0x0123)) do_something(); or even (if this is possible): // cr_begin returning a character iterator auto i = str.cr_begin(), e = str.cr_end(); // if the first character is A with acute ... if(i != e && *i == unicode_character({0x0041, 0x0301})) do_something();
I'm not against adding convenience functions into the string. It makes the code more readable when you concatenate operations. However, it violates this: http://www.drdobbs.com/184401197
I do not want to overuse the "breaking" of the encapsulation by adding new non-static or friend functions. If we can take an advantage in the implementation of say trim(str) we may but we don't have to. This is an implementation detail (if we decide that the usage is trim(str) and not str.trim());
[snip/]
I that is to succeed it has to be (backward)compatible with the existing APIs, however borked they seem to us (me included). There are lots of strings implementations that are *cool* but unusable by anything except algorithms specifically designed for them.
I can't exactly understand what has to be backward compatible with what... Can you please provide a few code snippets that mustn't break so I could think about that?
Maybe you have something different in mind, but what I was talking about is that you cannot pass an iterator_range *directly* to a WINAPI (or any other OS API that I know of) call. BR, Matus

On Wed, Jan 26, 2011 at 15:04, Matus Chochlik <chochlik@gmail.com> wrote:
On Wed, Jan 26, 2011 at 12:42 PM, Yakov Galka <ybungalobill@gmail.com> wrote:
On Wed, Jan 26, 2011 at 11:54, Matus Chochlik <chochlik@gmail.com> wrote: [snip/]
I'm fairly neutral on the immutability issue, I do not oppose it if someone shows why it is a superior design, provided it does not break everything horribly (from the backward compatibility perspective).
Me too, but it definitely will break existing code: string.resize(91);
This is just one of the examples. The append/prepend/etc. are others. The question is: do we allow them for the sake of the backward compatibility and implement them by using the immutable-semantic. Even resize could be implemented this way. Another matter is whether it makes sense.
Fine. If immutable strings with backward compatibility results in changing string.resize(91); to string = string.resize(91); I vote against immutability. Even if you through compatibility away on-one explained yet why immutable strings are better. For me it smells like "modern language here" influence.
[snip/]
My point is that 'Unicode-functionality' should be separate from the
string
implementation. This code for(char32_t cp : codepoints(my_string)); should work with any type of my_string whose encoding is known.
If you need just this, then why not use std::string as it is now for my_string and use any of the Unicode libraries around. What I would like is a string with which I *can* forget that there ever was anything like other encodings than Unicode except for those cases where it is completely impossible.
And even in those cases, like when calling a OS API function I don't want to specify exactly what encoding I want but just to say: Give me a representation (or "view" if you like) of the string that is in the "native" encoding of the currently selected locale for the desired character type.
Let me try to explain myself in other words. I propose the iterator_range idea as means by which you (we) achieve our gual. It's more like a C++08 concept. By "my_string whose encoding is known" I meant that strings like u8string should map to string_ranges with typename encoding == utf_8 (for example). As a result you *won't* need to specify the exact encoding because it will be deduced from the context. The only place you will write the encoding explicitly is at the boundaries of your code and legacy APIs. Look at the code you provided:
Something like this:
whatever_the_string_class_name_will_be cmd = init(); system(cmd.native<char>().c_str()); ShellExecute(..., cmd.native<TCHAR>().c_str(), ...); ShellExecuteW(..., cmd.native<wchar_t>().c_str(), ...); wxExecute(cmd.native<wxChar>());
or
whatever_the_string_class_name_will_be caption = get_non_ascii_string(); new wxFrame(parent, wxID_ANY, caption.native<wxChar>(), ...);
The ShellExecuteW, wxExecute and wxFrame are actually *more verbose than they have to be*. wxString is documented to be utf16 encoded as well as LPCWSTR on windows. So, providing a mapping from wxString to the string_range concept you could write it as: wxExecute(cmd); // creates utf16 wxString new wxFrame(parent, wxID_ANY, caption, ...); // creates utf16 wxString As a result *less* code will be affected when switching to utf8.
In many cases the above could be a no-op, depending on the *internal* encoding used by this string class. It could be UTF-8 by default and maybe UTF-16 on Windows.
Specifying *exactly* (like with iso_8859_2_cp_tag, or utf32_cp_tag, ...) which encoding I want, should be done only when absolutely necessary and *not* every time when I want to do something with the string.
Also, there should be iterators allowing you to do this, again without specifying what encoding you want exactly:
This is what I meant.
// cp_begin returning a "code-point-iterator" auto i = str.cp_begin(), e = str.cp_end(); if(i != e && *i == code_point(0x0123)) do_something();
or even (if this is possible):
// cr_begin returning a character iterator auto i = str.cr_begin(), e = str.cr_end(); // if the first character is A with acute ... if(i != e && *i == unicode_character({0x0041, 0x0301})) do_something();
I prefer: auto i = codepoints(str).begin(), e = codepoints(str).end(); auto i = characters(str).begin(), e = characters(str).end(); So 1) we can extend the syntax uniformly to words, sentences etc... 2) str may be of any type that maps to string_range concept. Will it be boost::string or (when a switch to utf8 occurs) std::string a string literal. If str is not mapped to string_range then the programmer must specify the encoding explicitly. std::string str = "hi"; const char* str2 = exception.what(); auto i = codepoints(treat_as<utf_8>(str)).begin(); // no-copy, no-op, just a cast. auto i = codepoints(treat_as<utf_8>(str2)).begin(); // works auto i = codepoints(str).begin(); // error: string is of unknown encoding. Compiles in 20 years when everyone uses utf8. boost::string (whatever name) will be just an std::string mapped to string_range in utf_8 encoding. [...]
-- Yakov

On Wed, Jan 26, 2011 at 3:06 PM, Yakov Galka <ybungalobill@gmail.com> wrote: [snip/]
Fine. If immutable strings with backward compatibility results in changing string.resize(91); to string = string.resize(91);
I don't see why this should be needed
I vote against immutability. Even if you through compatibility away on-one explained yet why immutable strings are better. For me it smells like "modern language here" influence.
[snip/]
If you need just this, then why not use std::string as it is now for my_string and use any of the Unicode libraries around. What I would like is a string with which I *can* forget that there ever was anything like other encodings than Unicode except for those cases where it is completely impossible.
And even in those cases, like when calling a OS API function I don't want to specify exactly what encoding I want but just to say: Give me a representation (or "view" if you like) of the string that is in the "native" encoding of the currently selected locale for the desired character type.
Let me try to explain myself in other words. I propose the iterator_range idea as means by which you (we) achieve our gual. It's more like a C++08 concept. By "my_string whose encoding is known" I meant that strings like u8string should map to string_ranges with typename encoding == utf_8 (for example). As a result you *won't* need to specify the exact encoding because it will be deduced from the context. The only place you will write the encoding explicitly is at the boundaries of your code and legacy APIs.
Look at the code you provided:
Something like this:
whatever_the_string_class_name_will_be cmd = init(); system(cmd.native<char>().c_str()); ShellExecute(..., cmd.native<TCHAR>().c_str(), ...); ShellExecuteW(..., cmd.native<wchar_t>().c_str(), ...); wxExecute(cmd.native<wxChar>());
or
whatever_the_string_class_name_will_be caption = get_non_ascii_string(); new wxFrame(parent, wxID_ANY, caption.native<wxChar>(), ...);
The ShellExecuteW, wxExecute and wxFrame are actually *more verbose than they have to be*. wxString is documented to be utf16 encoded as well as LPCWSTR on windows. So, providing a mapping from wxString to the string_range concept you could write it as:
wxExecute(cmd); // creates utf16 wxString new wxFrame(parent, wxID_ANY, caption, ...); // creates utf16 wxString
As a result *less* code will be affected when switching to utf8.
OK, if this is doable in the context of Boost, then you certainly will not hear any complaining from me. [snip/]
This is what I meant.
// cp_begin returning a "code-point-iterator" auto i = str.cp_begin(), e = str.cp_end(); if(i != e && *i == code_point(0x0123)) do_something();
or even (if this is possible):
// cr_begin returning a character iterator auto i = str.cr_begin(), e = str.cr_end(); // if the first character is A with acute ... if(i != e && *i == unicode_character({0x0041, 0x0301})) do_something();
I prefer: auto i = codepoints(str).begin(), e = codepoints(str).end(); auto i = characters(str).begin(), e = characters(str).end();
I really don't insist on cr_begin, etc. to be member functions (nor on calling them cr_begin, ..., for that matter).
So 1) we can extend the syntax uniformly to words, sentences etc... 2) str may be of any type that maps to string_range concept. Will it be boost::string or (when a switch to utf8 occurs) std::string a string literal.
If str is not mapped to string_range then the programmer must specify the encoding explicitly. std::string str = "hi"; const char* str2 = exception.what(); auto i = codepoints(treat_as<utf_8>(str)).begin(); // no-copy, no-op, just a cast. auto i = codepoints(treat_as<utf_8>(str2)).begin(); // works auto i = codepoints(str).begin(); // error: string is of unknown encoding. Compiles in 20 years when everyone uses utf8.
boost::string (whatever name) will be just an std::string mapped to string_range in utf_8 encoding.
If we can wrap the treat_as<utf_8> into something that does not refer to any encoding whatsoever in cases you don't have to then *thumbs up*. OK Matus

On Wed, Jan 26, 2011 at 5:54 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Wed, Jan 26, 2011 at 10:37 AM, Yakov Galka <ybungalobill@gmail.com> wrote:
Excuse my ignorance, but can someone explain to me why people are so keen on immutable strings? Aren't they basically the same as 'shared_ptr<const std::string>'?
I'm fairly neutral on the immutability issue, I do not oppose it if someone shows why it is a superior design, provided it does not break everything horribly (from the backward compatibility perspective).
I think I missed the part where the immutable string has to be backward compatible. Backward compatible to what? -- Dean Michael Berris about.me/deanberris

On Wed, Jan 26, 2011 at 5:37 PM, Yakov Galka <ybungalobill@gmail.com> wrote:
Excuse my ignorance, but can someone explain to me why people are so keen on immutable strings? Aren't they basically the same as 'shared_ptr<const std::string>'?
Nope. For one, an immutable string doesn't have to be stored in a contiguous thunk of memory. It can share common parts across instances, it can -- and probably should -- be implemented as a segmented container, and interning immutable strings is a popular way of getting crazy performance gains on different kinds of applications with relatively modest costs. These things you don't get with just a shared_ptr<std::string const>. :) Add to that the capability to use/define a "sane" memory management solution and you'd have a pretty good competitor to good old std::string. :) -- Dean Michael Berris about.me/deanberris

Yakov Galka wrote :
Excuse my ignorance, but can someone explain to me why people are so keen on immutable strings? Aren't they basically the same as 'shared_ptr<const std::string>'?
<snip>
create a class called boost::string that will have all the properties that a string handling class in 2011+ A.D. should have, basically what std::string should have been.
? What are those properties? Isn't std::string *is* what it should have been? Do you mean that you want to put there in any possible algorithm you can imagine?
Do you really consider a structure allowing anybody to change any byte in its internal representation, eventually breaking its validity, as a suitable candidate for a publicly used, standard, encoded string? If the message is "Don't mess with my bytes, use my vendor high level API to access me", we should not exspect C++ developpers to get it by providing an hazardous backward compatible API. I see the immutable string proposal as a way to express the definitely needed breaking change in our string handling habits. A kind of shortcut way, though, but quite a good one. Regards, Ivan.

On Wed, Jan 26, 2011 at 5:10 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Wed, Jan 26, 2011 at 9:34 AM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
Well, see here's the key phrase there that's important:
"string manipulation"
I actually want to wipe that phrase off the face of the C++ world and have everyone think in terms of "building strings" rather than manipulating them because in reality, there really is no manipulating something that's unique and immutable -- unless you create a new one from it.
OK, I think we think the same thing :) Looking at the aforementioned sources I found the doing things like: mystr[i] = whatever() was very rare, and while I used the term manipulation I didn't have this specific case in mind. I'm completely OK, if we all agree that what you propose is the way to go, with your idea how strings should be manip.. - er scratch that - built.
Agreed then as long as we stop thinking about manipulating strings along the way. :D
The immutability *does not* have a thing with the problems in the use-cases described above. Encoding *does*.
Right, so why should the encoding be part of the string then? I say the encoding should be external to the string (which I've been saying for the Nth time I think) and just a transformation on an input string. The transformation doesn't even have to be immediate -- it could and probably should be lazy. When you have immutable strings the lazy application of transformations is really a game changer especially in the way people (at least in C++) are used to when dealing with strings.
Sure, but still I don't see why you need to add c_str() to an immutable string when you're fine to create an std::string from that immutable string and call c_str() from the std::string instance instead?
One word. Performance :)
So you're saying c_str() is a performance enhancing feature? Did you consider that the reason why std::string concatenation is a bad performance killer is precisely because it has to support c_str()? I'd argue that converting an immutable string object (which doesn't have to be stored as a contiguous chunk of memory unlike how C strings are handled) to an std::string can be the (amortized constant) cost of a segmented traversal of the same string. So really, for performance reasons, I'd say an immutable string converted to an std::string will cost you once -- and I'd say probably just once for the lifetime of one immutable string especially if we bolt on interning or similar goodies -- and the performance would be pretty much predictable. Much like how you have to deal with linearizing your data anyway when stuffing it into a socket, it's just one of those things you've got to pay for at some point. ;) -- Dean Michael Berris about.me/deanberris

Dean Michael Berris wrote:
On Wed, Jan 26, 2011 at 5:10 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Wed, Jan 26, 2011 at 9:34 AM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
The immutability *does not* have a thing with the problems in the use-cases described above. Encoding *does*.
Right, so why should the encoding be part of the string then? I say the encoding should be external to the string (which I've been saying for the Nth time I think) and just a transformation on an input string. The transformation doesn't even have to be immediate -- it could and probably should be lazy. When you have immutable strings the lazy application of transformations is really a game changer especially in the way people (at least in C++) are used to when dealing with strings.
Based upon previous discussion, I think you need to present your case better for immutability. Others consider mutability to be an intrinsic and beneficial characteristic of a string class. You are proposing to drop that and assume all others implicitly understand why that would be good for all.
Sure, but still I don't see why you need to add c_str() to an immutable string when you're fine to create an std::string from that immutable string and call c_str() from the std::string instance instead?
One word. Performance :)
So you're saying c_str() is a performance enhancing feature? Did you consider that the reason why std::string concatenation is a bad performance killer is precisely because it has to support c_str()?
That's an interesting viewpoint. Note, however, that it is extremely common to build a string, piecewise, and then use it as an array of characters with some OS API. Those APIs won't change, so c_str(), in some form, is definitely needed. Furthermore, the piecewise assembly seems likely inefficient given and immutable string, especially one from which a contiguous array of characters is needed. Can you illustrate how that would be done and how it can be as efficient or more so than the status quo?
I'd argue that converting an immutable string object (which doesn't have to be stored as a contiguous chunk of memory unlike how C strings are handled) to an std::string can be the (amortized constant) cost of a segmented traversal of the same string.
That's quite interesting, but I'd argue that creating a std::string, which allocates a buffer on the free store to hold a duplicate of the sequence in the immutable string object can be unnecessary overhead should the immutable string already hold a contiguous array of characters. Thus, the sequence you suggest -- immutable string object to std::string to contiguous array of characters -- may be unnecessarily inefficient. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On Wed, Jan 26, 2011 at 10:46 PM, Stewart, Robert <Robert.Stewart@sig.com> wrote:
Dean Michael Berris wrote:
On Wed, Jan 26, 2011 at 5:10 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Wed, Jan 26, 2011 at 9:34 AM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
The immutability *does not* have a thing with the problems in the use-cases described above. Encoding *does*.
Right, so why should the encoding be part of the string then? I say the encoding should be external to the string (which I've been saying for the Nth time I think) and just a transformation on an input string. The transformation doesn't even have to be immediate -- it could and probably should be lazy. When you have immutable strings the lazy application of transformations is really a game changer especially in the way people (at least in C++) are used to when dealing with strings.
Based upon previous discussion, I think you need to present your case better for immutability.
Right, I think I need to spend a little more time to hash that out. With something close to 100 messages on the thread already I think it's about time I did that. Expect a different thread starter then in the next, oh, few minutes or so.
Others consider mutability to be an intrinsic and beneficial characteristic of a string class. You are proposing to drop that and assume all others implicitly understand why that would be good for all.
I don't think I'm assuming that others will implicitly understand -- I was more hoping that those reading or at least are interested in the proposition would likely try to work it out in their heads and maybe discuss what's unclear to them. ;) That said, it falls on me to be clearer I agree. :D
Sure, but still I don't see why you need to add c_str() to an immutable string when you're fine to create an std::string from that immutable string and call c_str() from the std::string instance instead?
One word. Performance :)
So you're saying c_str() is a performance enhancing feature? Did you consider that the reason why std::string concatenation is a bad performance killer is precisely because it has to support c_str()?
That's an interesting viewpoint. Note, however, that it is extremely common to build a string, piecewise, and then use it as an array of characters with some OS API. Those APIs won't change, so c_str(), in some form, is definitely needed. Furthermore, the piecewise assembly seems likely inefficient given and immutable string, especially one from which a contiguous array of characters is needed. Can you illustrate how that would be done and how it can be as efficient or more so than the status quo?
Right. Here's one attempt at presenting how you would: build an immutable string, perform (lazy) transformations of that string, and have a means of linearizing that string into a `char const *` (which in the end is really what c_str() is): To build a string, let's borrow from the sstream library in the STL: boost::ostringstream ss; ss << "This is a string literal" << L"this is another literal" << instance.some_char_const_ptr() << foo_that_returns_an_std_string_perhaps() // can be moved << etcetera(); // what it returns can be copied boost::string s = ss.str(); // #1 In #1 above, str() returns an immutable string already, which means that the ostringstream would be building a segmented data structure that packs chunks together potentially using Boost.Intrusive data structures -- to make it simple, it would potentially be statically-sized chunks that lay things out in a memory page's worth of memory. So really there's no cost to copy -- well, there will be an increment on the reference count in the string's metadata block. We can also implement a function in ss called 'move' which will actually move the string already built to the holder. We can implement output iterators that deal with boost::ostringstream objects to allow existing STL algorithms to stuff data into an boost::ostringstream using the familiar output iterator type. Once we have that immutable string, we can start copying it around and not worry about having to manually synchronize or ensure that the data is actually copied when dealing with it in different threads. The reference counting can happen transparently in true RAII fashion which is a good thing IMO. Let me try to describe "lazy" transformations then. Let's consider the case of obtaining a substring of the original string as a lazy transformation: boost::string substring = substr(substr(s, 0, 10), -5, 0); In a mutable string implementation, substr would *have* to create another string object in the call to substr(s, 0, 10) then apply the substr(/* temporary string */, -5, 0) on the temporary, to create yet another temporary that gets assigned to substring -- just so that you preserve the invariants on the existing string (s) which could change at any given point in this nesting of operations. Now consider the immutable case where you didn't have to make the copies at all, and encapsulate the bounds in an unspecified type that supports the string API as well, then later on once the actual assignment is performed you copy the resulting bounds information and voila you have a substring from characters (or code points) 5..10 *of the original immutable string that you can still refer to in the end*. Of course you would probably want to refer to just the blocks of the original string that contains these characters, or if the resulting string is short enough (an optimization point) you can actually break that off as a copy in a different memory location. I'm not even going as far as I can go with a DSEL for the string which would allow you to (with something like Proto) determine at compile time if it was possible to just reduce the nested substring into a single application of the substring transformation on the string. The last case I promised to show was how to linearize an immutable string into something accessible through a `char const *`. Now that you can see that a string's innards can be implemented in a segmented data structure, you can then implement an algorithm external to the string that deals with traversing these segments to turn out a potentially interned `char const *` that's unique to the immutable string. This interned `char const *` can be referred to in the metadata block of the string and will only ever have to be built once. The interface of that would look something like: template <class String> char const * linearize(String s) { return interned(s); } You can do all sorts of lock-free implementations (potentially leveraging TLS where it matters (for some C APIs TLS is actually preferrable)) on the assignment of the linearized block into the metadata block of s, or in the worst case you can just have a different data structure to hold the linearized string. Another interface that would be thread-friendly would be: template <class String> char const * linearize(String s, void * buf, size_t buf_len) { // linearize to the buffer, then return static_cast<char const *>(buf); } Since s will never ever change, this doesn't need to synchronize access to s in any thread.
I'd argue that converting an immutable string object (which doesn't have to be stored as a contiguous chunk of memory unlike how C strings are handled) to an std::string can be the (amortized constant) cost of a segmented traversal of the same string.
That's quite interesting, but I'd argue that creating a std::string, which allocates a buffer on the free store to hold a duplicate of the sequence in the immutable string object can be unnecessary overhead should the immutable string already hold a contiguous array of characters. Thus, the sequence you suggest -- immutable string object to std::string to contiguous array of characters -- may be unnecessarily inefficient.
That really depends on the interface to the linearization function. If all you needed was to be able to control where you would linearize the immutable string to, then maybe my example above allows you to put the string in a *gulp* stack-based char array. It doesn't necessarily have to be to a std::string if you don't want to put the data there. ;) HTH -- Dean Michael Berris about.me/deanberris

On Wed, Jan 26, 2011 at 3:46 PM, Stewart, Robert <Robert.Stewart@sig.com> wrote:
Dean Michael Berris wrote:
[snip/]
Sure, but still I don't see why you need to add c_str() to an immutable string when you're fine to create an std::string from that immutable string and call c_str() from the std::string instance instead?
One word. Performance :)
So you're saying c_str() is a performance enhancing feature? Did you consider that the reason why std::string concatenation is a bad performance killer is precisely because it has to support c_str()?
Of course I have considered it, I have played with the SGI's rope class in the past and it is great when you need to do a lot of concatenations and the you *do not* call anything related to c_str(). But c_str() is at least in my wold something that I (unfortunately) have to call, until somebody ports WINAPI, POSIX, etc. to C++ :)
That's an interesting viewpoint. Note, however, that it is extremely common to build a string, piecewise, and then use it as an array of characters with some OS API. Those APIs won't change, so c_str(), in some form, is definitely needed. Furthermore, the piecewise assembly seems likely inefficient given and immutable string, especially one from which a contiguous array of characters is needed. Can you illustrate how that would be done and how it can be as efficient or more so than the status quo?
+1 I can imagine how various libraries could take great advantage of "rope-like" string implementations, but there are unfortunately very few that actually do so. I would *love* this and other string-related or more specifically text-handling-related things to change in the foreseeable future but I am not going to hold my breath :)
I'd argue that converting an immutable string object (which doesn't have to be stored as a contiguous chunk of memory unlike how C strings are handled) to an std::string can be the (amortized constant) cost of a segmented traversal of the same string.
Of course it can. But, will your string come with a new version of CryptoAPI , etc., etc. ad nauseum :) that *will* take advantage of it ? Until this happens I say we need at least reasonably efficient implementation of c_str().
That's quite interesting, but I'd argue that creating a std::string, which allocates a buffer on the free store to hold a duplicate of the sequence in the immutable string object can be unnecessary overhead should the immutable string already hold a contiguous array of characters. Thus, the sequence you suggest -- immutable string object to std::string to contiguous array of characters -- may be unnecessarily inefficient.
Exactly Best, Matus

Hello, To all discussing about how to create an "ultimate" string, I'd like to remind you following "ultimate" strings existing there: 1. QString: http://doc.qt.nokia.com/latest/qstring.html 2. Glib::ustring: http://library.gnome.org/devel/glibmm/unstable/classGlib_1_1ustring.html 3. wxString: http://docs.wxwidgets.org/trunk/classwx_string.html 4. icu::UnicodeString http://icu-project.org/apiref/icu4c/classUnicodeString.html 5. CString http://msdn.microsoft.com/en-us/library/ms174288(v=VS.100).aspx Now Questions: -------------- 1. Why do YOU think you'll be able to create something "better"? 2. Why do YOU think boost::string would be adopted in favor of std::string or one of the current widely used QString/ustring/wxString/UnicodeString/CString? 3. What so painful problems are you going to solve that would make it so much better then widely used and adopted std::string? Iterators? Mutability? Performance? (Clue: there is no painful problems with std::string) Now Suggestion: --------------- 1. Accept it that there is quite small chance that something that is not std::string would be widely accepted 2. Try to solve existing "string" problems by using same std::string and adding few things to handle it better. Clue: take a look on what Boost.Locale does. Best Regards, Artyom

Artyom wrote :
(Clue: there is no painful problems with std::string)
So you do not consider the fact the following code compiles (and runs!) cleanly as a painful problem? std::string s = get_utf8_string(); s[42] = '?'; I consider that std::string is cool as an internal for a Boost.Locale implementor, not as a standard API for encoded strings that should be used by all C++ developpers. Ivan.

So you do not consider the fact the following code compiles (and runs!) cleanly as a painful problem?
std::string s = get_utf8_string(); s[42] = '?';
It is fine code. Why? Because if you assign s[42] you probably know why you are do it and you most likely know that there is a some ASCII character there. Like for example s=get_some_utf8_text_code(); for(size_t i=0;i<s.size();i++) { switch(s[i]) { case: '\r': case: '\n': case: '\t': s[i]=' '; } } Something very common and useful for text parsing. Is this code wrong? No! If fast there is lots of code that works fine with this. removing operator[] does not solve **any** problem.
I consider that std::string is cool as an internal for a Boost.Locale implementor, not as a standard API for encoded strings that should be used by all C++ developpers.
Seems to me you hadn't seen Boost.Locale code and what does it does. Artyom

To all discussing about how to create an "ultimate" string, I'd like to remind you following "ultimate" strings existing there:
1. QString:
http://doc.qt.nokia.com/latest/qstring.html
2. Glib::ustring:
http://library.gnome.org/devel/glibmm/unstable/classGlib_1_1ustring.html
3. wxString:
http://docs.wxwidgets.org/trunk/classwx_string.html
4. icu::UnicodeString
http://icu-project.org/apiref/icu4c/classUnicodeString.html
5. CString
http://msdn.microsoft.com/en-us/library/ms174288(v=VS.100).aspx
Now Questions: --------------
1. Why do YOU think you'll be able to create something "better"?
2. Why do YOU think boost::string would be adopted in favor of std::string or one of the current widely used QString/ustring/wxString/UnicodeString/CString?
3. What so painful problems are you going to solve that would make it so much better then widely used and adopted std::string? Iterators? Mutability? Performance?
(Clue: there is no painful problems with std::string)
Now Suggestion: ---------------
1. Accept it that there is quite small chance that something that is not std::string would be widely accepted
2. Try to solve existing "string" problems by using same std::string and adding few things to handle it better.
Clue: take a look on what Boost.Locale does.
And we are back at the beginning ... IMO at this point it would be useful if as many people as possible clearly expressed their opinion about this. I basically agree with everything, that Artyom said in this post. We want to create a better string class ? Fine but let us do it in a way that makes it as likely as possible that it would be adopted as the next std::string AND let us also consider the encoding, not only its performance and interface specification. BR, Matus

On Thu, Jan 27, 2011 at 4:09 PM, Artyom <artyomtnk@yahoo.com> wrote:
Hello,
To all discussing about how to create an "ultimate" string, I'd like to remind you following "ultimate" strings existing there:
Who are these people discussing how to create an "ultimate" string? Oh you mean me who wants to create an "immutable" string? I hardly called it an "ultimate" string so I think you're throwing a strawman red herring here. At any rate, I'll indulge you. [snip all the stupid non-ultimate strings quoted] They're all broken. Is that what you wanted me to say? :D
Now Questions: --------------
1. Why do YOU think you'll be able to create something "better"?
I don't know about "better". I do know "different" though. Better is largely a matter of perspective.
2. Why do YOU think boost::string would be adopted in favor of std::string or one of the current widely used QString/ustring/wxString/UnicodeString/CString?
I don't. That wasn't the point though. It's not some illusion of grandeur or some messianic vision that came from some voice in my head asking me to etch it in stone. There's an opportunity to implement strings in a different way and I think it's worth doing regardless of whether it will be adopted in favor of anything that's already out there. People said COBOL is ugly but still people to this day write programs in it -- so I have no intentions of asking others to use the immutable string if they won't want to.
3. What so painful problems are you going to solve that would make it so much better then widely used and adopted std::string? Iterators? Mutability? Performance?
(Clue: there is no painful problems with std::string)
Sorry, but for someone who's dealt with std::string for *a long time (close to 8 years)* here's are a few real painful problems with it: 1. Because of COW implementations you can't deal with it properly in multiple threads without explicitly creating a copy of the string. SSO makes it a lot more unpredictable. The's all sorts of concurrency problems with std::string that are not addressed by the interface. 2. Temporaries abound. Operations return temporaries and copies of data in the string. Substring operations create new strings to avoid the problem with concurrent mutation of the underlying data. 3. It has to be contiguous in memory because of the assumption that data() and c_str() should give a view of the string's data at any given time across mutations makes extending, shrinking, and generally manipulating it a pain not only from an interface/performance perspective but also from the point of view of an implementor. Then you look at the resource issues this raises with potential fragmentation when you keep mutating strings and growing it. 4. Because of the mutability of std::string your iterators *may* be invalidated when the string changes. This is crucial for efficiency concerned code that deals with strings. 5. Because of the contiguous requirement, using it for any "text" that's larger than a memory page's worth of data will kill your cache coherency -- and then when you modify parts of it then you can thank your virtual memory manager when the modifications are done. Then you see that you would have to implement your own segmented data structure to act as a string and then you realize you're better off not using std::string for situations where the amount of data you're going to deal with is potentially larger than cache line.
Now Suggestion: ---------------
1. Accept it that there is quite small chance that something that is not std::string would be widely accepted
So, what if the chances are small that it'd be widely accepted? That never stopped a lot of people -- heck it never stopped me.
2. Try to solve existing "string" problems by using same std::string and adding few things to handle it better.
Sorry, but the existing string problems are precisely because of the way std::string is designed/implemented. No if's buts about it. We can agree to disagree on this one.
Clue: take a look on what Boost.Locale does.
I did, I like how it's designed, and it solves what it solves. However I don't think an immutable string and Boost.Locale are mutually exclusive. You choose to deal with std::string while I OTOH would like to give an alternative interpretation of strings. I'll leave it at that. ;) -- Dean Michael Berris about.me/deanberris

To all discussing about how to create an "ultimate" string, I'd like to remind you following "ultimate" strings existing there:
Who are these people discussing how to create an "ultimate" string? Oh you mean me who wants to create an "immutable" string?
I hardly called it an "ultimate" string so I think you're throwing a strawman red herring here. At any rate, I'll indulge you.
[snip all the stupid non-ultimate strings quoted]
They're all broken. Is that what you wanted me to say? :D
Then don't call this thread [boost][string] proposal and don't call it boost::string Maybe boost::immutable_string - which is fine, but not string. Best, Artyom

On Thu, Jan 27, 2011 at 7:06 PM, Artyom <artyomtnk@yahoo.com> wrote:
To all discussing about how to create an "ultimate" string, I'd like to remind you following "ultimate" strings existing there:
Who are these people discussing how to create an "ultimate" string? Oh you mean me who wants to create an "immutable" string?
I hardly called it an "ultimate" string so I think you're throwing a strawman red herring here. At any rate, I'll indulge you.
[snip all the stupid non-ultimate strings quoted]
They're all broken. Is that what you wanted me to say? :D
Then don't call this thread [boost][string] proposal and don't call it boost::string
Why not? If it's different from std::string why wouldn't boost::string be a proper name? If I think *I* can muster the courage to say "strings should be done this way" why on earth wouldn't I call that a string? :D <emphasis> If you were to ask *me* and *me alone*, of course *I* think that *my* vision for boost::string *should* be the way strings are dealt with. Of course that's ego-maniacal and self-centered of me to say so, but if I had to be explicit about it and take a position I would say exactly that: std::string is broken and it doesn't deserve to be the string implementation that C++ programmers have to use. </emphasis> So why would I not want to call it boost::string? ;)
Maybe boost::immutable_string - which is fine, but not string.
Who gave you the monopoly on what `string` should mean? :P Seriously though, the point is this: Boost has an opportunity to influence some, if not a big part of the C++ community at large. What would be the point of doing another string that everybody else has done before when there's a chance that a different take on it can be potentially better than what's already there? I mean seriously, the world has flex+yacc -- imagine if Joel thought to himself and said "well, it works, that's fine, but it's ugly and I can deal with it so... forget this funky EDSL for generating parsers in C++" then I personally would think the world would be a really sad place without Spirit. There's also the MPL, people were getting by with just runtime polymorphism and OOP goodness when some enterprising people thought about a different way of doing it and doing computations at compile time. There are lots of examples of these in the Boost libraries -- I am always surprised that the smart pointers have been written about ad nauseam by countless journalists and book authors and still the one best implementation of a shared pointer is the one in Boost. With peace and love in my heart, I HTH :) -- Dean Michael Berris about.me/deanberris

From: Dean Michael Berris <mikhailberis@gmail.com> To: boost@lists.boost.org Sent: Thu, January 27, 2011 1:22:17 PM Subject: Re: [boost] [string] proposal
On Thu, Jan 27, 2011 at 7:06 PM, Artyom <artyomtnk@yahoo.com> wrote:
To all discussing about how to create an "ultimate" string, I'd like to remind you following "ultimate" strings existing there:
Who are these people discussing how to create an "ultimate" string? Oh you mean me who wants to create an "immutable" string?
I hardly called it an "ultimate" string so I think you're throwing a strawman red herring here. At any rate, I'll indulge you.
[snip all the stupid non-ultimate strings quoted]
They're all broken. Is that what you wanted me to say? :D
Then don't call this thread [boost][string] proposal and don't call it boost::string
Why not? If it's different from std::string why wouldn't boost::string be a proper name?
Because as you mentioned below, many things from boost go to std, i.e. shared_ptr, function, bind and many others. So it shouldn't be "string" as C++ already has one.
<emphasis> [snip] I would say exactly that: std::string is broken and it doesn't deserve to be the string implementation that C++ programmers have to use. </emphasis>
You think it is broken, others: - Some think it is fine. - Some think its API may be improved keeping it backward compatible - Some think that algorithms that use string may be improved. - And some do think it is broken
Who gave you the monopoly on what `string` should mean? :P
Nobody gave me a monopoly, however such monopoly had given to C++ standard committee that had defined what string means in C++'s standard namespace. It is fine have other strings but IMHO they should not be called boost::string.
Seriously though, the point is this: Boost has an opportunity to influence some, if not a big part of the C++ community at large. What would be the point of doing another string that everybody else has done before when there's a chance that a different take on it can be potentially better than what's already there?
Different not better as better is a matter of taste - I personally love std::string, especially GCC's implementation with COW.
I mean seriously, the world has flex+yacc -- imagine if Joel thought to himself and said "well, it works, that's fine, but it's ugly and I can deal with it so...
Believe me you don't want to start Spirit vs Yacc+Bison :-)
still the one best implementation of a shared pointer is the one in Boost.
Yes and now it is in Tr1! And if you want string to come to tr1 you need either: 1. Make it fully backward compatible with std::string 2. Call it by different name. My $0.02 Artyom

On Thu, Jan 27, 2011 at 8:16 PM, Artyom <artyomtnk@yahoo.com> wrote:
From: Dean Michael Berris <mikhailberis@gmail.com> To: boost@lists.boost.org Sent: Thu, January 27, 2011 1:22:17 PM Subject: Re: [boost] [string] proposal
On Thu, Jan 27, 2011 at 7:06 PM, Artyom <artyomtnk@yahoo.com> wrote:
Then don't call this thread [boost][string] proposal and don't call it boost::string
Why not? If it's different from std::string why wouldn't boost::string be a proper name?
Because as you mentioned below, many things from boost go to std, i.e. shared_ptr, function, bind and many others.
So it shouldn't be "string" as C++ already has one.
By this logic, interprocess' containers shouldn't be called vector, map, list, set, unordered_set, unordered_map. That doesn't make sense.
<emphasis> [snip] I would say exactly that: std::string is broken and it doesn't deserve to be the string implementation that C++ programmers have to use. </emphasis>
You think it is broken, others:
- Some think it is fine. - Some think its API may be improved keeping it backward compatible - Some think that algorithms that use string may be improved. - And some do think it is broken
Okay, so that's important because... ? Like I said (pretty much over and over), I see no need for a boost::string implementation to retain backward compatibility interface-wise to std::string. As in 0 need especially because it's a different string implementation period. To those who think std::string is fine, then keep using it! To those who think its API may be improved and keep it backward compatible then good luck with that. The algorithms improvement, sure we always need better algorithms. And to those who think it's broken like me, then let's do something about it.
Who gave you the monopoly on what `string` should mean? :P
Nobody gave me a monopoly, however such monopoly had given to C++ standard committee that had defined what string means in C++'s standard namespace.
It is fine have other strings but IMHO they should not be called boost::string.
And IMHO std::string's current interface can be deprecated by a suitably convinced standard committee. It's like std::auto_ptr being deprecated along with the interfaces of dozens of other libraries. If boost::string is a really well implemented string that does things really really well, then I don't see why std::string can't be deprecated in favor of an arguably better but certainly different string paradigm.
Seriously though, the point is this: Boost has an opportunity to influence some, if not a big part of the C++ community at large. What would be the point of doing another string that everybody else has done before when there's a chance that a different take on it can be potentially better than what's already there?
Different not better as better is a matter of taste - I personally love std::string, especially GCC's implementation with COW.
To each his own then. ;)
I mean seriously, the world has flex+yacc -- imagine if Joel thought to himself and said "well, it works, that's fine, but it's ugly and I can deal with it so...
Believe me you don't want to start Spirit vs Yacc+Bison :-)
You missed the point. The point I was making was that Spirit has its place in Boost because it does something different from the norm. It was developed with one thing in mind: make defining parsers in pure C++ using an embedded DSL possible. It's a different way of doing it and whether it's better is not relevant -- that it's there and being used by people no matter how many/few is what's important. For the record though I personally think the Spirit way is the better way, but of course that's IMHO.
still the one best implementation of a shared pointer is the one in Boost.
Yes and now it is in Tr1! And if you want string to come to tr1 you need either:
1. Make it fully backward compatible with std::string 2. Call it by different name.
Nope, I disagree with both. std::string can be deprecated if the standards body agree that there's cause for it to be deprecated. And the different name is frankly just unnecessary. HTH -- Dean Michael Berris about.me/deanberris

On Thu, Jan 27, 2011 at 1:52 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Thu, Jan 27, 2011 at 8:16 PM, Artyom <artyomtnk@yahoo.com> wrote:
So it shouldn't be "string" as C++ already has one.
By this logic, interprocess' containers shouldn't be called vector, map, list, set, unordered_set, unordered_map. That doesn't make sense.
No, because there are *std*::vector & co and there are *interprocess*::vector & co and I've never heard that there was an intention to replace the former with the latter. However there *is* an intention to replace (in my view extend) std::string.
Like I said (pretty much over and over), I see no need for a boost::string implementation to retain backward compatibility interface-wise to std::string. As in 0 need especially because it's a different string implementation period.
IMO zero (as in 0) backward compatibility => zero (as in 0) chance of adoption by the standard. People (including me) who have proposed to just switch to UTF-8 have met a lot of resistance, just because that would break some things at run time. And I personally believe that most (more than 50%) use-cases of string are encoding-agnostic so it would not basically break anything. If the standard commission somehow adopted the completely different string you propose, that would totally *W*H*A*C*K* pretty much all existing C++ code. std::string is not std::auto_ptr.
And IMHO std::string's current interface can be deprecated by a suitably convinced standard committee.
And IMHO this will happen only if you, besides the new string, have also invented a mind-control death-ray :)
It's like std::auto_ptr being deprecated along with the interfaces of dozens of other libraries. If boost::string is a really well implemented string that does things really really well, then I don't see why std::string can't be deprecated in favor of an arguably better but certainly different string paradigm.
No, std::auto_ptr is/was nowhere near std::string when considering the "frequency of usage".
1. Make it fully backward compatible with std::string 2. Call it by different name.
+1 BR, Matus

FYI
http://mail.python.org/pipermail/python-dev/2011-January/107641.html
BTW "Unicode" strings in Python IMHO total mess: 1. It may be utf-16 or utf-32 depending on compilation options!?@! 2. Python had switched default string from normal to "Unicode" in version 3 and many people around not thrilled with this. Artyom

On Thu, Jan 27, 2011 at 9:24 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Thu, Jan 27, 2011 at 1:52 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Thu, Jan 27, 2011 at 8:16 PM, Artyom <artyomtnk@yahoo.com> wrote:
So it shouldn't be "string" as C++ already has one.
By this logic, interprocess' containers shouldn't be called vector, map, list, set, unordered_set, unordered_map. That doesn't make sense.
No, because there are *std*::vector & co and there are *interprocess*::vector & co and I've never heard that there was an intention to replace the former with the latter. However there *is* an intention to replace (in my view extend) std::string.
See, but then boost::string can very well be a typedef to boost::strings::string -- I don't see why calling it 'string' is such a bad thing. And the only way BTW an std::string would be replaced is if the replacement does things in a different way -- otherwise what's the point of replacing std::string if it's the same (in my assertion, broken) string?!
Like I said (pretty much over and over), I see no need for a boost::string implementation to retain backward compatibility interface-wise to std::string. As in 0 need especially because it's a different string implementation period.
IMO zero (as in 0) backward compatibility => zero (as in 0) chance of adoption by the standard.
Huh? std::auto_ptr was dropped in favor of std::unique_ptr. They didn't need to deprecate auto_ptr but they chose to do that and introduce unique_ptr in its place.
People (including me) who have proposed to just switch to UTF-8 have met a lot of resistance, just because that would break some things at run time. And I personally believe that most (more than 50%) use-cases of string are encoding-agnostic so it would not basically break anything.
Silently breaking at runtime is just unacceptable. Breaking at compile time is way better.
If the standard commission somehow adopted the completely different string you propose, that would totally *W*H*A*C*K* pretty much all existing C++ code. std::string is not std::auto_ptr.
So what's wrong with deprecation again? And what makes std::string so different from std::auto_ptr when they're both defined in the standard library? I don't see the point you're trying to make.
And IMHO std::string's current interface can be deprecated by a suitably convinced standard committee.
And IMHO this will happen only if you, besides the new string, have also invented a mind-control death-ray :)
Well that remains to be seen right? :)
It's like std::auto_ptr being deprecated along with the interfaces of dozens of other libraries. If boost::string is a really well implemented string that does things really really well, then I don't see why std::string can't be deprecated in favor of an arguably better but certainly different string paradigm.
No, std::auto_ptr is/was nowhere near std::string when considering the "frequency of usage".
Deprecation wasn't about frequency of usage. It was about fixing things that were deemed broken. -- Dean Michael Berris about.me/deanberris

On Jan 27, 2011, at 8:24 AM, Matus Chochlik wrote:
On Thu, Jan 27, 2011 at 1:52 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
[snip]
And IMHO std::string's current interface can be deprecated by a suitably convinced standard committee.
And IMHO this will happen only if you, besides the new string, have also invented a mind-control death-ray :)
LOL! Yes, I think one has to penetrate the minds of old CS/Engineering farts a little more before expressing such (hopeful?) beliefs. We are not "Ruby crazy" having no problem changing the rules at a whim, breaking old code. We care about and cherish old stuff :-)
It's like std::auto_ptr being deprecated along with the interfaces of dozens of other libraries. If boost::string is a really well implemented string that does things really really well, then I don't see why std::string can't be deprecated in favor of an arguably better but certainly different string paradigm.
No, std::auto_ptr is/was nowhere near std::string when considering the "frequency of usage".
1. Make it fully backward compatible with std::string 2. Call it by different name.
+1
+1 here as well /David

Dean Michael Berris wrote:
On Thu, Jan 27, 2011 at 8:16 PM, Artyom <artyomtnk@yahoo.com> wrote:
From: Dean Michael Berris <mikhailberis@gmail.com> On Thu, Jan 27, 2011 at 7:06 PM, Artyom <artyomtnk@yahoo.com> wrote:
Then don't call this thread [boost][string] proposal and don't call it boost::string
Why not? If it's different from std::string why wouldn't boost::string be a proper name? [snip] It is fine have other strings but IMHO they should not be called boost::string.
And IMHO std::string's current interface can be deprecated by a suitably convinced standard committee.
It's like std::auto_ptr being deprecated along with the interfaces of dozens of other libraries. If boost::string is a really well implemented string that does things really really well, then I don't see why std::string can't be deprecated in favor of an arguably better but certainly different string paradigm.
That's a bad analogy. std::auto_ptr was deprecated, but not in favor of another std::auto_ptr but rather by std::shared_ptr. To do what you're hoping for, the committee would need to deprecated std::string and promote std::some_other_name. Thus you should create boost::some_other_name. (Artyom might accept the addition of the latter, but clearly hopes for the survival of a slightly improved std::string.)
The point I was making was that Spirit has its place in Boost because it does something different from the norm. It was developed with one thing in mind: make defining parsers in pure C++ using an embedded DSL possible.
It's a different way of doing it and whether it's better is not relevant -- that it's there and being used by people no matter how many/few is what's important. For the record though I personally think the Spirit way is the better way, but of course that's IMHO.
Notice, however, that it was not named for an existing tool. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

AMDG On 1/27/2011 6:11 AM, Stewart, Robert wrote:
It's like std::auto_ptr being deprecated along with the interfaces of dozens of other libraries. If boost::string is a really well implemented string that does things really really well, then I don't see why std::string can't be deprecated in favor of an arguably better but certainly different string paradigm.
That's a bad analogy. std::auto_ptr was deprecated, but not in favor of another std::auto_ptr but rather by std::shared_ptr.
Err, unique_ptr is closer to auto_ptr. In Christ, Steven Watanabe

From: Dean Michael Berris <mikhailberis@gmail.com> On Thu, Jan 27, 2011 at 8:16 PM, Artyom <artyomtnk@yahoo.com> wrote:
From: Dean Michael Berris <mikhailberis@gmail.com> On Thu, Jan 27, 2011 at 7:06 PM, Artyom <artyomtnk@yahoo.com> wrote:
Then don't call this thread [boost][string] proposal and don't call it boost::string
Why not? If it's different from std::string why wouldn't boost::string be a proper name?
Because as you mentioned below, many things from boost go to std, i.e. shared_ptr, function, bind and many others.
So it shouldn't be "string" as C++ already has one.
By this logic, interprocess' containers shouldn't be called vector, map, list, set, unordered_set, unordered_map. That doesn't make sense.
They are intendant to remain in interprocess namespace and have same functionality as standard list/set etc. This is different case.
Who gave you the monopoly on what `string` should mean? :P
Nobody gave me a monopoly, however such monopoly had given to C++ standard committee that had defined what string means in C++'s standard namespace.
It is fine have other strings but IMHO they should not be called boost::string.
And IMHO std::string's current interface can be deprecated by a suitably convinced standard committee.
It's like std::auto_ptr being deprecated along with the interfaces of dozens of other libraries. If boost::string is a really well implemented string that does things really really well, then I don't see why std::string can't be deprecated in favor of an arguably better but certainly different string paradigm.
Deprecated, not changed in backward incompatible way.
1. Make it fully backward compatible with std::string 2. Call it by different name.
Nope, I disagree with both.
std::string can be deprecated if the standards body agree that there's cause for it to be deprecated. And the different name is frankly just unnecessary.
It may be deprecated, not removed become broken. Personally I don't think that immutability worth breaking 95% of existing code. So just call it boost::text, boost::unicode_string, boost::immutable_string or whatever you want, but boost::string is bad idea. Artyom

On Thu, Jan 27, 2011 at 10:37 PM, Artyom <artyomtnk@yahoo.com> wrote:
From: Dean Michael Berris <mikhailberis@gmail.com>
By this logic, interprocess' containers shouldn't be called vector, map, list, set, unordered_set, unordered_map. That doesn't make sense.
They are intendant to remain in interprocess namespace and have same functionality as standard list/set etc.
Actually, it's going to move to Boost.Containers IIRC -- but that's largely beside the point I know.
This is different case.
Yes I realize that now after recalling that it is after all following and somewhat backporting the C++0x interface to C++03 for the containers and allocators.
And IMHO std::string's current interface can be deprecated by a suitably convinced standard committee.
It's like std::auto_ptr being deprecated along with the interfaces of dozens of other libraries. If boost::string is a really well implemented string that does things really really well, then I don't see why std::string can't be deprecated in favor of an arguably better but certainly different string paradigm.
Deprecated, not changed in backward incompatible way.
Right. I realize that as well. :)
std::string can be deprecated if the standards body agree that there's cause for it to be deprecated. And the different name is frankly just unnecessary.
It may be deprecated, not removed become broken. Personally I don't think that immutability worth breaking 95% of existing code.
So just call it boost::text, boost::unicode_string, boost::immutable_string or whatever you want, but boost::string is bad idea.
Right. Maybe `istring` would be succinct enough to convey the idea. Let me think about that a little bit more -- and if others have better ideas than `istring` I'd love to hear them. :) -- Dean Michael Berris about.me/deanberris

On Thu, Jan 27, 2011 at 1:52 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Thu, Jan 27, 2011 at 8:16 PM, Artyom <artyomtnk@yahoo.com> wrote:
Yes and now it is in Tr1! And if you want string to come to tr1 you need either:
1. Make it fully backward compatible with std::string 2. Call it by different name.
Nope, I disagree with both.
std::string can be deprecated if the standards body agree that there's cause for it to be deprecated. And the different name is frankly just unnecessary.
Hmm, is the following a dumb idea? Implement the immutable string. Implement the old interface for backward compatibility. Even the things like clear(), at(...) or operator[] could be implemented by using the new semantics. Get it into the standard, then deprecate the old interface and let the users some time to adopt to the new one. Yes, the things like at(...) would have a terrible performance which would "force" the people to migrate to the new interface without actually forcing them (like in: their code would fail to compile if they didn't). After enough time has passed the old interface could be ropped completely. BR, Matus

On 27 January 2011 10:34, Matus Chochlik <chochlik@gmail.com> wrote:
Hmm, is the following a dumb idea?
Implement the old interface for backward compatibility.
Can't. Run-time complexity guarantees are part of the interface, as far as the standard is concerned. I'd like to see this broken up into three discussions: 1. Immutable strings. 2. utf8 strings. 3. Unrealistic pipe dream about replacing std::string. -- Nevin ":-)" Liber <mailto:nevin@eviloverlord.com> (847) 691-1404

I dont want to rain on anyvody's parade or bike shed conventino but, it is not liek we have such code ready in the SOC09 folder. Mathias Gaunard made an extensive work on the encoding problem: https://svn.boost.org/svn/boost/sandbox/SOC/2009/unicode/

On Fri, Jan 28, 2011 at 4:27 AM, Joel Falcou <joel.falcou@lri.fr> wrote:
I dont want to rain on anyvody's parade or bike shed conventino but, it is not liek we have such code ready in the SOC09 folder. Mathias Gaunard made an extensive work on the encoding problem:
+1 This already deals with std::string as it is. And when we do have that immutable string that allows for exposing ranges, then I think these would be golden as well. -- Dean Michael Berris about.me/deanberris

On Thu, 27 Jan 2011 19:22:17 +0800 Dean Michael Berris <mikhailberis@gmail.com> wrote:
<emphasis> If you were to ask *me* and *me alone*, of course *I* think that *my* vision for boost::string *should* be the way strings are dealt with. Of course that's ego-maniacal and self-centered of me to say so, but if I had to be explicit about it and take a position I would say exactly that: std::string is broken and it doesn't deserve to be the string implementation that C++ programmers have to use. </emphasis>
Question: if you replaced std::string with your immutable string, how would you build strings one character at a time for it? std::back_inserter wouldn't be possible. A large number of current uses for std::string require that, all the way up to std::copy. Building them in an array or vector<char> would be less efficient due to an extra copy. I'm not objecting to the basic idea -- you made an excellent case for it in the message this is a reply to, and it convinced me. I just can't see any way that it could replace mutable strings, as you're asserting.
So why would I not want to call it boost::string? ;)
Because it isn't a string, in the accepted C++ sense? :-) -- Chad Nelson Oak Circle Software, Inc. * * *

On Thu, Jan 27, 2011 at 10:05 PM, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
On Thu, 27 Jan 2011 19:22:17 +0800 Dean Michael Berris <mikhailberis@gmail.com> wrote:
<emphasis> If you were to ask *me* and *me alone*, of course *I* think that *my* vision for boost::string *should* be the way strings are dealt with. Of course that's ego-maniacal and self-centered of me to say so, but if I had to be explicit about it and take a position I would say exactly that: std::string is broken and it doesn't deserve to be the string implementation that C++ programmers have to use. </emphasis>
Question: if you replaced std::string with your immutable string, how would you build strings one character at a time for it?
Good question. See answer below.
std::back_inserter wouldn't be possible. A large number of current uses for std::string require that, all the way up to std::copy. Building them in an array or vector<char> would be less efficient due to an extra copy.
Actually, what std::copy requires is an iterator to model the OutputIterator concept. This means, std::back_inserter would somehow be analogous to std::ostream_iterator<>, and while we're at it, you build strings with a "stream" instead of modifying an already created string. :)
I'm not objecting to the basic idea -- you made an excellent case for it in the message this is a reply to, and it convinced me. I just can't see any way that it could replace mutable strings, as you're asserting.
The first step is to think about a string as something that, once constructed and is considered "live", would not need to be modified. Then the question becomes "so how do we build strings from other strings" and there are two possible answers: 1) build it using the ostringstream model 2) concatenate strings to build new strings. Either approach yields different performance characteristics and you end up with having clearly defined semantics for constructing strings.
So why would I not want to call it boost::string? ;)
Because it isn't a string, in the accepted C++ sense? :-)
Well one thing is certain: I suck at names. So if there's a suitable name that better models my notion of a string, then I'm open to suggestions. Although I maintain, what *I* think an immutable string representation is what will make std::string's brokenness way more obvious. As much as I would like to call it just `boost::string` I may be in the minority on this point so I'm willing to be convinced of using a different name. :) -- Dean Michael Berris about.me/deanberris

On Thu, 27 Jan 2011 22:16:14 +0800 Dean Michael Berris <mikhailberis@gmail.com> wrote:
Question: if you replaced std::string with your immutable string, how would you build strings one character at a time for it?
[...] Actually, what std::copy requires is an iterator to model the OutputIterator concept. This means, std::back_inserter would somehow be analogous to std::ostream_iterator<>, and while we're at it, you build strings with a "stream" instead of modifying an already created string. :)
In other words, essentially the same as using a mutable buffer, then copying the data to an immutable string. I'll be interested in the code you propose for it.
So why would I not want to call it boost::string? ;)
Because it isn't a string, in the accepted C++ sense? :-)
Well one thing is certain: I suck at names. So if there's a suitable name that better models my notion of a string, then I'm open to suggestions. [...]
The "istring" you mentioned in a later message is good enough for this discussion, though it might not be descriptive enough for a final name. -- Chad Nelson Oak Circle Software, Inc. * * *

On Thu, Jan 27, 2011 at 10:48 PM, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
On Thu, 27 Jan 2011 22:16:14 +0800 Dean Michael Berris <mikhailberis@gmail.com> wrote:
Question: if you replaced std::string with your immutable string, how would you build strings one character at a time for it?
[...] Actually, what std::copy requires is an iterator to model the OutputIterator concept. This means, std::back_inserter would somehow be analogous to std::ostream_iterator<>, and while we're at it, you build strings with a "stream" instead of modifying an already created string. :)
In other words, essentially the same as using a mutable buffer, then copying the data to an immutable string. I'll be interested in the code you propose for it.
Actually, it won't be a copy. :) If the mutable buffer was a sequence of reference-counted fixed-sized blocks, then you can imagine building an istring that just referred to these reference-counted fixed-sized blocks. The other way is to just move the ownership of the blocks to the built istring. Your "stream" interface would pretty much use the same interface that the output stream operations and underneath can benefit from discontiguous chunks of memory used as buffers.
So why would I not want to call it boost::string? ;)
Because it isn't a string, in the accepted C++ sense? :-)
Well one thing is certain: I suck at names. So if there's a suitable name that better models my notion of a string, then I'm open to suggestions. [...]
The "istring" you mentioned in a later message is good enough for this discussion, though it might not be descriptive enough for a final name.
I agree. Help wanted on a better name for the immutable string. :) -- Dean Michael Berris about.me/deanberris

On Thu, 27 Jan 2011 22:56:13 +0800 Dean Michael Berris <mikhailberis@gmail.com> wrote:
[...] Actually, what std::copy requires is an iterator to model the OutputIterator concept. This means, std::back_inserter would somehow be analogous to std::ostream_iterator<>, and while we're at it, you build strings with a "stream" instead of modifying an already created string. :)
In other words, essentially the same as using a mutable buffer, then copying the data to an immutable string. I'll be interested in the code you propose for it.
Actually, it won't be a copy. :)
If the mutable buffer was a sequence of reference-counted fixed-sized blocks, then you can imagine building an istring that just referred to these reference-counted fixed-sized blocks. [...]
Ahh... I'm starting to warm to the idea. You've obviously put some thought into it.
The "istring" you mentioned in a later message is good enough for this discussion, though it might not be descriptive enough for a final name.
I agree. Help wanted on a better name for the immutable string. :)
What about immutable::string? You'll probably need a namespace to put the supporting code into anyway. -- Chad Nelson Oak Circle Software, Inc. * * *

On Thu, Jan 27, 2011 at 3:56 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Thu, Jan 27, 2011 at 10:48 PM, Chad Nelson [snip/]
The "istring" you mentioned in a later message is good enough for this discussion, though it might not be descriptive enough for a final name.
I agree. Help wanted on a better name for the immutable string. :)
Half-jokingly, but: what about 'chain' ? :) Matus

On Fri, Jan 28, 2011 at 5:58 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Thu, Jan 27, 2011 at 3:56 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Thu, Jan 27, 2011 at 10:48 PM, Chad Nelson [snip/]
The "istring" you mentioned in a later message is good enough for this discussion, though it might not be descriptive enough for a final name.
I agree. Help wanted on a better name for the immutable string. :)
Half-jokingly, but: what about 'chain' ? :)
LOL -- chain I like. It's one less character than `string`. It involves immutability, concatenation as a means of building it, and it's usually segmented. Brilliant I say. Any other ideas? If not then I like the sound of Boost.Chains :D -- Dean Michael Berris about.me/deanberris

Dean Michael Berris wrote:
On Fri, Jan 28, 2011 at 5:58 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Thu, Jan 27, 2011 at 3:56 PM, Dean Michael Berris
Help wanted on a better name for the immutable string. :)
Half-jokingly, but: what about 'chain' ? :)
LOL -- chain I like. It's one less character than `string`. It involves immutability, concatenation as a means of building it, and it's usually segmented.
+1 The name is unlike any other container names I've heard of, doesn't connote text, and doesn't convey useful associations like segmentation and composition. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On Fri, Jan 28, 2011 at 11:58, Matus Chochlik <chochlik@gmail.com> wrote:
On Thu, Jan 27, 2011 at 3:56 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Thu, Jan 27, 2011 at 10:48 PM, Chad Nelson [snip/]
The "istring" you mentioned in a later message is good enough for this discussion, though it might not be descriptive enough for a final name.
I agree. Help wanted on a better name for the immutable string. :)
Half-jokingly, but: what about 'chain' ? :)
Why not just call it boost::rope then? Its implementation will be a rope (almost certainly) and no-body here has to remain compatible with SGI.
-- Yakov

On Fri, Jan 28, 2011 at 11:20 AM, Yakov Galka <ybungalobill@gmail.com> wrote:
On Fri, Jan 28, 2011 at 11:58, Matus Chochlik <chochlik@gmail.com> wrote:
On Thu, Jan 27, 2011 at 3:56 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Thu, Jan 27, 2011 at 10:48 PM, Chad Nelson [snip/]
The "istring" you mentioned in a later message is good enough for this discussion, though it might not be descriptive enough for a final name.
I agree. Help wanted on a better name for the immutable string. :)
Half-jokingly, but: what about 'chain' ? :)
Why not just call it boost::rope then? Its implementation will be a rope (almost certainly) and no-body here has to remain compatible with SGI.
I believe that some people might object to that. The original rope has a different interface than Dean proposes. Matus

On Fri, Jan 28, 2011 at 6:29 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Fri, Jan 28, 2011 at 11:20 AM, Yakov Galka <ybungalobill@gmail.com> wrote:
Why not just call it boost::rope then? Its implementation will be a rope
(almost certainly) and no-body here has to remain compatible with SGI.
I believe that some people might object to that. The original rope has a different interface than Dean proposes.
Rope is not bad, it's actually shorter than 'chain'. I don't remember though whether the SGI rope implementation implied immutability. I know it uses concatenation trees of immutable substrings and substring operations rely on maintaining a reference to the original string. Somehow though I think it exposes a similar API to std::string and thus had to compete with it in terms of "mutability" by creating new chunks copying the modified chunks into new storage (and re-creating the concatenation tree). It also had some balancing algorithms implemented which I think could be addressed in a different manner. So the interface I was thinking about (and suggesting) is a lot more minimal than what rope or std::string have exposed. I think when I do finish that design document (with rationale) it would be clear why I would like to keep it immutable and why I would prefer it still be called a string. Let me finish that document -- expect something over the weekend. :) -- Dean Michael Berris about.me/deanberris

On Fri, Jan 28, 2011 at 7:20 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
So the interface I was thinking about (and suggesting) is a lot more minimal than what rope or std::string have exposed. I think when I do finish that design document (with rationale) it would be clear why I would like to keep it immutable and why I would prefer it still be called a string.
Let me finish that document -- expect something over the weekend. :)
And I stopped before I write too much -- the initial version is already up: https://github.com/downloads/mikhailberis/cpp-string-theory/cpp-string-theor... -- I'll give it more information and the actual interfaces and implementation as soon as I get some Z's. :) -- Dean Michael Berris about.me/deanberris

C++ String Theory? :-) I like it, perhaps it could transcend scientific fields, this theory. /David On Jan 28, 2011, at 12:59 PM, Dean Michael Berris wrote:
On Fri, Jan 28, 2011 at 7:20 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
So the interface I was thinking about (and suggesting) is a lot more minimal than what rope or std::string have exposed. I think when I do finish that design document (with rationale) it would be clear why I would like to keep it immutable and why I would prefer it still be called a string.
Let me finish that document -- expect something over the weekend. :)
And I stopped before I write too much -- the initial version is already up: https://github.com/downloads/mikhailberis/cpp-string-theory/cpp-string-theor... -- I'll give it more information and the actual interfaces and implementation as soon as I get some Z's. :)
-- Dean Michael Berris about.me/deanberris _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

From: Dean Michael Berris <mikhailberis@gmail.com> On Fri, Jan 28, 2011 at 7:20 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
So the interface I was thinking about (and suggesting) is a lot more minimal than what rope or std::string have exposed. I think when I do finish that design document (with rationale) it would be clear why I would like to keep it immutable and why I would prefer it still be called a string.
Let me finish that document -- expect something over the weekend. :)
And I stopped before I write too much -- the initial version is already up: https://github.com/downloads/mikhailberis/cpp-string-theory/cpp-string-theor...
-- I'll give it more information and the actual interfaces and implementation as soon as I get some Z's. :)
I'm sorry but this "document" full of mistakes and misses serious points: 1. "Contiguity" Continuity and c_str() is one of the most important properties of C++ string (that is BTW required by C++0x) Reason: c_str() is a boundary to almost every library existing in C++ and C world. So removing this "bad" feature makes it useless for vast majority of string users. Note: all strings around in all languages are continuous for the reasons. 2. Efficiency - have you forgotten about std::string::reserve? 3. non-uniform-memory-architecture Give me a break... Who uses NUMA for string processing?! 4. About string builder. Most languages require is as they don't have "reserve" also if you want efficient string builder use std::ostream with nice stream buffer. Don't copy paradigms that do not belong to C++! 5. Makeing all operations lazy you bring more segmentation to memory as it is not recycled, also it reduces performance as does not have "liner" location in cache/ 6. Encoding is extrinsic to strings ?!?!?! All the discussion in started because we need UTF-8 in strings now we are back to the beginning? This is classic example of how trying to do something "cool" gives us theoretically interesting and cool things that are useless in real world where simple and straight forward things actually work a way better. So you are welcome to propose overcomplicated interface that tries to optimize some corner cases and finally makes it useless. Sorry, But what you had written has nothing to do to reality. SGI had ropes... Where they are today? Artyom

On Sat, Jan 29, 2011 at 4:26 AM, Artyom <artyomtnk@yahoo.com> wrote:
From: Dean Michael Berris <mikhailberis@gmail.com> On Fri, Jan 28, 2011 at 7:20 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
And I stopped before I write too much -- the initial version is already up: https://github.com/downloads/mikhailberis/cpp-string-theory/cpp-string-theor...
-- I'll give it more information and the actual interfaces and implementation as soon as I get some Z's. :)
I'm sorry but this "document" full of mistakes and misses serious points:
1. "Contiguity"
Continuity and c_str() is one of the most important properties of C++ string (that is BTW required by C++0x)
It is the single most problematic feature bar none.
Reason: c_str() is a boundary to almost every library existing in C++ and C world. So removing this "bad" feature makes it useless for vast majority of string users.
Note: all strings around in all languages are continuous for the reasons.
You didn't read the whole thing. I present an algorithm to linearize a string. What else do you need?
2. Efficiency - have you forgotten about std::string::reserve?
It's still contiguous. And then when it grows past the reserved size, explain to me what happens.
3. non-uniform-memory-architecture
Give me a break... Who uses NUMA for string processing?!
If you're using a modern Intel Core i5/i7, you're using NUMA for *everything*. Xeon 5400's are NUMA. AMD with hypertransfer is NUMA. NUMA is an architecture and if you're running your programs in a NUMA machine well you're using NUMA. Google NUMA and see what I mean.
4. About string builder. Most languages require is as they don't have "reserve" also if you want efficient string builder use std::ostream with nice stream buffer.
Don't copy paradigms that do not belong to C++!
Who are you to say what paradigms don't belong to C++? And std::ostream is not a stream builder -- std::ostringstream is the string builder as it is now but it uses a string buffer, which is also contiguous and has the same problems as std::string.
5. Makeing all operations lazy you bring more segmentation to memory as it is not recycled, also it reduces performance as does not have "liner" location in cache/
Are you F'n kidding me? Lazy operations doesn't bring more segmentation, it makes application of operations delayed until the data is required.
6. Encoding is extrinsic to strings
?!?!?!
All the discussion in started because we need UTF-8 in strings now we are back to the beginning?
No, the discussion started because we need a UTF-8 view of data. You missed the point I was making. And you didn't understand the document I wrote. It's obvious you have an idea of what a string should be and I have a different one. So I don't see any point in trying to convince you otherwise when you've already made up your mind that std::string is fine when the whole point of my document builds around why std::string is broken.
This is classic example of how trying to do something "cool" gives us theoretically interesting and cool things that are useless in real world where simple and straight forward things actually work a way better.
Simple and straight forward translates to naive and inefficient most of the time. This isn't meant to be "cool" it was meant to address a problem that has already been identified repeatedly.
So you are welcome to propose overcomplicated interface that tries to optimize some corner cases and finally makes it useless.
Thank you. I shall try to prove you wrong then -- or better yet, guess what, I don't really care what *you* think because I don't think I need to convince *you* that std::string is broken when you obviously think it's fine.
Sorry,
No need to apologize, it's apparent you missed the whole point anyway.
But what you had written has nothing to do to reality. SGI had ropes... Where they are today?
Ropes, are getting into TR2. Read the same document referred to with regards to COW with strings becoming non-standard compliant in C++0x. -- Dean Michael Berris about.me/deanberris

On Fri, Jan 28, 2011 at 9:46 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Sat, Jan 29, 2011 at 4:26 AM, Artyom <artyomtnk@yahoo.com> wrote: [snip/]
6. Encoding is extrinsic to strings
?!?!?!
All the discussion in started because we need UTF-8 in strings now we are back to the beginning?
No, the discussion started because we need a UTF-8 view of data. You missed the point I was making. And you didn't understand the document I wrote.
Sorry, but no. The discussion started by the proposal that we should by default treat std::strings as if they were UTF-8 encoded. Artyom should know because he was the one who did the original proposal. The whole 'view' idea was brought up only much later. [snip/] Matus

Matus Chochlik wrote:
On Fri, Jan 28, 2011 at 9:46 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Sat, Jan 29, 2011 at 4:26 AM, Artyom <artyomtnk@yahoo.com> wrote:
6. Encoding is extrinsic to strings
?!?!?!
All the discussion in started because we need UTF-8 in strings now we are back to the beginning?
No, the discussion started because we need a UTF-8 view of data. You missed the point I was making. And you didn't understand the document I wrote.
Sorry, but no. The discussion started by the proposal that we should by default treat std::strings as if they were UTF-8 encoded. Artyom should know because he was the one who did the original proposal. The whole 'view' idea was brought up only much later.
Yes, the discussion began with that proposal, but Dean suggested steering things another way. Two different ways of viewing (no pun intended) the same problem. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On Sat, Jan 29, 2011 at 5:13 AM, Matus Chochlik <chochlik@gmail.com> wrote:
On Fri, Jan 28, 2011 at 9:46 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Sat, Jan 29, 2011 at 4:26 AM, Artyom <artyomtnk@yahoo.com> wrote: [snip/]
6. Encoding is extrinsic to strings
?!?!?!
All the discussion in started because we need UTF-8 in strings now we are back to the beginning?
No, the discussion started because we need a UTF-8 view of data. You missed the point I was making. And you didn't understand the document I wrote.
Sorry, but no. The discussion started by the proposal that we should by default treat std::strings as if they were UTF-8 encoded. Artyom should know because he was the one who did the original proposal. The whole 'view' idea was brought up only much later.
And the point I was making was that, doing precisely this was the "wrong" way of doing it. Assuming a default encoding is "unnecessary" as an encoding is largely a matter of interpretation of data ultimately. I was attempting to solve the problem that is std::string. In the process I'm moving the issue away from the underlying data and moving it to a matter of interpretation. To do that in a manner that would make sense as how I see it, that means moving it into a view of the data that is held in a string. The string would be the data structure, the view an interpretation of it. I never precluded that the string can hold UTF-8 encoded data, but saying that is the default achieves nothing and is ultimately unnecessary. In the design I've been proposing the point of the matter is, interpreting data in a given encoding is separate from how the data is actually stored. Now let's say you have a UTF-8 string builder, what else would that write in memory aside from UTF-8 encoded data? It will though still yield a string, which could be interpreted many different ways -- I just don't see the encoding as something intrinsic to the string. That means a string can hold UTF-8 encoded data and I can wrap that in a view for UTF-16 and see that it will not validate correctly -- unless I wrap the string with a view for UTF-8 first then pass that into a view for UTF-16 and transcoding can happen on the fly. Writing algorithms that deal with strings, is different from writing algorithms that deal with encoded text. That's two different levels. This explaining, and trying to explain again, the whole point of the matter makes me sound like a broken record. If you still don't get what I'm saying then I guess I'm going to have to try a different route and just show what I mean in terms of code at some point in time. HTH -- Dean Michael Berris about.me/deanberris

On Fri, Jan 28, 2011 at 10:31 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Sat, Jan 29, 2011 at 5:13 AM, Matus Chochlik <chochlik@gmail.com> wrote:
On Fri, Jan 28, 2011 at 9:46 PM, Dean Michael Berris
All the discussion in started because we need UTF-8 in strings now we are back to the beginning?
No, the discussion started because we need a UTF-8 view of data. You missed the point I was making. And you didn't understand the document I wrote.
Sorry, but no. The discussion started by the proposal that we should by default treat std::strings as if they were UTF-8 encoded. Artyom should know because he was the one who did the original proposal. The whole 'view' idea was brought up only much later.
And the point I was making was that, doing precisely this was the "wrong" way of doing it. Assuming a default encoding is "unnecessary" as an encoding is largely a matter of interpretation of data ultimately.
I was attempting to solve the problem that is std::string. In the process I'm moving the issue away from the underlying data and moving it to a matter of interpretation. To do that in a manner that would make sense as how I see it, that means moving it into a view of the data that is held in a string. The string would be the data structure, the view an interpretation of it.
I never precluded that the string can hold UTF-8 encoded data, but saying that is the default achieves nothing and is ultimately unnecessary. In the design I've been proposing the point of the matter is, interpreting data in a given encoding is separate from how the data is actually stored. Now let's say you have a UTF-8 string builder, what else would that write in memory aside from UTF-8 encoded data? It will though still yield a string, which could be interpreted many different ways -- I just don't see the encoding as something intrinsic to the string. That means a string can hold UTF-8 encoded data and I can wrap that in a view for UTF-16 and see that it will not validate correctly -- unless I wrap the string with a view for UTF-8 first then pass that into a view for UTF-16 and transcoding can happen on the fly.
Writing algorithms that deal with strings, is different from writing algorithms that deal with encoded text. That's two different levels.
This explaining, and trying to explain again, the whole point of the matter makes me sound like a broken record. If you still don't get what I'm saying then I guess I'm going to have to try a different route and just show what I mean in terms of code at some point in time.
Dean, believe me, I got what you said the first time you said it, like 200 posts ago. I know that the string data is ultimately stored in the memory as a sequence of bytes. But then you proposed to solve my problem by suggesting the view<Encoding> template. Then like 50 posts ago we finally agreed on typedef-ing and naming it 'text' since using something called view<encoding_tag> is not acceptable for me. Now, if this typedef view<utf8_encoding_tag> text; is the only line of code where I see the encoding and I'll be able to do all the text handling, i.e.: searching for code points/characters (not only bytes), searching for words, concatenation, splitting, writing it into a file, socket, etc. and reading it from file, socket, etc., using it with some c_str-like adapter with C APIs, etc., basically doing (nearly) everything that I was able to do with std::string *without* ever mentioning the encoding again, the You already have me convinced. If I cannot do those things without specifying the encoding (unless necessary) then this is useless for me for text handling. Peace, Love, Best regards, Matus

On 01/28/2011 12:46 PM, Dean Michael Berris wrote:
... elision by patrick ... Ropes, are getting into TR2. Read the same document referred to with regards to COW with strings becoming non-standard compliant in C++0x.
I think you're referring to n2668 from 2008, which concluded with: Recovering the Loss The largest potential loss in performance due to a switch away from copy-on-write implementations is the increased consumption of memory for applications with very large read-mostly strings. However, we believe that for those applications ropes are a better technical solution, and recommend a rope proposal be considered for inclusion in Library TR2. That's not the same as saying that it will be. We don't know yet what will be in TR2 as far as I know. If you go to the working group's page: http://www.open-std.org/jtc1/sc22/wg21/ you see a link to the draft of TR1, but nothing for TR2 yet. Patrick

Artyom wrote:
From: Dean Michael Berris <mikhailberis@gmail.com>
And I stopped before I write too much -- the initial version is already up: <https://github.com/downloads/mikhailberis/cpp-string-theory/cpp-string-theory.pdf>
I'm sorry but this "document" full of mistakes and misses serious points:
1. "Contiguity"
Continuity and c_str() is one of the most important properties of C++ string (that is BTW required by C++0x)
"Contiguity" is the correct word. "Continuous" is not.
Reason: c_str() is a boundary to almost every library existing in C++ and C world. So removing this "bad" feature makes it useless for vast majority of string users.
Dean must address this use case in the document, but he has suggested ways to accomplish it in this thread.
2. Efficiency - have you forgotten about std::string::reserve?
That isn't automatic and many don't know to use it.
4. About string builder. Most languages require is as they don't have "reserve" also if you want efficient string builder use std::ostream with nice stream buffer.
Don't copy paradigms that do not belong to C++!
Just because it hasn't been used in C++ doesn't mean it doesn't make sense.
5. Makeing all operations lazy you bring more segmentation to memory as it is not recycled, also it reduces performance as does not have "liner" location in cache/
The lazy approach means that until the result of the operations are needed, very little work needs to be done. When the result is needed, many operations can be combined into fewer net operations. The result is faster processing and better memory/cache usage.
6. Encoding is extrinsic to strings
?!?!?!
All the discussion in started because we need UTF-8 in strings now we are back to the beginning?
This is disingenuous. Dean has advocated that the encoding should be separate from the storage all along. In that sense, yes, he's going back to the beginning. Just because you think he should have been convinced by now that your view is correct doesn't mean that he's wrong.
This is classic example of how trying to do something "cool" gives us theoretically interesting and cool things that are useless in real world where simple and straight forward things actually work a way better.
Simple and straightforward can always be implemented atop the "cool" if that's your only problem. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

At Fri, 28 Jan 2011 12:26:59 -0800 (PST), Artyom wrote:
I'm sorry but this "document" full of mistakes and misses serious points:
I don't think sarcasm is warranted here. It most certainly is a document.
1. "Contiguity"
Continuity and c_str() is one of the most important properties of C++ string (that is BTW required by C++0x)
Reason: c_str() is a boundary to almost every library existing in C++ and C world. So removing this "bad" feature makes it useless for vast majority of string users.
Note: all strings around in all languages are continuous for the reasons.
Eliminating c_str() doesn't mean there's no easy way to produce a contiguous NTBS.
3. non-uniform-memory-architecture
Give me a break... Who uses NUMA for string processing?!
Anyone running on a multiprocessor system with AMD Opterons or Intel Nehalem or Tukwila processors. You don't always get to choose the kind of architecture your code will run on, and those systems are all NUMA. But even when you do get to choose, some very large problems that would be appropriate to NUMA involve lots of strings.
4. About string builder. Most languages require is as they don't have "reserve" also if you want efficient string builder use std::ostream with nice stream buffer.
There's nothing efficient about std::ostream, no matter what buffer you put on it.
Don't copy paradigms that do not belong to C++!
So we have nothing to learn from other languages?
6. Encoding is extrinsic to strings
?!?!?!
All the discussion in started because we need UTF-8 in strings now we are back to the beginning?
This, I believe, is a persistent misunderstanding. IIUC, Dean is only suggesting to avoid giving UTF-8 any special status in the string's interface. He's not arguing against using UTF-8 storage in the implementation.
This is classic example of how trying to do something "cool" gives us theoretically interesting and cool things that are useless in real world where simple and straight forward things actually work a way better.
That may ultimately turn out to be true, but your reaction here seems so over-the-top and premature as to make that conclusion very unconvincing. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 01/28/2011 01:01 PM, Dave Abrahams wrote:
At Fri, 28 Jan 2011 12:26:59 -0800 (PST), Artyom wrote:
Give me a break... Who uses NUMA for string processing?!
Anyone running on a multiprocessor system with AMD Opterons or Intel Nehalem or Tukwila processors. You don't always get to choose the kind of architecture your code will run on, and those systems are all NUMA. But even when you do get to choose, some very large problems that would be appropriate to NUMA involve lots of strings.
What hasn't been explained is how NUMA is relevant to the design of this string type. As far as I understand, the main consideration required for NUMA systems is to try to have each processor use the portion of memory that it can access most efficiently. Potentially this suggests that if a large string is to be shared read-only between two threads running on different NUMA nodes, it might be advantageous for it to be copied into memory that can be accessed quickly by the second thread (assuming it is initially in memory optimal for the first thread). Thus, it seems it would be useful for a string/buffer type that is by default reference-counted/COW to also support explicit copying. This could also be useful for a reference-counted string type with a compile time policy of using non-locking, non-atomic operations to handle the reference counting (for use only by a single thread), as atomic operations often incur significant overhead compared to a non-atomic operation. However, as far as whether to use contiguous virtual memory or break the string up into non-contiguous chunks, I don't see the relevance.

1. "Contiguity"
Continuity and c_str() is one of the most important properties of C++ string (that is BTW required by C++0x)
Eliminating c_str() doesn't mean there's no easy way to produce a contiguous NTBS.
Yes, just it can't be really "char const *c_str() CONST" or would require extra stuff like linearization. It would turn away 90% of users.
3. non-uniform-memory-architecture
Give me a break... Who uses NUMA for string processing?!
Anyone running on a multiprocessor system with AMD Opterons or Intel Nehalem or Tukwila processors. You don't always get to choose the kind of architecture your code will run on, and those systems are all NUMA. But even when you do get to choose, some very large problems that would be appropriate to NUMA involve lots of strings.
1. The locality of cache or private processor cache does not makes them "NUMA" 2. In such case it would be even better to have non-shared strings
4. About string builder. Most languages require is as they don't have "reserve" also if you want efficient string builder use std::ostream with nice stream buffer.
There's nothing efficient about std::ostream, no matter what buffer you put on it.
I beg your pardon? It is efficient as all functions are as efficient as memcpy with exceptions of overflow/underflow happens which require some virtual functions calls which are pretty fast as well... Also 99% of issues are just solved with reserve. (and I work with text parsing, combining and processing a lot)
6. Encoding is extrinsic to strings
?!?!?!
All the discussion in started because we need UTF-8 in strings now we are back to the beginning?
This, I believe, is a persistent misunderstanding. IIUC, Dean is only suggesting to avoid giving UTF-8 any special status in the string's interface. He's not arguing against using UTF-8 storage in the implementation.
The entire "buzz" started with the fact that under windows we have problems with string encoding not being UTF-8
This is classic example of how trying to do something "cool" gives us theoretically interesting and cool things that are useless in real world where simple and straight forward things actually work a way better.
That may ultimately turn out to be true, but your reaction here seems so over-the-top and premature as to make that conclusion very unconvincing.
This article written from wrong understanding of real problems - instead of solving a problem it suggests some idea for some cases not looking to the problem in hole. Starting from "std::string is broken" statement... Artyom

On Sat, Jan 29, 2011 at 3:02 PM, Artyom <artyomtnk@yahoo.com> wrote:
1. "Contiguity"
Continuity and c_str() is one of the most important properties of C++ string (that is BTW required by C++0x)
Eliminating c_str() doesn't mean there's no easy way to produce a contiguous NTBS.
Yes, just it can't be really "char const *c_str() CONST" or would require extra stuff like linearization.
It would turn away 90% of users.
It might turn you away because you obviously love std::string. Generalizing is a different matter and is largely a hot-air blowing exercise that is futile for convincing anybody.
3. non-uniform-memory-architecture
Give me a break... Who uses NUMA for string processing?!
Anyone running on a multiprocessor system with AMD Opterons or Intel Nehalem or Tukwila processors. You don't always get to choose the kind of architecture your code will run on, and those systems are all NUMA. But even when you do get to choose, some very large problems that would be appropriate to NUMA involve lots of strings.
1. The locality of cache or private processor cache does not makes them "NUMA"
What makes an architecture NUMA is when each CPU manages memory by embedding the memory manager in the CPU. By not having a single memory controller in the system, effectively a CPU's access to the whole available memory is non-uniform because it will have faster access to some memory while having to go through other CPUs if it needs to access memory that's controlled by another CPU. It looks like you don't know what NUMA is from what you're saying.
2. In such case it would be even better to have non-shared strings
Weh?
4. About string builder. Most languages require is as they don't have "reserve" also if you want efficient string builder use std::ostream with nice stream buffer.
There's nothing efficient about std::ostream, no matter what buffer you put on it.
I beg your pardon? It is efficient as all functions are as efficient as memcpy with exceptions of overflow/underflow happens which require some virtual functions calls which are pretty fast as well...
Also 99% of issues are just solved with reserve. (and I work with text parsing, combining and processing a lot)
And you obviously don't work with systems that have to do this multiple thousand times in one second to not know what the effects of NUMA are and why allocating a contiguous amount of memory is the performance killer that it is.
This, I believe, is a persistent misunderstanding. IIUC, Dean is only suggesting to avoid giving UTF-8 any special status in the string's interface. He's not arguing against using UTF-8 storage in the implementation.
The entire "buzz" started with the fact that under windows we have problems with string encoding not being UTF-8
No, the entire buzz started when people like you suggested treating std::string as UTF-8 by default which I have maintained already is largely unnecessary from a design perspective.
This is classic example of how trying to do something "cool" gives us theoretically interesting and cool things that are useless in real world where simple and straight forward things actually work a way better.
That may ultimately turn out to be true, but your reaction here seems so over-the-top and premature as to make that conclusion very unconvincing.
This article written from wrong understanding of real problems - instead of solving a problem it suggests some idea for some cases not looking to the problem in hole.
I think you mean "in whole". The article was written from the understanding that the real problem stems from how std::string is broken. It already identifies why it's broken. It seems that you're just happy to attack people and the work they do more than you are interested in solving problems. If you disagree with what's being said argue on the merits of "why". Mud-slinging and sitting on a high horse and just saying "blech, you're wrong" is not helping solve any technical problems. I've already pointed out some of the problems why I think std::string is broken. Now if you disagree with the things I pointed out that's fine, have it your way. I'm not here to please you, I'm here to make a technical point and a contribution that others may very well welcome.
Starting from "std::string is broken" statement...
You obviously love it like the way it is so I don't think I need to be convincing you otherwise. Good luck with that. -- Dean Michael Berris about.me/deanberris

-
From: Dean Michael Berris <mikhailberis@gmail.com>
On Sat, Jan 29, 2011 at 3:02 PM, Artyom <artyomtnk@yahoo.com> wrote:
1. "Contiguity"
Continuity and c_str() is one of the most important properties of C++ string (that is BTW required by C++0x)
Eliminating c_str() doesn't mean there's no easy way to produce a contiguous NTBS.
Yes, just it can't be really "char const *c_str() CONST" or would require extra stuff like linearization.
It would turn away 90% of users.
It might turn you away because you obviously love std::string. Generalizing is a different matter and is largely a hot-air blowing exercise that is futile for convincing anybody.
I would say it more clear: 1. All users that use C libraries and need c_str() at boundaries And this is a huge amount of users that need to communicate with modules that are already working and ready but written in C. And this is about of half of libraries there is C is the lowest level API that allows easy bindings to all languages. 2. All users of GUI toolkits like GTK, Qt, wxWidgets, MFC as they require conversion of "boost::what_ever_is_it_called" to QString, ustring, wxString, CString. and it is done via C string. 3. All users who actually use Operating System API that uses C strings and require char *. Plenty? Isn't it? Please take a look on frequent cases of string usage and you'll see how much do you indeed need rope like structure and how much normal string. Don't forget that almost all string implementations in all languages are continuous single memory chunks.
3. non-uniform-memory-architecture
Give me a break... Who uses NUMA for string processing?!
2. In such case it would be even better to have non-shared strings
Weh?
Because of memory locality, think of part of string references to "other memory"
4. About string builder. Most languages require is as they don't have "reserve" also if you want efficient string builder use std::ostream with nice stream buffer.
There's nothing efficient about std::ostream, no matter what buffer you put on it.
I beg your pardon? It is efficient as all functions are as efficient as memcpy with exceptions of overflow/underflow happens which require some virtual functions calls which are pretty fast as well...
Also 99% of issues are just solved with reserve. (and I work with text parsing, combining and processing a lot)
And you obviously don't work with systems that have to do this multiple thousand times in one second to not know what the effects of NUMA are and why allocating a contiguous amount of memory is the performance killer that it is.
I know, but I hadn't suggested that streambuf should use single memory chunk.
This, I believe, is a persistent misunderstanding. IIUC, Dean is only suggesting to avoid giving UTF-8 any special status in the string's interface. He's not arguing against using UTF-8 storage in the implementation.
The entire "buzz" started with the fact that under windows we have problems with string encoding not being UTF-8
[snip]
This article written from wrong understanding of real problems - instead of solving a problem it suggests some idea for some cases not looking to the problem in hole.
[snip]
The article was written from the understanding that the real problem stems from how std::string is broken. It already identifies why it's broken. It seems that you're just happy to attack people and the work they do more than you are interested in solving problems.
If you disagree with what's being said argue on the merits of "why". Mud-slinging and sitting on a high horse and just saying "blech, you're wrong" is not helping solve any technical problems.
I'm sorry but I think that much more real problem is that: - My father in law can't use Thunderbird because he defined non-ascii user name and Thunderbird fails to open the profile. So he needs to create a new account because half of the other programs are broken when Unicode paths are used! - That acrobat reader can't open files with Unicode file names that user have (at least it was last time I've tried it) - That you can't write cross platform Unicode aware code using simple std::string/char * or what even encoding non-aware string there. What you are doing is classical example if micro-optimization that is concentrated on string storage. Note: all strings in all toolkits around don't do much beyond what std::string does in terms of storage (QString, ustring, UnicodeString, wxString) all use same storage model, some immutable that I can accept it but all: 1. Are Unicode aware 2. Use single memory chunk There is a very good reason for this but it seems that you just to get it why all strings designs around this same principle. Take a look on these fundamental operations you had written: - Concatenation (generally ok) - Substring - should be Unicode aware in most of cases - Filtration - should be Unicode and Locale aware - Tokenization - should be Unicode and Locale aware - Search/Pattern Matching - should be Unicode and Locale aware. So please if you don't understand why these fundamental operations and why the string should relate to encoding then you need to reread this thread. New string that does not solve any of Unicode issues has no place - and this is *real* problem. Please don't write theories on C++ string if you do not see what string is - text human readable text that is much more complex then set of byte chunks. Artyom

On Sat, Jan 29, 2011 at 5:24 PM, Artyom <artyomtnk@yahoo.com> wrote:
-
From: Dean Michael Berris <mikhailberis@gmail.com>
On Sat, Jan 29, 2011 at 3:02 PM, Artyom <artyomtnk@yahoo.com> wrote:
It would turn away 90% of users.
It might turn you away because you obviously love std::string. Generalizing is a different matter and is largely a hot-air blowing exercise that is futile for convincing anybody.
I would say it more clear:
1. All users that use C libraries and need c_str() at boundaries And this is a huge amount of users that need to communicate with modules that are already working and ready but written in C.
And this is about of half of libraries there is C is the lowest level API that allows easy bindings to all languages.
But c_str() doesn't have to be part of the string's interface.
2. All users of GUI toolkits like GTK, Qt, wxWidgets, MFC as they require conversion of "boost::what_ever_is_it_called" to QString, ustring, wxString, CString. and it is done via C string.
So, what was the point again?
3. All users who actually use Operating System API that uses C strings and require char *.
Plenty? Isn't it?
I know, so what is your point?
Please take a look on frequent cases of string usage and you'll see how much do you indeed need rope like structure and how much normal string.
Don't forget that almost all string implementations in all languages are continuous single memory chunks.
So what if all other string implementations in all languages are contiguous? Does that mean that's the *only* way to do it? Look, in my paper -- if you read and *understood* it -- I pointed out that linearizing a string is an algorithm that deals with a string. Much like how std::copy is an algorithm that is external to a container, I see linearization as something not part of the string interface. That was towards the end part. I never said that a string shouldn't be linearizable.
2. In such case it would be even better to have non-shared strings
Weh?
Because of memory locality, think of part of string references to "other memory"
Memory locality is solved by making it available to the cache. If you have a contiguous chunk of 4kb *that never ever changes* then accessing that memory from all the cores in a NUMA machine is largely a matter of the cache reading part of that and making it available. Making copies of the string is *unnecessarily wasteful*.
I beg your pardon? It is efficient as all functions are as efficient as memcpy with exceptions of overflow/underflow happens which require some virtual functions calls which are pretty fast as well...
Also 99% of issues are just solved with reserve. (and I work with text parsing, combining and processing a lot)
And you obviously don't work with systems that have to do this multiple thousand times in one second to not know what the effects of NUMA are and why allocating a contiguous amount of memory is the performance killer that it is.
I know, but I hadn't suggested that streambuf should use single memory chunk.
So then what's the point of making strings use a single contiguous memory chunk if it's not necessary?
This article written from wrong understanding of real problems - instead of solving a problem it suggests some idea for some cases not looking to the problem in hole.
[snip]
The article was written from the understanding that the real problem stems from how std::string is broken. It already identifies why it's broken. It seems that you're just happy to attack people and the work they do more than you are interested in solving problems.
If you disagree with what's being said argue on the merits of "why". Mud-slinging and sitting on a high horse and just saying "blech, you're wrong" is not helping solve any technical problems.
I'm sorry but I think that much more real problem is that:
- My father in law can't use Thunderbird because he defined non-ascii user name and Thunderbird fails to open the profile. So he needs to create a new account because half of the other programs are broken when Unicode paths are used!
So fix Thunderbird.
- That acrobat reader can't open files with Unicode file names that user have (at least it was last time I've tried it)
So go work for Adobe and fix Acrobat Reader.
- That you can't write cross platform Unicode aware code using simple std::string/char * or what even encoding non-aware string there.
You can, there are already libraries for that sort of thing if you insist on using std::string.
What you are doing is classical example if micro-optimization that is concentrated on string storage.
No. If you really did read the document and understood it, I was talking at a high level about how to fix the problem by going through the rationale for immutable strings. The storage issue is a necessary component of the implementation efficiency concerns. If you design something without thinking about the efficiency of the solution, then you're doing art not engineering. I'm not an artist and I sure want to think I'm an engineer by training and by trade. Notice that I haven't mentioned any micro-optimizations or micro-benchmarks in the document as well so I don't know where you're coming from when you say I'm micro-optimizing anything.
Note: all strings in all toolkits around don't do much beyond what std::string does in terms of storage (QString, ustring, UnicodeString, wxString) all use same storage model, some immutable that I can accept it but all:
1. Are Unicode aware 2. Use single memory chunk
There is a very good reason for this but it seems that you just to get it why all strings designs around this same principle.
I do know why it's designed the same way: because that's the naive thing to do. Someone thought "oh well, we can malloc a chunk of memory and put a \0 in the end and call that a string". That worked up to a certain level, and then when people started to look at a better way of doing things, they saw that this isn't enough. Notice how the applications you mention use a segmented data structure when dealing with things like edit buffers or similar things. Strings are largely reserved for "short" data and anytime you need anything "longer" you'd use something else -- the question I'm trying to address is why can't you use one data structure that will be efficient for both cases? That's the point of the title which aims to come up with a singular way of explaining how strings should be so that they're suitable for short and long "strings of characters".
Take a look on these fundamental operations you had written:
- Concatenation (generally ok) - Substring - should be Unicode aware in most of cases - Filtration - should be Unicode and Locale aware - Tokenization - should be Unicode and Locale aware - Search/Pattern Matching - should be Unicode and Locale aware.
So please if you don't understand why these fundamental operations and why the string should relate to encoding then you need to reread this thread.
But these are *algorithms* that should be aware of the encoding, *not the string*. If you don't understand that point then you need to read the document *again*. The point is if you viewed a string a given way then that's largely an implementation of the view. I haven't gotten to the explanation of the view but I hinted that interpretation is a matter of composition. So if you "wrap" a string and say that it should be interpreted one way, then that's the whole point of enforcing an encoding on the view of the string. The string itself doesn't *have* to be encoded a certain way. Now if strings are just values then the encoding in which they come are largely a matter of implementation. Think of an int -- you don't really know if it's big/small endian -- or a float -- whether it's IEEE xxx or yyy. All that matters though is how the operations on these strings are defined. The point of the abstraction is that you have one way of dealing with the string as a value and write algorithms around that abstraction.
New string that does not solve any of Unicode issues has no place - and this is *real* problem.
You missed the point. It's not the string you want, it's the view of the string you want when you're talking about encoding.
Please don't write theories on C++ string if you do not see what string is - text human readable text that is much more complex then set of byte chunks.
See, I defined what a string is in that document. If you don't agree with that definition then I can't help you. As much as humans want to think that computers see the world the same way, unfortunately that's not the case. A string is a data structure. How you view a string in a given encoding is a matter of algorithm. If you don't see that then I'm sorry for you. -- Dean Michael Berris about.me/deanberris

It might turn you away because you obviously love std::string. Generalizing is a different matter and is largely a hot-air blowing exercise that is futile for convincing anybody.
I would say it more clear:
1. All users that use C libraries and need c_str() at boundaries And this is a huge amount of users that need to communicate with modules that are already working and ready but written in C.
And this is about of half of libraries there is C is the lowest level API that allows easy bindings to all languages.
But c_str() doesn't have to be part of the string's interface.
What is better fd = creat(file.c_str(),0666) or fd = creat(boost::string::lineraize(file),0666) Isn't it obvious?
So, what was the point again? [snip] I know, so what is your point?
See above!
I beg your pardon? It is efficient as all functions are as efficient as memcpy with exceptions of overflow/underflow happens which require some virtual functions calls which are pretty fast as well...
Also 99% of issues are just solved with reserve. (and I work with text parsing, combining and processing a lot)
And you obviously don't work with systems that have to do this multiple thousand times in one second to not know what the effects of NUMA are and why allocating a contiguous amount of memory is the performance killer that it is.
I know, but I hadn't suggested that streambuf should use single memory chunk.
So then what's the point of making strings use a single contiguous memory chunk if it's not necessary?
When you create large text chunks it makes sense and then create a sing string from it if you want.
I'm sorry but I think that much more real problem is that:
- My father in law can't use Thunderbird because he defined non-ascii user name and Thunderbird fails to open the profile. So he needs to create
a
new account because half of the other programs are broken when Unicode paths are used!
So fix Thunderbird.
- That acrobat reader can't open files with Unicode file names that user have (at least it was last time I've tried it)
So go work for Adobe and fix Acrobat Reader.
- That you can't write cross platform Unicode aware code using simple std::string/char * or what even encoding non-aware string there.
You can, there are already libraries for that sort of thing if you insist on using std::string.
If the tool causes bugs and problems in every second problem, then seems that there is a problem in the tool. You can't fix all programs but you can make tools better so less issues would arise. So basically what you say that "there is no problem with encodings in current C++ world" I'm sorry but you are wrong. You know what... I'd really like your data structure if you were not calling it string but rather bytes chunk or immutable bytes array. What you are suggesting has noting to do with text, and I don't understand how do you fail to see this. Artyom

On Sat, Jan 29, 2011 at 8:06 PM, Artyom <artyomtnk@yahoo.com> wrote:
But c_str() doesn't have to be part of the string's interface.
What is better
fd = creat(file.c_str(),0666)
or
fd = creat(boost::string::lineraize(file),0666)
Isn't it obvious?
No, it's not obvious. Here's why: fd = creat(file.c_str(), 0666); What does c_str() here imply? It implies that there's a buffer somewhere that is a `char const *` which is either created and returned and then held internally by whatever 'file' is. Now let's say what if file changed in a different thread, what happens to the buffer pointed to by the pointer returned by c_str()? Explain to me that because *it is possible and it can happen in real code*. fd = creat(linearize(file), 0666); // rely on ADL This is also bad because linearize would be allocating the buffer for me which I might not be able to control the size of or know whether it will be a given length -- worse it might even throw. It also may mean that the buffer is static, or I have to manage the pointer being returned. I like this better: char * filename = (char *)malloc(255, sizeof(char)); // I know I want 255 characters max if (filename == NULL) { // deal with the error here } linearize(substr(file, 0, 255), filename); fd = creat(filename, 0666); It's explicit and I see all the memory that I need. I can even imagine a simple class that can do this. Oh wait, here it is: std::string filename = substr(file, 0, 255); fd = creat(filename.c_str(), 0666); All I have to do is to create a conversion operator to std::string and I'll be fine. Or, actually: fd = creat(static_cast<std::string>(substr(file, 0, 255)).c_str(), 0666); Now here file could be a view, could be a raw underlying string. There are many ways to skin a cat (in this context, cut a string) but having c_str() as part of the interface puts too much of a burden on the potential efficiency of an immutable string implementation.
So then what's the point of making strings use a single contiguous memory chunk if it's not necessary?
When you create large text chunks it makes sense and then create a sing string from it if you want.
And then the problem is not addressed then of the unnecessary contiguous buffer.
You can, there are already libraries for that sort of thing if you insist on using std::string.
If the tool causes bugs and problems in every second problem, then seems that there is a problem in the tool.
Weh?
You can't fix all programs but you can make tools better so less issues would arise.
So basically what you say that
"there is no problem with encodings in current C++ world"
I'm sorry but you are wrong.
I NEVER SAID THIS! You're arguing a strawman here. I said: encoding is largely external to a string data structure. That's why there's a view of a string in a given encoding.
You know what...
I'd really like your data structure if you were not calling it string but rather bytes chunk or immutable bytes array.
What you are suggesting has noting to do with text, and I don't understand how do you fail to see this.
I don't know if you're not a native English speaker or whether you just really think strings are just for text. Strings are a data structure (look it up). Encoding is a way of representing or interpreting data in a certain way. I fail to see why encoding has anything to do with a data structure. So if I have data in a data structure, I should be able to apply an encoding on that data structure and "view" it a given way I want/need. What I'm saying is, a string data structure should have clearly defined semantics -- hence the document going into the immutability, value semantics, etc. -- now encoding is largely a matter at a different level operating on strings. Encoding is an interpretation of strings. *I* fail to see why *you* fail to understand this clear statement. -- Dean Michael Berris about.me/deanberris

On Jan 29, 2011, at 7:33 AM, Dean Michael Berris wrote:
On Sat, Jan 29, 2011 at 8:06 PM, Artyom <artyomtnk@yahoo.com> wrote:
[snip]
You know what...
I'd really like your data structure if you were not calling it string but rather bytes chunk or immutable bytes array.
What you are suggesting has noting to do with text, and I don't understand how do you fail to see this.
I don't know if you're not a native English speaker or whether you just really think strings are just for text.
First of all, in programming languages (at least the 20 or so that I master and in which I have developed software professionally), the notion of 'string' is that of text (and in some languages 'string' is nothing more than an alias for an array/vector of characters.) But your intention is to use your "string" for other types of elements, i.e., to be what is called a 'vector' in C++, albeit immutable. No? So, why are you complaining when Artyom actually wants you to call it exactly what you yourself *claim* it is. It is you who bring confusion by: 1. sometimes arguing that it is nothing but a byte sequence, 2. sometimes arguing that *anything* can be stored in those sequences, and 3. sometimes talking about text and encodings - in the form of views - clearly indicating some very special use case for your byte sequence Can you please clarify *which* notion you are after with your "string" proposal? So we understand the exact use case(s) for it? Since we (at least Artyom and myself) have this preconceived notion of what a 'string' is in a programming language, no matter how esoteric that preconception might be...
Strings are a data structure (look it up).
Yes, definitely. I asked you if you meant computer-scientific "string" when you said something similar before and you said 'NO'. But, that is the definition and meaning you are alluding to now, is it not. If not, can you please provide a reference to the "string" you want Artyom to lookup. And, actually, the (CS) string is a proper approach to the problem of (textual...) string as well: a sequence of symbols (in our world, 'character' of some form.) This is very important: it is a (finite...) sequence of *symbols* (characters...) which in our case(s) would be actual characters used in a (natural or not) language. It is *not* a sequence of bytes happening to represent a sequence of characters.
Encoding is a way of representing or interpreting data in a certain way. I fail to see why encoding has anything to do with a data structure.
Encoding has nothing to do with the sequence of characters, except that in order to *represent* a (CS or 'textual') string one needs some type of encoding, and, yes, one that handles the characters in question (such as both Latin-1 and UTF-8 being able to handle the symbol 'Ä')
So if I have data in a data structure, I should be able to apply an encoding on that data structure and "view" it a given way I want/need.
There are four layers in play here: 1. The sequence of characters/symbols, as in CS string; totally abstract but precise... (one such CS string can be represented as Unicode or Latin-1 at #2, for instance.) 2. The sequence of code points, in a given character set. Yes, one CS string (as in #1) can have multiple distinct manifestations at this level. They could be identical in integral sequences. 3. The sequence of code values, using an encoding form such as UTF-8 or UTF-16 for a Unicode code point. 4. The byte storage representing the code values; could be a contiguous sequence of bytes or chunks, etc. It is quite clear that you are (in most posts, at least...) targeting #4 with your proposal. Is that not right? If so, two comments: 1. Why can't this byte storage type not be used for all kinds of things; is not 'string' a quite bad name for it, since it is neither a string according to most programming languages (see above) nor according to that CS definition that you are alluding to (unless you consider uninterpreted bytes to be symbols, but be quite aware that those 'symbols' would have nothing - or very little - to do with the symbols of the text represented through your construct.) 2. What is that 'view' notion of yours - it seems to involve a mixture of #2 and #3 above? In what way is it less unstable that reinterpret_cast<> ? I.e., does it make sense to be able to switch views?
What I'm saying is, a string data structure should have clearly defined semantics -- hence the document going into the immutability, value semantics, etc. -- now encoding is largely a matter at a different level operating on strings. Encoding is an interpretation of strings.
No, encoding is a *representation* of a string (both in the 'text' sense and CS sense.) This difference is crucial. On the other hand: encoding is an interpretation of a byte sequence, *yielding* a string.
*I* fail to see why *you* fail to understand this clear statement.
Because it is false? Again: a 'string' is *not* a sequence of uninterpreted (i.e., detached from encoding) bytes, neither in most programming languages nor in CS. If you have any other definition for 'string' you can provide that, but rest assured that most people will have their preconceived notions firmly established in one (or both) of the above fields. /David

On Sat, Jan 29, 2011 at 9:43 PM, David Bergman <David.Bergman@bergmangupta.com> wrote:
On Jan 29, 2011, at 7:33 AM, Dean Michael Berris wrote:
On Sat, Jan 29, 2011 at 8:06 PM, Artyom <artyomtnk@yahoo.com> wrote:
[snip]
You know what...
I'd really like your data structure if you were not calling it string but rather bytes chunk or immutable bytes array.
What you are suggesting has noting to do with text, and I don't understand how do you fail to see this.
I don't know if you're not a native English speaker or whether you just really think strings are just for text.
First of all, in programming languages (at least the 20 or so that I master and in which I have developed software professionally), the notion of 'string' is that of text (and in some languages 'string' is nothing more than an alias for an array/vector of characters.)
But your intention is to use your "string" for other types of elements, i.e., to be what is called a 'vector' in C++, albeit immutable. No?
NO. I wonder where you got that notion. I framed the discussion around my definition of `string` to be a sequence. In that context (in an earlier post) I was basically saying "a string is a data structure for holding things, [FOR EXAMPLE] a string of events, a string of characters, ..." just to frame the definition properly and identify that I was talking about a data structure. I had no other intention of implementing a string of events, but mostly that data structure is already there. Lifting the notion of what a string is, is what I did. And I explained it as well (I hate going meta on english like this, but hey...) that it was a linguistic tool used to set the stage for a discussion. It's more like "setting the basis from which your arguments will be based on". I never though this would be such an issue in presenting the idea of what a string *data structure* is.
So, why are you complaining when Artyom actually wants you to call it exactly what you yourself *claim* it is.
It is you who bring confusion by:
1. sometimes arguing that it is nothing but a byte sequence,
2. sometimes arguing that *anything* can be stored in those sequences, and
3. sometimes talking about text and encodings - in the form of views - clearly indicating some very special use case for your byte sequence
Can you please clarify *which* notion you are after with your "string" proposal? So we understand the exact use case(s) for it? Since we (at least Artyom and myself) have this preconceived notion of what a 'string' is in a programming language, no matter how esoteric that preconception might be...
A string, is something that contains data, is immutable, and on which you can perform primitive operations on that define the whole "string calculus" I refer to in the paper. It's like asking me to define what a number is in math when it's just really a value that plays along a set of concepts and within a certain set of rules. I can describe to you the concept of a string -- and that is general by design, so that we can talk about interfaces and all that design-goodness jazz. What you can't make me do is say "a string is a series of characters with an encoding" because that's not what describes the *concept* of a string. Now a string is a data structure in my mind and I'm trying my best to explain how that data structure doesn't include an encoding. Now the *view* mechanism is what allows for *interpreting the data in a string (whatever that is)* and looking at it a certain way. An encoding is something you apply to data to make it look like a certain thing -- in your and Artyom's case that's just a UTF-8 encoded view of the underlying data in a string.
Strings are a data structure (look it up).
Yes, definitely. I asked you if you meant computer-scientific "string" when you said something similar before and you said 'NO'. But, that is the definition and meaning you are alluding to now, is it not. If not, can you please provide a reference to the "string" you want Artyom to lookup.
See, the answer I gave about whether it's a CS-string was clear: yes, it is a string data structure. And I thought I was clear.
And, actually, the (CS) string is a proper approach to the problem of (textual...) string as well: a sequence of symbols (in our world, 'character' of some form.) This is very important: it is a (finite...) sequence of *symbols* (characters...) which in our case(s) would be actual characters used in a (natural or not) language. It is *not* a sequence of bytes happening to represent a sequence of characters.
So on a computer, let's be frank, what is it that you deal with -- isn't it bytes? So we can run around in circles about whether it's a character string, an event string, or a wide character string, in the end, they are *bytes*. Now of course in some encodings a character may be of a minimum and (potentially, but not required) maximum number of bytes "per element" or "symbol" if you like. In other encodings it's a fixed size (as in ASCII). However the string stores it is largely inconsequential as long as you view it in a consistent manner.
Encoding is a way of representing or interpreting data in a certain way. I fail to see why encoding has anything to do with a data structure.
Encoding has nothing to do with the sequence of characters, except that in order to *represent* a (CS or 'textual') string one needs some type of encoding, and, yes, one that handles the characters in question (such as both Latin-1 and UTF-8 being able to handle the symbol 'Ä')
An encoding, is largely a function that you apply to input to yield output. Given a string 's' and a function 'encode', an encoded string is the result of `encode(s)`. What's so hard to understand there?
So if I have data in a data structure, I should be able to apply an encoding on that data structure and "view" it a given way I want/need.
There are four layers in play here:
1. The sequence of characters/symbols, as in CS string; totally abstract but precise... (one such CS string can be represented as Unicode or Latin-1 at #2, for instance.)
Yes, but it doesn't *have* to be *defaulted* to something which is my whole point all along.
2. The sequence of code points, in a given character set. Yes, one CS string (as in #1) can have multiple distinct manifestations at this level. They could be identical in integral sequences.
Okay.
3. The sequence of code values, using an encoding form such as UTF-8 or UTF-16 for a Unicode code point.
4. The byte storage representing the code values; could be a contiguous sequence of bytes or chunks, etc.
It is quite clear that you are (in most posts, at least...) targeting #4 with your proposal. Is that not right? If so, two comments:
No, you missed it too. I was presenting the foundation (#4) so that I can build upon it 1, 2, and 3. The approach is bottom up instead of top-down.
1. Why can't this byte storage type not be used for all kinds of things; is not 'string' a quite bad name for it, since it is neither a string according to most programming languages (see above) nor according to that CS definition that you are alluding to (unless you consider uninterpreted bytes to be symbols, but be quite aware that those 'symbols' would have nothing - or very little - to do with the symbols of the text represented through your construct.)
It actually can. But that only makes sense if the operations make sense on it. For example if you put the raw byte sequences for a float into a string -- does concatenation make sense for that data? Maybe in your application yes, but what's *in* the string is largely inconsequential to the algorithms you apply to the string. Now if you had a view which wrapped this concatenated byte sequences of floats that yielded a pair of floats, isn't the abstraction still appropriate? I'll leave you to work that out on your own.
2. What is that 'view' notion of yours - it seems to involve a mixture of #2 and #3 above? In what way is it less unstable that reinterpret_cast<> ? I.e., does it make sense to be able to switch views?
Because reinterpret_cast assumes that the data referred to in the pointer is contiguous. And I have already maintained that the string's implementation will explicitly be non-contiguous so that you cannot assume that the data it contains is contiguous. Now does that make sense?
What I'm saying is, a string data structure should have clearly defined semantics -- hence the document going into the immutability, value semantics, etc. -- now encoding is largely a matter at a different level operating on strings. Encoding is an interpretation of strings.
No, encoding is a *representation* of a string (both in the 'text' sense and CS sense.) This difference is crucial. On the other hand: encoding is an interpretation of a byte sequence, *yielding* a string.
Eh? Encoding is an algorithm (a transformation if you will) applied to *data* and what's yielded is an encoded result. So when you take a string (as how I define it) and you wrap it in a "raw view" then you get the raw data in the string as exposed by iterators. Then think about what happens when you view a string in a given encoding; let's say you have BOM markers in the beginning of a byte sequence, or somehow have data at the start of it that gives information about what the encoding is -- then you can make a generic 'view' that can handle this data appropriately. The possibilities are endless here. So when you see `view<utf8_encoded>` that tells you whatever string you wrap with this view will be viewed as a UTF-8 encoded "text" in your parlance -- actually what's going to happen is you have access to iterators that yield appropriately-typed code-points or "characters".
*I* fail to see why *you* fail to understand this clear statement.
Because it is false? Again: a 'string' is *not* a sequence of uninterpreted (i.e., detached from encoding) bytes, neither in most programming languages nor in CS. If you have any other definition for 'string' you can provide that, but rest assured that most people will have their preconceived notions firmly established in one (or both) of the above fields.
A string is a data structure that contains data, has defined operations on data, and is largely viewed as a container -- in my definition, it is also immutable. Much like how numbers in math are defined by a concept, a higher level concept of string that defines its interface and semantics is what I have presented. Also whatever the encoding of the underlying data in a string is is largely inconsequential as a matter of the operations defined on it. Truth is largely a matter of agreement I find as well so long as we all agree to what the definition of truth is. Sorry to go philosophical on you but really saying "it's false" is different from saying "because I disagree". -- Dean Michael Berris about.me/deanberris

You know what...
I'd really like your data structure if you were not calling it string but rather bytes chunk or immutable bytes array.
What you are suggesting has noting to do with text, and I don't understand how do you fail to see this.
I don't know if you're not a native English speaker or whether you just really think strings are just for text.
First of all, in programming languages (at least the 20 or so that I master
and in which I have developed software professionally), the notion of 'string' is that of text (and in some languages 'string' is nothing more than an alias for an array/vector of characters.)
But your intention is to use your "string" for other types of elements,
i.e., to be what is called a 'vector' in C++, albeit immutable. No?
NO.
I wonder where you got that notion. I framed the discussion around my definition of `string` to be a sequence. In that context (in an earlier post) I was basically saying "a string is a data structure for holding things, [FOR EXAMPLE] a string of events, a string of characters, ..." just to frame the definition properly and identify that I was talking about a data structure.
I'm sorry: Let's see: - Java String - one meaning text, UTF-16 encoded - C# string - one meaning text, UTF-16 encoded - C++/GTKmm ustring - one meaning text, UTF-8 encoded - C++/Qt QString - one meaning text, UTF-16 encoded - C++/wxWidgets wxString - one meaning text, Unicode (don't remember encoding type) - Vala string - text UTF-8 encoded - Python 3 str[ing] - text UTF-16 ot UTF-32 encoded Is this clear enough? When you say string you mean TEXT not more not less. And yes in C++ you can store arbitrary date in char buffers or in std::string, but this does not change the meaning of string word - it means "text" Don't try to reinvent the meaning of string word in CS context. It is not about English, it is about concept. Artyom

On Sat, Jan 29, 2011 at 11:44 PM, Artyom <artyomtnk@yahoo.com> wrote:
I wonder where you got that notion. I framed the discussion around my definition of `string` to be a sequence. In that context (in an earlier post) I was basically saying "a string is a data structure for holding things, [FOR EXAMPLE] a string of events, a string of characters, ..." just to frame the definition properly and identify that I was talking about a data structure.
And you snipped the part explaining that I used that statement for framing a discussion, as in laying down the foundations for arguments. *sigh*
I'm sorry:
Let's see:
- Java String - one meaning text, UTF-16 encoded
Nope, in Java a String is a data type that derives from Object which stores an immutable sequence of 16-bit characters. Not necessarily *text*, and it just so happens that it chooses the UTF-16 encoding. AFAIK you can still stuff arbitrary bytes when constructing a String object -- try reading from a binary file and see what I mean.
- C# string - one meaning text, UTF-16 encoded
Sequence of UTF-16 characters. Not sure if it's immutable.
- C++/GTKmm ustring - one meaning text, UTF-8 encoded
Sequence of characters, just so happens to be UTF-8 encoded.
- C++/Qt QString - one meaning text, UTF-16 encoded
Sequence of UTF-16 characters. May or may not be *text* i.e. I can still fill the buffer with *garbage*.
- C++/wxWidgets wxString - one meaning text, Unicode (don't remember encoding type)
Same thing, sequence of characters. Just so happens to choose a default encoding, but still you can fill the thing with garbage.
- Vala string - text UTF-8 encoded
I have no idea what a Vala string is.
- Python 3 str[ing] - text UTF-16 ot UTF-32 encoded
In Python 3 it chose to deal with strings as UTF-16 or UTF-32 (there's a move to make this largely transparent to the user depending on the platform) characters. I can still fill a str with garbage even in Python.
Is this clear enough?
I didn't get the point. You were enumerating these data types... to convince me that 'string' only denotes text?
When you say string you mean TEXT not more not less.
No, maybe *you* say string when TEXT is what you mean. I OTOH view a string as a sequence of characters for whatever suitable meaning of character exists. Also, TEXT can be represented in many different ways as well, not just with strings. And TEXT is largely a human idiom referring to letters and words visible on some medium. This has nothing to do with computers because to computers, guess what: it's all *bytes*.
And yes in C++ you can store arbitrary date in char buffers or in std::string, but this does not change the meaning of string word - it means "text"
No, sorry. I think really it's either bad computer science or bad English (or bad translation of concepts, FWIW).
Don't try to reinvent the meaning of string word in CS context. It is not about English, it is about concept.
And the concept of a string in computer science is a sequence of characters for whatever suitable definition of characters exist. -- Dean Michael Berris about.me/deanberris

I'm sorry:
Let's see:
- Java String - one meaning text, UTF-16 encoded
Nope, in Java a String is a data type that derives from Object which stores an immutable sequence of 16-bit characters. Not necessarily *text*, and it just so happens that it chooses the UTF-16 encoding. AFAIK you can still stuff arbitrary bytes when constructing a String object -- try reading from a binary file and see what I mean.
http://download.oracle.com/javase/6/docs/api/java/lang/String.html Some methods: - toLowerCase() - toUpperCase() - trim(); - equalsIgnoreCase(); They don't have any sense for non-text storage. The fact you can fill it with garbage does not change the fact it is text oriented storage
- C# string - one meaning text, UTF-16 encoded
Sequence of UTF-16 characters. Not sure if it's immutable.
http://msdn.microsoft.com/en-us/library/system.string.aspx Some methods: - IsNormalized() - Normalize() - ToLower() - ToUpper() etc. They don't have any sense for non-text storage. The fact you can fill it with garbage does not change the fact it is text oriented storage
- C++/GTKmm ustring - one meaning text, UTF-8 encoded
Sequence of characters, just so happens to be UTF-8 encoded.
http://library.gnome.org/devel/glibmm/unstable/classGlib_1_1ustring.html Some member functions: - collate_key() - normalize() - uppercase() - lowercase() - casefold() They don't have any sense for non-text storage. The fact you can fill it with garbage does not change the fact it is text oriented storage ------------------------------------------- As you probably understand same for QString, wxString and others.
No, maybe *you* say string when TEXT is what you mean.bytes*.
No I mean that when you say "String" you mean "Text" and not only me but all language developers around who developed so many tools for string/text provessing. It is not accident that String means Text in CS context. You may not agree or say English is wrong or Translation is Wrong - but the fact remains: String is Text Storage in most common CS context. If you still not sure about it make a poll and ask you colleges what string means for them - sequence of objects or text. Artyom

From: Dean Michael Berris <mikhailberis@gmail.com> On Sat, Jan 29, 2011 at 8:06 PM, Artyom <artyomtnk@yahoo.com> wrote:
But c_str() doesn't have to be part of the string's interface.
What is better
fd = creat(file.c_str(),0666)
or
fd = creat(boost::string::lineraize(file),0666)
Isn't it obvious?
No, it's not obvious. Here's why:
fd = creat(file.c_str(), 0666);
What does c_str() here imply? It implies that there's a buffer somewhere that is a `char const *` which is either created and returned and then held internally by whatever 'file' is.
It implies that const file owns const buffer that holds null terminated string that can be passed to "char const *" API.
Now let's say what if file changed in a different thread, what happens to the buffer pointed to by the pointer returned by c_str()? Explain to me that because *it is possible and it can happen in real code*.
I'm sorry but string as anything else has value semantics that is: - safe for "const" access from multiple threads - safe for mutable access from single thread I don't see why string should be different from any other value type like "int" because following x+=y + y is not safe for integer as well. The code I had shown is **perfectly** safe with string has has value semantics (which std::string has)
fd = creat(linearize(file), 0666); // rely on ADL
This is also bad because linearize would be allocating the buffer for me which I might not be able to control the size of or know whether it will be a given length -- worse it might even throw.
Exactly! c_str() never throws as it is "const" member function in C++ string semantics. So for example this code is fine with const c_str() bool create_two_lock_files(string const &f1,string const &f2) { int fd1=creat(f1.c_str(),O_EXCL ...) if(fd2==-1) return false; int fd1=creat(f2.c_str(),O_EXCL ...) if(fd2==-1) { unlink(f1.c_str()); close(fd1); return false; } close(f1); close(f2); return true; } It would not work with all linerazie stuff because it would not be exception safe and would require me to create a temporary variable to store f1 linearized.
It also may mean that the buffer is static, or I have to manage the pointer being returned.
I think we both and 95% of C++ programmers that use STL know what is the semantics of std::string::c_str()
I like this better:
char * filename = (char *)malloc(255, sizeof(char)); // I know I want 255 characters max if (filename == NULL) { // deal with the error here } linearize(substr(file, 0, 255), filename); fd = creat(filename, 0666);
Sorry? Is this better then: fd=create(filename.substr(0,256).c_str(),O_EXCL...) Which by the way is 100% thread safe as well (but still may throw). Even thou I can't see any reason to cut 256 bytes before create
It's explicit and I see all the memory that I need. I can even imagine a simple class that can do this. Oh wait, here it is:
std::string filename = substr(file, 0, 255); fd = creat(filename.c_str(), 0666)
You don't need create explicit temporary filename, C++ keeps it alive as long as create is not completed.
All I have to do is to create a conversion operator to std::string and I'll be fine. Or, actually:
fd = creat(static_cast<std::string>(substr(file, 0, 255)).c_str(), 0666);
Now here file could be a view, could be a raw underlying string.
As above. throws and extreamly verbose.
There are many ways to skin a cat (in this context, cut a string) but having c_str() as part of the interface puts too much of a burden on the potential efficiency of an immutable string implementation.
So then what's the point of making strings use a single contiguous memory chunk if it's not necessary?
When you create large text chunks it makes sense and then create a sing string from it if you want.
And then the problem is not addressed then of the unnecessary contiguous buffer.
There is good idea to have some non-linear data storage but it should used in very specific cases. Also what is really large string for you that would have performance advantage not being stored lineary. Talk to me in numbers?
What you are suggesting has noting to do with text, and I don't understand how do you fail to see this.
I don't know if you're not a native English speaker or whether you just really think strings are just for text.
There was other great answer by David Bergman for this you had probably already read I can only say +1 to his answer can't be told better. Artyom

On Sat, Jan 29, 2011 at 11:25 PM, Artyom <artyomtnk@yahoo.com> wrote:
From: Dean Michael Berris <mikhailberis@gmail.com> On Sat, Jan 29, 2011 at 8:06 PM, Artyom <artyomtnk@yahoo.com> wrote:
No, it's not obvious. Here's why:
fd = creat(file.c_str(), 0666);
What does c_str() here imply? It implies that there's a buffer somewhere that is a `char const *` which is either created and returned and then held internally by whatever 'file' is.
It implies that const file owns const buffer that holds null terminated string that can be passed to "char const *" API.
Yes, which is the problem in the first place. Every instance of string would then need to have that same buffer even if that string is just a temporary or worse just a copy.
Now let's say what if file changed in a different thread, what happens to the buffer pointed to by the pointer returned by c_str()? Explain to me that because *it is possible and it can happen in real code*.
I'm sorry but string as anything else has value semantics that is:
- safe for "const" access from multiple threads - safe for mutable access from single thread
I don't see why string should be different from any other value type like "int" because following
x+=y + y
is not safe for integer as well.
The code I had shown is **perfectly** safe with string has has value semantics (which std::string has)
Now the point I was making was, in the case of a string that is immutable, you don't worry about the string changing *ever*, don't need a contiguous buffer for something that isn't explicitly required. It's value semantics *plus* immutability. std::string *is* mutable and now that means the original pointer returned by a call to c_str() may be invalid by the time the C API accesses that pointer because the original string may have already changed (potentially changing the pointer returned by c_str() at a later time).
fd = creat(linearize(file), 0666); // rely on ADL
This is also bad because linearize would be allocating the buffer for me which I might not be able to control the size of or know whether it will be a given length -- worse it might even throw.
Exactly!
c_str() never throws as it is "const" member function in C++ string semantics.
So for example this code is fine with const c_str()
bool create_two_lock_files(string const &f1,string const &f2) { int fd1=creat(f1.c_str(),O_EXCL ...) if(fd2==-1) return false; int fd1=creat(f2.c_str(),O_EXCL ...) if(fd2==-1) { unlink(f1.c_str()); close(fd1); return false; } close(f1); close(f2); return true; }
It would not work with all linerazie stuff because it would not be exception safe and would require me to create a temporary variable to store f1 linearized.
But see, the point I was making was (which apparently you missed or I was unclear): f1.c_str() Can return a pointer at the time of the call, and by the time the C API goes and accesses that pointer, f1 could have already changed underneath and the previous buffer location would have already changed. That's unless f1 is a temporary, but note that the same API can be called to refer to a non-const `std::string`. As in: std::string f1 = "foo", f2 = "bar"; // start threads that may deal with f1 as an lvalue reference create_two_lock_files(f1, f2); // f1.c_str() may be invalidated after c_str() is called. In a world where you couldn't change a string, this use case would be simplified and all requirements to linarize a string are made explcit -- so now you know exactly when you need to have the data linearized. Also when copying immutable strings around, the cost would potentially be the same cost of copying a shared_ptr<> -- which is incrementing a reference count and copying a pointer. It doesn't need to copy the contents anymore for safety because *it will never change*.
It also may mean that the buffer is static, or I have to manage the pointer being returned.
I think we both and 95% of C++ programmers that use STL know what is the semantics of std::string::c_str()
I like this better:
char * filename = (char *)malloc(255, sizeof(char)); // I know I want 255 characters max if (filename == NULL) { // deal with the error here } linearize(substr(file, 0, 255), filename); fd = creat(filename, 0666);
Sorry? Is this better then:
fd=create(filename.substr(0,256).c_str(),O_EXCL...)
Which by the way is 100% thread safe as well (but still may throw). Even thou I can't see any reason to cut 256 bytes before create
No, not better because std::string's substr() will return a temporary, which means it will be a copy -- meaning another allocation and a call to memcpy(...). You don't get the benefit of COW on this one because you need to cut the string down to a "maximum size". I don't know if you know, but them C APIs from OSes have a defined maximum on the lengths of filenames and things like that...
It's explicit and I see all the memory that I need. I can even imagine a simple class that can do this. Oh wait, here it is:
std::string filename = substr(file, 0, 255); fd = creat(filename.c_str(), 0666)
You don't need create explicit temporary filename, C++ keeps it alive as long as create is not completed.
I don't need to but I *can* which makes it explicit and clear that: 1) I am linearizing an immutable string. 2) I do need a linearized buffer.
All I have to do is to create a conversion operator to std::string and I'll be fine. Or, actually:
fd = creat(static_cast<std::string>(substr(file, 0, 255)).c_str(), 0666);
Now here file could be a view, could be a raw underlying string.
As above. throws and extreamly verbose.
This would throw because of... an out of memory exception? At that point then all bets are off. And verbosity is exactly what you want when making explicit conversions and explicit operations.
And then the problem is not addressed then of the unnecessary contiguous buffer.
There is good idea to have some non-linear data storage but it should used in very specific cases.
Also what is really large string for you that would have performance advantage not being stored lineary.
Talk to me in numbers?
Say on a machine+OS combo that has 4kb pages, a large string would be something that spans more than one memory page -- i.e. >4kb. Now a "short" string is one that can fit within a page. For concatenated strings, it's fine to compact/copy the substrings into a growable/shrinkable block. The overhead for a short string would be constant, as the concatenation tree would be a pointer and a length with a reference count integer. If your string (whether long or short) is copied around, then reference counts will kick in on copy constructors and destructors -- how shared_ptr does it. This means there's no extra memory allocations (aside from the actual string object which would be a pointer's width). Now your long strings become interesting. For a string that spans more than one page, you have the overhead of potentially a page's worth of the concatenation tree data structure -- so if a tree node will contain three pointers (block pointer, pointer to left node, pointer to right node), a reference count (inner nodes can be referred to by other higher level concatenation nodes), and a length, then doing some math on it (12bytes+4bytes+2bytes = 18bytes) you have 1 page overhead for every 220-page long string if you pack these 220 nodes in a single page (which is really, an array-based tree). What you get in return is: constant-time dereference bidirectional iterators, logarithmic time 'char' random access, cheap copies, and thread safety.
What you are suggesting has noting to do with text, and I don't understand how do you fail to see this.
I don't know if you're not a native English speaker or whether you just really think strings are just for text.
There was other great answer by David Bergman for this you had probably already read
I can only say +1 to his answer can't be told better.
Okay. -- Dean Michael Berris about.me/deanberris

On 01/28/2011 09:59 AM, Dean Michael Berris wrote:
On Fri, Jan 28, 2011 at 7:20 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
So the interface I was thinking about (and suggesting) is a lot more minimal than what rope or std::string have exposed. I think when I do finish that design document (with rationale) it would be clear why I would like to keep it immutable and why I would prefer it still be called a string.
Let me finish that document -- expect something over the weekend. :)
And I stopped before I write too much -- the initial version is already up: https://github.com/downloads/mikhailberis/cpp-string-theory/cpp-string-theor... -- I'll give it more information and the actual interfaces and implementation as soon as I get some Z's. :)
You mention that your string is thread safe by design, but you only solve the problem of mutating the data of a string, your references to the pieces from which you compose a string are not thread safe, since they are mutable, right? Or is it that when you paste another chunk on the end that you get a whole new container and the old is thrown away? Otherwise if two threads are trying to attach a new piece on the end of the string you could get out of wack without keeping them from dealing with the same data structures. Patrick

On Fri, Jan 28, 2011 at 6:36 PM, Gregory Crosswhite <gcross@phys.washington.edu> wrote:
On 1/27/11 6:56 AM, Dean Michael Berris wrote:
I agree. Help wanted on a better name for the immutable string. :)
How about "twine", "cord", or "yarn"?
Right about now I wish there was an easy way to take a poll somewhere about the name of an immutable finite byte sequence data structure. ;) -- Dean Michael Berris about.me/deanberris

On 1/28/11 3:21 AM, Dean Michael Berris wrote:
Right about now I wish there was an easy way to take a poll somewhere about the name of an immutable finite byte sequence data structure. ;)
Your wish is my command: https://catalyst.uw.edu/webq/survey/gcross/123312 Cheers, Gregory Crosswhite

On Sat, Jan 29, 2011 at 4:51 AM, Gregory Crosswhite <gcross@phys.washington.edu> wrote:
On 1/28/11 3:21 AM, Dean Michael Berris wrote:
Right about now I wish there was an easy way to take a poll somewhere about the name of an immutable finite byte sequence data structure. ;)
Your wish is my command:
Wow. You sir, are awesome.
Cheers, Gregory Crosswhite
Thank you very much! :) Now I look forward to people participating. :D PS. I think this is a product waiting to happen. You should consider monetizing this. :D -- Dean Michael Berris about.me/deanberris

On Jan 27, 2011, at 5:52 AM, Dean Michael Berris wrote:
On Thu, Jan 27, 2011 at 4:09 PM, Artyom <artyomtnk@yahoo.com> wrote:
[snip]
3. What so painful problems are you going to solve that would make it so much better then widely used and adopted std::string? Iterators? Mutability? Performance?
(Clue: there is no painful problems with std::string)
Sorry, but for someone who's dealt with std::string for *a long time (close to 8 years)* here's are a few real painful problems with it:
I actually agree with you here, DMB (which are also my initials...), although I am still in shock of this "new IT world" where 8 years is a long time :-) Some old farts here (me included) have dealt with std::string for 20+ years, and, yes, we share your pain :-) Artyom is a bit silly here, since "having to know the encoded length of each character and then use some calculus to get to the index to use when setting a character" is indeed a problem for a client of a class - one does want that intelligence to be embedded in a proper string class. Yes, there are cases for UTF-8, specifically, where we can just watch for a certain byte coming up, but that is not a stable pattern and applies only to certain operations, such as his searching for whitespace. Additionally, not only having dealt with std::string for more than 20 years but also with Java, C#, Ruby etc. [sS]trings for the last 10-15 years, it is quite clear that not having a string class in the standard library that can at least handle UTF-8 properly is a wart and an embarrassment when trying to lure "soft programmers" into C++. One has to come up with defenses like those of Artyom's to salvage the situation... So, I applaud your fight here, DMB, seriously. I just happen to disagree with your specific focus. What I would do would be to focus not on a class per se but on the (GP...) concept and the associated iterators needed. Then we can see if we can also produce a proper model for that, or if we can find a sub concept (even though it would only be "sub" specification-wise but actually "super" set-wise...) that would make the current std::string a model. And, I would drop the "immutable" part since this thread was supposed to be about a new thingie that could potentially replace std::string in C++3x.. /David

On Fri, Jan 28, 2011 at 1:43 AM, David Bergman <David.Bergman@bergmangupta.com> wrote:
On Jan 27, 2011, at 5:52 AM, Dean Michael Berris wrote:
Sorry, but for someone who's dealt with std::string for *a long time (close to 8 years)* here's are a few real painful problems with it:
I actually agree with you here, DMB (which are also my initials...), although I am still in shock of this "new IT world" where 8 years is a long time :-) Some old farts here (me included) have dealt with std::string for 20+ years, and, yes, we share your pain :-)
:D (I think I should have said "long *enough* time" :D)
Artyom is a bit silly here, since "having to know the encoded length of each character and then use some calculus to get to the index to use when setting a character" is indeed a problem for a client of a class - one does want that intelligence to be embedded in a proper string class. Yes, there are cases for UTF-8, specifically, where we can just watch for a certain byte coming up, but that is not a stable pattern and applies only to certain operations, such as his searching for whitespace.
Additionally, not only having dealt with std::string for more than 20 years but also with Java, C#, Ruby etc. [sS]trings for the last 10-15 years, it is quite clear that not having a string class in the standard library that can at least handle UTF-8 properly is a wart and an embarrassment when trying to lure "soft programmers" into C++. One has to come up with defenses like those of Artyom's to salvage the situation...
So, I applaud your fight here, DMB, seriously. I just happen to disagree with your specific focus.
What I would do would be to focus not on a class per se but on the (GP...) concept and the associated iterators needed. Then we can see if we can also produce a proper model for that, or if we can find a sub concept (even though it would only be "sub" specification-wise but actually "super" set-wise...) that would make the current std::string a model.
Right, I agree. But the context of the discussion was simply around the class. I have in my mind an almost-thoroughly though out picture of how strings can be dealt with from an algorithm perspective. It's complete to the point of just needing to be written down as a document that everybody can chime in on and maybe review for comment. I did introduce the concept a while back when I asked for a different string container that had its own semantics different from the way std::string does it -- I think I called it a string_handle and that name didn't stick nor did it evoke enough responses to merit a (bikeshed?) discussion like this one has. There I proposed a: 1. Simple Expression-Templates based interface for concatenation and performing lazy transformations. 2. Range-based algorithms that may be aware of certain encodings (not just Unicode). 3. A foundation from which a family of string algorithms can be built around. Having done more GP myself recently, more and more I think the forces that shape the abstractions and the algorithms should be "equal" to come up with any reasonably effective solution. Much of this though revolves around what you really want to be able to achieve with the family of algorithms and the abstractions you want to model.
And, I would drop the "immutable" part since this thread was supposed to be about a new thingie that could potentially replace std::string in C++3x..
Unfortunately that's a deal breaker for me. I see a world where immutable strings would be the real solution to a lot of the issues we're facing. And I would think if I (and others who believe) can make it happen, then why shouldn't it be a replacement to std::string in C++4x? :D Thanks David for the encouragement and feedback. /me hunkers down and writes the document. -- Dean Michael Berris about.me/deanberris

On 01/27/2011 02:52 AM, Dean Michael Berris wrote:
... elision by patrick ...
Sorry, but for someone who's dealt with std::string for *a long time (close to 8 years)* here's are a few real painful problems with it:
1. Because of COW implementations you can't deal with it properly in multiple threads without explicitly creating a copy of the string. SSO makes it a lot more unpredictable. The's all sorts of concurrency problems with std::string that are not addressed by the interface.
2. Temporaries abound. Operations return temporaries and copies of data in the string. Substring operations create new strings to avoid the problem with concurrent mutation of the underlying data.
3. It has to be contiguous in memory because of the assumption that data() and c_str() should give a view of the string's data at any given time across mutations makes extending, shrinking, and generally manipulating it a pain not only from an interface/performance perspective but also from the point of view of an implementor. Then you look at the resource issues this raises with potential fragmentation when you keep mutating strings and growing it.
It doesn't have to be contiguous, but rather act as if it were. Of course everyone does it contiguously because the alternatives are all a lot worse.
4. Because of the mutability of std::string your iterators *may* be invalidated when the string changes. This is crucial for efficiency concerned code that deals with strings.
You _have_ to treat them as if they were invalidate.
5. Because of the contiguous requirement, using it for any "text" that's larger than a memory page's worth of data will kill your cache coherency -- and then when you modify parts of it then you can thank your virtual memory manager when the modifications are done. Then you see that you would have to implement your own segmented data structure to act as a string and then you realize you're better off not using std::string for situations where the amount of data you're going to deal with is potentially larger than cache line.
Thank you! This is a nice discussion of some of the advantages of an immutable string vs a mutable string.

AMDG On 1/27/2011 2:52 AM, Dean Michael Berris wrote:
5. Because of the contiguous requirement, using it for any "text" that's larger than a memory page's worth of data will kill your cache coherency -- and then when you modify parts of it then you can thank your virtual memory manager when the modifications are done. Then you see that you would have to implement your own segmented data structure to act as a string and then you realize you're better off not using std::string for situations where the amount of data you're going to deal with is potentially larger than cache line.
I beg your pardon, but this makes no sense to me. Would you mind explaining exactly what kind of usage makes the hardware unhappy and why? I'm not even sure what this has to do with cache coherency (http://en.wikipedia.org/wiki/Cache_coherence). In Christ, Steven Watanabe

On Fri, Jan 28, 2011 at 1:30 PM, Steven Watanabe <watanabesj@gmail.com> wrote:
On 1/27/2011 2:52 AM, Dean Michael Berris wrote:
5. Because of the contiguous requirement, using it for any "text" that's larger than a memory page's worth of data will kill your cache coherency -- and then when you modify parts of it then you can thank your virtual memory manager when the modifications are done. Then you see that you would have to implement your own segmented data structure to act as a string and then you realize you're better off not using std::string for situations where the amount of data you're going to deal with is potentially larger than cache line.
I beg your pardon, but this makes no sense to me. Would you mind explaining exactly what kind of usage makes the hardware unhappy and why?
For multi-core set-ups where you have a NUMA architecture, having one thread allocate memory that's in a given memory controller (in a given core/CPU) that has to be a given size spanning multiple pages will give your OS a hard time finding at least two contiguous memory pages (especially when memory is a scarce resource). That's the virtual memory manager at work and that's a performance killer on most modern (and even not so modern) platforms. This means, lets say that you have a memory page that's 4kb long, and you need a contiguous string that is say 9kb, then that means your OS would have to find 3 contiguous memory pages to be able to fit a single string. Say that again, a single string needing 3 *contiguous* memory pages. If this allocation happens on one core and then the process is migrated to a different core for whatever reason that controls a different memory module, then just accessing that string that's even in L3 cache would be a problem. Then when you change anything in the pages, your write-through cache will see it, and the OS virtual memory manager will mark that page dirty. Next time you need to access memory in the same page in a different core and especially if the data isn't in L3 cache yet, OS sees a dirty page and does all page management necessary to mark that page clean just for reading the data. Now that's fine if the string doesn't grow. But now let's say you concatenate a 4kb string to a 9kb string -- you now then need 5 *contiguous* pages to fit all the 13kb data in a single string. That's on top of the storage that's already required by either string. Then all hell breaks loose as your VMM will have to find these 5 pages that sit right next to each other, potentially paging out stuff just so that it can satisfy this simple concatenation request. The time it takes to get something like that done is just unacceptable whenever you need to do something remotely interesting on a system that should be handling thousands of transactions per second -- or even just on a user's machine where you have absolutely no idea what other programs are running and what kind of architecture they have. :)
I'm not even sure what this has to do with cache coherency (http://en.wikipedia.org/wiki/Cache_coherence).
In the case where you have two strings trying to access the same string, and the string changes on a process that's working on one processor, the processor has to keep the caches of both up-to-date especially if it's in L1 cache. That means if you're building a string in one thread and simultaneously reading it from another, you're stressing the cache coherence mechanisms used by them multi-core machines. On single-core machines that's largely not too much of a problem unless you run into non-local memory access and have to swap things in and out of the cache -- the case for when you have large contiguous chunks of memory that have to be accessed randomly (like strings and vectors). HTH -- Dean Michael Berris about.me/deanberris

AMDG On 1/28/2011 12:49 AM, Dean Michael Berris wrote:
On Fri, Jan 28, 2011 at 1:30 PM, Steven Watanabe<watanabesj@gmail.com> wrote:
On 1/27/2011 2:52 AM, Dean Michael Berris wrote:
5. Because of the contiguous requirement, using it for any "text" that's larger than a memory page's worth of data will kill your cache coherency -- and then when you modify parts of it then you can thank your virtual memory manager when the modifications are done. Then you see that you would have to implement your own segmented data structure to act as a string and then you realize you're better off not using std::string for situations where the amount of data you're going to deal with is potentially larger than cache line.
I beg your pardon, but this makes no sense to me. Would you mind explaining exactly what kind of usage makes the hardware unhappy and why?
For multi-core set-ups where you have a NUMA architecture, having one thread allocate memory that's in a given memory controller (in a given core/CPU) that has to be a given size spanning multiple pages will give your OS a hard time finding at least two contiguous memory pages (especially when memory is a scarce resource). That's the virtual memory manager at work and that's a performance killer on most modern (and even not so modern) platforms.
I'm not sure what this has to do with the OS. The memory manager exists in user space and only goes to the OS when it can't fulfill the request with the memory that it has on hand.
This means, lets say that you have a memory page that's 4kb long, and you need a contiguous string that is say 9kb, then that means your OS would have to find 3 contiguous memory pages to be able to fit a single string. Say that again, a single string needing 3 *contiguous* memory pages. If this allocation happens on one core and then the process is migrated to a different core for whatever reason that controls a different memory module, then just accessing that string that's even in L3 cache would be a problem.
Humph. 9 KB is not all that much memory. Besides, how often do you have strings this big anyway?
Then when you change anything in the pages, your write-through cache will see it, and the OS virtual memory manager will mark that page dirty. Next time you need to access memory in the same page in a different core and especially if the data isn't in L3 cache yet, OS sees a dirty page and does all page management necessary to mark that page clean just for reading the data.
Now that's fine if the string doesn't grow. But now let's say you concatenate a 4kb string to a 9kb string -- you now then need 5 *contiguous* pages to fit all the 13kb data in a single string. That's on top of the storage that's already required by either string.
So, what you're saying is that requiring contiguous strings interferes with sharing memory between strings and thus increases memory usage. That I can understand, but you already covered this in a separate bullet point.
Then all hell breaks loose as your VMM will have to find these 5 pages that sit right next to each other, potentially paging out stuff just so that it can satisfy this simple concatenation request.
In my experience most programs don't come close enough to exhausting their address spaces to make this a big deal. I don't believe that this is as apocalyptic as you're making it out in general.
The time it takes to get something like that done is just unacceptable whenever you need to do something remotely interesting on a system that should be handling thousands of transactions per second -- or even just on a user's machine where you have absolutely no idea what other programs are running and what kind of architecture they have. :)
You need to be more clear about whether you're talking about physical addresses or virtual addresses. Contiguous pages do not have to be stored in contiguous physical memory. If what you're saying is true than std::deque should be preferable to std::vector, but this is exactly the opposite of what I've always heard.
I'm not even sure what this has to do with cache coherency (http://en.wikipedia.org/wiki/Cache_coherence).
In the case where you have two strings trying to access the same string, and the string changes on a process that's working on one processor, the processor has to keep the caches of both up-to-date especially if it's in L1 cache. That means if you're building a string in one thread and simultaneously reading it from another, you're stressing the cache coherence mechanisms used by them multi-core machines.
This sounds like a bad idea to begin with, never mind the performance implications.
On single-core machines that's largely not too much of a problem unless you run into non-local memory access and have to swap things in and out of the cache -- the case for when you have large contiguous chunks of memory that have to be accessed randomly (like strings and vectors).
Sure, but this happens regardless of whether the string is stored contiguously or not. In fact, I would think that your proposal would make this worse by allowing hidden sharing between apparently unrelated strings. In Christ, Steven Watanabe

On Sat, Jan 29, 2011 at 1:43 AM, Steven Watanabe <watanabesj@gmail.com> wrote:
AMDG
On 1/28/2011 12:49 AM, Dean Michael Berris wrote:
For multi-core set-ups where you have a NUMA architecture, having one thread allocate memory that's in a given memory controller (in a given core/CPU) that has to be a given size spanning multiple pages will give your OS a hard time finding at least two contiguous memory pages (especially when memory is a scarce resource). That's the virtual memory manager at work and that's a performance killer on most modern (and even not so modern) platforms.
I'm not sure what this has to do with the OS. The memory manager exists in user space and only goes to the OS when it can't fulfill the request with the memory that it has on hand.
Which OS are we talking about? At least for Linux I know for a fact that the Virtual Memory Manager is an OS-level service. I don't know Windows internals but I'm willing to say that the virtual memory manager exists as an OS-level service. That means the heap that's available to an application would have to be allocated by a VMM before the application even touches the CPU. Did I get that part wrong?
This means, lets say that you have a memory page that's 4kb long, and you need a contiguous string that is say 9kb, then that means your OS would have to find 3 contiguous memory pages to be able to fit a single string. Say that again, a single string needing 3 *contiguous* memory pages. If this allocation happens on one core and then the process is migrated to a different core for whatever reason that controls a different memory module, then just accessing that string that's even in L3 cache would be a problem.
Humph. 9 KB is not all that much memory. Besides, how often do you have strings this big anyway?
When you're downloading HTML pages and/or building HTML pages, or parsing HTTP requests, it happens a lot. 9kb is a number that's suitable to show that you need multiple contiguous pages for any meaningful discussion on what memory paging effects have something to do with strings and algorithms that deal with strings. Of course this assumes 4kb pages too, the formula would be the same though with just values changing for any platform. ;)
Then when you change anything in the pages, your write-through cache will see it, and the OS virtual memory manager will mark that page dirty. Next time you need to access memory in the same page in a different core and especially if the data isn't in L3 cache yet, OS sees a dirty page and does all page management necessary to mark that page clean just for reading the data.
Now that's fine if the string doesn't grow. But now let's say you concatenate a 4kb string to a 9kb string -- you now then need 5 *contiguous* pages to fit all the 13kb data in a single string. That's on top of the storage that's already required by either string.
So, what you're saying is that requiring contiguous strings interferes with sharing memory between strings and thus increases memory usage. That I can understand, but you already covered this in a separate bullet point.
But the effect of finding the 5 contiguous pages is what I was trying to highlight. Even if you could grow the original buffers of one of the strings, in a std::string it would still have to be a contiguous chunk of bytes anyway. :)
Then all hell breaks loose as your VMM will have to find these 5 pages that sit right next to each other, potentially paging out stuff just so that it can satisfy this simple concatenation request.
In my experience most programs don't come close enough to exhausting their address spaces to make this a big deal. I don't believe that this is as apocalyptic as you're making it out in general.
If we're going to go through "in my experience most programs" I'd say this happens pretty often. ;) Of course I'm talking about network services and basically having to deal with potentially huge data structures in memory along with text that has to be generated and streamed out to network connections. And this usually has to be done at pretty crazy rates (think several thousands of requests per second). This doesn't even consider the case where you not only have to build strings that are potentially huge but also competing with mmap'ed files in the same address space that span multiple contiguous memory pages as well. :)
The time it takes to get something like that done is just unacceptable whenever you need to do something remotely interesting on a system that should be handling thousands of transactions per second -- or even just on a user's machine where you have absolutely no idea what other programs are running and what kind of architecture they have. :)
You need to be more clear about whether you're talking about physical addresses or virtual addresses. Contiguous pages do not have to be stored in contiguous physical memory.
The fact that contiguous pages don't get stored in contiguous physical memory compounds the problem even more. You feel this a lot when your system starts thrashing and the VMM does the work of swapping pages and re-arranging page tables for you. The round-trips between a CPU page fault and the VMM page fault handler along with potentially invalidating L1/L2/L3 caches costs are significant in environments where you need to get as much things done per second as physically possible. If you can avoid having to go through this dance with a sane memory management strategy, I think that's a win even in the trivial case.
If what you're saying is true than std::deque should be preferable to std::vector, but this is exactly the opposite of what I've always heard.
It depends on what you're trying to do and what algorithms you're applying on your std::dequeue. This is the whole crux of the segmented iterators debate/paper by Matt Austern and others. Segmented versions of the algorithms would make using std::dequeue a lot more preferable if you really needed a data structure that doesn't promote heap fragmentation as much as a growing std::vector does.
In the case where you have two strings trying to access the same
I should have said "two threads thread trying"...
string, and the string changes on a process that's working on one processor, the processor has to keep the caches of both up-to-date especially if it's in L1 cache. That means if you're building a string in one thread and simultaneously reading it from another, you're stressing the cache coherence mechanisms used by them multi-core machines.
This sounds like a bad idea to begin with, never mind the performance implications.
Right. Now if you prevented this from even being possible by making a string immutable, I'd say it's a win, no? :)
On single-core machines that's largely not too much of a problem unless you run into non-local memory access and have to swap things in and out of the cache -- the case for when you have large contiguous chunks of memory that have to be accessed randomly (like strings and vectors).
Sure, but this happens regardless of whether the string is stored contiguously or not. In fact, I would think that your proposal would make this worse by allowing hidden sharing between apparently unrelated strings.
The hidden sharing will only be a problem if you're allowing mutation. Otherwise it's a good thing from a VMM/Cache perspective all-around (except for the cache line that has a reference count, but that's largely only touched when copies/temporaries get created/destroyed, and in sane systems would be an atomic increment/decrement anyway).
In Christ, Steven Watanabe
HTH -- Dean Michael Berris about.me/deanberris

On 28 January 2011 12:22, Dean Michael Berris <mikhailberis@gmail.com>wrote: The fact that contiguous pages don't get stored in contiguous physical
memory compounds the problem even more. You feel this a lot when your system starts thrashing and the VMM does the work of swapping pages and re-arranging page tables for you. The round-trips between a CPU page fault and the VMM page fault handler along with potentially invalidating L1/L2/L3 caches costs are significant in environments where you need to get as much things done per second as physically possible. If you can avoid having to go through this dance with a sane memory management strategy, I think that's a win even in the trivial case.
If I have to keep data in physical memory, the way to use the least number of pages is to store it contiguously with respect to virtual memory. Anything else will (except for degenerate cases) require *more* physical pages of RAM. There are many other reasons to break something up into a segmented data structure, but avoiding thrashing when storing the object is not one of them. Honestly, trying to argue that your new class is more efficient *on all fronts* is futile. You'd do far better showing the tradeoffs you are making; which things are more efficient, and being honest about which things are less efficient.
If what you're saying is true than std::deque should be preferable to std::vector, but this is exactly the opposite of what I've always heard.
It depends on what you're trying to do and what algorithms you're
applying on your std::dequeue. This is the whole crux of the segmented iterators debate/paper by Matt Austern and others. Segmented versions of the algorithms would make using std::dequeue a lot more preferable if you really needed a data structure that doesn't promote heap fragmentation as much as a growing std::vector does.
GIven that a deque requires a new heap allocation for every N elements, how does it avoid heap fragmentation? And I think this is straying off-topic from Boost... -- Nevin ":-)" Liber <mailto:nevin@eviloverlord.com> (847) 691-1404

On Sat, Jan 29, 2011 at 2:51 AM, Nevin Liber <nevin@eviloverlord.com> wrote:
On 28 January 2011 12:22, Dean Michael Berris <mikhailberis@gmail.com>wrote:
The fact that contiguous pages don't get stored in contiguous physical
memory compounds the problem even more. You feel this a lot when your system starts thrashing and the VMM does the work of swapping pages and re-arranging page tables for you. The round-trips between a CPU page fault and the VMM page fault handler along with potentially invalidating L1/L2/L3 caches costs are significant in environments where you need to get as much things done per second as physically possible. If you can avoid having to go through this dance with a sane memory management strategy, I think that's a win even in the trivial case.
If I have to keep data in physical memory, the way to use the least number of pages is to store it contiguously with respect to virtual memory. Anything else will (except for degenerate cases) require *more* physical pages of RAM.
More pages doesn't mean more contiguous pages. If you really needed contiguous pages you already have those containers in place. I never said that the segmented storage for strings will require less memory. ;)
There are many other reasons to break something up into a segmented data structure, but avoiding thrashing when storing the object is not one of them.
Hmmm... ? So how is avoiding the (mostly unnecessary) paging in/out by not requiring contiguous pages not a way of avoiding thrashing? (Too many negatives, too tired to edit that message).
Honestly, trying to argue that your new class is more efficient *on all fronts* is futile. You'd do far better showing the tradeoffs you are making; which things are more efficient, and being honest about which things are less efficient.
Really, look -- I didn't say my new class is more efficient on all fronts, I think you're arguing a strawman here. What was being discussed in this particular line of discourse is the effect of contiguous pages of memory in a data structure like std::vector or std::string. For the cases that you don't need these contiguous pages like in immutable strings, not using contiguous pages is a good thing, better than unnecessarily forcing the use of contiguous pages. It wasn't about the saving of memory, it was about limiting the potential for fragmentation and VMM paging involvement. Also, read the paper now that it's there. This is getting tiring really especially when a strawman red herring is thrown in once in a while.
It depends on what you're trying to do and what algorithms you're
applying on your std::dequeue. This is the whole crux of the segmented iterators debate/paper by Matt Austern and others. Segmented versions of the algorithms would make using std::dequeue a lot more preferable if you really needed a data structure that doesn't promote heap fragmentation as much as a growing std::vector does.
GIven that a deque requires a new heap allocation for every N elements, how does it avoid heap fragmentation?
Consider the flip side with an std::vector which doesn't have a reallocate option. It requires a *larger* heap allocation everytime it grows. Now riddle me this, what happens to the memory the vector used to occupy? Doesn't having irregular sized allocations promote more fragmentation than having regular sized allocations? Having same-sized segments is already a good thing for cases where you have an optimal way of allocating pages -- and most modern OSes have these. Growing a std::queue introduces less fragmentation and therefore less thrashing by making the sizes of the segments tuned and aligned properly.
And I think this is straying off-topic from Boost...
I agree. Let's stop this talk now. -- Dean Michael Berris about.me/deanberris

AMDG On 1/28/2011 10:22 AM, Dean Michael Berris wrote:
On Sat, Jan 29, 2011 at 1:43 AM, Steven Watanabe<watanabesj@gmail.com> wrote:
I'm not sure what this has to do with the OS. The memory manager exists in user space and only goes to the OS when it can't fulfill the request with the memory that it has on hand.
Which OS are we talking about? At least for Linux I know for a fact that the Virtual Memory Manager is an OS-level service. I don't know Windows internals but I'm willing to say that the virtual memory manager exists as an OS-level service. That means the heap that's available to an application would have to be allocated by a VMM before the application even touches the CPU.
There are two layers. The application has a memory manager in the C library which services most requests. This allocator gets more memory from the OS as needed, but usually doesn't release it back to the OS immediately. In ptmalloc2 which is used be glibc, the default settings work something like this: * If the requested size is at least 128 KB use mmap to let the OS handle the request directly. * If there is a block available that's big enough, use it. * try to use sbrk to expand the heap. * If sbrk fails, use mmap to allocate a 1 MB chunk. Note that the OS never gets a mmap request less than 128 KB and all the memory allocated via sbrk is always contiguous.
This means, lets say that you have a memory page that's 4kb long, and you need a contiguous string that is say 9kb, then that means your OS would have to find 3 contiguous memory pages to be able to fit a single string. Say that again, a single string needing 3 *contiguous* memory pages. If this allocation happens on one core and then the process is migrated to a different core for whatever reason that controls a different memory module, then just accessing that string that's even in L3 cache would be a problem.
Humph. 9 KB is not all that much memory. Besides, how often do you have strings this big anyway?
When you're downloading HTML pages and/or building HTML pages, or parsing HTTP requests, it happens a lot. 9kb is a number that's suitable to show that you need multiple contiguous pages for any meaningful discussion on what memory paging effects have something to do with strings and algorithms that deal with strings. Of course this assumes 4kb pages too, the formula would be the same though with just values changing for any platform. ;)
How big are the strings you're really dealing with? 9 KB is just insignificant even in a 32 (31?)-bit virtual address space. As far as I'm concerned, it isn't big until you're talking megabytes, at least. With 64-bit you can exhaust physical memory and use the entire hard drive for swap and still be an order of magnitude short of using up all the virtual address space, even when the hardware restricts you to 48-bit virtual addresses.
But the effect of finding the 5 contiguous pages is what I was trying to highlight. Even if you could grow the original buffers of one of the strings, in a std::string it would still have to be a contiguous chunk of bytes anyway. :)
The page size is simply irrelevant W.R.T. the effect of allocating contiguous memory.
Then all hell breaks loose as your VMM will have to find these 5 pages that sit right next to each other, potentially paging out stuff just so that it can satisfy this simple concatenation request.
In my experience most programs don't come close enough to exhausting their address spaces to make this a big deal. I don't believe that this is as apocalyptic as you're making it out in general.
If we're going to go through "in my experience most programs" I'd say this happens pretty often. ;)
Well, looking at my system right now, the most memory that any process is using is 420 MB. This is still a lot less than the 2 GB limit. The next memory user isn't even close (180 MB) and after that it drops off rapidly. If you're getting this close to running out of address space, contiguous strings are probably the least of your problems anyway. There's no way the 5 contiguous pages make any noticeable difference whatsoever.
You need to be more clear about whether you're talking about physical addresses or virtual addresses. Contiguous pages do not have to be stored in contiguous physical memory.
The fact that contiguous pages don't get stored in contiguous physical memory compounds the problem even more.
Huh?
You feel this a lot when your system starts thrashing and the VMM does the work of swapping pages and re-arranging page tables for you. The round-trips between a CPU page fault and the VMM page fault handler along with potentially invalidating L1/L2/L3 caches costs are significant in environments where you need to get as much things done per second as physically possible. If you can avoid having to go through this dance with a sane memory management strategy, I think that's a win even in the trivial case.
I understand perfectly well that avoiding swapping helps performance. What I still don't understand is how storing a string with contiguous virtual addresses increases the amount of swapping required. Unless you write your own allocator the odds are that splitting up the string into small chunks will spread it across /more/ pages, increasing your working set.
If what you're saying is true than std::deque should be preferable to std::vector, but this is exactly the opposite of what I've always heard.
It depends on what you're trying to do and what algorithms you're applying on your std::dequeue. This is the whole crux of the segmented iterators debate/paper by Matt Austern and others. Segmented versions of the algorithms would make using std::dequeue a lot more preferable if you really needed a data structure that doesn't promote heap fragmentation as much as a growing std::vector does.
I used to like the idea of segmented algorithms. However, I've more recently decided that they just make easy things hard and hard things impossible. The segmented implementation of for_each is bad enough. I don't want even to think about what any algorithm that couldn't easily be implemented in terms of for_each would look like.
string, and the string changes on a process that's working on one processor, the processor has to keep the caches of both up-to-date especially if it's in L1 cache. That means if you're building a string in one thread and simultaneously reading it from another, you're stressing the cache coherence mechanisms used by them multi-core machines.
This sounds like a bad idea to begin with, never mind the performance implications.
Right. Now if you prevented this from even being possible by making a string immutable, I'd say it's a win, no? :)
I disagree. Reading and writing the same data from multiple threads in parallel is a bad idea in general. There's nothing special about strings. If you want to share immutable data, declare it const. Problem solved. No need to force all strings to be immutable just for this.
In fact, I would think that your proposal would make this worse by allowing hidden sharing between apparently unrelated strings.
The hidden sharing will only be a problem if you're allowing mutation. Otherwise it's a good thing from a VMM/Cache perspective all-around (except for the cache line that has a reference count, but that's largely only touched when copies/temporaries get created/destroyed, and in sane systems would be an atomic increment/decrement anyway).
Okay. In Christ, Steven Watanabe

On Sat, Jan 29, 2011 at 6:53 AM, Steven Watanabe <watanabesj@gmail.com> wrote:
AMDG
On 1/28/2011 10:22 AM, Dean Michael Berris wrote:
On Sat, Jan 29, 2011 at 1:43 AM, Steven Watanabe<watanabesj@gmail.com> wrote:
I'm not sure what this has to do with the OS. The memory manager exists in user space and only goes to the OS when it can't fulfill the request with the memory that it has on hand.
Which OS are we talking about? At least for Linux I know for a fact that the Virtual Memory Manager is an OS-level service. I don't know Windows internals but I'm willing to say that the virtual memory manager exists as an OS-level service. That means the heap that's available to an application would have to be allocated by a VMM before the application even touches the CPU.
There are two layers. The application has a memory manager in the C library which services most requests.
Yes.
This allocator gets more memory from the OS as needed, but usually doesn't release it back to the OS immediately.
On Sun the implementation of malloc/free are like this. But on Linux and other OSes as far as I can tell a call to free will actually tell the system memory manager that the segment(s) or page(s) in question are available.
In ptmalloc2 which is used be glibc, the default settings work something like this: * If the requested size is at least 128 KB use mmap to let the OS handle the request directly.
Yes.
* If there is a block available that's big enough, use it.
Yep.
* try to use sbrk to expand the heap.
Uhuh.
* If sbrk fails, use mmap to allocate a 1 MB chunk.
And there we go.
Note that the OS never gets a mmap request less than 128 KB and all the memory allocated via sbrk is always contiguous.
Yes, and this is the problem with sbrk: if you rely on your allocator to use sbrk, sbrk is not guaranteed to just expand the available segment in place -- which means the VMM no matter what you do will actually have to find the memory for you and give you that chunk of memory. This means potentially swapping pages in/out. You limit this likelihood by asking for page-sized and page-aligned chunks and using those properly; making data in these chunks immutable makes for much more sane "shareability" and a much more predictable performance profile. The problem is then compounded by when your program calls sbrk/brk in one core and fails -- then mmap is called and the memory returned by mmap will actually be in the memory module that's handled by a NUMA CPU. Now imagine using these chunks from another thread that's not running on the same core and you go downhill from there. That doesn't even factor the cache hits/misses that can kill your performance especially if the VMM has to page in/out memory that has been evicted from the L3 cache of the processors (if there's even an L3). All of this stems from the fact that you cannot rely on there being enough memory at the start of the program at any given point. Thinking about it differently helps this way: if you assume that every memory allocation you make will cause a page fault and will potentially cause the VMM to allocate you a page, then you should be using this page of memory properly and saving on the amount of page faults you generate. In the context of playing well with the OS VMM, this means being wise about how you use your memory so that the VMM doesn't have to work too hard when you request for a data segment of a given size. This works for sbrk as well because asking for an extension in the amount of memory available to your program as a "heap" will cause page faults almost always.
Humph. 9 KB is not all that much memory. Besides, how often do you have strings this big anyway?
When you're downloading HTML pages and/or building HTML pages, or parsing HTTP requests, it happens a lot. 9kb is a number that's suitable to show that you need multiple contiguous pages for any meaningful discussion on what memory paging effects have something to do with strings and algorithms that deal with strings. Of course this assumes 4kb pages too, the formula would be the same though with just values changing for any platform. ;)
How big are the strings you're really dealing with?
There's really no way to know up front especially with HTTP traffic that's coming in as chunk-encoded bytes. ;)
9 KB is just insignificant even in a 32 (31?)-bit virtual address space.
It is if you have 10,000 concurrent connections and you have to allocate somewhere around that same amount of data for each connection's "buffer".
As far as I'm concerned, it isn't big until you're talking megabytes, at least.
We're not just talking about one string -- I'm talking about several tens of thousands of these potentially large strings.
With 64-bit you can exhaust physical memory and use the entire hard drive for swap and still be an order of magnitude short of using up all the virtual address space, even when the hardware restricts you to 48-bit virtual addresses.
Yes, but the thing is, you don't have to get to that point for your VMM to start swapping things in/out of memory. Especially on NUMA architectures, how Linux behaves is pretty surprising, and that's another can of worms you don't wanna get into. ;)
But the effect of finding the 5 contiguous pages is what I was trying to highlight. Even if you could grow the original buffers of one of the strings, in a std::string it would still have to be a contiguous chunk of bytes anyway. :)
The page size is simply irrelevant W.R.T. the effect of allocating contiguous memory.
It actually is. Note that there's a hardware page size and a virtual page size. Bus bandwidths matter as well regarding how much data you can actually pull from main memory into a CPU's cache. These factors put a predictable upper bound to what you can do efficiently on a given machine. Of course these are things you cannot control a priori in your C++ application. This also is a matter of how well your OS does virtual memory paging. BUT... what you can control is how you ask for memory from the system. This means if you can ask for a page-aligned and page-sized worth of memory -- by knowing sufficiently well enough how big a "page" is at runtime or a-priori -- then that's better memory usage all around playing well with the VMM and your caches.
If we're going to go through "in my experience most programs" I'd say this happens pretty often. ;)
Well, looking at my system right now, the most memory that any process is using is 420 MB. This is still a lot less than the 2 GB limit. The next memory user isn't even close (180 MB) and after that it drops off rapidly. If you're getting this close to running out of address space, contiguous strings are probably the least of your problems anyway. There's no way the 5 contiguous pages make any noticeable difference whatsoever.
Right, so given this context, you're saying that using a sane memory model wouldn't help? Also when I meant "my experience", I was referring to really demanding C++ applications that had to deal with thousands of transactions per second. Even if these transactions just dealt with a few kb's worth of data that fits in a page, having a thousand of these happening in one second means having multiple thousand kb's potentially paged in/out in one second. :D It's in these kinds of applications where you would really want to avoid the cost of unnecessarily paging data in/out at the VMM side of the equation or having a sane allocator that knows how to play nicely with a VMM.
The fact that contiguous pages don't get stored in contiguous physical memory compounds the problem even more.
Huh?
On NUMA machine with 4 memory controllers total sitting on 4 cores, imagine a page table where 4 contiguous pages of virtual memory are mapped across each of these 4 controllers evenly. Now if you're not making mutations you're saving yourself the round-trip from the CPU's write-through cache to the appropriate memory controller and potentially invalidating cached pages on those controllers as well. This is how NUMA changes the equation and how contiguous pages give you the "illusion" of contiguity. This is the hidden cost of mutability in these situations, and how asking for contiguous virtual memory pages will kill your performance. Try a simple benchmark: malloc a 1GB randomly generated file into a 64-bit multi-core Intel nehalem based processor, spawn 1,000 threads on Linux and randomly on each thread change a value in a randomly picked offset. This will show you how cache misses, VMM paging, and the cost of mutability of data will actually kill your efficiency. Now if you change the program to just randomly read from a part of memory, you'll see how immutability helps. Then change it yet again to malloc 1GB worth of data but this time on multiple files breaking them files up into smaller pieces. The dramatic improvement in performance should be evident there. :)
You feel this a lot when your system starts thrashing and the VMM does the work of swapping pages and re-arranging page tables for you. The round-trips between a CPU page fault and the VMM page fault handler along with potentially invalidating L1/L2/L3 caches costs are significant in environments where you need to get as much things done per second as physically possible. If you can avoid having to go through this dance with a sane memory management strategy, I think that's a win even in the trivial case.
I understand perfectly well that avoiding swapping helps performance. What I still don't understand is how storing a string with contiguous virtual addresses increases the amount of swapping required. Unless you write your own allocator the odds are that splitting up the string into small chunks will spread it across /more/ pages, increasing your working set.
The paper does assume an allocator that knows how to use pages wisely. Vicente Botet has a nice allocplus proposal which IMO is a good/better way to go than the currently constrained allocators defined in C++03. Maybe that will make it to C++0x, I haven't checked yet.
I used to like the idea of segmented algorithms. However, I've more recently decided that they just make easy things hard and hard things impossible. The segmented implementation of for_each is bad enough. I don't want even to think about what any algorithm that couldn't easily be implemented in terms of for_each would look like.
Of course from an implementor's perspective this is true. But from the user perspective there is absolutely 0 difference for the interface of the segmented for_each algorithm. :) We are after all writing libraries that developers (users) will use right? :D
Right. Now if you prevented this from even being possible by making a string immutable, I'd say it's a win, no? :)
I disagree. Reading and writing the same data from multiple threads in parallel is a bad idea in general.
Sure, so how do you discourage that from being encouraged if all the data structures you can share across threads are mutable?
There's nothing special about strings. If you want to share immutable data, declare it const. Problem solved. No need to force all strings to be immutable just for this.
const has not prevented anybody from const_cast<...>'ing anything. It's not sufficient that your immutability guarantee only comes from a language feature that can even be overridden. And even if an object is const, an implementor of that object's type has the mutable keyword giving you a false sense of security in making calls even to const member functions of that type. Unfortunately as much as I would like to agree with the statement "there's nothing special about strings", a large body of computer science algorithms dealing with strings primarily (think compression algorithms, distance algorithms, regex matching, parsing, etc.) show otherwise. Strings are an appropriate abstraction for some kinds of data and this is especially true if you're dealing with text-based network protocols and user-generated data. HTH -- Dean Michael Berris about.me/deanberris

On Sat, Jan 29, 2011 at 5:25 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
Try a simple benchmark: malloc a 1GB randomly generated file into a 64-bit multi-core Intel nehalem based processor, spawn 1,000 threads on Linux and randomly on each thread change a value in a randomly picked offset. This will show you how cache misses, VMM paging, and the cost of mutability of data will actually kill your efficiency. Now if you change the program to just randomly read from a part of memory, you'll see how immutability helps. Then change it yet again to malloc 1GB worth of data but this time on multiple files breaking them files up into smaller pieces. The dramatic improvement in performance should be evident there. :)
s/malloc/mmap/g :D -- Dean Michael Berris about.me/deanberris

On 01/29/2011 01:25 AM, Dean Michael Berris wrote:
On Sat, Jan 29, 2011 at 6:53 AM, Steven Watanabe<watanabesj@gmail.com> wrote: ... elision by patrick...
Note that the OS never gets a mmap request less than 128 KB and all the memory allocated via sbrk is always contiguous.
Yes, and this is the problem with sbrk: if you rely on your allocator to use sbrk, sbrk is not guaranteed to just expand the available segment in place -- which means the VMM no matter what you do will actually have to find the memory for you and give you that chunk of memory. This means potentially swapping pages in/out. You limit this likelihood by asking for page-sized and page-aligned chunks and using those properly; making data in these chunks immutable makes for much more sane "shareability" and a much more predictable performance profile.
The problem is then compounded by when your program calls sbrk/brk in one core and fails -- then mmap is called and the memory returned by mmap will actually be in the memory module that's handled by a NUMA CPU. Now imagine using these chunks from another thread that's not running on the same core and you go downhill from there. That doesn't even factor the cache hits/misses that can kill your performance especially if the VMM has to page in/out memory that has been evicted from the L3 cache of the processors (if there's even an L3).
All of this stems from the fact that you cannot rely on there being enough memory at the start of the program at any given point. Thinking about it differently helps this way: if you assume that every memory allocation you make will cause a page fault and will potentially cause the VMM to allocate you a page, then you should be using this page of memory properly and saving on the amount of page faults you generate. In the context of playing well with the OS VMM, this means being wise about how you use your memory so that the VMM doesn't have to work too hard when you request for a data segment of a given size. This works for sbrk as well because asking for an extension in the amount of memory available to your program as a "heap" will cause page faults almost always.
Just food for thought. Have you guys thought about how this changes on a small memory constrained embedded device like a controller or a phone? Patrick

On Sun, Jan 30, 2011 at 1:17 AM, Patrick Horgan <phorgan1@gmail.com> wrote:
Just food for thought. Have you guys thought about how this changes on a small memory constrained embedded device like a controller or a phone?
I have, and in those environments it's going to be a trade-off between: * Whether you want to be able to deal with long strings and concatenate them without having to have all that data be contiguous in memory. * Whether you just really need to use an in-memory buffer to deal with data that's in storage, where you also need it to be mutable. In the first case, you can still use the `chain` type and still get the benefits of cheap concatenation. On the second case, use a vector<char>. Of course though if your phone is using Linux (like on Android) or FreeBSD (like on Apple (? wild guess here, not knowing anything about the internals and just basing it on Mac OS X being based on FreeBSD)) I'd say you have pretty much the same VMM tuned for these devices and you'd get the same benefits even if it isn't NUMA. HTH -- Dean Michael Berris about.me/deanberris

Dean Michael Berris wrote:
Yes, and this is the problem with sbrk: if you rely on your allocator to use sbrk, sbrk is not guaranteed to just expand the available segment in place -- which means the VMM no matter what you do will actually have to find the memory for you and give you that chunk of memory. This means potentially swapping pages in/out.
Let me try to understand that; are you saying the same as this: sbrk is not guaranteed to expand into memory that is physically contiguous with the previously-allocated memory. Which means that the OS will have to spend time studying its memory layout data structures to find a suitable area of physical memory to map for you. If that's what you're saying, I believe it, though I'm not convinced that the time taken to do this is often a bottleneck. References would be appreciated. But you might mean something else.
This means potentially swapping pages in/out.
That's the crucial bit. Why do you believe that this allocation could possibly lead to swapping? Do you mean that only on systems that are already low on RAM, or are you suggesting that this could happen even when the overall amount of free RAM is reasonable?
You limit this likelihood by asking for page-sized and page-aligned chunks
When you say "page-sized", do you really mean "page-sized", or do you mean "multiple of the page size"? If you really mean "page-sized", you're suggesting that there should be one call to sbrk or mmap for each 4 kbytes of RAM, right? That seems very wrong to me. [snip]
Note that there's a hardware page size and a virtual page size.
Really? Can you give an example?
Try a simple benchmark: [mmap] a 1GB randomly generated file into a 64-bit multi-core Intel nehalem based processor, spawn 1,000 threads on Linux and randomly on each thread change a value in a randomly picked offset. This will show you how cache misses, VMM paging, and the cost of mutability of data will actually kill your efficiency.
Right, it will be slow.
Now if you change the program to just randomly read from a part of memory, you'll see how immutability helps.
Right, it will be much faster. But if your program actually needs to change the data then it needs to change the data; you would need to compare e.g. std::string s; .... s[n]=c; vs. immutable_string s; ... s = s.substr(0,n) + c + s.substr(n+1); Now if you can demonstrate that being faster, I'll be impressed.
Then change it yet again to [mmap] 1GB worth of data but this time on multiple files breaking them files up into smaller pieces. The dramatic improvement in performance should be evident there. :)
That's an interesting suggestion. Have you actually done this? Can you post some numbers? I would try it myself, but I don't have a suitable NUMA test system right now. Presumably the idea is that the files are distributed randomly between the different processors' memory controllers. Would you get a similar benefit in the case of the monolithic 1 GB file by mmapping it in e.g. 4 or 8 chunks? In any case, I'm not convinced that this is relevant to the memory allocation issue. Regards, Phil.

On Sun, Jan 30, 2011 at 1:23 AM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
Dean Michael Berris wrote:
Yes, and this is the problem with sbrk: if you rely on your allocator to use sbrk, sbrk is not guaranteed to just expand the available segment in place -- which means the VMM no matter what you do will actually have to find the memory for you and give you that chunk of memory. This means potentially swapping pages in/out.
Let me try to understand that; are you saying the same as this:
sbrk is not guaranteed to expand into memory that is physically contiguous with the previously-allocated memory. Which means that the OS will have to spend time studying its memory layout data structures to find a suitable area of physical memory to map for you.
If that's what you're saying, I believe it, though I'm not convinced that the time taken to do this is often a bottleneck. References would be appreciated. But you might mean something else.
Pretty much the same thing. I'm not sure where that quote came from, but just doing `man sbrk` on Linux will give a lot of information on what `sbrk` will do in cases where it can't grow the current segment for the application. Also, it's not the *only* problem (finding suitable memory to use): on systems where you have many threads competing to allocate/deallocate memory, since threads share the address space basically of the spawning process, you can end up with each thread trying to call sbrk and growing the heap multiple times than you really want -- although I'm positive that ptmalloc would be doing some locking to make sure this happens on a process' address space, stranger things have happened before. ;)
This means potentially swapping pages in/out.
That's the crucial bit. Why do you believe that this allocation could possibly lead to swapping? Do you mean that only on systems that are already low on RAM, or are you suggesting that this could happen even when the overall amount of free RAM is reasonable?
So let's do a thought experiment here: let's say you have an application that's already built and you run it on any given system. The best case performance is that it's the only thing running on the system, in which case you have absolute control and you can use all the memory available to you -- games that run on consoles are like these. Even in this situation where you know the physical limits and characteristics of each machine, you may or may not be talking to a virtual memory manager (I don't know whether Sony PS3's, XboX, etc. have an OS service handling the memory, but let's assume it does). Even in this best case scenario your VMM can choose to page in/out areas of memory mapping the virtual address space to physical addresses, depending on the time of the day (random) or in the case of multi-threaded code running, the actual order in which operations are actually scheduled to run. The best thing you can do even in this best-case scenario is to be prudent with the use of memory -- and if you have a library that does that for you in the best-case scenario, imagine how it would work in the worst-case scenario. ;) Now let's take the worst-case scenario that your application is actually run on a system that has very little RAM available and that it's not the only program running. Worse, the VMM isn't very smart about handling multiple applications that require RAM in general. If you design for this case and it actually works imagine how it would work on the best case scenario? :D
You limit this likelihood by asking for page-sized and page-aligned chunks
When you say "page-sized", do you really mean "page-sized", or do you mean "multiple of the page size"? If you really mean "page-sized", you're suggesting that there should be one call to sbrk or mmap for each 4 kbytes of RAM, right? That seems very wrong to me.
No, see when you call sbrk, you typically don't know how much RAM is actually given to your application. And you only usually call sbrk (in the allocator implementation) when you don't have any more available heap space to deal with. :) What I'm suggesting is that given you already have enough heap to deal with, make your data structure just use page-sized and page-aligned chunks, so that: 1) since it's page-boundary aligned you know that data that's a page's worth doesn't spill into another (contiguous) page and 2) so that when the VMM pages in/out data from physical ram to the cache that it has all the data it needs to deal with in that page. And even if you don't have enough heap to deal with and you get more (with sbrk or mmap), you use the same strategy from a higher level so that your data structures exploit the cache and the page mechanism available from the VMM and the hardware you're running on. Does that make sense? We can split off this discussion if it proves to be more general in the context of a string implementation. ;)
[snip]
Note that there's a hardware page size and a virtual page size.
Really? Can you give an example?
http://en.wikipedia.org/wiki/Page_size -- too lazy to give a specific example, I hope this helps.
Try a simple benchmark: [mmap] a 1GB randomly generated file into a 64-bit multi-core Intel nehalem based processor, spawn 1,000 threads on Linux and randomly on each thread change a value in a randomly picked offset. This will show you how cache misses, VMM paging, and the cost of mutability of data will actually kill your efficiency.
Right, it will be slow.
Now if you change the program to just randomly read from a part of memory, you'll see how immutability helps.
Right, it will be much faster.
But if your program actually needs to change the data then it needs to change the data; you would need to compare e.g.
std::string s; .... s[n]=c;
vs.
immutable_string s; ... s = s.substr(0,n) + c + s.substr(n+1);
Now if you can demonstrate that being faster, I'll be impressed.
Note that in the immutable string case, you won't actually be allocating more memory -- since what you just built is a concatenation tree that refers to the appropriate immutable data blocks or internal tree nodes that's referenced here. You may need to allocate an extra block (or using a "free store" block for individual character storage), but then largely because what you're really doing is building a new string, then it's not really a good representative of the cost of mutation, right? :D If you really need to mutate data, the point is you don't use an immutable data structure to do it. :D
Then change it yet again to [mmap] 1GB worth of data but this time on multiple files breaking them files up into smaller pieces. The dramatic improvement in performance should be evident there. :)
That's an interesting suggestion. Have you actually done this?
Yep. And so do other open-source projects like MongoDB. :)
Can you post some numbers? I would try it myself, but I don't have a suitable NUMA test system right now. Presumably the idea is that the files are distributed randomly between the different processors' memory controllers.
Not just that -- what you're doing is giving the VMM a chance to find the appropriate set of smaller virtual page table chunks than you are giving it with a 1GB allocation in one go. :)
Would you get a similar benefit in the case of the monolithic 1 GB file by mmapping it in e.g. 4 or 8 chunks? In any case, I'm not convinced that this is relevant to the memory allocation issue.
From the OS's view it's an optimization. Form a developer/user
Assuming that you have an SSD storage device where random acccess to memory is pretty much "uniform" basically having the bus speed and the controller's throughput as the limit for the throughput you can get for moving data from "disk" to main memory, what you think about is really the virtual page table that's reserved for the 1GB file. If you mmap it in 4 or 8 chunks what's going to happen is the kernel will see it's the same FD and just give you different addresses that are still contiguous -- and what's even going to happen is it's going to try and do the math and see that "well you're using the same fd so I'll treat these mmap's the same way I would normally treat a single mmap". perspective, in some cases it might be good, in others it might be bad. However when you pass in different fd's the kernel can avoid the synchronization it would do compared to the case if you were using the single fd. And now your contiguous virtual memory chunks would have pretty much unbridled "equal opportunity access" to more memory at very little synchronization. That's why systems like MongoDB typically have a chunk size of 200MB or thereabouts, so that if you want to read data in a given chunk you get unbridled almost unsynchronized access to that 200MB chunk as compared to mapping a whole XXGB file. ~200MB is not just a magic number either, at least in Linux there is a way of doing a "slab allocation" with the kernel's cooperation where you can get large chunks of address space for your process without having giving other processes too much of a hassle. This is interesting discussion and is largely very low-level and system-specific, but the wisdom in the whole discussion really is: prudence with memory use pays dividends especially if you work with the VMM instead of against it. There's a good write-up that goes along this same line at the ACM Queue written by Poul-Henning Kamp espousing the way we should be thinking about effectively using memory pages that are given to applications by the OS. It's an interesting read which you can find here: http://queue.acm.org/detail.cfm?id=1814327 HTH :) -- Dean Michael Berris about.me/deanberris

AMDG On 1/29/2011 10:15 AM, Dean Michael Berris wrote:
On Sun, Jan 30, 2011 at 1:23 AM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
Dean Michael Berris wrote:
Note that there's a hardware page size and a virtual page size.
Really? Can you give an example?
http://en.wikipedia.org/wiki/Page_size -- too lazy to give a specific example, I hope this helps.
It doesn't. I see no such distinction on that page. I don't even see how it's possible, since the page tables are what the system uses to map virtual memory to physical memory. In Christ, Steven Watanabe

Dean Michael Berris wrote: [snip lots!]
Does that make sense?
No, sorry. I'm trying to decide whether (a) You're describing general problems that I should know about but don't. (b) You're describing problems that apply to advanced architectures (NUMA) that I have limited experience with. (c) You're confused, and these problems don't exist at all. Have you ever had the experience of talking to someone, maybe in a job interview, and thinking "this guy is either a genius or an idiot but I don't know which"? That's what I'm getting here. I hope that wasn't too honest! Sadly, I don't have enough time to delve any deeper. Regards, Phil.

On Jan 29, 2011, at 3:53 PM, Phil Endecott wrote:
Dean Michael Berris wrote: [snip lots!]
Does that make sense?
No, sorry. I'm trying to decide whether
(a) You're describing general problems that I should know about but don't. (b) You're describing problems that apply to advanced architectures (NUMA) that I have limited experience with. (c) You're confused, and these problems don't exist at all.
Have you ever had the experience of talking to someone, maybe in a job interview, and thinking "this guy is either a genius or an idiot but I don't know which"? That's what I'm getting here. I hope that wasn't too honest! Sadly, I don't have enough time to delve any deeper.
I have been trying to stay away from those waters. But... Dean: you do have a way of jumping here and there, a bit at random, pulling a straw here and another there. Why don't you (Dean, that is) just focus on proposing a really good immutable (or partially immutable...) byte sequence type and we see where we go from there? Trying to either convince people that most text handling (on top of such a byte-based sequence type) would benefit from the immutability, performance- or concurrency-wise or that this byte sequence type is somehow entangled with text handling will either fail or confuse and subsequently fail. Being an old FP aficionado (currently working mostly in Clojure, actually), I welcome immutable types. But as Steven Watanabe pointed out, your structure is only partially immutable; I can't remember anymore, but I trust him there. So: please go ahead and suggest an immutable_byte_sequence type and we might be able to (ab-)use it for a lot of cool stuff. /David

On Sun, Jan 30, 2011 at 5:04 AM, David Bergman <David.Bergman@bergmangupta.com> wrote:
I have been trying to stay away from those waters. But... Dean: you do have a way of jumping here and there, a bit at random, pulling a straw here and another there. Why don't you (Dean, that is) just focus on proposing a really good immutable (or partially immutable...) byte sequence type and we see where we go from there?
Okay. :)
Trying to either convince people that most text handling (on top of such a byte-based sequence type) would benefit from the immutability, performance- or concurrency-wise or that this byte sequence type is somehow entangled with text handling will either fail or confuse and subsequently fail.
Okay. :)
Being an old FP aficionado (currently working mostly in Clojure, actually), I welcome immutable types. But as Steven Watanabe pointed out, your structure is only partially immutable; I can't remember anymore, but I trust him there.
So: please go ahead and suggest an immutable_byte_sequence type and we might be able to (ab-)use it for a lot of cool stuff.
Okay. :) /me shutting up now. ;) -- Dean Michael Berris about.me/deanberris

Dean Michael Berris wrote: [snip lots!]
Does that make sense?
No, sorry. I'm trying to decide whether
(a) You're describing general problems that I should know about but don't. (b) You're describing problems that apply to advanced architectures (NUMA) that I have limited experience with. (c) You're confused, and these problems don't exist at all.
Have you ever had the experience of talking to someone, maybe in a job interview, and thinking "this guy is either a genius or an idiot but I don't know which"? That's what I'm getting here. I hope that wasn't too honest! Sadly, I don't have enough time to delve any deeper.
<offtopic> At that point, while grading exams, I usually give up trying to understand and award my students with a 'sympathy point'. That saves time on my end and honors the student for trying :-P </offtopic> Sorry, couldn't resist. Regards Hartmut --------------- http://boost-spirit.com

On 1/30/11 5:27 AM, Hartmut Kaiser wrote:
Dean Michael Berris wrote: [snip lots!]
Does that make sense?
No, sorry. I'm trying to decide whether
(a) You're describing general problems that I should know about but don't. (b) You're describing problems that apply to advanced architectures (NUMA) that I have limited experience with. (c) You're confused, and these problems don't exist at all.
Have you ever had the experience of talking to someone, maybe in a job interview, and thinking "this guy is either a genius or an idiot but I don't know which"? That's what I'm getting here. I hope that wasn't too honest! Sadly, I don't have enough time to delve any deeper.
<offtopic> At that point, while grading exams, I usually give up trying to understand and award my students with a 'sympathy point'. That saves time on my end and honors the student for trying :-P </offtopic>
FWIW, I am an advocate of immutable strings and views. I've been advocating that for a long time now (looking back at messages in this mailing list from years back). The very design of immutable sequences is embodied in the design of fusion, for example, where operations return new sequences or views instead of direct manipulation. I won't go into this thread though and I won't go as far as saying that std::string is broken. I believe both should co-exist (immutable and mutable). I also believe that immutable strings and views can be built on top of, and should accommodate, any existing string including the C string through traits and customization points. I think Dean has very valid points that are unfortunately drowned under heavy and wordy rationales which have the tendency of having more holes which are potential reasons for misunderstandings and disagreements. Bottom line: just do it and the work will speak out for itself. Talk is cheap. Output and results (e.g. code and benchmarks) speak better. Regards, -- Joel de Guzman http://www.boostpro.com http://spirit.sf.net

On Sun, Jan 30, 2011 at 4:53 AM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
Dean Michael Berris wrote: [snip lots!]
Does that make sense?
No, sorry.
That's alright. I'll try with less words next time. ;)
I'm trying to decide whether
(a) You're describing general problems that I should know about but don't. (b) You're describing problems that apply to advanced architectures (NUMA) that I have limited experience with. (c) You're confused, and these problems don't exist at all.
Probably (c). ;)
Have you ever had the experience of talking to someone, maybe in a job interview, and thinking "this guy is either a genius or an idiot but I don't know which"?
Yes, and I usually flip a coin or ask someone else. :) Although I don't mind if you consider me an idiot just so that you don't have to think about it anymore. :D
That's what I'm getting here. I hope that wasn't too honest!
I don't know being honest can get more honest. ;) That's alright though. :)
Sadly, I don't have enough time to delve any deeper.
No worries. Have a good one Phil. :) -- Dean Michael Berris about.me/deanberris

AMDG On 1/29/2011 1:25 AM, Dean Michael Berris wrote:
On Sat, Jan 29, 2011 at 6:53 AM, Steven Watanabe<watanabesj@gmail.com> wrote:
This allocator gets more memory from the OS as needed, but usually doesn't release it back to the OS immediately.
On Sun the implementation of malloc/free are like this. But on Linux and other OSes as far as I can tell a call to free will actually tell the system memory manager that the segment(s) or page(s) in question are available.
Wrong. It will only do so if it can, which is definitely not all the time. If you try a naive benchmark, it probably will look like it always releases the memory.
In ptmalloc2 which is used be glibc, the default settings work something like this: * If the requested size is at least 128 KB use mmap to let the OS handle the request directly.
Yes.
* If there is a block available that's big enough, use it.
Yep.
* try to use sbrk to expand the heap.
Uhuh.
* If sbrk fails, use mmap to allocate a 1 MB chunk.
And there we go.
??? 1 MB is usually a lot bigger than a page. The allocator itself is asking the system for chunks far bigger than the sizes that you seem to be so worried about. I don't understand what you're trying to point out here.
Note that the OS never gets a mmap request less than 128 KB and all the memory allocated via sbrk is always contiguous.
Yes, and this is the problem with sbrk: if you rely on your allocator to use sbrk, sbrk is not guaranteed to just expand the available segment in place
Yes it is. Its interface makes it impossible for it to do anything else. #include <unistd.h> int brk(void *addr); void *sbrk(intptr_t increment); brk() and sbrk() change the location of the program break, which defines the end of the process's data segment (i.e., the program break is the first location after the end of the uninitialized data segment).
-- which means the VMM no matter what you do will actually have to find the memory for you and give you that chunk of memory. This means potentially swapping pages in/out. You limit this likelihood by asking for page-sized and page-aligned chunks and using those properly;
Why?
making data in these chunks immutable makes for much more sane "shareability" and a much more predictable performance profile.
This is irrelevant to the discussion at hand. I don't want to get into anything except the effects of allocating contiguous buffers, since that seems to be quite enough.
The problem is then compounded by when your program calls sbrk/brk in one core and fails -- then mmap is called and the memory returned by mmap will actually be in the memory module that's handled by a NUMA CPU. Now imagine using these chunks from another thread that's not running on the same core and you go downhill from there. That doesn't even factor the cache hits/misses that can kill your performance especially if the VMM has to page in/out memory that has been evicted from the L3 cache of the processors (if there's even an L3).
irrelevant.
All of this stems from the fact that you cannot rely on there being enough memory at the start of the program at any given point. Thinking about it differently helps this way: if you assume that every memory allocation you make will cause a page fault and will potentially cause the VMM to allocate you a page, then you should be using this page of memory properly and saving on the amount of page faults you generate.
This assumption doesn't necessarily hold. In fact, the library's allocator probably tries to avoid it. Anyway, even with this assumption we have a single 5 page allocation -> 5 page faults, 5 separate 1 page allocations -> 5 page faults. How does allocating only a page at a time save you anything?
Humph. 9 KB is not all that much memory. Besides, how often do you have strings this big anyway?
When you're downloading HTML pages and/or building HTML pages, or parsing HTTP requests, it happens a lot. 9kb is a number that's suitable to show that you need multiple contiguous pages for any meaningful discussion on what memory paging effects have something to do with strings and algorithms that deal with strings. Of course this assumes 4kb pages too, the formula would be the same though with just values changing for any platform. ;)
How big are the strings you're really dealing with?
There's really no way to know up front especially with HTTP traffic that's coming in as chunk-encoded bytes. ;)
9 KB is just insignificant even in a 32 (31?)-bit virtual address space.
It is if you have 10,000 concurrent connections and you have to allocate somewhere around that same amount of data for each connection's "buffer".
So why is it a problem if the buffers are contiguous? I don't see why the memory manager can't handle this in a reasonably efficient way.
As far as I'm concerned, it isn't big until you're talking megabytes, at least.
We're not just talking about one string -- I'm talking about several tens of thousands of these potentially large strings.
With 64-bit you can exhaust physical memory and use the entire hard drive for swap and still be an order of magnitude short of using up all the virtual address space, even when the hardware restricts you to 48-bit virtual addresses.
Yes, but the thing is, you don't have to get to that point for your VMM to start swapping things in/out of memory. Especially on NUMA architectures, how Linux behaves is pretty surprising, and that's another can of worms you don't wanna get into. ;)
Of course, but my point was that there will always be a sufficiently large contiguous address range.
But the effect of finding the 5 contiguous pages is what I was trying to highlight. Even if you could grow the original buffers of one of the strings, in a std::string it would still have to be a contiguous chunk of bytes anyway. :)
The page size is simply irrelevant W.R.T. the effect of allocating contiguous memory.
It actually is. Note that there's a hardware page size and a virtual page size. Bus bandwidths matter as well regarding how much data you can actually pull from main memory into a CPU's cache. These factors put a predictable upper bound to what you can do efficiently on a given machine. Of course these are things you cannot control a priori in your C++ application. This also is a matter of how well your OS does virtual memory paging.
You still have to allocate just as much memory whether it's contiguous or not. The page size is not directly correlated to what the allocator can handle efficiently.
BUT... what you can control is how you ask for memory from the system. This means if you can ask for a page-aligned and page-sized worth of memory -- by knowing sufficiently well enough how big a "page" is at runtime or a-priori -- then that's better memory usage all around playing well with the VMM and your caches.
You keep asserting that this is better, but you still haven't explained why except from a flawed understanding of how the memory manager works. If you just use malloc or new, allocating 4096 bytes or whatever the page size is, will probably not give you what you want, because the memory manager adds its own header, so the actual memory block will take up slightly more than a page (In an implementation dependent way).
If we're going to go through "in my experience most programs" I'd say this happens pretty often. ;)
Well, looking at my system right now, the most memory that any process is using is 420 MB. This is still a lot less than the 2 GB limit. The next memory user isn't even close (180 MB) and after that it drops off rapidly. If you're getting this close to running out of address space, contiguous strings are probably the least of your problems anyway. There's no way the 5 contiguous pages make any noticeable difference whatsoever.
Right, so given this context, you're saying that using a sane memory model wouldn't help?
I disagree with what you consider sane.
The fact that contiguous pages don't get stored in contiguous physical memory compounds the problem even more.
Huh?
On NUMA machine with 4 memory controllers total sitting on 4 cores, imagine a page table where 4 contiguous pages of virtual memory are mapped across each of these 4 controllers evenly. Now if you're not making mutations you're saving yourself the round-trip from the CPU's write-through cache to the appropriate memory controller and potentially invalidating cached pages on those controllers as well. This is how NUMA changes the equation and how contiguous pages give you the "illusion" of contiguity.
This is the hidden cost of mutability in these situations, and how asking for contiguous virtual memory pages will kill your performance.
What does this have to do with contiguity?
Try a simple benchmark: malloc a 1GB randomly generated file into a 64-bit multi-core Intel nehalem based processor, spawn 1,000 threads on Linux and randomly on each thread change a value in a randomly picked offset. This will show you how cache misses, VMM paging, and the cost of mutability of data will actually kill your efficiency. Now if you change the program to just randomly read from a part of memory, you'll see how immutability helps. Then change it yet again to malloc 1GB worth of data but this time on multiple files breaking them files up into smaller pieces. The dramatic improvement in performance should be evident there. :)
a) Why use a file, instead of simply allocating that much memory? b) Intentionally bad memory usage is not a good benchmark. c) I might actually try it if you provide source code so I can be sure that we're talking about the same thing.
I understand perfectly well that avoiding swapping helps performance. What I still don't understand is how storing a string with contiguous virtual addresses increases the amount of swapping required. Unless you write your own allocator the odds are that splitting up the string into small chunks will spread it across /more/ pages, increasing your working set.
The paper does assume an allocator that knows how to use pages wisely.
What do you mean by "use pages wisely?" It isn't just a matter of having a reasonable allocator. You would need an allocator specifically tuned for your usage.
Vicente Botet has a nice allocplus proposal which IMO is a good/better way to go than the currently constrained allocators defined in C++03. Maybe that will make it to C++0x, I haven't checked yet.
You mean Ion, I think. Anyway, I don't see how this is relevant.
Right. Now if you prevented this from even being possible by making a string immutable, I'd say it's a win, no? :)
I disagree. Reading and writing the same data from multiple threads in parallel is a bad idea in general.
Sure, so how do you discourage that from being encouraged if all the data structures you can share across threads are mutable?
If you want to use a pure functional language, go ahead. I do not believe in restricting features because they /can/ be abused under /some/ circumstances. Strings don't have to be shared across threads. Besides, from what I saw, your string isn't really immutable. It just doesn't allow mutable access to characters.
There's nothing special about strings. If you want to share immutable data, declare it const. Problem solved. No need to force all strings to be immutable just for this.
const has not prevented anybody from const_cast<...>'ing anything. It's not sufficient that your immutability guarantee only comes from a language feature that can even be overridden.
Sorry, but I have little sympathy for those who abuse const_cast.
And even if an object is const, an implementor of that object's type has the mutable keyword giving you a false sense of security in making calls even to const member functions of that type.
Well don't do that then. Declaring that a type is immutable doesn't make you any safer against this.
Unfortunately as much as I would like to agree with the statement "there's nothing special about strings", a large body of computer science algorithms dealing with strings primarily (think compression algorithms, distance algorithms, regex matching, parsing, etc.) show otherwise. Strings are an appropriate abstraction for some kinds of data and this is especially true if you're dealing with text-based network protocols and user-generated data.
Huh? Why does that mean that strings should be treated differently from everything else? In Christ, Steven Watanabe

On Sun, Jan 30, 2011 at 4:39 AM, Steven Watanabe <watanabesj@gmail.com> wrote:
AMDG
On 1/29/2011 1:25 AM, Dean Michael Berris wrote:
On Sat, Jan 29, 2011 at 6:53 AM, Steven Watanabe<watanabesj@gmail.com> wrote:
This allocator gets more memory from the OS as needed, but usually doesn't release it back to the OS immediately.
On Sun the implementation of malloc/free are like this. But on Linux and other OSes as far as I can tell a call to free will actually tell the system memory manager that the segment(s) or page(s) in question are available.
Wrong. It will only do so if it can, which is definitely not all the time. If you try a naive benchmark, it probably will look like it always releases the memory.
I wasn't implying all the time that it will release the memory. Only if the segment is actually unused will it be done so.
* If sbrk fails, use mmap to allocate a 1 MB chunk.
And there we go.
??? 1 MB is usually a lot bigger than a page. The allocator itself is asking the system for chunks far bigger than the sizes that you seem to be so worried about. I don't understand what you're trying to point out here.
Right. I didn't know what I was thinking.
Note that the OS never gets a mmap request less than 128 KB and all the memory allocated via sbrk is always contiguous.
Yes, and this is the problem with sbrk: if you rely on your allocator to use sbrk, sbrk is not guaranteed to just expand the available segment in place
Yes it is. Its interface makes it impossible for it to do anything else.
#include <unistd.h>
int brk(void *addr); void *sbrk(intptr_t increment);
brk() and sbrk() change the location of the program break, which defines the end of the process's data segment (i.e., the program break is the first location after the end of the uninitialized data segment).
You're right, I'm wrong.
-- which means the VMM no matter what you do will actually have to find the memory for you and give you that chunk of memory. This means potentially swapping pages in/out. You limit this likelihood by asking for page-sized and page-aligned chunks and using those properly;
Why?
Why will page-sized/page-aligned chunks avoid the likelihood of (unnecessarily) swapping pages in and out? Because if you didn't take care that your data would spill over to page boundaries the amount of pages you're touching is one more than you need to touch.
making data in these chunks immutable makes for much more sane "shareability" and a much more predictable performance profile.
This is irrelevant to the discussion at hand. I don't want to get into anything except the effects of allocating contiguous buffers, since that seems to be quite enough.
Ok.
The problem is then compounded by when your program calls sbrk/brk in one core and fails -- then mmap is called and the memory returned by mmap will actually be in the memory module that's handled by a NUMA CPU. Now imagine using these chunks from another thread that's not running on the same core and you go downhill from there. That doesn't even factor the cache hits/misses that can kill your performance especially if the VMM has to page in/out memory that has been evicted from the L3 cache of the processors (if there's even an L3).
irrelevant.
Ok.
All of this stems from the fact that you cannot rely on there being enough memory at the start of the program at any given point. Thinking about it differently helps this way: if you assume that every memory allocation you make will cause a page fault and will potentially cause the VMM to allocate you a page, then you should be using this page of memory properly and saving on the amount of page faults you generate.
This assumption doesn't necessarily hold. In fact, the library's allocator probably tries to avoid it. Anyway, even with this assumption we have a single 5 page allocation -> 5 page faults, 5 separate 1 page allocations -> 5 page faults. How does allocating only a page at a time save you anything?
If you didn't take care to allocate the right-sized chunks you might be allocating memory that would spill over to more than just one page. This means you may be touching more pages you want and are causing one more page fault than necessary.
It is if you have 10,000 concurrent connections and you have to allocate somewhere around that same amount of data for each connection's "buffer".
So why is it a problem if the buffers are contiguous?
It's not if it fits in one page or in the minimum amount of pages. You limit the page faults and thus the paging.
I don't see why the memory manager can't handle this in a reasonably efficient way.
If you touch two pages instead of one the likelihood that you have 100% more page faults is higher.
Yes, but the thing is, you don't have to get to that point for your VMM to start swapping things in/out of memory. Especially on NUMA architectures, how Linux behaves is pretty surprising, and that's another can of worms you don't wanna get into. ;)
Of course, but my point was that there will always be a sufficiently large contiguous address range.
Okay.
It actually is. Note that there's a hardware page size and a virtual page size. Bus bandwidths matter as well regarding how much data you can actually pull from main memory into a CPU's cache. These factors put a predictable upper bound to what you can do efficiently on a given machine. Of course these are things you cannot control a priori in your C++ application. This also is a matter of how well your OS does virtual memory paging.
You still have to allocate just as much memory whether it's contiguous or not. The page size is not directly correlated to what the allocator can handle efficiently.
Yes, now if your allocations didn't play along with the page sizes you could be touching more pages than you need to be touching. Let's say you have an std::list of things and each node sat on page boundaries, then touching each node would cause two page faults instead of just one.
BUT... what you can control is how you ask for memory from the system. This means if you can ask for a page-aligned and page-sized worth of memory -- by knowing sufficiently well enough how big a "page" is at runtime or a-priori -- then that's better memory usage all around playing well with the VMM and your caches.
You keep asserting that this is better, but you still haven't explained why except from a flawed understanding of how the memory manager works. If you just use malloc or new, allocating 4096 bytes or whatever the page size is, will probably not give you what you want, because the memory manager adds its own header, so the actual memory block will take up slightly more than a page (In an implementation dependent way).
It's really simple. If your allocator is stupid, then you can't do much about that right? But if your allocator was smart and if you asked for memory in chunk-sizes that are equal to or close to page-sized chunks, it can give you a page-aligned address. That makes a world of difference in limiting the number of potential page faults you have than getting data in a non-page-aligned manner in a larger-than-a-page chunk of data. Of course if you really wanted contiguous data then by all means go get a contiguous block of N pages. But that's not the case I'm talking about because I'm interested in the case when I don't *need* a contiguous block of pages.
Right, so given this context, you're saying that using a sane memory model wouldn't help?
I disagree with what you consider sane.
Alright, fair enough.
This is the hidden cost of mutability in these situations, and how asking for contiguous virtual memory pages will kill your performance.
What does this have to do with contiguity?
Page-boundary and cross-page access potentially causing more page faults than actually necessary.
Try a simple benchmark: malloc a 1GB randomly generated file into a 64-bit multi-core Intel nehalem based processor, spawn 1,000 threads on Linux and randomly on each thread change a value in a randomly picked offset. This will show you how cache misses, VMM paging, and the cost of mutability of data will actually kill your efficiency. Now if you change the program to just randomly read from a part of memory, you'll see how immutability helps. Then change it yet again to malloc 1GB worth of data but this time on multiple files breaking them files up into smaller pieces. The dramatic improvement in performance should be evident there. :)
a) Why use a file, instead of simply allocating that much memory?
Because mmap does demand-loading by default -- it's not until you actually access the memory does a page fault actually happen. This gives you an idea how the page faults actually affect the solution.
b) Intentionally bad memory usage is not a good benchmark.
Random access to a huge chunk of memory is not intentionally bad memory usage -- that's a pathological case for caches and storage services like RDBMS etc.
c) I might actually try it if you provide source code so I can be sure that we're talking about the same thing.
I just might do that later on. ;)
The paper does assume an allocator that knows how to use pages wisely.
What do you mean by "use pages wisely?" It isn't just a matter of having a reasonable allocator. You would need an allocator specifically tuned for your usage.
Use pages wisely: layout a tree for example on a page of memory or in as few pages as possible. Put chunks of data in a page-aligned block. Things like these. If you can tell the allocator to give you a page-aligned or page-sized chunk of data then you can go ahead and do the right thing with the data limiting the chances that you're causing unnecessary page faults.
Vicente Botet has a nice allocplus proposal which IMO is a good/better way to go than the currently constrained allocators defined in C++03. Maybe that will make it to C++0x, I haven't checked yet.
You mean Ion, I think. Anyway, I don't see how this is relevant.
Yes, I meant Ion. Sorry about that. If that allocator knew how long a page is at runtime and allowed for page-aligned chunks to be reserved as a "free store" where you used some sort of bin-packing heuristic, then it can limit the page faults induced when users of the data in the allocator actually access memory in these areas provided.
Sure, so how do you discourage that from being encouraged if all the data structures you can share across threads are mutable?
If you want to use a pure functional language, go ahead. I do not believe in restricting features because they /can/ be abused under /some/ circumstances. Strings don't have to be shared across threads. Besides, from what I saw, your string isn't really immutable. It just doesn't allow mutable access to characters.
There are many things that don't have to be shared across threads but unfortunately they do get shared. And no I don't want to use a pure functional language. If the type didn't allow mutable access to characters, how is it not an immutable string?
There's nothing special about strings. If you want to share immutable data, declare it const. Problem solved. No need to force all strings to be immutable just for this.
const has not prevented anybody from const_cast<...>'ing anything. It's not sufficient that your immutability guarantee only comes from a language feature that can even be overridden.
Sorry, but I have little sympathy for those who abuse const_cast.
You and I both. But then it's a language feature.
And even if an object is const, an implementor of that object's type has the mutable keyword giving you a false sense of security in making calls even to const member functions of that type.
Well don't do that then. Declaring that a type is immutable doesn't make you any safer against this.
Which isn't what I'm going to do, you're right. But I was pointing out that nobody's stopping anybody from const_casting and making things mutable.
Unfortunately as much as I would like to agree with the statement "there's nothing special about strings", a large body of computer science algorithms dealing with strings primarily (think compression algorithms, distance algorithms, regex matching, parsing, etc.) show otherwise. Strings are an appropriate abstraction for some kinds of data and this is especially true if you're dealing with text-based network protocols and user-generated data.
Huh? Why does that mean that strings should be treated differently from everything else?
Because there are special cases and algorithms that only make sense for strings? -- Dean Michael Berris about.me/deanberris

AMDG On 1/29/2011 1:18 PM, Dean Michael Berris wrote:
On Sun, Jan 30, 2011 at 4:39 AM, Steven Watanabe<watanabesj@gmail.com> wrote:
On 1/29/2011 1:25 AM, Dean Michael Berris wrote:
-- which means the VMM no matter what you do will actually have to find the memory for you and give you that chunk of memory. This means potentially swapping pages in/out. You limit this likelihood by asking for page-sized and page-aligned chunks and using those properly;
Why?
Why will page-sized/page-aligned chunks avoid the likelihood of (unnecessarily) swapping pages in and out? Because if you didn't take care that your data would spill over to page boundaries the amount of pages you're touching is one more than you need to touch.
One more comment, and then I'm done with this thread. I agree that page-alignment will help reduce swapping. What I remain unconvinced about are the benefits of allocating only single pages. In Christ, Steven Watanabe

On Sun, Jan 30, 2011 at 6:32 AM, Steven Watanabe <watanabesj@gmail.com> wrote:
AMDG
On 1/29/2011 1:18 PM, Dean Michael Berris wrote:
On Sun, Jan 30, 2011 at 4:39 AM, Steven Watanabe<watanabesj@gmail.com> wrote:
On 1/29/2011 1:25 AM, Dean Michael Berris wrote:
-- which means the VMM no matter what you do will actually have to find the memory for you and give you that chunk of memory. This means potentially swapping pages in/out. You limit this likelihood by asking for page-sized and page-aligned chunks and using those properly;
Why?
Why will page-sized/page-aligned chunks avoid the likelihood of (unnecessarily) swapping pages in and out? Because if you didn't take care that your data would spill over to page boundaries the amount of pages you're touching is one more than you need to touch.
One more comment, and then I'm done with this thread. I agree that page-alignment will help reduce swapping. What I remain unconvinced about are the benefits of allocating only single pages.
Right. When I say "allocate", I use it as in "ask a C++ allocator to give me a chunk" -- as in call the alloca member of the allocator instance. The thought really is prudence: ask only for what you need and make the most of what you have. In the string context: For concatenation nodes, laying it out in a page-aligned/aware manner and packing as much of them in a single page as possible is generally a good thing -- better than if you just willy-nilly allocate from anywhere or without regard for page boundaries. For block nodes, this means having a "usable block" that can be shared by multiple strings and referred to by concatenation nodes. In general, since you cannot control the growth factor for heap spaces generally from within a C++ application, using predictably sized units (like page sizes worth of memory) for *segmented* data structures is a good thing. Again if you knew beforehand that you need contiguous memory *and* don't really care about page alignment then you don't need this scheme. When growing segmented data structures having a predictable (preferably constant) growth characteristic or segment size is usually a good approach. This requires cooperation from an allocator that knows how to reserve/grow/shrink appropriately sized memory that is again page-aligned for maximum efficiency. HTH FWIW, I'm done with this thread too now. -- Dean Michael Berris about.me/deanberris

Dean Michael Berris wrote:
For multi-core set-ups where you have a NUMA architecture, having one thread allocate memory that's in a given memory controller (in a given core/CPU) that has to be a given size spanning multiple pages will give your OS a hard time finding at least two contiguous memory pages (especially when memory is a scarce resource). That's the virtual memory manager at work and that's a performance killer on most modern (and even not so modern) platforms.
Any chance you could cite some references for that? I can't help feeling that perhaps you have read something about finding contiguous _physical_ pages, which certainly could be a problem for e.g. kernel code but does not apply here, and have misinterpreted it as referring to _virtual_ pages. Thanks, Phil.

Artyom wrote:
1. Why do YOU think you'll be able to create something "better"?
Why does anyone think they can improve upon the past? Perhaps they will and perhaps not. Should that preclude the attempt?
2. Why do YOU think boost::string would be adopted in favor of std::string or one of the current widely used QString/ustring/wxString/UnicodeString/CString?
The obvious answer is that such adoption would occur if it was sufficiently superior in interface, functionality, performance, etc., or some combination thereof. There is a tradeoff one must make: the pain of switching versus the benefits that accrue. Obviously, the more similar the interface of boost::string (which, in Dean's vision, might just be a typedef for view<some_default_encoding_like_utf_8>) to the existing string types, the more readily it can be switched. However, that may, or may not, be a good thing. If the behavior is sufficiently different, such as throwing exceptions (or no longer throwing exceptions) in specific use cases, runtime behavior will change silently. In such cases, a completely different interface would be better as it would force the programmer to examine each use separately.
3. What so painful problems are you going to solve that would make it so much better then widely used and adopted std::string? Iterators? Mutability? Performance?
(Clue: there is no painful problems with std::string)
Clue: there are many painful problems with std::string. It is infinitely better than having no string class in the standard, but it has problems. Dean was fairly exhaustive on this point, so I'll not say more here.
1. Accept it that there is quite small chance that something that is not std::string would be widely accepted
I disagree. std::string has a significantly better chance than any arbitrary tom_dick_and_harry::string, all things being equal. However, the goal of this thread, and others, is to consider how to create a better string, not just another. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On Fri, 21 Jan 2011 20:07:51 +0800 Dean Michael Berris <mikhailberis@gmail.com> wrote:
[...] Mostly I'm interested in seeing a string class that is:
1. Immutable. No if's or but's about it. I don't want a string to be modifiable. Period. You can create it, and once it's created, that's it. [...]
I'm confused by this. You want the basic type to always act as if it's const, with no way to modify the string at ALL after it's been created?
I think strings are different from the encoding they're interpreted as. Let's fix the problem of a string data structure first then tack on encoding/decoding as something that depends on the string abstraction first.
That gets back to the problem that I was originally trying to solve with the UTF types: that a string needs a way to carry around its encoding. A UTF-8 type could be built on such a thing very easily. -- Chad Nelson Oak Circle Software, Inc. * * *

On Sat, Jan 22, 2011 at 1:37 AM, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
On Fri, 21 Jan 2011 20:07:51 +0800 Dean Michael Berris <mikhailberis@gmail.com> wrote:
[...] Mostly I'm interested in seeing a string class that is:
1. Immutable. No if's or but's about it. I don't want a string to be modifiable. Period. You can create it, and once it's created, that's it. [...]
I'm confused by this. You want the basic type to always act as if it's const, with no way to modify the string at ALL after it's been created?
Yep. No changing arbitrary content in the string. Concatenation is a process of creating new strings.
I think strings are different from the encoding they're interpreted as. Let's fix the problem of a string data structure first then tack on encoding/decoding as something that depends on the string abstraction first.
That gets back to the problem that I was originally trying to solve with the UTF types: that a string needs a way to carry around its encoding. A UTF-8 type could be built on such a thing very easily.
Hmm... I OTOH don't think the encoding should be part of the string. The encoding is really external to the string, more like a function that is applied to the string. If you can wrap the string in a UTF-8, UTF-16, UTF-32 encoder/decoder then that should be the way to go. However building it into the string is not something that will scale in case there are other encodings that would be supported -- think about not just Unicode, but things like Base64, Zip, <insert encoding here>. Ultimately the underlying string should be efficient and could be operated upon in a predictable manner. It should be lightweight so that it can be referred to in many different situations and there should be an infinite number of possibilities for what you can use a string for. -- Dean Michael Berris about.me/deanberris

At Sat, 22 Jan 2011 01:56:36 +0800, Dean Michael Berris wrote:
On Sat, Jan 22, 2011 at 1:37 AM, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
On Fri, 21 Jan 2011 20:07:51 +0800 Dean Michael Berris <mikhailberis@gmail.com> wrote:
[...] Mostly I'm interested in seeing a string class that is:
1. Immutable. No if's or but's about it. I don't want a string to be modifiable. Period. You can create it, and once it's created, that's it. [...]
I'm confused by this. You want the basic type to always act as if it's const, with no way to modify the string at ALL after it's been created?
Yep.
No changing arbitrary content in the string. Concatenation is a process of creating new strings.
But you're allowing assignment. That's not acting "as if it's const, with no way to modify the string" -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Sat, Jan 22, 2011 at 3:18 AM, Dave Abrahams <dave@boostpro.com> wrote:
At Sat, 22 Jan 2011 01:56:36 +0800, Dean Michael Berris wrote:
No changing arbitrary content in the string. Concatenation is a process of creating new strings.
But you're allowing assignment. That's not acting "as if it's const, with no way to modify the string"
Unless you frame assignment in terms of a "move". x = "This is the original"; x = "Not anymore"; What's happening here is that you're really making x refer to a different string. In essence, x is what you might call a proxy. You can change what the proxy refers to, but what it refers to you cannot change -- if that makes any sense. If you're reading or dealing with x, basically you're dealing with the proxy. So when you're doing concatenation, what's really happening is you're building a new string and making the proxy refer to that new string. x = "Hello,"; x = x ^ " World!"; Note this doesn't violate the value semantics of the object much like how pointers provide value semantics (because they are values). -- Dean Michael Berris about.me/deanberris

At Sat, 22 Jan 2011 03:40:44 +0800, Dean Michael Berris wrote:
On Sat, Jan 22, 2011 at 3:18 AM, Dave Abrahams <dave@boostpro.com> wrote:
At Sat, 22 Jan 2011 01:56:36 +0800, Dean Michael Berris wrote:
No changing arbitrary content in the string. Concatenation is a process of creating new strings.
But you're allowing assignment. That's not acting "as if it's const, with no way to modify the string"
Unless you frame assignment in terms of a "move".
x = "This is the original"; x = "Not anymore";
What's happening here is that you're really making x refer to a different string. In essence, x is what you might call a proxy. You can change what the proxy refers to, but what it refers to you cannot change -- if that makes any sense. If you're reading or dealing with x, basically you're dealing with the proxy.
Sorry, no, that's not value semantics. Value semantics are a subset of Stepanov's "Regular Type" concept. See EOP. Let me be clear: when you assign into x, you are modifying its value. If that can happen when x is const, x doesn't have proper value semantics. Implementation details like underlying buffers and refcounting are irrelevant. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Sat, Jan 22, 2011 at 3:56 AM, Dave Abrahams <dave@boostpro.com> wrote:
At Sat, 22 Jan 2011 03:40:44 +0800, Dean Michael Berris wrote:
On Sat, Jan 22, 2011 at 3:18 AM, Dave Abrahams <dave@boostpro.com> wrote:
At Sat, 22 Jan 2011 01:56:36 +0800, Dean Michael Berris wrote:
No changing arbitrary content in the string. Concatenation is a process of creating new strings.
But you're allowing assignment. That's not acting "as if it's const, with no way to modify the string"
Unless you frame assignment in terms of a "move".
x = "This is the original"; x = "Not anymore";
What's happening here is that you're really making x refer to a different string. In essence, x is what you might call a proxy. You can change what the proxy refers to, but what it refers to you cannot change -- if that makes any sense. If you're reading or dealing with x, basically you're dealing with the proxy.
Sorry, no, that's not value semantics. Value semantics are a subset of Stepanov's "Regular Type" concept. See EOP.
Right. So in the above, 'x' is assignable, copyable, and equality-comparable. It can enforce an ordering (total or partial) and it can be default constructed. I don't see why it doesn't support the "regular type" nor the value semantics "protocol".
Let me be clear: when you assign into x, you are modifying its value. If that can happen when x is const, x doesn't have proper value semantics. Implementation details like underlying buffers and refcounting are irrelevant.
Agreed. So it would be ill-formed to do something like this: boost::string const x("This is the initial value"); x = "Another value"; // compile error, X const doesn't have an assignment operator I didn't meant to imply that a `boost::string const` should be assignable -- I was mostly thinking about the case where it is a non-const lvalue. In which case I don't see why a proxy object wouldn't qualify for being something that follows the value semantics protocol. HTH -- Dean Michael Berris about.me/deanberris

On Sun, Jan 23, 2011 at 9:08 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
What's happening here is that you're really making x refer to a different string. In essence, x is what you might call a proxy. You can change what the proxy refers to, but what it refers to you cannot change -- if that makes any sense. If you're reading or dealing with x, basically you're dealing with the proxy.
Sorry, no, that's not value semantics. Value semantics are a subset of Stepanov's "Regular Type" concept. See EOP.
Right.
So in the above, 'x' is assignable, copyable, and equality-comparable. It can enforce an ordering (total or partial) and it can be default constructed. I don't see why it doesn't support the "regular type" nor the value semantics "protocol".
Let me be clear: when you assign into x, you are modifying its value. If that can happen when x is const, x doesn't have proper value semantics. Implementation details like underlying buffers and refcounting are irrelevant.
Agreed. So it would be ill-formed to do something like this:
boost::string const x("This is the initial value"); x = "Another value"; // compile error, X const doesn't have an assignment operator
I didn't meant to imply that a `boost::string const` should be assignable -- I was mostly thinking about the case where it is a non-const lvalue. In which case I don't see why a proxy object wouldn't qualify for being something that follows the value semantics protocol.
Sorry, this was just a miscommunication, because you were talking about implementation and I was talking about interface. Once you said "refers to" all my reference-semantics bells went off. I was just trying to get you to nail down the desired semantics (i.e. no implementation details). -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 21 January 2011 11:56, Dean Michael Berris <mikhailberis@gmail.com>wrote:
No changing arbitrary content in the string. Concatenation is a process of creating new strings.
Ultimately the underlying string should be efficient
Please describe how you are going to make this efficient, if concatenation effectively requires an allocation (once past the small string optimization) and a copy. It is one thing to be in a garbage collected world where the cost of allocation is relatively cheap, but that world is not C++. -- Nevin ":-)" Liber <mailto:nevin@eviloverlord.com> (847) 691-1404

On Sat, Jan 22, 2011 at 5:38 AM, Nevin Liber <nevin@eviloverlord.com> wrote:
On 21 January 2011 11:56, Dean Michael Berris <mikhailberis@gmail.com>wrote:
No changing arbitrary content in the string. Concatenation is a process of creating new strings.
Ultimately the underlying string should be efficient
Please describe how you are going to make this efficient, if concatenation effectively requires an allocation (once past the small string optimization) and a copy.
It is one thing to be in a garbage collected world where the cost of allocation is relatively cheap, but that world is not C++.
Correct. So this implementation pre-supposes that there is an efficient "arena" allocator, not dissimilar from Boost.Pool, or even one that allows "growing" of buffers (via a call similar to realloc), or even something akin to what Interprocess has in the form of a `managed_heap_memory`. You can then implement an optimization for concatenation if you knew that a string to be concatenated is only referred to by only one referrer/proxy, that the RHS is a temporary, and that the resulting string would only be assigned. This implies that you need to have some smarts in the implementation -- potentially having an EDSL just for string concatenation/operations. It's needless to say that Proto would be good to use in this EDSL so that you had the capability of knowing whether strings are temporaries (generated in-place from literal strings) or an lvalue. HTH -- Dean Michael Berris about.me/deanberris

On Sat, 22 Jan 2011 01:56:36 +0800 Dean Michael Berris <mikhailberis@gmail.com> wrote:
I think strings are different from the encoding they're interpreted as. Let's fix the problem of a string data structure first then tack on encoding/decoding as something that depends on the string abstraction first.
That gets back to the problem that I was originally trying to solve with the UTF types: that a string needs a way to carry around its encoding. A UTF-8 type could be built on such a thing very easily.
Hmm... I OTOH don't think the encoding should be part of the string. The encoding is really external to the string, more like a function that is applied to the string.
It's a property of the string. It may change, but some encoding (even if it's just "none") should be associated with a particular string throughout its existence. Otherwise you might as well use the existing std::string.
If you can wrap the string in a UTF-8, UTF-16, UTF-32 encoder/decoder then that should be the way to go. However building it into the string is not something that will scale in case there are other encodings that would be supported -- think about not just Unicode, but things like Base64, Zip, <insert encoding here>.
I assume that there is some unique identification for each language and encoding, or that one could be created. But that's too big a task for one volunteer developer, so my UTF classes are intended only to handle the three types that can encode any Unicode code-point.
Ultimately the underlying string should be efficient and could be operated upon in a predictable manner. It should be lightweight so that it can be referred to in many different situations and there should be an infinite number of possibilities for what you can use a string for.
You've just described std::string. Or alternately, std::vector<char>. -- Chad Nelson Oak Circle Software, Inc. * * *

On Sat, Jan 22, 2011 at 10:43 AM, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
On Sat, 22 Jan 2011 01:56:36 +0800 Dean Michael Berris <mikhailberis@gmail.com> wrote:
I think strings are different from the encoding they're interpreted as. Let's fix the problem of a string data structure first then tack on encoding/decoding as something that depends on the string abstraction first.
That gets back to the problem that I was originally trying to solve with the UTF types: that a string needs a way to carry around its encoding. A UTF-8 type could be built on such a thing very easily.
Hmm... I OTOH don't think the encoding should be part of the string. The encoding is really external to the string, more like a function that is applied to the string.
It's a property of the string. It may change, but some encoding (even if it's just "none") should be associated with a particular string throughout its existence. Otherwise you might as well use the existing std::string.
I think I disagree with this. A string is by definition a sequence of something -- a string of integers, a string of events, a string of characters. Encoding is not an intrinsic property of a string. As for using the existing std::string, I think the problem *is* std::string and the way it's implemented. In particular I think allowing for mutation of individual arbitrary elements makes users that don't need this mutation pay for the cost of having it. Because of this requirement things like SSO, copy-on-write optimizations(?), and all the other algorithm baggage that comes with the std::string implementation makes it really a bad basic string for the language. In a world where individual element mutation is a requirement, std::string may very well be an acceptable implementation. In other cases where you really don't need to be mutating any character in the string that's already there, well it's a really bad string implementation. For the purpose of interpreting a string as something else, you don't need mutation -- and hence you gain a lot by having a string that is immutable but interpretable in many different ways. Consider the case where for example I want to interpret the same string as UTF-8 and then later on as UTF-32. In your proposal I would need to copy the type that has a UTF-8 encoding into another type that has a UTF-32 encoding. If somehow the copy was trivial and doesn't need to give any programmer pause to do that, then that would be a good thing -- which is why an immutable string is something that your implementation would benefit from in a "plumbing" perspective.
If you can wrap the string in a UTF-8, UTF-16, UTF-32 encoder/decoder then that should be the way to go. However building it into the string is not something that will scale in case there are other encodings that would be supported -- think about not just Unicode, but things like Base64, Zip, <insert encoding here>.
I assume that there is some unique identification for each language and encoding, or that one could be created. But that's too big a task for one volunteer developer, so my UTF classes are intended only to handle the three types that can encode any Unicode code-point.
Sure, but that doesn't mean that you can't design it in a way that others can extend it appropriately. This was/is the beauty of how the iterator/range abstraction works out for generic code.
Ultimately the underlying string should be efficient and could be operated upon in a predictable manner. It should be lightweight so that it can be referred to in many different situations and there should be an infinite number of possibilities for what you can use a string for.
You've just described std::string. Or alternately, std::vector<char>.
Except these are mutable containers which are exactly what I *don't* want. -- Dean Michael Berris about.me/deanberris

On Jan 23, 2011, at 9:34 PM, Dean Michael Berris wrote:
On Sat, Jan 22, 2011 at 10:43 AM, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
On Sat, 22 Jan 2011 01:56:36 +0800 Dean Michael Berris <mikhailberis@gmail.com> wrote:
I think strings are different from the encoding they're interpreted as. Let's fix the problem of a string data structure first then tack on encoding/decoding as something that depends on the string abstraction first.
That gets back to the problem that I was originally trying to solve with the UTF types: that a string needs a way to carry around its encoding. A UTF-8 type could be built on such a thing very easily.
Hmm... I OTOH don't think the encoding should be part of the string. The encoding is really external to the string, more like a function that is applied to the string.
It's a property of the string. It may change, but some encoding (even if it's just "none") should be associated with a particular string throughout its existence. Otherwise you might as well use the existing std::string.
I think I disagree with this. A string is by definition a sequence of something -- a string of integers, a string of events, a string of characters. Encoding is not an intrinsic property of a string.
Ok... it feels like you are changing the rules as we play, instead of admitting "defeat" ;-) Or, did you indeed talk about *generic sequences* this whole time? If so, why the focus on encoding strategies for characters? /David

On Mon, Jan 24, 2011 at 11:51 AM, David Bergman <David.Bergman@bergmangupta.com> wrote:
On Jan 23, 2011, at 9:34 PM, Dean Michael Berris wrote:
I think I disagree with this. A string is by definition a sequence of something -- a string of integers, a string of events, a string of characters. Encoding is not an intrinsic property of a string.
Ok... it feels like you are changing the rules as we play, instead of admitting "defeat" ;-)
Or, did you indeed talk about *generic sequences* this whole time? If so, why the focus on encoding strategies for characters?
Huh? I've always been pointing out that strings should just be immutable and agnostic of the encoding and have the encoding enforced externally to the string. Are you confusing me for someone else? My assertion has been from the beginning: 1. Let's focus on a string class first that is (arguably) better than std::string by making it efficient, immutable, and does proper value semantics. 2. Once we have this then let's build upon it to allow for multiple ways of interpreting the *contents* of the string. I'm inclined to think you're confusing me for someone else while replying to my message above. -- Dean Michael Berris about.me/deanberris

On Jan 24, 2011, at 12:13 AM, Dean Michael Berris wrote:
On Mon, Jan 24, 2011 at 11:51 AM, David Bergman <David.Bergman@bergmangupta.com> wrote:
On Jan 23, 2011, at 9:34 PM, Dean Michael Berris wrote:
I think I disagree with this. A string is by definition a sequence of something -- a string of integers, a string of events, a string of characters. Encoding is not an intrinsic property of a string.
Ok... it feels like you are changing the rules as we play, instead of admitting "defeat" ;-)
Or, did you indeed talk about *generic sequences* this whole time? If so, why the focus on encoding strategies for characters?
Huh?
Why did you suddenly mention string as being an alias for what we usually denote a 'sequence' (of which std::vector is a model, by the way)?
I've always been pointing out that strings should just be immutable and agnostic of the encoding and have the encoding enforced externally to the string.
Are you confusing me for someone else?
No, not at all. There were two - for me - pretty awkward statements made by you, indicating a lack of coherence: 1. Your answer regarding what you meant by 'smarter iterator', which was a tautology adding no information at all: "The way I was thinking about it, "smarter" would mean something along the lines of "knows more than your average <thing>" where <thing> is a bare iterator." Yes, that touches at the definition of 'smarter'... but surely you understand that we (or he...) wondered *in what way* it was smarter; you *did* in fact extend a little bit on that later, I can give you that... 2. Your sudden proclamation that a string is a sequence of anything; indicating that you have been talking about a new sequence concept (variant) all this time, capable of holding stuff that are quite distinct from characters. Yes, ok, that is one meaning of 'string' in a strict sense, yes, but (I hope it was clear) that it is not the meaning used in this specific discussion; so, that switch of interpretation of the term does probably not make the discussion more focused.
My assertion has been from the beginning:
1. Let's focus on a string class first that is (arguably) better than std::string by making it efficient, immutable, and does proper value semantics.
2. Once we have this then let's build upon it to allow for multiple ways of interpreting the *contents* of the string.
I'm inclined to think you're confusing me for someone else while replying to my message above.
No, I did not. Sorry. By what you said above, you also add this point, which de-coheres the picture quite a bit: 3. This string class should be able to manifest sequences of anything, including events or arbitrary objects. /David

On Mon, Jan 24, 2011 at 1:41 PM, David Bergman <David.Bergman@bergmangupta.com> wrote:
On Jan 24, 2011, at 12:13 AM, Dean Michael Berris wrote:
On Mon, Jan 24, 2011 at 11:51 AM, David Bergman <David.Bergman@bergmangupta.com> wrote:
On Jan 23, 2011, at 9:34 PM, Dean Michael Berris wrote:
I think I disagree with this. A string is by definition a sequence of something -- a string of integers, a string of events, a string of characters. Encoding is not an intrinsic property of a string.
Ok... it feels like you are changing the rules as we play, instead of admitting "defeat" ;-)
Or, did you indeed talk about *generic sequences* this whole time? If so, why the focus on encoding strategies for characters?
Huh?
Why did you suddenly mention string as being an alias for what we usually denote a 'sequence' (of which std::vector is a model, by the way)?
Hmmm... In English, it's usually used to frame a discussion, by saying what you *think* what's being discussed it before you lay your argument down. This is normal for making a sound discourse/argument for/against a given position. I was merely saying that a string of something means that it's a collection of something -- stating a basis for the arguments that follow. Also, actually, I wasn't even going down the "generic sequence" route. The reason basic_string is even a template in the STL is precisely because you can instantiate the template type of the "character". It's perfectly fine if someone came up with a character-like type that you can feed into basic_string<> (something like wchar_t, or your own "character-like" type). I wasn't suggesting replacing std::vector because what I was interested in solving is the problem of dealing with strings of characters, or in the context of the discussion in general "text" -- which just so happens to be concerned (quite narrowly I believe) on encoding/decoding of data in a certain format.
I've always been pointing out that strings should just be immutable and agnostic of the encoding and have the encoding enforced externally to the string.
Are you confusing me for someone else?
No, not at all. There were two - for me - pretty awkward statements made by you, indicating a lack of coherence:
1. Your answer regarding what you meant by 'smarter iterator', which was a tautology adding no information at all:
"The way I was thinking about it, "smarter" would mean something along the lines of "knows more than your average <thing>" where <thing> is a bare iterator."
Yes, that touches at the definition of 'smarter'... but surely you understand that we (or he...) wondered *in what way* it was smarter; you *did* in fact extend a little bit on that later, I can give you that...
So I don't understand what you're saying here... that my trying to define what I mean by 'smarter' is... is awkward? Also, smarter can be interpreted many different ways -- it could mean clever, contrived, has more capabilities, etc. I was simply stating that I used the word 'smarter' in a sense that implies that it knows more than the average iterator. And I did make the point how a smarter iterator would look like. So now, I think *I'm* confused about you saying I'm changing the rules when all I've been saying has been consistent towards a different implementation (and semantics) for what a string should be.
2. Your sudden proclamation that a string is a sequence of anything; indicating that you have been talking about a new sequence concept (variant) all this time, capable of holding stuff that are quite distinct from characters.
Yes, ok, that is one meaning of 'string' in a strict sense, yes, but (I hope it was clear) that it is not the meaning used in this specific discussion; so, that switch of interpretation of the term does probably not make the discussion more focused.
I wasn't even implying that the discussion even go anywhere other than the string which has to do with "characters" for whatever definition of the word "character" there is. And if you look at my points, it's been towards the definition of "an std::string that is immutable, has value semantics, lightweight, and can be the basis of encoding/decoding algorithms". You took one sentence and chose to read that out of context and then assume that I've somehow been "defeated"?
My assertion has been from the beginning:
1. Let's focus on a string class first that is (arguably) better than std::string by making it efficient, immutable, and does proper value semantics.
2. Once we have this then let's build upon it to allow for multiple ways of interpreting the *contents* of the string.
I'm inclined to think you're confusing me for someone else while replying to my message above.
No, I did not. Sorry. By what you said above, you also add this point, which de-coheres the picture quite a bit:
3. This string class should be able to manifest sequences of anything, including events or arbitrary objects.
I didn't say that it *should* be able to manifest sequences of anything -- I was pointing out *one* definition of *string*. I was doing this to point out that nowhere in that definition does "encoding" become intrinsic -- and even if you look at the string in the context of std::string, neither is encoding intrinsic to that string. Also, to point, UTF doesn't even imply *strings*, it implies characters and character encodings, so I don't see why the encoding of a string of characters should be considered an inherent property of a string *type*. Maybe you're assuming I'm making the #3 point above when in fact I was just framing the discussion to assert that the encoding of the data encapsulated by a string is not intrinsic to the string's type. -- Dean Michael Berris about.me/deanberris

On 01/23/2011 07:51 PM, David Bergman wrote:
On Jan 23, 2011, at 9:34 PM, Dean Michael Berris wrote:
... elision by patrick... I think I disagree with this. A string is by definition a sequence of something -- a string of integers, a string of events, a string of characters. Encoding is not an intrinsic property of a string. Ok... it feels like you are changing the rules as we play, instead of admitting "defeat" ;-)
Or, did you indeed talk about *generic sequences* this whole time? If so, why the focus on encoding strategies for characters?
David, there are three related conversations (threads) going on. One about a particular set of utf encoded strings. One is about replacing std::string with a string specialized for utf-8, and the third, this one, is about a string type that is immutable. They keep leaking into each other a bit. Patrick

On 01/23/2011 07:51 PM, David Bergman wrote:
On Jan 23, 2011, at 9:34 PM, Dean Michael Berris wrote:
... elision by patrick... I think I disagree with this. A string is by definition a sequence of something -- a string of integers, a string of events, a string of characters. Encoding is not an intrinsic property of a string. Ok... it feels like you are changing the rules as we play, instead of admitting "defeat" ;-)
Or, did you indeed talk about *generic sequences* this whole time? If so, why the focus on encoding strategies for characters?
David, there are three related conversations (threads) going on. One about a particular set of utf encoded strings. One is about replacing std::string with a string specialized for utf-8, and the third, this one, is about a string type that is immutable. They keep leaking into each other a bit, since they're all about string. Patrick

... elision by patrick ... I think I disagree with this. A string is by definition a sequence of something -- a string of integers, a string of events, a string of characters. Encoding is not an intrinsic property of a string. I'm with you here, but to be fair to Chad, you could add to that list a string of utf-8 encoded characters. If a string contains things with a
As for using the existing std::string, I think the problem *is* std::string and the way it's implemented. In particular I think allowing for mutation of individual arbitrary elements makes users that don't need this mutation pay for the cost of having it. Because of this requirement things like SSO, copy-on-write optimizations(?), and all the other algorithm baggage that comes with the std::string implementation makes it really a bad basic string for the language. So you're saying that there _also_ needs to be an immutable string type
In a world where individual element mutation is a requirement, std::string may very well be an acceptable implementation. In other cases where you really don't need to be mutating any character in the string that's already there, well it's a really bad string implementation. So what's wrong with having two different strings? For the purpose of interpreting a string as something else, you don't need mutation -- and hence you gain a lot by having a string that is immutable but interpretable in many different ways.
Consider the case where for example I want to interpret the same string as UTF-8 and then later on as UTF-32. Are you saying that you try it as utf-8, it doesn't decode and then you
On 01/23/2011 06:34 PM, Dean Michael Berris wrote: particular encoding there's value in being able to keep track of whether it's validly encoded. It may very well be that a std::string is part of another type, or that there's some encoding wrapper that lets you see it as utf-8 in the same way an external iterator lets you look at chars. that wouldn't pay this penalty. try utf-32 to see if it works? Cause the same string couldn't be both. Or are you saying that the string has some underlying encoding but something lets it be viewed in other encodings, for example it might actually be EUC, but external iterators let you view it as utf-8 or utf-16 or utf-32 interpreting on the fly?
In your proposal I would need to copy the type that has a UTF-8 encoding into another type that has a UTF-32 encoding. If somehow the copy was trivial and doesn't need to give any programmer pause to do that, then that would be a good thing -- which is why an immutable string is something that your implementation would benefit from in a "plumbing" perspective. You could imagine:
utf-8_string u8s; utf-32_string u32s; // some code that gives a value to u32 u8s = u32s; // this would use a converting _copy_ constructor That would be cool. But what if someone had one of these that represented an edit buffer and was doing a global search and replace? I suppose then the underlying string would not be able to be the immutable one. Perhaps the std::string or std::immutable_string would be a template argument to basic_utf_string<encoding,stringtype>.
If you can wrap the string in a UTF-8, UTF-16, UTF-32 encoder/decoder then that should be the way to go. However building it into the string is not something that will scale in case there are other encodings that would be supported -- think about not just Unicode, but things like Base64, Zip,<insert encoding here>. I assume that there is some unique identification for each language and encoding, or that one could be created. But that's too big a task for one volunteer developer, so my UTF classes are intended only to handle the three types that can encode any Unicode code-point.
Sure, but that doesn't mean that you can't design it in a way that others can extend it appropriately. This was/is the beauty of how the iterator/range abstraction works out for generic code.
That's a wonderful idea, you could design it to work with statefull encodings like JIS and EUC and non-statefull encodings like the utf encodings.
Ultimately the underlying string should be efficient and could be operated upon in a predictable manner. It should be lightweight so that it can be referred to in many different situations and there should be an infinite number of possibilities for what you can use a string for. You've just described std::string. Or alternately, std::vector<char>. Except these are mutable containers which are exactly what I *don't* want.
But of course as you said before that if you _do_ want mutability then std::string is acceptable. It seems that we just need a lighter weight immutable addition to the fold. Patrick

On Mon, Jan 24, 2011 at 3:04 PM, Patrick Horgan <phorgan1@gmail.com> wrote:
On 01/23/2011 06:34 PM, Dean Michael Berris wrote:
... elision by patrick ... I think I disagree with this. A string is by definition a sequence of something -- a string of integers, a string of events, a string of characters. Encoding is not an intrinsic property of a string.
I'm with you here, but to be fair to Chad, you could add to that list a string of utf-8 encoded characters. If a string contains things with a particular encoding there's value in being able to keep track of whether it's validly encoded. It may very well be that a std::string is part of another type, or that there's some encoding wrapper that lets you see it as utf-8 in the same way an external iterator lets you look at chars.
Sure, however I personally don't see the value of making the encoding an intrinsic property of a string object. I still personally think that encoding/decoding are algorithms applied on data, in which case I would like string to just be the data. I see the encoding being a different concern from the type of a data structure -- where I see a string as basically an encapsulation that represents a collection of "characters" for any suitable definition of "character".
As for using the existing std::string, I think the problem *is* std::string and the way it's implemented. In particular I think allowing for mutation of individual arbitrary elements makes users that don't need this mutation pay for the cost of having it. Because of this requirement things like SSO, copy-on-write optimizations(?), and all the other algorithm baggage that comes with the std::string implementation makes it really a bad basic string for the language.
So you're saying that there _also_ needs to be an immutable string type that wouldn't pay this penalty.
Yes -- and I would argue that the string type that is immutable is a better start for building algorithms around it than those that are mutable (which std::string is one example). I personally think that building a string is a different concern from dealing with a string that is already built -- putting both in the same abstract data type is a little misguided. That's not the case though for other data types like trees, vectors, stacks, etc.
In a world where individual element mutation is a requirement, std::string may very well be an acceptable implementation. In other cases where you really don't need to be mutating any character in the string that's already there, well it's a really bad string implementation.
So what's wrong with having two different strings?
Nothing -- which is why I think if we were going to create a boost::string, it should be the string that is immutable, because if you wanted a mutable string, every other string implementation out there (including std::string) is already mutable. ;)
For the purpose of interpreting a string as something else, you don't need mutation -- and hence you gain a lot by having a string that is immutable but interpretable in many different ways.
Consider the case where for example I want to interpret the same string as UTF-8 and then later on as UTF-32.
Are you saying that you try it as utf-8, it doesn't decode and then you try utf-32 to see if it works? Cause the same string couldn't be both. Or are you saying that the string has some underlying encoding but something lets it be viewed in other encodings, for example it might actually be EUC, but external iterators let you view it as utf-8 or utf-16 or utf-32 interpreting on the fly?
I'm saying the string could contain whatever it contains (which is largely of little consequence) but that you can give a "view" of the string as UTF-8 if it's valid UTF-8, or UTF-32 if it's valid UTF-32. I think that encoding/decoding on the fly would be terribly inefficient, therefore describing precisely what kind of interpretation you need at the point of interpretation would be a much more scalable approach. Consider the following: template <class String> void needs_utf8(String const & s) { view<utf8_encoded> utf8_string(s); if (!valid(utf8_string)) throw invalid_string("I need a UTF-8 string."); } template <class String> void needs_utf16(String const & s) { view<utf16_encoded> utf16_string(s); if (!valid(utf16_string)) throw invalid_string("I need a UTF-16 string."); } I would say you have four choices when implementing `view` and `valid`: 1. view converts, and valid is a no-op. 2. view doesn't convert, and valid does the validation on the underlying string. 3. view converts, and valid does the validation on the underlying string. 4. view doesn't convert, but valid checks the validation on the view. I'm leaning towards #2.
In your proposal I would need to copy the type that has a UTF-8 encoding into another type that has a UTF-32 encoding. If somehow the copy was trivial and doesn't need to give any programmer pause to do that, then that would be a good thing -- which is why an immutable string is something that your implementation would benefit from in a "plumbing" perspective.
You could imagine:
utf-8_string u8s; utf-32_string u32s; // some code that gives a value to u32 u8s = u32s; // this would use a converting _copy_ constructor
Actually, if you didn't do any "immediate" enforcement of the UTF invariant on the strings, then the assignment would amount to a pointer copy.
That would be cool. But what if someone had one of these that represented an edit buffer and was doing a global search and replace? I suppose then the underlying string would not be able to be the immutable one. Perhaps the std::string or std::immutable_string would be a template argument to basic_utf_string<encoding,stringtype>.
I think with an immutable string, you would go about it a different way -- instead of dealing with the underlying string directly, I would say that you would represent the edit buffer as a raw buffer of bytes.
From the UI perspective (assuming a GUI application) you can do the rendering based on user preferences, in which case you didn't directly deal with immutable string objects. That would allow the editing to happen in a different buffer as compared to having it apply on string objects (which is a bad way to go about it IMO).
The strings would only come into the picture if you're applying algorithms on the string data -- and/or viewing the strings in a given encoding -- for example when saving the file, or loading files from streams.
If you can wrap the string in a UTF-8, UTF-16, UTF-32 encoder/decoder then that should be the way to go. However building it into the string is not something that will scale in case there are other encodings that would be supported -- think about not just Unicode, but things like Base64, Zip,<insert encoding here>.
I assume that there is some unique identification for each language and encoding, or that one could be created. But that's too big a task for one volunteer developer, so my UTF classes are intended only to handle the three types that can encode any Unicode code-point.
Sure, but that doesn't mean that you can't design it in a way that others can extend it appropriately. This was/is the beauty of how the iterator/range abstraction works out for generic code.
That's a wonderful idea, you could design it to work with statefull encodings like JIS and EUC and non-statefull encodings like the utf encodings.
And if you get the efficiency win of the immutable strings along with it, then that's a double win IMO. :)
Ultimately the underlying string should be efficient and could be operated upon in a predictable manner. It should be lightweight so that it can be referred to in many different situations and there should be an infinite number of possibilities for what you can use a string for.
You've just described std::string. Or alternately, std::vector<char>.
Except these are mutable containers which are exactly what I *don't* want.
But of course as you said before that if you _do_ want mutability then std::string is acceptable. It seems that we just need a lighter weight immutable addition to the fold.
Yup, which is what I think boost::string should be in the first place. ;) -- Dean Michael Berris about.me/deanberris

Dean Michael Berris wrote:
Consider the following:
template <class String> void needs_utf8(String const & s) { view<utf8_encoded> utf8_string(s); if (!valid(utf8_string)) throw invalid_string("I need a UTF-8 string."); }
template <class String> void needs_utf16(String const & s) { view<utf16_encoded> utf16_string(s); if (!valid(utf16_string)) throw invalid_string("I need a UTF-16 string."); }
I would say you have four choices when implementing `view` and `valid`:
1. view converts, and valid is a no-op. 2. view doesn't convert, and valid does the validation on the underlying string. 3. view converts, and valid does the validation on the underlying string. 4. view doesn't convert, but valid checks the validation on the view.
I'm leaning towards #2.
#1 and #3 would be wasteful for cases when the string is already known to have the desired encoding, so they are non-starters. I'm not sure I understand the distinction or reason for the distinction you imply by #2 versus #4. #2's wording suggests that you mean valid() accesses the underlying string through the view, but why is that better or worse than just using the view as in #4? _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On Mon, Jan 24, 2011 at 9:48 PM, Stewart, Robert <Robert.Stewart@sig.com> wrote:
Dean Michael Berris wrote:
Consider the following:
template <class String> void needs_utf8(String const & s) { view<utf8_encoded> utf8_string(s); if (!valid(utf8_string)) throw invalid_string("I need a UTF-8 string."); }
template <class String> void needs_utf16(String const & s) { view<utf16_encoded> utf16_string(s); if (!valid(utf16_string)) throw invalid_string("I need a UTF-16 string."); }
I would say you have four choices when implementing `view` and `valid`:
1. view converts, and valid is a no-op. 2. view doesn't convert, and valid does the validation on the underlying string. 3. view converts, and valid does the validation on the underlying string. 4. view doesn't convert, but valid checks the validation on the view.
I'm leaning towards #2.
#1 and #3 would be wasteful for cases when the string is already known to have the desired encoding, so they are non-starters.
I'm not sure I understand the distinction or reason for the distinction you imply by #2 versus #4. #2's wording suggests that you mean valid() accesses the underlying string through the view, but why is that better or worse than just using the view as in #4?
In #2, you can have valid be implemented like this: template <template class <class> View, class Encoding> bool valid(View<Encoding> const & encoded_view) { if (!valid_length(encoded_view.raw(), Encoding())) // use static tag-dispatch return false; // ... do other validity checking based on just the raw data // like BOM checking, character-by-character check on whether // there are invalid characters not within range, consider Base64 // and/or hex-encodings aside from just Unicode, etc. } Which you really would want to have for performance reasons -- case in point, if the underlying string doesn't have a valid length for UTF-16 or UTF-32 strings, you get a win by just doing some math on the length check for validity. Some libraries even make these parts compile to vectorized code, use OpenMP, or might do some things like even do GPU-assisted validation. For #4 though this would be unnecessarily limited by the interface provided by the view, which may mean that the only way you would write a validator would be to try to get an iterator from the view where you essentially wait for a dereference of an iterator to fail through some mechanism -- maybe throw on dereference, or something like that. By doing it through the #2 approach you can write a general validation routine that can even be specialized on through the specific encoding. You get the tag-dispatch goodness you can whenever for example you have a specialized routine for validation in a given encoding, have some room for partial/full specialization, etc. -- Dean Michael Berris about.me/deanberris

On Mon, 24 Jan 2011 19:28:50 +0800 Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Mon, Jan 24, 2011 at 3:04 PM, Patrick Horgan <phorgan1@gmail.com> wrote:
[...] I'm with you here, but to be fair to Chad, you could add to that list a string of utf-8 encoded characters. If a string contains things with a particular encoding there's value in being able to keep track of whether it's validly encoded. It may very well be that a std::string is part of another type, or that there's some encoding wrapper that lets you see it as utf-8 in the same way an external iterator lets you look at chars.
Sure, however I personally don't see the value of making the encoding an intrinsic property of a string object. [...]
Then I think we have different purposes, and I'll absent myself from this part of the discussion after this reply. Before I go, I'll note in passing that I've started on the modifications to the UTF types, and I found that it made sense to omit many of the mutating functions from utf8_t and utf16_t, at least the ones that operate on anything other than the end of the string.
Are you saying that you try it as utf-8, it doesn't decode and then you try utf-32 to see if it works? Cause the same string couldn't be both. Or are you saying that the string has some underlying encoding but something lets it be viewed in other encodings, for example it might actually be EUC, but external iterators let you view it as utf-8 or utf-16 or utf-32 interpreting on the fly?
I'm saying the string could contain whatever it contains (which is largely of little consequence) but that you can give a "view" of the string as UTF-8 if it's valid UTF-8, or UTF-32 if it's valid UTF-32. [...]
For what it's worth, that's the basic concept that I've adopted for the utf*_t modifications. The utf*_t gives only a code-point iterator (you can also get a char/char16_t/char32_t iterator from the type returned by the encoded() function). I plan to write a separate character iterator that will accept code-points and return actual Unicode characters. -- Chad Nelson Oak Circle Software, Inc. * * *

At Mon, 24 Jan 2011 16:34:54 -0500, Chad Nelson wrote:
Sure, however I personally don't see the value of making the encoding an intrinsic property of a string object. [...]
Then I think we have different purposes, and I'll absent myself from this part of the discussion after this reply.
I think there *might* be a miscommunication here. IIUC, Dean is saying that he doesn't necessarily see a reason that a string's encoding needs to be exposed in its interface. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Tue, Jan 25, 2011 at 9:43 AM, Dave Abrahams <dave@boostpro.com> wrote:
At Mon, 24 Jan 2011 16:34:54 -0500, Chad Nelson wrote:
Sure, however I personally don't see the value of making the encoding an intrinsic property of a string object. [...]
Then I think we have different purposes, and I'll absent myself from this part of the discussion after this reply.
I think there *might* be a miscommunication here. IIUC, Dean is saying that he doesn't necessarily see a reason that a string's encoding needs to be exposed in its interface.
+1 And not only that, that viewing a string in a given encoding is largely a concern for a different type. I'm largely looking at how Fusion does it with sequences and views. -- Dean Michael Berris about.me/deanberris

On Mon, 24 Jan 2011 20:43:56 -0500 Dave Abrahams <dave@boostpro.com> wrote:
Sure, however I personally don't see the value of making the encoding an intrinsic property of a string object. [...]
Then I think we have different purposes, and I'll absent myself from this part of the discussion after this reply.
I think there *might* be a miscommunication here. IIUC, Dean is saying that he doesn't necessarily see a reason that a string's encoding needs to be exposed in its interface.
Yes, and I think I see what he has in mind, vaguely. But it's a different vision than what I'm developing -- better or worse, I don't know, just different. -- Chad Nelson Oak Circle Software, Inc. * * *

On Tue, Jan 25, 2011 at 5:34 AM, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
On Mon, 24 Jan 2011 19:28:50 +0800 Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Mon, Jan 24, 2011 at 3:04 PM, Patrick Horgan <phorgan1@gmail.com> wrote:
[...] I'm with you here, but to be fair to Chad, you could add to that list a string of utf-8 encoded characters. If a string contains things with a particular encoding there's value in being able to keep track of whether it's validly encoded. It may very well be that a std::string is part of another type, or that there's some encoding wrapper that lets you see it as utf-8 in the same way an external iterator lets you look at chars.
Sure, however I personally don't see the value of making the encoding an intrinsic property of a string object. [...]
Then I think we have different purposes, and I'll absent myself from this part of the discussion after this reply.
I don't think we have different purposes, I just think we're discussing two different levels. I for one want a string that is efficient and lightweight to use. Whether it encodes the data underneath as UTF-32 for convenience is largely of little consequence to me at that level. However, as I have already described elsewhere on a different message, "viewing" a string in a given encoding is much more scalable as far as design is concerned as it allows others to extend the view mechanism to be unique to the encoding being supported. This allows you to write algorithms and views that adapt existing strings (std::string, QString, CString, std::wstring, <insert string implementation here>) and operate on them in a generic manner. The hypothetical `boost::string` can have implicit conversion constructors (?) that deal with the supported strings, and that means you are able to view that `boost::string` instead in the view.
Before I go, I'll note in passing that I've started on the modifications to the UTF types, and I found that it made sense to omit many of the mutating functions from utf8_t and utf16_t, at least the ones that operate on anything other than the end of the string.
Actually, I think if you have the immutable string as something you use internally in your UTF-* "views", then you may even be able to omit even the mutation parts even those dealing with the end of the string. ;)
Are you saying that you try it as utf-8, it doesn't decode and then you try utf-32 to see if it works? Cause the same string couldn't be both. Or are you saying that the string has some underlying encoding but something lets it be viewed in other encodings, for example it might actually be EUC, but external iterators let you view it as utf-8 or utf-16 or utf-32 interpreting on the fly?
I'm saying the string could contain whatever it contains (which is largely of little consequence) but that you can give a "view" of the string as UTF-8 if it's valid UTF-8, or UTF-32 if it's valid UTF-32. [...]
For what it's worth, that's the basic concept that I've adopted for the utf*_t modifications. The utf*_t gives only a code-point iterator (you can also get a char/char16_t/char32_t iterator from the type returned by the encoded() function). I plan to write a separate character iterator that will accept code-points and return actual Unicode characters.
I do suggest however that you implement/design algorithms first and build your iterators around the algorithms. I know that might sound counter-intuitive but having concrete algorithms in mind will allow you to delineate the proper (or more effective) abstractions better than thinking of the iterators in isolation. -- Dean Michael Berris about.me/deanberris

On Tue, 25 Jan 2011 12:27:02 +0800 Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Tue, Jan 25, 2011 at 5:34 AM, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
Sure, however I personally don't see the value of making the encoding an intrinsic property of a string object. [...]
Then I think we have different purposes, and I'll absent myself from this part of the discussion after this reply.
I don't think we have different purposes, I just think we're discussing two different levels.
I suspect that's the case.
I for one want a string that is efficient and lightweight to use. Whether it encodes the data underneath as UTF-32 for convenience is largely of little consequence to me at that level.
I think I see where you're coming from: that if you had a set of conversion iterators that would provide a view of the type in whatever coding you want, the underlying coding matters very little to you. And that's effectively what I'm aiming for with the current UTF string revisions: they all provide a code-point iterator (which works the same regardless of the underlying type, it always provides the 21-bit code-point, as a 32-bit value), as well as a way to access the encoded data.
However, as I have already described elsewhere on a different message, "viewing" a string in a given encoding is much more scalable as far as design is concerned as it allows others to extend the view mechanism to be unique to the encoding being supported.
And if I understand it correctly (which I don't guarantee), that sounds like a nice design for some kinds of programs. It's just not the one I'm pursuing, because the problem it addresses doesn't seem to be the one I'm trying to solve, which is efficiently storing, manipulating, and converting Unicode data.
This allows you to write algorithms and views that adapt existing strings (std::string, QString, CString, std::wstring, <insert string implementation here>) and operate on them in a generic manner. The hypothetical `boost::string` can have implicit conversion constructors (?) that deal with the supported strings, and that means you are able to view that `boost::string` instead in the view.
I'm avoiding the boost::string idea. It's great in theory, but it's still pretty nebulous, and I can foresee someone spending a lot of time trying different things out on it. "There comes a time in every project when you have to shoot the engineers and put the damn thing into production." :-) I want working code, or a design I can quickly turn into working code, and the boost::string idea seems to be tangential to the problem I'm working on.
Before I go, I'll note in passing that I've started on the modifications to the UTF types, and I found that it made sense to omit many of the mutating functions from utf8_t and utf16_t, at least the ones that operate on anything other than the end of the string.
Actually, I think if you have the immutable string as something you use internally in your UTF-* "views", then you may even be able to omit even the mutation parts even those dealing with the end of the string. ;)
The underlying type of the UTF strings can't be an immutable string, because the conversion functions have to operate one code-point at a time. For instance, there's no way to know the code-point length of a UTF-8 sequence without at least walking over it and examining most of the bytes, which means that converting a raw UTF-8 string to UTF-32 would either have to use a mutable type in between (making it even less efficient than using a mutable underlying string because it requires two copy operations) or you'd have to walk it first to get the length, create the immutable string's storage, then walk it again to convert each character.
For what it's worth, that's the basic concept that I've adopted for the utf*_t modifications. The utf*_t gives only a code-point iterator (you can also get a char/char16_t/char32_t iterator from the type returned by the encoded() function). I plan to write a separate character iterator that will accept code-points and return actual Unicode characters.
I do suggest however that you implement/design algorithms first and build your iterators around the algorithms.
I know that might sound counter-intuitive but having concrete algorithms in mind will allow you to delineate the proper (or more effective) abstractions better than thinking of the iterators in isolation.
The problem being that I don't know what the tasks someone might want out of it. I'm aiming to provide the basics that any other algorithm can be layered onto, and as many of std::string's capabilities as I can manage. -- Chad Nelson Oak Circle Software, Inc. * * *

On 21 January 2011 06:07, Dean Michael Berris <mikhailberis@gmail.com>wrote:
4. Looks like a real STL container except the iterator type is smarter than your average iterator.
What does "smarter" mean? -- Nevin ":-)" Liber <mailto:nevin@eviloverlord.com> (847) 691-1404

On Sat, Jan 22, 2011 at 5:01 AM, Nevin Liber <nevin@eviloverlord.com> wrote:
On 21 January 2011 06:07, Dean Michael Berris <mikhailberis@gmail.com>wrote:
4. Looks like a real STL container except the iterator type is smarter than your average iterator.
What does "smarter" mean?
The way I was thinking about it, "smarter" would mean something along the lines of "knows more than your average <thing>" where <thing> is a bare iterator. In the context of strings, I was thinking it should be able to know what string it came from, what encoding is the string supposed to be interpreted in, or whether there are special computations that an iterator for string might need. One example that comes to mind is having a tokenizing iterator which returns a string when dereferenced and knows what the delimiters of the string are -- to do that correctly your iterator would need to know which string it came from and where in the string its internal "counter" is already "parked" at from the last dereference. This would require that iterators be built externally from the string, something like: auto it = encoded<utf8_encoding>(original_string), end = encoded<utf8_encoding>(); Here `it` could interpret the original string as UTF-8 and you can possibly assume that dereferencing this iterator can return an appropriate (possibly variant) type that is convertible to the appropriate holder (char, wchar_t, uint32_t (for utf32)). From here you can build ranges appropriately and deal with ranges and just know that the encoding is explicitly defined in the iterator. HTH -- Dean Michael Berris about.me/deanberris

On 01/23/2011 06:16 PM, Dean Michael Berris wrote:
On 21 January 2011 06:07, Dean Michael Berris<mikhailberis@gmail.com>wrote:
4. Looks like a real STL container except the iterator type is smarter than your average iterator.
What does "smarter" mean? The way I was thinking about it, "smarter" would mean something along
On Sat, Jan 22, 2011 at 5:01 AM, Nevin Liber<nevin@eviloverlord.com> wrote: the lines of "knows more than your average<thing>" where<thing> is a bare iterator.
In the context of strings, I was thinking it should be able to know what string it came from, what encoding is the string supposed to be interpreted in, or whether there are special computations that an iterator for string might need. One example that comes to mind is having a tokenizing iterator which returns a string when dereferenced and knows what the delimiters of the string are -- to do that correctly your iterator would need to know which string it came from and where in the string its internal "counter" is already "parked" at from the last dereference.
This would require that iterators be built externally from the string, something like:
auto it = encoded<utf8_encoding>(original_string), end = encoded<utf8_encoding>(); I like that idea but was toying with a different paradigm. A template argument similar to a locale in that it would contain the information needed to compare elements and to iterate elements. If it made sense to change things an imbue idea for comparisons and iterators could work.
Here `it` could interpret the original string as UTF-8 and you can possibly assume that dereferencing this iterator can return an appropriate (possibly variant) type that is convertible to the appropriate holder (char, wchar_t, uint32_t (for utf32)). From here you can build ranges appropriately and deal with ranges and just know that the encoding is explicitly defined in the iterator. I like the idea of segmenting an encoded string into ranges where a "character" would be a range capturing one or more of the underlying encoding's characters and combining characters. It solves the problem of what to return when dereferencing the iterators. Of course you'd have to be able to compare two (for example) utf-16 ranges meaningfully based on some locale, just as if a human who knew the symbols was comparing the glyphs that would be drawn for each range. Another idea I like though, is that dereferencing an iterator would return one UCS codepoint and it would be up to a higher level of abstraction to fetch the combining characters and form the final glyph. That way, any string that encoded UCS, whether it was utf-32, utf-16, or utf-8, could return char32_t from dereferencing an iterator. I suspect that either or both of these as well as other variations would at times be the better idea, because the interpretation of the underlying code varies so much. Lots of places share the same scripts but with quite different rules about what to do with them, and how to combine or compare them. Beware of a naive solution if the intent is to make a completely general solution. I'm not even sure if it's possible without doing a layered approach.
Patrick

On Mon, Jan 24, 2011 at 2:45 PM, Patrick Horgan <phorgan1@gmail.com> wrote:
On 01/23/2011 06:16 PM, Dean Michael Berris wrote:
In the context of strings, I was thinking it should be able to know what string it came from, what encoding is the string supposed to be interpreted in, or whether there are special computations that an iterator for string might need. One example that comes to mind is having a tokenizing iterator which returns a string when dereferenced and knows what the delimiters of the string are -- to do that correctly your iterator would need to know which string it came from and where in the string its internal "counter" is already "parked" at from the last dereference.
This would require that iterators be built externally from the string, something like:
auto it = encoded<utf8_encoding>(original_string), end = encoded<utf8_encoding>();
I like that idea but was toying with a different paradigm. A template argument similar to a locale in that it would contain the information needed to compare elements and to iterate elements. If it made sense to change things an imbue idea for comparisons and iterators could work.
Right, unfortunately that kind of information doesn't seem to fit well to be made part of the string's type. I would think that algorithms that apply to the string should be external and just leave the string type to behave like a value that you can deal with. The simple reason for why this kind of information would be best made external is in the case of runtime switching on what kind of encoding you may want to interpret data in.
Here `it` could interpret the original string as UTF-8 and you can possibly assume that dereferencing this iterator can return an appropriate (possibly variant) type that is convertible to the appropriate holder (char, wchar_t, uint32_t (for utf32)). From here you can build ranges appropriately and deal with ranges and just know that the encoding is explicitly defined in the iterator.
I like the idea of segmenting an encoded string into ranges where a "character" would be a range capturing one or more of the underlying encoding's characters and combining characters. It solves the problem of what to return when dereferencing the iterators. Of course you'd have to be able to compare two (for example) utf-16 ranges meaningfully based on some locale, just as if a human who knew the symbols was comparing the glyphs that would be drawn for each range.
Right. I'm not sure though whether that kind of "human" intelligence can be succinctly described in an algorithm -- unless of course it boils down to a simple "nested switch" statement (that could be code-gen'ed anyway).
Another idea I like though, is that dereferencing an iterator would return one UCS codepoint and it would be up to a higher level of abstraction to fetch the combining characters and form the final glyph. That way, any string that encoded UCS, whether it was utf-32, utf-16, or utf-8, could return char32_t from dereferencing an iterator. I suspect that either or both of these as well as other variations would at times be the better idea, because the interpretation of the underlying code varies so much. Lots of places share the same scripts but with quite different rules about what to do with them, and how to combine or compare them. Beware of a naive solution if the intent is to make a completely general solution. I'm not even sure if it's possible without doing a layered approach.
I agree in general that the naive solution can be misleading and can potentially be worse than if you had the string encoding information directly in a string type. This is where I think things like Proto or the smart use of template metaprogramming (on a micro-scale, especially with iterator nesting) can allow the fusing of certain transformations/encoding+decoding techniques, but only in cases where you have the nesting statically defined. Of course the proof will be in the pudding once that implementation starts to bake. ;) -- Dean Michael Berris about.me/deanberris

Hi, In order to keep the number of my replies down, in my following posts I will respond to several messages at once (sorry in advance). On Fri, Jan 21, 2011 at 1:42 PM, Glyn Matthews <glyn.matthews@gmail.com> wrote:
No need to duck and cover, you should be applauded for taking the initiative here.
Thanks for the encouragement.
Definitely make a Git repository.
On Fri, Jan 21, 2011 at 1:07 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
Maybe you have a publicly available Git repository -- maybe on Github -- we'd have a better discussion going?
Actually I think that it would be prudent if some "more senior" member of the Boost community managed the repository for this very important project. But, if nobody does it by the end of this week I will create a new git repo. Any preferences (SourceForge, Github, Gitorious, Something different) ? BR, Matus

On Fri, Jan 21, 2011 at 1:07 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
Mostly I'm interested in seeing a string class that is:
1. Immutable. No if's or but's about it. I don't want a string to be modifiable. Period. You can create it, and once it's created, that's it.
2. Has real value semantics. This means, once you've copied it, that's really copied. No funky copy-on-write reference-counting mumbo-jumbo.
I also prefer nothing too fancy. But most of these things are implementation details, let us get the interface right first and focus on the optimizations afterwards.
3. Has all the algorithms that apply to it defined externally.
[snip/]
Encoding is a matter of external interpretation and I think should not be part of a string's interface. You can have wrappers that interpret a string as a UTF-* string.
auto it = encoded<utf8_encoding>(original_string), end = encoded<utf8_encoding>(); is perfectly generic and well-designed for some use-cases the first reaction of
I am all for a generalized-*string* class in the pedantic interpretation of the word i.e. a sequence of chars, char16_ts, bytes, octets, words, dwords, etc. without any enforced encoding for use-cases that call for it, but again, the reason why I participate in this whole discussion is because I think that C++ deserves also a class focused on the "everyday", *nice* and *convenient* handling of text, without having to worry about how do I need to "view" that raw-chunk-of-binary-data in this call to an OS API function and how do I have to "view" it in that other library call, explicitly specifying to which encoding I want to convert it using *ugly* :-) tag types, etc. (as much as this is possible). Another important concern for me is portability. I'd like (being very self-centered :-P) for example the following: boost::string s = "Mat" + code_point(0x00FA/*u with acute*/) + code_point(0x0161/*s with caron*/); std::cout << s << std::endl; (everywhere where the terminal can handle it) to print: Matúš // hope your email client can handle that :) instead of: Mat$#@!% or completely upsetting the terminal. Also, while I see that for example this the-average-joe-programmer-inside-me's when seeing it was, *yuck*. Sorry :-) Sometimes it is more important for the code and people writing/maintaining it to be nice and easy to understand than to be really-really-generic and smart. That said, it *is* perfectly valid if someone uses the generic version above. Let's do both. The reason why I want to call it (std::)string is that many not-so-pedantic people would react to the question "What is your first thought when you hear 'string type'?" with "Some kind of type for handling text, eh?" and not with "Some kind of generalized sequence of elements without any intrinsic encoding having the following properties...". But if there is so much resistance to calling it that then I vote for (boost|std)::text (however this sounds a little awkward to me, I don't know why). Let us keep the basic_string<CharT> as that generalized string (I never suggested to dump it, just that std::string would be an another type and not defined as typedef std::basic_string<char>). Regarding #1 above and the following ...
x = "Hello,"; x = x ^ " World!";
... would you be against, if the interface in addition also included a few convenience/backward compatibility member functions like ... string& append(const string& s) { *this = *this ^ s; return *this; } string& prepend(const string& s) { *this = s ^ *this; return *this; } ... etc? For the same reasons as above: clarity, simplicity (it may not be obvious what a fancy operator expression does, it is more obvious when using names like append, prepend, ...) and people are used to that programming style. BR, Matus

On Wed, Jan 26, 2011 at 3:47 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Fri, Jan 21, 2011 at 1:07 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
Mostly I'm interested in seeing a string class that is:
1. Immutable. No if's or but's about it. I don't want a string to be modifiable. Period. You can create it, and once it's created, that's it.
2. Has real value semantics. This means, once you've copied it, that's really copied. No funky copy-on-write reference-counting mumbo-jumbo.
I also prefer nothing too fancy. But most of these things are implementation details, let us get the interface right first and focus on the optimizations afterwards.
Actually, it's not an implementation detail. Value semantics has everything to do the interface and not the implementation. It's just that, at the time I was thinking about and writing this reply, I was just really wanting something lightweight and allowed for unbridled cross-thread access. That original assumption of mine that reference counting was a bad thing has since been clarified by others in the ensuing threads.
3. Has all the algorithms that apply to it defined externally.
[snip/]
Encoding is a matter of external interpretation and I think should not be part of a string's interface. You can have wrappers that interpret a string as a UTF-* string.
I am all for a generalized-*string* class in the pedantic interpretation of the word i.e. a sequence of chars, char16_ts, bytes, octets, words, dwords, etc. without any enforced encoding for use-cases that call for it, but again,
the reason why I participate in this whole discussion is because I think that C++ deserves also a class focused on the "everyday", *nice* and *convenient* handling of text, without having to worry about how do I need to "view" that raw-chunk-of-binary-data in this call to an OS API function and how do I have to "view" it in that other library call, explicitly specifying to which encoding I want to convert it using *ugly* :-) tag types, etc. (as much as this is possible).
But I we already have these everyday nice and convenient text handling algorithms in Boost.Algorithm's String_algo library. As a matter of fact, *all* the implementations cited about dealing with UTF-8 and UTF-16 have everything to do with wrapping raw data into a view of it that (unfortunately) allows for mutating transformations. Note also that I wasn't even going into the generic point of strings being a sequence of anything other than characters to be read. That's a different topic that I don't want to get into at this time. But even the pedantic definition of a string doesn't include mutability as an intrinsic requirement.
Another important concern for me is portability. I'd like (being very self-centered :-P) for example the following:
boost::string s = "Mat" + code_point(0x00FA/*u with acute*/) + code_point(0x0161/*s with caron*/); std::cout << s << std::endl;
(everywhere where the terminal can handle it) to print: Matúš // hope your email client can handle that :)
instead of: Mat$#@!% or completely upsetting the terminal.
A few things here: 1. This is totally fine with an immutable string implementation. I don't see any mutations going on here. 2. A string class that "works correctly while immutable" allows for dealing with arbitrary data interpreted as some thunk that is obtained from a given source (as long as you have a length of the data that is). 3. String I/O can be defined independently of the string especially if you're dealing with C++ streams. I don't see why the above would be a problem with an immutable string implementation. 4. I don't see why a hypothetical boost::string implementation that is immutable would have portability problems when it just deals with immutable thunks of memory that can be viewed in a different manner depending on the encoding you want at the point where you need to be dealing with a specific encoding.
auto it = encoded<utf8_encoding>(original_string), end = encoded<utf8_encoding>(); is perfectly generic and well-designed for some use-cases the first reaction of
Also, while I see that for example this the-average-joe-programmer-inside-me's when seeing it was, *yuck*. Sorry :-)
So you'd say yuck to any STL algorithm that dealt with iterators? Have you used the Boost.Iterators library yet because then you'd be calling all those chaining/wrapping operations "yucky" too. ;)
Sometimes it is more important for the code and people writing/maintaining it to be nice and easy to understand than to be really-really-generic and smart. That said, it *is* perfectly valid if someone uses the generic version above. Let's do both.
But the problem there is "nice" is really subjective. I absolutely abhor code like this: boost::string s = "Foo"; s.append("Bar").append("Baz"); When I can express it entirely with less characters and succinctly with this instead: boost::string s = "Foo" ^ "Bar" ^ "Baz";
The reason why I want to call it (std::)string is that many not-so-pedantic people would react to the question "What is your first thought when you hear 'string type'?" with "Some kind of type for handling text, eh?" and not with "Some kind of generalized sequence of elements without any intrinsic encoding having the following properties...". But if there is so much resistance to calling it that then I vote for (boost|std)::text (however this sounds a little awkward to me, I don't know why).
I think you're missing something here though. The point of creating a new string implementation is so that you can generalize a whole family of string-related algorithms around a well-defined abstraction. In this case there's really no question that a string of characters is used to represent "text" -- although it can very well represent a lot of other things too. However you cut it though the abstraction bears out of algorithms that have something to do with strings like: concatenation, compression, ordering, encoding, decoding, rendering, sub-string, parsing, lexical analysis, search, etc. These algorithms are applied to strings and there are a ton of algorithms dealing with different kinds of strings. Encoding (or interpreting) a string as UTF-8 is just one algorithm, and it will be naive IMO if we design a string implementation just around the idea that any string will need an encoding defined when the algorithms that deal with strings are much more general in reality.
Let us keep the basic_string<CharT> as that generalized string (I never suggested to dump it, just that std::string would be an another type and not defined as typedef std::basic_string<char>).
Like I said though, I think we're talking in different levels. I for one think that solving the std::string problem brings more to the world than just solving the encoding problem. Bold statement I know. ;) Also, last time I checked, there are already a ton of Unicode-encoding libraries out there, I don't see why there's a need for yet-another-encoding-library for character strings. This is why I think I'm liking the way Boost.Locale is handling it because it conveys that the library is about making a common interface through which different back-ends can be plugged into. If Boost.Locale dealt with iterators then I think having a string library that is better than std::string in more ways than one gives us a good way of tackling the cross-platform string encoding issue. But there I stress, I think C++ needs a better than the standard string implementation.
Regarding #1 above and the following ...
x = "Hello,"; x = x ^ " World!";
... would you be against, if the interface in addition also included a few convenience/backward compatibility member functions like ...
string& append(const string& s) { *this = *this ^ s; return *this; }
string& prepend(const string& s) { *this = s ^ *this; return *this; }
... etc? For the same reasons as above: clarity, simplicity (it may not be obvious what a fancy operator expression does, it is more obvious when using names like append, prepend, ...) and people are used to that programming style.
I think this is a slippery slope though. If we make the boost::string look like something that is mutable without it being really mutable, then you have a disconnect between the interface and the semantics you want to convey. Having member functions like 'append' and 'prepend' makes you think that you're modifying the string when in fact you're really building another string. I've already pointed out that string construction can very well be handled by the string streams so I don't think we want to encourage people to think of strings as state-ful objects with mutable semantics because that's not the original intention of the string. By forcing users of the string to make it look like they're building a string instead of "modifying and existing string" *should* be conveyed in the interface. This is largely an issue of documentation though. The short answer to your question would be "yes, I am opposed to having member functions similar to what you have pointed out above". :)
BR,
Thanks for taking the time and I hope this helps! -- Dean Michael Berris about.me/deanberris

On Wed, Jan 26, 2011 at 9:25 AM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Wed, Jan 26, 2011 at 3:47 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Fri, Jan 21, 2011 at 1:07 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
[snip/] I also prefer nothing too fancy. But most of these things are implementation details, let us get the interface right first and focus on the optimizations afterwards.
Actually, it's not an implementation detail. Value semantics has everything to do the interface and not the implementation.
It's just that, at the time I was thinking about and writing this reply, I was just really wanting something lightweight and allowed for unbridled cross-thread access. That original assumption of mine that reference counting was a bad thing has since been clarified by others in the ensuing threads.
I didn't say that I regard the immutability or value semantics to be an implementation detail. But some part of the discussion focused on if we should employ COW, how to implement it, etc. Value semantics - a part of the interface specification - can be implemented in a number of ways.
3. Has all the algorithms that apply to it defined externally.
[snip/]
Encoding is a matter of external interpretation and I think should not be part of a string's interface. You can have wrappers that interpret a string as a UTF-* string.
OK, I give up :) I do not insist any more on calling it 'string'.
[snip/]
But I we already have these everyday nice and convenient text handling algorithms in Boost.Algorithm's String_algo library.
But still it is encoding agnostic, which is bad in many cases.
As a matter of fact, *all* the implementations cited about dealing with UTF-8 and UTF-16 have everything to do with wrapping raw data into a view of it that (unfortunately) allows for mutating transformations.
Note also that I wasn't even going into the generic point of stringsdo not being a sequence of anything other than characters to be read. That's a different topic that I don't want to get into at this time. But even the pedantic definition of a string doesn't include mutability as an intrinsic requirement.
I really do not have anything against the immutability and the value semantics, see above. I think you misunderstood me :)
Another important concern for me is portability. I'd like (being very self-centered :-P) for example the following:
boost::string s = "Mat" + code_point(0x00FA/*u with acute*/) + code_point(0x0161/*s with caron*/); std::cout << s << std::endl;
(everywhere where the terminal can handle it) to print: Matúš // hope your email client can handle that :)
instead of: Mat$#@!% or completely upsetting the terminal.
A few things here:
1. This is totally fine with an immutable string implementation. I don't see any mutations going on here.
Me neither :-) What I see however is that it fails because of encoding.
2. A string class that "works correctly while immutable" allows for dealing with arbitrary data interpreted as some thunk that is obtained from a given source (as long as you have a length of the data that is).
Agreed
3. String I/O can be defined independently of the string especially if you're dealing with C++ streams. I don't see why the above would be a problem with an immutable string implementation.
Agreed, but again it has to be convenient. [snip/]
auto it = encoded<utf8_encoding>(original_string), end = encoded<utf8_encoding>(); is perfectly generic and well-designed for some use-cases the first reaction of
Also, while I see that for example this the-average-joe-programmer-inside-me's when seeing it was, *yuck*. Sorry :-)
So you'd say yuck to any STL algorithm that dealt with iterators? Have you used the Boost.Iterators library yet because then you'd be calling all those chaining/wrapping operations "yucky" too. ;)
Some of them ? Yes, in many situations. [snip/]
But the problem there is "nice" is really subjective. I absolutely abhor code like this:
boost::string s = "Foo"; s.append("Bar").append("Baz");
When I can express it entirely with less characters and succinctly with this instead:
boost::string s = "Foo" ^ "Bar" ^ "Baz";
Agreed, this is a matter of opinion and while I see the beauty of what you propose, it may not be clear what you mean by "Foo" ^ "Bar". If I learned something from this whole discussion, then it is that it's not nice to shove anything (programming style included) down anyones throat :-)
The reason why I want to call it (std::)string is that many not-so-pedantic people would react to the question "What is your first thought when you hear 'string type'?" with "Some kind of type for handling text, eh?" and not with "Some kind of generalized sequence of elements without any intrinsic encoding having the following properties...". But if there is so much resistance to calling it that then I vote for (boost|std)::text (however this sounds a little awkward to me, I don't know why).
I think you're missing something here though.
The point of creating a new string implementation is so that you can generalize a whole family of string-related algorithms around a well-defined abstraction. In this case there's really no question that a string of characters is used to represent "text" -- although it can very well represent a lot of other things too. However you cut it though the abstraction bears out of algorithms that have something to do with strings like: concatenation, compression, ordering, encoding, decoding, rendering, sub-string, parsing, lexical analysis, search, etc.
And I think you misunderstand me, I *do not* want to stop us from doing such implementation of string. But just as it is important for you to have the generic string class, it is important for me to have the "nice" 'text' class :) I even don't have anything against boost::text to be implemented as a special case of boost::string if it is possible/wise.
[snip/]
Like I said though, I think we're talking in different levels.
I have exactly the same feeling :)
I for one think that solving the std::string problem brings more to the world than just solving the encoding problem. Bold statement I know. ;)
For you (and others) not for me (and others).
Also, last time I checked, there are already a ton of Unicode-encoding libraries out there, I don't see why there's a need for yet-another-encoding-library for character strings. This is why I think I'm liking the way Boost.Locale is handling it because it conveys that the library is about making a common interface through which different back-ends can be plugged into. If Boost.Locale dealt with iterators then I think having a string library that is better than std::string in more ways than one gives us a good way of tackling the cross-platform string encoding issue. But there I stress, I think C++ needs a better than the standard string implementation.
And what is their level of acceptance by different APIs ?
Regarding #1 above and the following ...
x = "Hello,"; x = x ^ " World!";
... would you be against, if the interface in addition also included a few convenience/backward compatibility member functions like ...
[snip/]
... etc? For the same reasons as above: clarity, simplicity (it may not be obvious what a fancy operator expression does, it is more obvious when using names like append, prepend, ...) and people are used to that programming style.
I think this is a slippery slope though. If we make the boost::string look like something that is mutable without it being really mutable, then you have a disconnect between the interface and the semantics you want to convey.
Having member functions like 'append' and 'prepend' makes you think that you're modifying the string when in fact you're really building another string. I've already pointed out that string construction can very well be handled by the string streams so I don't think we want to encourage people to think of strings as state-ful objects with mutable semantics because that's not the original intention of the string.
By forcing users of the string to make it look like they're building a string instead of "modifying and existing string" *should* be conveyed in the interface. This is largely an issue of documentation though.
Again, this is a matter of taste. Is the enforcing of our "superior" interface design really that much more important then level of acceptability by other people which do not share the same opinion ? Nobody forces you to use append/ prepend and you should not force others to use the operator ^. IMO in this case you are even in an advantage, because append/ prepend/etc. would be wrappers around "your" :) interface. And, yes, they should be clearly documented as such. Best, Matus

On Wed, Jan 26, 2011 at 5:01 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Wed, Jan 26, 2011 at 9:25 AM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Wed, Jan 26, 2011 at 3:47 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Fri, Jan 21, 2011 at 1:07 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
[snip/] I also prefer nothing too fancy. But most of these things are implementation details, let us get the interface right first and focus on the optimizations afterwards.
Actually, it's not an implementation detail. Value semantics has everything to do the interface and not the implementation.
It's just that, at the time I was thinking about and writing this reply, I was just really wanting something lightweight and allowed for unbridled cross-thread access. That original assumption of mine that reference counting was a bad thing has since been clarified by others in the ensuing threads.
I didn't say that I regard the immutability or value semantics to be an implementation detail. But some part of the discussion focused on if we should employ COW, how to implement it, etc.
Sure, which is also where the reference counting implementation lies. Details like that are deal-breakers in performance-critical code and if we're talking about replacing std::string or implementing a competing string, it would have to beat the std::string performance (however bad/good that is).
Value semantics - a part of the interface specification - can be implemented in a number of ways.
I don't see though how else value semantics can be implemented aside from having: a default constructor, an assignment operator, a copy constructor, and later on maybe an optimal move constructor. That's really all there is to value semantics -- some would argue that swap is "necessary" too but I'm not convinced that swap is really required for value semantics.
But I we already have these everyday nice and convenient text handling algorithms in Boost.Algorithm's String_algo library.
But still it is encoding agnostic, which is bad in many cases.
So it's the algorithms that are the problem -- for being encoding agnostic -- and not really the string is that what you're implying?
As a matter of fact, *all* the implementations cited about dealing with UTF-8 and UTF-16 have everything to do with wrapping raw data into a view of it that (unfortunately) allows for mutating transformations.
Note also that I wasn't even going into the generic point of stringsdo not being a sequence of anything other than characters to be read. That's a different topic that I don't want to get into at this time. But even the pedantic definition of a string doesn't include mutability as an intrinsic requirement.
I really do not have anything against the immutability and the value semantics, see above. I think you misunderstood me :)
I think I didn't understand what you meant when you referred to implementation details. ;)
A few things here:
1. This is totally fine with an immutable string implementation. I don't see any mutations going on here.
Me neither :-) What I see however is that it fails because of encoding.
I still don't understand this though. What does encoding have to do with the string? Isn't encoding a separate process?
3. String I/O can be defined independently of the string especially if you're dealing with C++ streams. I don't see why the above would be a problem with an immutable string implementation.
Agreed, but again it has to be convenient.
We may have different definitions of convenient, but I like the way string-streams and iostreams in the standard do it. For all intents and purposes a stringbuf implementation that deals with efficient allocation precisely for immutable strings would be nice to have.
So you'd say yuck to any STL algorithm that dealt with iterators? Have you used the Boost.Iterators library yet because then you'd be calling all those chaining/wrapping operations "yucky" too. ;)
Some of them ? Yes, in many situations.
How about Boost.RangeEx-wrapped STL algorithms? I for one like the simplicity and flexibility of it which may explain why I think we have different interpretations of "convenient". For me, iterators and layering operations on iterators, and then feeding them through algorithms is the convenient route. Anything that resembles Java code or Smalltalk-like "OOP message-passing" inspired interfaces just don't seem enticing to my brain anymore. Maybe that's largely a problem with me than with the code although the jury's still out on that one. ;)
[snip/]
But the problem there is "nice" is really subjective. I absolutely abhor code like this:
boost::string s = "Foo"; s.append("Bar").append("Baz");
When I can express it entirely with less characters and succinctly with this instead:
boost::string s = "Foo" ^ "Bar" ^ "Baz";
Agreed, this is a matter of opinion and while I see the beauty of what you propose, it may not be clear what you mean by "Foo" ^ "Bar". If I learned something from this whole discussion, then it is that it's not nice to shove anything (programming style included) down anyones throat :-)
Right on both counts. :D To be more "complete" about it though the semantics of "+" on strings is really a misnomer. The "+" operator signifies associativity which string concatenation is not -- and you're really not adding string values either. What you want is an operator that conveys "I'm joining the string on the left with the one on the right in the specified order" -- because the "^" operator is left associative and can be used as a joining symbol, it fits the use case for strings better. So you read it as: "Foo" joined with "Bar" joined with ...
I think you're missing something here though.
The point of creating a new string implementation is so that you can generalize a whole family of string-related algorithms around a well-defined abstraction. In this case there's really no question that a string of characters is used to represent "text" -- although it can very well represent a lot of other things too. However you cut it though the abstraction bears out of algorithms that have something to do with strings like: concatenation, compression, ordering, encoding, decoding, rendering, sub-string, parsing, lexical analysis, search, etc.
And I think you misunderstand me, I *do not* want to stop us from doing such implementation of string. But just as it is important for you to have the generic string class, it is important for me to have the "nice" 'text' class :) I even don't have anything against boost::text to be implemented as a special case of boost::string if it is possible/wise.
I still don't understand what "nice" is. I think precisely because "nice" is such a subjective thing I fear that without any objective criterion to base the definition of an interface/implementation on, we will keep chasing after what's "nice" or "convenient". OTOH if we agree that algorithms are as important as the abstraction, then I think it's better if we agree what the correct abstraction is and what the algorithms we intend to implement/support are. In that discussion what's "nice" is largely a matter of taste. ;)
Also, last time I checked, there are already a ton of Unicode-encoding libraries out there, I don't see why there's a need for yet-another-encoding-library for character strings. This is why I think I'm liking the way Boost.Locale is handling it because it conveys that the library is about making a common interface through which different back-ends can be plugged into. If Boost.Locale dealt with iterators then I think having a string library that is better than std::string in more ways than one gives us a good way of tackling the cross-platform string encoding issue. But there I stress, I think C++ needs a better than the standard string implementation.
And what is their level of acceptance by different APIs ?
I think we need to qualify what you refer to as APIs. If just judging from the amount of code that's written against Qt or MFC for example then I'd say "they're pretty well accepted". If you look at the libraries that use ICU as a backend I'd say we already have one in Boost called Boost.Regex. And there's all these other libraries in the Linux arena that have their own little niche to play in the Unicode game -- there's Glib, the GNOME and KDE libraries, ad nauseam. I really think we don't want to be playing the "one ring in the darkness bind them" game here. If we want to change the way things are going, there's little point in preserving the status quo IMO especially if we're all in agreement that the status quo is broken. And now that we've decided that it's something worth fixing let's fix it in a way that's actually different from how everyone else has tried to do before. Doing the same thing over and over and expecting a different result is insanity -- paraphrased from someone important that I should know who really. ;)
I think this is a slippery slope though. If we make the boost::string look like something that is mutable without it being really mutable, then you have a disconnect between the interface and the semantics you want to convey.
Having member functions like 'append' and 'prepend' makes you think that you're modifying the string when in fact you're really building another string. I've already pointed out that string construction can very well be handled by the string streams so I don't think we want to encourage people to think of strings as state-ful objects with mutable semantics because that's not the original intention of the string.
By forcing users of the string to make it look like they're building a string instead of "modifying and existing string" *should* be conveyed in the interface. This is largely an issue of documentation though.
Again, this is a matter of taste.
Actually, I think it's a matter of design, not taste.
Is the enforcing of our "superior" interface design really that much more important then level of acceptability by other people which do not share the same opinion ?
What opinion is there to be had? If the string is immutable why would you want to make it look like it is mutable?
Nobody forces you to use append/ prepend and you should not force others to use the operator ^.
Well, the primitive data types force you to use the operators defined on them. Spirit forces you to define rules using the DSEL. So does the MSM library. The BGL forces you to use the graph abstraction if you intend to deal with that library. I don't see why it's unreasonable to force operator^ for consistency's sake.
IMO in this case you are even in an advantage, because append/ prepend/etc. would be wrappers around "your" :) interface. And, yes, they should be clearly documented as such.
But the point of the thing being immutable is lost in translation. More to the point, operator^ has simple semantics as opposed to 'append' and 'prepend' which are two words for the same operation with just the order of the operands switched around. Am I missing something here? -- Dean Michael Berris about.me/deanberris

On Wed, Jan 26, 2011 at 3:22 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Wed, Jan 26, 2011 at 5:01 PM, Matus Chochlik <chochlik@gmail.com> wrote: [snip/]
I didn't say that I regard the immutability or value semantics to be an implementation detail. But some part of the discussion focused on if we should employ COW, how to implement it, etc.
Sure, which is also where the reference counting implementation lies. Details like that are deal-breakers in performance-critical code and if we're talking about replacing std::string or implementing a competing string, it would have to beat the std::string performance (however bad/good that is).
Value semantics - a part of the interface specification - can be implemented in a number of ways.
I don't see though how else value semantics can be implemented aside from having: a default constructor, an assignment operator, a copy constructor, and later on maybe an optimal move constructor. That's really all there is to value semantics -- some would argue that swap is "necessary" too but I'm not convinced that swap is really required for value semantics.
I may be wrong, but my idea when I hear you say that a type has a default constructor, assignment operator, is you talking about the interface of the type. When you explain how the assignment operator, etc. is implemented then you are talking about implementation details :) [snip/]
So it's the algorithms that are the problem -- for being encoding agnostic -- and not really the string is that what you're implying?
[snip/]
1. This is totally fine with an immutable string implementation. I don't see any mutations going on here.
Me neither :-) What I see however is that it fails because of encoding.
I still don't understand this though. What does encoding have to do with the string? Isn't encoding a separate process?
Hm, my ability to express myself obviously totally su*ks :) you are completely right, that the encoding is a completely separate process, and I'm saying that I want it *completely* to be hidden from my sight, unless it is absolutely necessary for me to be concerned about it :-) The means for this would be: Let us build a string, that may (or may not) be based on your general (encoding agnostic) string. And this string would handle the transcoding in most cases without me viewing the underlying byte sequence by functors that need me *everytime* to specify what encoding I want explicitly. By default I want UTF-8, if I talk to the OS I say I want the string in an encoding that the OS expects, not that I want it in UTF-16, ISO-8859-2, KOI8-R, etc. If and only if I want to handle the string in another encoding than Unicode should I have to specify that explicitly. [snip/]
How about Boost.RangeEx-wrapped STL algorithms?
I for one like the simplicity and flexibility of it which may explain why I think we have different interpretations of "convenient". For me, iterators and layering operations on iterators, and then feeding them through algorithms is the convenient route. Anything that resembles Java code or Smalltalk-like "OOP message-passing" inspired interfaces just don't seem enticing to my brain anymore.
This is a different matter, Again I may be wrong but I live under the expression that RangeEx has been implemented to hide the ugliness of complex STL iterator-based algorithms. [snip/]
To be more "complete" about it though the semantics of "+" on strings is really a misnomer. The "+" operator signifies associativity which string concatenation is not -- and you're really not adding string values either. What you want is an operator that conveys "I'm joining the string on the left with the one on the right in the specified order" -- because the "^" operator is left associative and can be used as a joining symbol, it fits the use case for strings better.
So you read it as: "Foo" joined with "Bar" joined with ...
I know that of course because we are having this discussion, but will it be clear to someone is not participating. It may become clear when the string gets wider adoption.
I still don't understand what "nice" is. I think precisely because "nice" is such a subjective thing I fear that without any objective criterion to base the definition of an interface/implementation on, we will keep chasing after what's "nice" or "convenient".
OTOH if we agree that algorithms are as important as the abstraction, then I think it's better if we agree what the correct abstraction is and what the algorithms we intend to implement/support are. In that discussion what's "nice" is largely a matter of taste. ;)
OK, I think that it is pointless to discuss "nice" :) exactly because it is very subjective.
Also, last time I checked, there are already a ton of Unicode-encoding libraries out there, I don't see why there's a need for yet-another-encoding-library for character strings. This is why I think I'm liking the way Boost.Locale is handling it because it conveys that the library is about making a common interface through which different back-ends can be plugged into. If Boost.Locale dealt with iterators then I think having a string library that is better than std::string in more ways than one gives us a good way of tackling the cross-platform string encoding issue. But there I stress, I think C++ needs a better than the standard string implementation.
And what is their level of acceptance by different APIs ?
I think we need to qualify what you refer to as APIs. If just judging from the amount of code that's written against Qt or MFC for example then I'd say "they're pretty well accepted". If you look at the libraries that use ICU as a backend I'd say we already have one in Boost called Boost.Regex. And there's all these other libraries in the Linux arena that have their own little niche to play in the Unicode game -- there's Glib, the GNOME and KDE libraries, ad nauseam.
Besides what you mentioned an API for me is for example WINAPI, POSIX API, OpenGL API, OpenSSL API, etc. Basically all the functions "exported" by the various C/C++ libraries that I cannot imagine my life without :) and which expect not a generic iterator range or a view or whatnot but plain and simple pointer (const char*) pointing to a contiguous block in memory containing a zero terminated C string, or if we are luckier expects std::string.
What opinion is there to be had? If the string is immutable why would you want to make it look like it is mutable?
Nobody forces you to use append/ prepend and you should not force others to use the operator ^.
Well, the primitive data types force you to use the operators defined on them. Spirit forces you to define rules using the DSEL. So does the MSM library. The BGL forces you to use the graph abstraction if you intend to deal with that library.
I don't see why it's unreasonable to force operator^ for consistency's sake.
IMO in this case you are even in an advantage, because append/ prepend/etc. would be wrappers around "your" :) interface. And, yes, they should be clearly documented as such.
But the point of the thing being immutable is lost in translation. More to the point, operator^ has simple semantics as opposed to 'append' and 'prepend' which are two words for the same operation with just the order of the operands switched around.
Am I missing something here?
I see your point of view. You imagine this new string class to be a completely new beast. Me and I expect that there are few others, view it as the next std::string. I don't see any big point in creating another-uber-string, that is *so much* better in performance, etc. etc. if it does not get wide adoption. There already are dozens of such strings already. Best, Matus

On Wed, Jan 26, 2011 at 11:19 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Wed, Jan 26, 2011 at 3:22 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
I don't see though how else value semantics can be implemented aside from having: a default constructor, an assignment operator, a copy constructor, and later on maybe an optimal move constructor. That's really all there is to value semantics -- some would argue that swap is "necessary" too but I'm not convinced that swap is really required for value semantics.
I may be wrong, but my idea when I hear you say that a type has a default constructor, assignment operator, is you talking about the interface of the type. When you explain how the assignment operator, etc. is implemented then you are talking about implementation details :)
Right, but others seem to want to know about the implementation details to try and work out whether the overall interface being designed is actually going to be a viable implementation. So while I say "value semantics" others have asked how that would be implemented and -- being the gratuitous typer that I am ;) -- I would respond. :D
I still don't understand this though. What does encoding have to do with the string? Isn't encoding a separate process?
Hm, my ability to express myself obviously totally su*ks :) you are completely right, that the encoding is a completely separate process, and I'm saying that I want it *completely* to be hidden from my sight, unless it is absolutely necessary for me to be concerned about it :-)
So what would be the point of implementing a string "wrapper" that knew its encoding as part of the type if you didn't want to know the encoding in most of the cases? I think I'm missing the logic there.
The means for this would be: Let us build a string, that may (or may not) be based on your general (encoding agnostic) string. And this string would handle the transcoding in most cases without me viewing the underlying byte sequence by functors that need me *everytime* to specify what encoding I want explicitly. By default I want UTF-8, if I talk to the OS I say I want the string in an encoding that the OS expects, not that I want it in UTF-16, ISO-8859-2, KOI8-R, etc. If and only if I want to handle the string in another encoding than Unicode should I have to specify that explicitly.
So we're obviously talking about two different strings here -- your "text" that knows the encoding and the immutable string that you may or may not build upon. How then do you design the algorithms if you *didn't* want to explicitly specify the encoding you want the algorithms to use? In one of the previous messages I laid out an algorithm template like so: template <class String> void foo(String s) { view<encoding> encoded(s); // deal with encoded from here on out } Of course then from foo's user perspective, she wouldn't have to do anything with his string to be passed in. From the algorithm implementer perspective you would know exactly what encoding was wanted and how to go about implementing the algorithm even potentially having something like this as well: template <class Encoding> void foo(view<Encoding> encoded) { // deal with the encoded string appropriately here } And you get the benefits in either case of being able to either explicitly or implicitly deal with strings depending on whether they have been explicitly encoded already or whether it's just a raw set of bytes.
[snip/]
How about Boost.RangeEx-wrapped STL algorithms?
I for one like the simplicity and flexibility of it which may explain why I think we have different interpretations of "convenient". For me, iterators and layering operations on iterators, and then feeding them through algorithms is the convenient route. Anything that resembles Java code or Smalltalk-like "OOP message-passing" inspired interfaces just don't seem enticing to my brain anymore.
This is a different matter, Again I may be wrong but I live under the expression that RangeEx has been implemented to hide the ugliness of complex STL iterator-based algorithms.
Right. So what's the difference between the RangeEx way and the STL way when they both deal with iterators? What makes the STL version "yuckier" than the RangeEx version? I might have my answer to that but hearing your answer to this question might give me a better idea of what you might mean when you say "nice" or "convenient". ;)
So you read it as: "Foo" joined with "Bar" joined with ...
I know that of course because we are having this discussion, but will it be clear to someone is not participating. It may become clear when the string gets wider adoption.
Of course the proof will be in the pudding. ;)
I still don't understand what "nice" is. I think precisely because "nice" is such a subjective thing I fear that without any objective criterion to base the definition of an interface/implementation on, we will keep chasing after what's "nice" or "convenient".
OTOH if we agree that algorithms are as important as the abstraction, then I think it's better if we agree what the correct abstraction is and what the algorithms we intend to implement/support are. In that discussion what's "nice" is largely a matter of taste. ;)
OK, I think that it is pointless to discuss "nice" :) exactly because it is very subjective.
Agreed. :)
I think we need to qualify what you refer to as APIs. If just judging from the amount of code that's written against Qt or MFC for example then I'd say "they're pretty well accepted". If you look at the libraries that use ICU as a backend I'd say we already have one in Boost called Boost.Regex. And there's all these other libraries in the Linux arena that have their own little niche to play in the Unicode game -- there's Glib, the GNOME and KDE libraries, ad nauseam.
Besides what you mentioned an API for me is for example WINAPI, POSIX API, OpenGL API, OpenSSL API, etc. Basically all the functions "exported" by the various C/C++ libraries that I cannot imagine my life without :) and which expect not a generic iterator range or a view or whatnot but plain and simple pointer (const char*) pointing to a contiguous block in memory containing a zero terminated C string, or if we are luckier expects std::string.
So, if there was a way to "encode" (there's that word again) the data in an immutable string into an acceptably-rendered `char const *` would that solve the problem? The whole point of my assertion (and Dave's question) is whether c_str() would have to be intrinsic to the string, which I have pointed out in a different message (not too long ago) that it could very well be an external algorithm.
Am I missing something here?
I see your point of view. You imagine this new string class to be a completely new beast. Me and I expect that there are few others, view it as the next std::string. I don't see any big point in creating another-uber-string, that is *so much* better in performance, etc. etc. if it does not get wide adoption. There already are dozens of such strings already.
Right. This is Boost anyway, and I've always viewed libraries that get proposed to an accepted into Boost are the kinds of libraries that are developed to eventually be made part of the C++ standard library. So while out of the gate the string implementation can very well be not called std::string, I don't see why the current std::string can't be deprecated later on (look at std::auto_ptr) and a different implementation be put in its place? :D Of course that may very well be C++21xx so I don't think I need to worry about it having to be a std::string killer in the outset. ;) HTH -- Dean Michael Berris about.me/deanberris

On Wed, Jan 26, 2011 at 5:06 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Wed, Jan 26, 2011 at 11:19 PM, Matus Chochlik <chochlik@gmail.com> wrote: [snip/]
Right, but others seem to want to know about the implementation details to try and work out whether the overall interface being designed is actually going to be a viable implementation. So while I say "value semantics" others have asked how that would be implemented and -- being the gratuitous typer that I am ;) -- I would respond. :D
OK :)
I still don't understand this though. What does encoding have to do with the string? Isn't encoding a separate process?
Hm, my ability to express myself obviously totally su*ks :) you are completely right, that the encoding is a completely separate process, and I'm saying that I want it *completely* to be hidden from my sight, unless it is absolutely necessary for me to be concerned about it :-)
So what would be the point of implementing a string "wrapper" that knew its encoding as part of the type if you didn't want to know the encoding in most of the cases? I think I'm missing the logic there.
The logic would be that you no longer would have to be concerned if A' (A with acute), etc, is encoded as ISO-8859-2 says, or as UTF-8 says etc.. But you would *always* handle the string as a sequence of *Unicode* code-points or even "logical characters" and not as a sequence of bytes that are being somehow encoded (generally). I can imagine use-cases where it still would be OK to get the underlying byte-sequence (read-only) for things that are encoding-independent.
The means for this would be: Let us build a string, that may (or may not) be based on your general (encoding agnostic) string. And this string would handle the transcoding in most cases without me viewing the underlying byte sequence by functors that need me *everytime* to specify what encoding I want explicitly. By default I want UTF-8, if I talk to the OS I say I want the string in an encoding that the OS expects, not that I want it in UTF-16, ISO-8859-2, KOI8-R, etc. If and only if I want to handle the string in another encoding than Unicode should I have to specify that explicitly.
So we're obviously talking about two different strings here -- your "text" that knows the encoding and the immutable string that you may or may not build upon. How then do you design the algorithms if you *didn't* want to explicitly specify the encoding you want the algorithms to use?
By saying that the *implicit* encoding is UTF-8 and that should I need to use another encoding I will treat it as a special case. Every time when I do not specify an encoding it is assumed by default to be UTF-8 i.e. when I'm reading text from a TCP connection or from a file I expect that it already is UTF-8 encoded and would like the string (optionally or always) to validate it for me. Then there are two cases: a) Default encoding of std::string depending upon std::locale and encoding of std::wstring which is for example on Windows be default treated as being encoded with UTF-16 and on Linux as being encoded as UTF-32. For these I would love to have some simple means of saying to 'boost::text' give me your representation in the encoding that std::string is expected to be encoded in or "build" yourself from the native encoding, that std::string is supposed to be using. + the same for wstring. b) Every other encoding. For example if I really needed to convert my string to IBM CP850 because I want to send it to an old printer then only in this case should I be required (obviously) to specify the encoding explicitly.
In one of the previous messages I laid out an algorithm template like so:
template <class String> void foo(String s) { view<encoding> encoded(s); // deal with encoded from here on out }
Of course then from foo's user perspective, she wouldn't have to do anything with his string to be passed in. From the algorithm implementer perspective you would know exactly what encoding was wanted and how to go about implementing the algorithm even potentially having something like this as well:
template <class Encoding> void foo(view<Encoding> encoded) { // deal with the encoded string appropriately here }
And you get the benefits in either case of being able to either explicitly or implicitly deal with strings depending on whether they have been explicitly encoded already or whether it's just a raw set of bytes.
I see that this is OK for many use cases. But having a single pre-defined, default encoding, has also it's advantages, because usually you can skip the whole view<Encoding> part.
[snip/]
This is a different matter, Again I may be wrong but I live under the expression that RangeEx has been implemented to hide the ugliness of complex STL iterator-based algorithms. impression (of course) :)
Of course the proof will be in the pudding. ;)
I think we need to qualify what you refer to as APIs. If just judging from the amount of code that's written against Qt or MFC for example then I'd say "they're pretty well accepted". If you look at the libraries that use ICU as a backend I'd say we already have one in Boost called Boost.Regex. And there's all these other libraries in the Linux arena that have their own little niche to play in the Unicode game -- there's Glib, the GNOME and KDE libraries, ad nauseam.
Besides what you mentioned an API for me is for example WINAPI, POSIX API, OpenGL API, OpenSSL API, etc. Basically all the functions "exported" by the various C/C++ libraries that I cannot imagine my life without :) and which expect not a generic iterator range or a view or whatnot but plain and simple pointer (const char*) pointing to a contiguous block in memory containing a zero terminated C string, or if we are luckier expects std::string.
So, if there was a way to "encode" (there's that word again) the data in an immutable string into an acceptably-rendered `char const *` would that solve the problem? The whole point of my assertion (and Dave's question) is whether c_str() would have to be intrinsic to the string, which I have pointed out in a different message (not too long ago) that it could very well be an external algorithm.
generally speaking the syntax is not that important for me I can get used to almost everything :) so c_str(my_str) is OK with me, if it does not involve just copying the string whatever the internal representation is. As Robert said if the internal string data already is non-contiguous then this should be no-op. boost::string s = get_huge_string(); s = s ^ get_another_huge_string(); s = s ^ get_yet_another_huge_string(); std::string(s).c_str() is too inefficient for my taste.
Right. This is Boost anyway, and I've always viewed libraries that get proposed to an accepted into Boost are the kinds of libraries that are developed to eventually be made part of the C++ standard library.
So while out of the gate the string implementation can very well be not called std::string, I don't see why the current std::string can't be deprecated later on (look at std::auto_ptr) and a different implementation be put in its place? :D Of course that may very well be C++21xx so I don't think I need to worry about it having to be a std::string killer in the outset. ;)
If you pull this off (replacing std::string without having a transition period with a backward compatible interface) then you will be my personal hero. :-) Wait .. provided, that the encoding-related stuff I said above will be part of the string :) or there will be some wrapper around it providing that functionality. Matus

generally speaking the syntax is not that important for me I can get used to almost everything :) so c_str(my_str) is OK with me, if it does not involve just copying the string whatever the internal representation is. As Robert said if the internal string data already is non-contiguous then this should be no-op.
patch: - non-contiguous + contiguous sorry

On Thu, Jan 27, 2011 at 12:43 AM, Matus Chochlik <chochlik@gmail.com> wrote:
On Wed, Jan 26, 2011 at 5:06 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Wed, Jan 26, 2011 at 11:19 PM, Matus Chochlik <chochlik@gmail.com> wrote: [snip/]
Right, but others seem to want to know about the implementation details to try and work out whether the overall interface being designed is actually going to be a viable implementation. So while I say "value semantics" others have asked how that would be implemented and -- being the gratuitous typer that I am ;) -- I would respond. :D
OK :)
:D
So what would be the point of implementing a string "wrapper" that knew its encoding as part of the type if you didn't want to know the encoding in most of the cases? I think I'm missing the logic there.
The logic would be that you no longer would have to be concerned if A' (A with acute), etc, is encoded as ISO-8859-2 says, or as UTF-8 says etc.. But you would *always* handle the string as a sequence of *Unicode* code-points or even "logical characters" and not as a sequence of bytes that are being somehow encoded (generally). I can imagine use-cases where it still would be OK to get the underlying byte-sequence (read-only) for things that are encoding-independent.
So really this wrapper is the 'view' that I talk about that carries with it an encoding and the underlying data. Right?
So we're obviously talking about two different strings here -- your "text" that knows the encoding and the immutable string that you may or may not build upon. How then do you design the algorithms if you *didn't* want to explicitly specify the encoding you want the algorithms to use?
By saying that the *implicit* encoding is UTF-8 and that should I need to use another encoding I will treat it as a special case.
I don't see the value in this though requiring that it be part of the 'text'. I could easily write something like: typedef view<utf8_encoded> utf8; And have something like this be possible: utf8 u("The quick brown fox jumps over the lazy dog."); Now, that's your default utf8-encoded view of the underlying string. Right?
Every time when I do not specify an encoding it is assumed by default to be UTF-8 i.e. when I'm reading text from a TCP connection or from a file I expect that it already is UTF-8 encoded and would like the string (optionally or always) to validate it for me.
Hmmm... So then it's just a matter of using a type similar to what I pointed out above as the default then?
Then there are two cases: a) Default encoding of std::string depending upon std::locale and encoding of std::wstring which is for example on Windows be default treated as being encoded with UTF-16 and on Linux as being encoded as UTF-32. For these I would love to have some simple means of saying to 'boost::text' give me your representation in the encoding that std::string is expected to be encoded in or "build" yourself from the native encoding, that std::string is supposed to be using. + the same for wstring.
b) Every other encoding. For example if I really needed to convert my string to IBM CP850 because I want to send it to an old printer then only in this case should I be required (obviously) to specify the encoding explicitly.
I don't see why the default and the other encoding case are really that different from an interface perspective. The underlying string will still be a series of bytes in memory, and encoding is just a matter of viewing it a given way. Right?
In one of the previous messages I laid out an algorithm template like so:
template <class String> void foo(String s) { view<encoding> encoded(s); // deal with encoded from here on out }
Of course then from foo's user perspective, she wouldn't have to do anything with his string to be passed in. From the algorithm implementer perspective you would know exactly what encoding was wanted and how to go about implementing the algorithm even potentially having something like this as well:
template <class Encoding> void foo(view<Encoding> encoded) { // deal with the encoded string appropriately here }
And you get the benefits in either case of being able to either explicitly or implicitly deal with strings depending on whether they have been explicitly encoded already or whether it's just a raw set of bytes.
I see that this is OK for many use cases. But having a single pre-defined, default encoding, has also it's advantages, because usually you can skip the whole view<Encoding> part.
So what if `typedef view<Encoding> utf8` was there how far would that be from the default encoding case? And why does it have to be especially UTF for that matter?
So, if there was a way to "encode" (there's that word again) the data in an immutable string into an acceptably-rendered `char const *` would that solve the problem? The whole point of my assertion (and Dave's question) is whether c_str() would have to be intrinsic to the string, which I have pointed out in a different message (not too long ago) that it could very well be an external algorithm.
generally speaking the syntax is not that important for me I can get used to almost everything :) so c_str(my_str) is OK with me, if it does not involve just copying the string whatever the internal representation is. As Robert said if the internal string data already is non-contiguous then this should be no-op.
boost::string s = get_huge_string(); s = s ^ get_another_huge_string(); s = s ^ get_yet_another_huge_string(); std::string(s).c_str()
is too inefficient for my taste.
Why is it inefficient when there's no need for an actual copy to be involved? s ^ get_huge_string() would basically yield a lazily composed concatenation which could just hold references to the original strings (again, with potential for optimizations depending on the length of the strings, etc.). So then you can layer that up and just need to linearize it when it's actually required -- in the conversion for the std::string case. And if you really wanted to just linearize the string into a void * buffer somewhere then that should be perfectly fine as well. I guess assuming that you have actual temporaries built (like how std::string would have you believe) when concatenating strings will make it look like it's really inefficient, but there should be a way of making it more efficient *because* the string is immutable.
Right. This is Boost anyway, and I've always viewed libraries that get proposed to an accepted into Boost are the kinds of libraries that are developed to eventually be made part of the C++ standard library.
So while out of the gate the string implementation can very well be not called std::string, I don't see why the current std::string can't be deprecated later on (look at std::auto_ptr) and a different implementation be put in its place? :D Of course that may very well be C++21xx so I don't think I need to worry about it having to be a std::string killer in the outset. ;)
If you pull this off (replacing std::string without having a transition period with a backward compatible interface) then you will be my personal hero. :-)
Well don't hold your breath for that because, well, you won't have 'erase' and other things that std::string supports, so it won't be backward compatible to std::string. :)
Wait .. provided, that the encoding-related stuff I said above will be part of the string :) or there will be some wrapper around it providing that functionality.
typedef view<utf8_encoding> utf8; I don't see why that shouldn't work for your requirements. :) -- Dean Michael Berris about.me/deanberris

On Wed, Jan 26, 2011 at 6:26 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Thu, Jan 27, 2011 at 12:43 AM, Matus Chochlik <chochlik@gmail.com> wrote:
So really this wrapper is the 'view' that I talk about that carries with it an encoding and the underlying data. Right?
Basically right, but generally I can imagine that the encoding would not have to be 'carried' by the view but just 'assumed'. But if by carrying you mean that it'll have just some tag template argument without any (too much) internal state, then: just Right.
I don't see the value in this though requiring that it be part of the 'text'. I could easily write something like:
The value of choosing Unicode is, that when processing text, you don't have to worry if you picked the right code table to interpret the text into characters, another "win" is that on many platforms UTF-8 is already being used as the default encoding where things are not completely encoding-independent so you actually do not have to do any transcoding to view the data as UTF-8 because it already is UTF-8. I'm not a machine so when I see written text I see characters not byte-sequences and code tables and I would like to be able to handle text in my programs at the level of code-points if not logical characters too, and let the computer handle the encoding validation, etc. just like I don't have to specify how to encode a float or double. Another matter is that if you send your text by whatever means you choose to someone at the other end of the world, then he will see the same characters without any need for transcoding, locale-picking, mumbo jumbo.
typedef view<utf8_encoded> utf8;
And have something like this be possible:
utf8 u("The quick brown fox jumps over the lazy dog.");
Now, that's your default utf8-encoded view of the underlying string.
Right?
Right.
Every time when I do not specify an encoding it is assumed by default to be UTF-8 i.e. when I'm reading text from a TCP connection or from a file I expect that it already is UTF-8 encoded and would like the string (optionally or always) to validate it for me.
Hmmm... So then it's just a matter of using a type similar to what I pointed out above as the default then?
Yes.
I don't see why the default and the other encoding case are really that different from an interface perspective. The underlying string will still be a series of bytes in memory, and encoding is just a matter of viewing it a given way. Right?
if there is ... typedef something_beyond_my_current_level_of_comprehension native_encoding; typedef ... native_wide_encoding; ... which works as I described above with your view ... text my_text = init(); and I can do for example: ShellExecuteW( ..., cstr(view<native_wide_encoding>(my_text)), ... ); ... (and we can have some shorthand for c_str(view<native_wide_encoding>(x))), then, basically, Right.
So what if `typedef view<Encoding> utf8` was there how far would that be from the default encoding case? And why does it have to be especially UTF for that matter?
Because it is an already acepted and widely used text-encoding, standard capable of representing (basically) every writing system known to mankind, even some of the invented ones (Quenia, Klingon,...) *at once* (in the same text) without any encoding switching mumbo-jumbo being used :) If this is not obvious; I live in a part of the world where ASCII is just not enough and, believe me, we are all tired here of juggling with language-specific code-pages :)
[snip/]
generally speaking the syntax is not that important for me I can get used to almost everything :) so c_str(my_str) is OK with me, if it does not involve just copying the string whatever the internal representation is. As Robert said if the internal string data already is non-contiguous then this should be no-op.
boost::string s = get_huge_string(); s = s ^ get_another_huge_string(); s = s ^ get_yet_another_huge_string(); std::string(s).c_str()
is too inefficient for my taste.
Why is it inefficient when there's no need for an actual copy to be involved?
No unnecessary copying => /me not complaining :)
typedef view<utf8_encoding> utf8;
I don't see why that shouldn't work for your requirements. :)
Make that typedef view<utf8_encoding> text; and I will be completely happy and very grateful. :-) we can always work out the little things in the process. Matus

On Thu, Jan 27, 2011 at 2:28 AM, Matus Chochlik <chochlik@gmail.com> wrote:
On Wed, Jan 26, 2011 at 6:26 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Thu, Jan 27, 2011 at 12:43 AM, Matus Chochlik <chochlik@gmail.com> wrote:
So really this wrapper is the 'view' that I talk about that carries with it an encoding and the underlying data. Right?
Basically right, but generally I can imagine that the encoding would not have to be 'carried' by the view but just 'assumed'. But if by carrying you mean that it'll have just some tag template argument without any (too much) internal state, then: just Right.
Being part of the type is "carrying".
I don't see the value in this though requiring that it be part of the 'text'. I could easily write something like:
The value of choosing Unicode is, that when processing text, you don't have to worry if you picked the right code table to interpret the text into characters, another "win" is that on many platforms UTF-8 is already being used as the default encoding where things are not completely encoding-independent so you actually do not have to do any transcoding to view the data as UTF-8 because it already is UTF-8.
I don't think I was questioning why UTF-8 specifically. I was questioning why there had to be a "default is UTF-8" when really it's just a sequence of bytes whether UTF-8, MPEG, Base64, MIME, etc.
I'm not a machine so when I see written text I see characters not byte-sequences and code tables and I would like to be able to handle text in my programs at the level of code-points if not logical characters too, and let the computer handle the encoding validation, etc. just like I don't have to specify how to encode a float or double.
Another matter is that if you send your text by whatever means you choose to someone at the other end of the world, then he will see the same characters without any need for transcoding, locale-picking, mumbo jumbo.
But this already happens, it's called 7-bit clean byte encoding -- barring any endianness issues, just stuff whatever you already have in a `char const *` into a socket. HTTP, FTP, and even memcached's protocol work fine without the need to interpret strings other than a sequence of bytes; my original opposition is having a string that by default looked at data in it as UTF-8 when really a string would just be a sequence of bytes not necessarily contiguous. [snip parts where we already agree]
I don't see why the default and the other encoding case are really that different from an interface perspective. The underlying string will still be a series of bytes in memory, and encoding is just a matter of viewing it a given way. Right?
if there is ...
typedef something_beyond_my_current_level_of_comprehension native_encoding; typedef ... native_wide_encoding;
... which works as I described above with your view ...
text my_text = init();
and I can do for example: ShellExecuteW( ..., cstr(view<native_wide_encoding>(my_text)), ... );
... (and we can have some shorthand for c_str(view<native_wide_encoding>(x))), then, basically, Right.
Yes that's the intention. There's even an alternative (really F'n ugly way) I suggested as well: char evil_stack_buffer[256]; linearize(string, evil_stack_buffer, 256); Which means you can let the user of the interface define where the linearized version of the immutable string would be placed.
So what if `typedef view<Encoding> utf8` was there how far would that be from the default encoding case? And why does it have to be especially UTF for that matter?
Because it is an already acepted and widely used text-encoding, standard capable of representing (basically) every writing system known to mankind, even some of the invented ones (Quenia, Klingon,...) *at once* (in the same text) without any encoding switching mumbo-jumbo being used :)
I think I was asking why make a string to default encode in UTF-8 when UTF-8 was really just a means of interpreting a sequence of bytes. Why a string would have to do that by default is what I don't understand -- and which is why I see it as a decoupling of a view and an underlying string. I know why UTF-* have their merits for encoding all the "characters" for the languages of the world. However I don't see why it has to be default for a string.
If this is not obvious; I live in a part of the world where ASCII is just not enough and, believe me, we are all tired here of juggling with language-specific code-pages :)
Nope, it's not obvious, but it's largely a matter of circumstance really I would say. ;)
Why is it inefficient when there's no need for an actual copy to be involved?
No unnecessary copying => /me not complaining :)
Cool. :)
typedef view<utf8_encoding> utf8;
I don't see why that shouldn't work for your requirements. :)
Make that
typedef view<utf8_encoding> text;
and I will be completely happy and very grateful. :-) we can always work out the little things in the process.
Agreed. Now I'm prepared to move on and solidify the interface to the immutable string and the views. Expect some (long, potentially tiring, but hopefully more coherent) output in terms of an interface proposal in a few hours. :) -- Dean Michael Berris about.me/deanberris

On Thu, Jan 27, 2011 at 4:49 AM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Thu, Jan 27, 2011 at 2:28 AM, Matus Chochlik <chochlik@gmail.com> wrote:
On Wed, Jan 26, 2011 at 6:26 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Thu, Jan 27, 2011 at 12:43 AM, Matus Chochlik <chochlik@gmail.com> wrote:
So really this wrapper is the 'view' that I talk about that carries with it an encoding and the underlying data. Right?
Basically right, but generally I can imagine that the encoding would not have to be 'carried' by the view but just 'assumed'. But if by carrying you mean that it'll have just some tag template argument without any (too much) internal state, then: just Right.
Being part of the type is "carrying".
Yes, but a polymorphic type could also "carry" the encoding information it wasn't clear to me what exactly do you have in mind. [snip/]
I don't think I was questioning why UTF-8 specifically. I was questioning why there had to be a "default is UTF-8" when really it's just a sequence of bytes whether UTF-8, MPEG, Base64, MIME, etc.
Last time I checked, JPEG, MPEG, Base64, ASN1, etc., etc., were not *text* encodings. And I believe that handling text is what the whole discussion is ultimately about. [snip/]
But this already happens, it's called 7-bit clean byte encoding -- barring any endianness issues, just stuff whatever you already have in a `char const *` into a socket. HTTP, FTP, and even memcached's protocol work fine without the need to interpret strings other than a sequence of bytes; my original opposition is having a string that by default looked at data in it as UTF-8 when really a string would just be a sequence of bytes not necessarily contiguous.
Again, where you see a string primarily as a class for handling raw data, that can be interpreted in hundreds of different ways I see primarily string as a class for encoding human readable text. [snip/]
if there is ...
typedef something_beyond_my_current_level_of_comprehension native_encoding; typedef ... native_wide_encoding;
On second thought, this probably should be a type templated with a char type.
... which works as I described above with your view ...
text my_text = init();
and I can do for example: ShellExecuteW( ..., cstr(view<native_wide_encoding>(my_text)), ... );
... (and we can have some shorthand for c_str(view<native_wide_encoding>(x))), then, basically, Right.
Yes that's the intention. There's even an alternative (really F'n ugly way) I suggested as well:
char evil_stack_buffer[256]; linearize(string, evil_stack_buffer, 256);
Of course it is an alternative, but there are also lots of functions in various APIs the ShellExecute above being one of them where you would need 4-10 such evil_stack_buffers and the performance gain compared to the loss related to the ugliness and C-ness of the code is not worth it (for me). If I liked that kind of programming I would use C all the time and not only in places where absolutely necessary.
Which means you can let the user of the interface define where the linearized version of the immutable string would be placed.
[snip/]
I think I was asking why make a string to default encode in UTF-8 when UTF-8 was really just a means of interpreting a sequence of bytes. Why a string would have to do that by default is what I don't understand -- and which is why I see it as a decoupling of a view and an underlying string.
I know why UTF-* have their merits for encoding all the "characters" for the languages of the world. However I don't see why it has to be default for a string.
Because the byte sequence is interpreted into *text*. Let me try one more time: Imagine that someone proposed to you that he creates a ultra-fast-and-generic type for handling floating point numbers and there would be ~200 possible encodings for a float or double and the usage of the type would be uber_float x = get_x(); uber_float y = get_y(); uber_float z = view<acme_float_encoding_123_331_4342_Z>(x) + view<acme_float_encoding_123_331_4342_Z>(y); uber_float w = third::party::math::log(view<acme_float_encoding_452323_X>(z)); would you choose it to calculate your z = x + y and w = log(z) in the 98% of the regular cases where you don't need to handle numbers on the helluva-big scale/range/precision? I would not.
If this is not obvious; I live in a part of the world where ASCII is just not enough and, believe me, we are all tired here of juggling with language-specific code-pages :)
Nope, it's not obvious, but it's largely a matter of circumstance really I would say. ;)
But the most of the world actually lives under these circumstances. :)
[snip/] BR, Matus

On Thu, Jan 27, 2011 at 5:32 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Thu, Jan 27, 2011 at 4:49 AM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Thu, Jan 27, 2011 at 2:28 AM, Matus Chochlik <chochlik@gmail.com> wrote:
Being part of the type is "carrying".
Yes, but a polymorphic type could also "carry" the encoding information it wasn't clear to me what exactly do you have in mind.
Sorry, I had no intentions of implying that a template would carry type information any other way than just having that information part of the type.
[snip/]
I don't think I was questioning why UTF-8 specifically. I was questioning why there had to be a "default is UTF-8" when really it's just a sequence of bytes whether UTF-8, MPEG, Base64, MIME, etc.
Last time I checked, JPEG, MPEG, Base64, ASN1, etc., etc., were not *text* encodings. And I believe that handling text is what the whole discussion is ultimately about.
But why do you need to separate text encoding from encoding in general? Here's the logic: You have a sequence of bytes (in a string). You want to interpret that sequence of bytes in a given encoding (with a view). Why does the encoding have to apply only to text?
[snip/]
But this already happens, it's called 7-bit clean byte encoding -- barring any endianness issues, just stuff whatever you already have in a `char const *` into a socket. HTTP, FTP, and even memcached's protocol work fine without the need to interpret strings other than a sequence of bytes; my original opposition is having a string that by default looked at data in it as UTF-8 when really a string would just be a sequence of bytes not necessarily contiguous.
Again, where you see a string primarily as a class for handling raw data, that can be interpreted in hundreds of different ways I see primarily string as a class for encoding human readable text.
So what's the difference between a string for encoding human readable text and a string that handles raw data?
Yes that's the intention. There's even an alternative (really F'n ugly way) I suggested as well:
char evil_stack_buffer[256]; linearize(string, evil_stack_buffer, 256);
Of course it is an alternative, but there are also lots of functions in various APIs the ShellExecute above being one of them where you would need 4-10 such evil_stack_buffers and the performance gain compared to the loss related to the ugliness and C-ness of the code is not worth it (for me). If I liked that kind of programming I would use C all the time and not only in places where absolutely necessary.
I didn't disagree with your original statement and I think both interfaces -- the one that returns a pointer and the one that takes a buffer with length as arguments -- have a place in the same world.
I know why UTF-* have their merits for encoding all the "characters" for the languages of the world. However I don't see why it has to be default for a string.
Because the byte sequence is interpreted into *text*.
So?
Let me try one more time: Imagine that someone proposed to you that he creates a ultra-fast-and-generic type for handling floating point numbers and there would be ~200 possible encodings for a float or double and the usage of the type would be
uber_float x = get_x(); uber_float y = get_y(); uber_float z = view<acme_float_encoding_123_331_4342_Z>(x) + view<acme_float_encoding_123_331_4342_Z>(y); uber_float w = third::party::math::log(view<acme_float_encoding_452323_X>(z));
would you choose it to calculate your z = x + y and w = log(z) in the 98% of the regular cases where you don't need to handle numbers on the helluva-big scale/range/precision? I would not.
So what's wrong with: view<some_encoding_0> x = get_x(); view<some_encoding_1> y = get_y(); view<some_encoding_3> z = x+y; float w = log(as<acme_float_encoding>(z)); ? See, there's absolutely 0 reason why you *have* to deal with a raw sequence of bytes if what you really want is to deal with a view of these bytes from the outset. Again I ask, am I missing something here?
If this is not obvious; I live in a part of the world where ASCII is just not enough and, believe me, we are all tired here of juggling with language-specific code-pages :)
Nope, it's not obvious, but it's largely a matter of circumstance really I would say. ;)
But the most of the world actually lives under these circumstances. :)
Right, what I meant to say is that it hardly has any bearing when we're talking about engineering solutions. So your circumstances and mine may very well be different, but that doesn't change that we're trying to solve the same problem. :) -- Dean Michael Berris about.me/deanberris

On Thu, Jan 27, 2011 at 12:09 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Thu, Jan 27, 2011 at 5:32 PM, Matus Chochlik <chochlik@gmail.com> wrote: [snip/]
Last time I checked, JPEG, MPEG, Base64, ASN1, etc., etc., were not *text* encodings. And I believe that handling text is what the whole discussion is ultimately about.
But why do you need to separate text encoding from encoding in general? Here's the logic:
In general? Nothing. I do not have (nor did I have in the past) anything against a general efficient encoding-agnostic string if it is called general_string. But std::string IMO is and always has been primarily about handling text. I certainly do not know anyone who would store a MPEG inside std::string.
You have a sequence of bytes (in a string). You want to interpret that sequence of bytes in a given encoding (with a view).
Why does the encoding have to apply only to text?
Encoding does not have to apply only to text, but my, let's call it a vision, is, that the "everyday" handling of text would use a single encoding. There are people who have invested a whole lotta of love :) and time into making it possible and they are generally called Unicode consortium. C++(1x) already adopts part of their work via the u"" and U"" literal types, because it has countless advantages. Why not take a one more step in that direction and use it for the 'string' type by default.
[snip/]
But this already happens, it's called 7-bit clean byte encoding -- barring any endianness issues, just stuff whatever you already have in a `char const *` into a socket. HTTP, FTP, and even memcached's protocol work fine without the need to interpret strings other than a sequence of bytes; my original opposition is having a string that by default looked at data in it as UTF-8 when really a string would just be a sequence of bytes not necessarily contiguous.
Again, where you see a string primarily as a class for handling raw data, that can be interpreted in hundreds of different ways I see primarily string as a class for encoding human readable text.
So what's the difference between a string for encoding human readable text and a string that handles raw data?
Usability. It is usually more difficult to use the super-generic everything- solving things. I again for probably the 10-th time repeat that I'm not against such string in general but this is not std::string. [snip/]
Because the byte sequence is interpreted into *text*.
So?
Let me try one more time: Imagine that someone proposed to you that he creates a ultra-fast-and-generic type for handling floating point numbers and there would be ~200 possible encodings for a float or double and the usage of the type would be
uber_float x = get_x(); uber_float y = get_y(); uber_float z = view<acme_float_encoding_123_331_4342_Z>(x) + view<acme_float_encoding_123_331_4342_Z>(y); uber_float w = third::party::math::log(view<acme_float_encoding_452323_X>(z));
would you choose it to calculate your z = x + y and w = log(z) in the 98% of the regular cases where you don't need to handle numbers on the helluva-big scale/range/precision? I would not.
So what's wrong with:
view<some_encoding_0> x = get_x(); view<some_encoding_1> y = get_y(); view<some_encoding_3> z = x+y; float w = log(as<acme_float_encoding>(z));
Unnecessary verbosity. Do you really want all the people that now do: struct person { std::string name; std::string middle_name; std::string family_name; // .. etc. }; to do this ? struct person { boost::view<some_encoding_tag> name; boost::view<some_encoding_tag> middle_name; boost::view<some_encoding_tag> family_name; // .. etc. };
?
See, there's absolutely 0 reason why you *have* to deal with a raw sequence of bytes if what you really want is to deal with a view of these bytes from the outset.
Again I ask, am I missing something here?
Please see the example above. [snip/]
Right, what I meant to say is that it hardly has any bearing when we're talking about engineering solutions. So your circumstances and mine may very well be different, but that doesn't change that we're trying to solve the same problem. :)
If along solving your problem (all the completely valid points that you had about the performance) we also solve my and other's problem (completely valid points about the encoding) and we think about the acceptability and "adoptability", we provide a backward compatible interface for people who do not have the time to re-implement all their string-related code at once and try really hard to get it into the standard than I do not have a thing against it. BR, Matus

On Thu, Jan 27, 2011 at 8:45 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Thu, Jan 27, 2011 at 12:09 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
But why do you need to separate text encoding from encoding in general? Here's the logic:
In general? Nothing. I do not have (nor did I have in the past) anything against a general efficient encoding-agnostic string if it is called general_string. But std::string IMO is and always has been primarily about handling text. I certainly do not know anyone who would store a MPEG inside std::string.
std::string has not been about handling text -- it's about encapsulating the notion of a sequence of characters with a suitable definition of `character`. You have string algorithms that apply mostly to strings -- pattern matching, slicing/concatenation, character location, tokenization, etc. The notion of "text" is actually a higher concept which imbues a string with things like encoding, language, locality, etc. which all live at a different level. As for people storing <encoded> data inside a string, note that most text-based protocols transfer things now in Base64 or Base32 or some variant of that encoding -- precisely so that they can be dealt with as character sequences. If you were catching an XMPP stream-fed Base64 encoded H.264 video stream why not put it in a string? I wouldn't put it in std::string if I had any *sane* choice because it's just broken IMO but like most people who intend to do things with data in memory gotten from a character stream, you put it in a string.
You have a sequence of bytes (in a string). You want to interpret that sequence of bytes in a given encoding (with a view).
Why does the encoding have to apply only to text?
Encoding does not have to apply only to text, but my, let's call it a vision, is, that the "everyday" handling of text would use a single encoding. There are people who have invested a whole lotta of love :) and time into making it possible and they are generally called Unicode consortium. C++(1x) already adopts part of their work via the u"" and U"" literal types, because it has countless advantages. Why not take a one more step in that direction and use it for the 'string' type by default.
So the literals are already encoded and guess what, they're still a sequence of bytes. The only "sane" way to deal with it is to provide an appropriate *view* of the encoded data in the appropriate level of abstraction. A string I argue is *not* that level of abstraction.
So what's the difference between a string for encoding human readable text and a string that handles raw data?
Usability. It is usually more difficult to use the super-generic everything- solving things. I again for probably the 10-th time repeat that I'm not against such string in general but this is not std::string.
Usability of what, the type? Any type is as usable as any other the way I see it -- they're all just types. So aside from aesthetic/cosmetic differences, what's the point?
So what's wrong with:
view<some_encoding_0> x = get_x(); view<some_encoding_1> y = get_y(); view<some_encoding_3> z = x+y; float w = log(as<acme_float_encoding>(z));
Unnecessary verbosity.
What verbosity? We deal with that through typedefs and descriptive names. Heck C++0x has auto so I don't know what 'verbosity' you're referring to. And if you really wanted to know the encoding of the data from the type, how else would you do it?
Do you really want all the people that now do:
struct person { std::string name; std::string middle_name; std::string family_name; // .. etc. };
to do this ?
struct person { boost::view<some_encoding_tag> name; boost::view<some_encoding_tag> middle_name; boost::view<some_encoding_tag> family_name; // .. etc. };
Well: typedef boost::strings::view<boost::strings::utf8_encoding> utf8_string; struct person { utf8_string name, middle_name, family_name; }; Where's the verbosity in that?
?
See, there's absolutely 0 reason why you *have* to deal with a raw sequence of bytes if what you really want is to deal with a view of these bytes from the outset.
Again I ask, am I missing something here?
Please see the example above.
I did and I saw an even more succinct way of doing it. So again, I don't see what I'm missing here.
[snip/]
Right, what I meant to say is that it hardly has any bearing when we're talking about engineering solutions. So your circumstances and mine may very well be different, but that doesn't change that we're trying to solve the same problem. :)
If along solving your problem (all the completely valid points that you had about the performance) we also solve my and other's problem (completely valid points about the encoding) and we think about the acceptability and "adoptability",
I don't know what "acceptability" and "adoptability" mean in this context. Both of these are a matter of taste and not of technical merit.
we provide a backward compatible interface for people who do not have the time to re-implement all their string-related code at once and try really hard to get it into the standard than I do not have a thing against it.
Backward compatibility to a broken implementation hardly seems like a worthy goal. Deprecation is a better route IMHO. Even if it does become std::string, it will be a deprecation of the original definition. Deprecation *is* an option. HTH -- Dean Michael Berris about.me/deanberris

On Thu, Jan 27, 2011 at 2:55 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Thu, Jan 27, 2011 at 8:45 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Thu, Jan 27, 2011 at 12:09 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
[snip/]
Usability of what, the type? Any type is as usable as any other the way I see it -- they're all just types. So aside from aesthetic/cosmetic differences, what's the point?
I don't want to sound like an advertisment but .. aesthetic matters .. to many people. Unless we want to turn even the completely basic things into rocket science.
So what's wrong with:
view<some_encoding_0> x = get_x(); view<some_encoding_1> y = get_y(); view<some_encoding_3> z = x+y; float w = log(as<acme_float_encoding>(z));
Unnecessary verbosity.
What verbosity?
Like in having to say to much (view<some_encoding_3>) to express a very simple thing (string/text)?
We deal with that through typedefs and descriptive names. Heck C++0x has auto so I don't know what 'verbosity' you're referring to.
And if you really wanted to know the encoding of the data from the type, how else would you do it?
Do you really want all the people that now do:
struct person { std::string name; std::string middle_name; std::string family_name; // .. etc. };
to do this ?
struct person { boost::view<some_encoding_tag> name; boost::view<some_encoding_tag> middle_name; boost::view<some_encoding_tag> family_name; // .. etc. };
Well:
typedef boost::strings::view<boost::strings::utf8_encoding> utf8_string;
This (well not exactly utf8_string, but that was a completely different discussion) is what I was suggesting all along :)
struct person { utf8_string name, middle_name, family_name; };
Where's the verbosity in that?
Chop away the utf8_ part and I will be happy and quiet. [snip/]
Right, what I meant to say is that it hardly has any bearing when we're talking about engineering solutions. So your circumstances and mine may very well be different, but that doesn't change that we're trying to solve the same problem. :)
If along solving your problem (all the completely valid points that you had about the performance) we also solve my and other's problem (completely valid points about the encoding) and we think about the acceptability and "adoptability",
I don't know what "acceptability" and "adoptability" mean in this context.
It is the "sex-appeal" of your class. It is basically about if the people would be willing to take your class and use it instead of std::string in their code and this includes doing all the necessary changes in their code.
Both of these are a matter of taste and not of technical merit.
we provide a backward compatible interface for people who do not have the time to re-implement all their string-related code at once and try really hard to get it into the standard than I do not have a thing against it.
Backward compatibility to a broken implementation hardly seems like a worthy goal. Deprecation is a better route IMHO.
OK, so what do you propose your string is going to be called in the standard? Because calling it std::string is not deprecation. Deprecation would be calling it std::someotherstring and having them both in the standard for a certain period of time.
Even if it does become std::string, it will be a deprecation of the original definition. Deprecation *is* an option.
OK, if it is called std::string how do you imagine this would work? The (old) std::string would be deprecated for a certain period of time and then it would be replaced by (new) std::string ? If this is what you have in mind then I need to change my view on what that word means. Matus

On Thu, Jan 27, 2011 at 11:09 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Thu, Jan 27, 2011 at 2:55 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
Usability of what, the type? Any type is as usable as any other the way I see it -- they're all just types. So aside from aesthetic/cosmetic differences, what's the point?
I don't want to sound like an advertisment but .. aesthetic matters .. to many people. Unless we want to turn even the completely basic things into rocket science.
I don't know who advertises and says "aesthetics matter" -- because I don't see advertisements that say that; they imply it though, but not straight out say it. But that's beside the point. :D I don't know what you mean by "completely basic things". Are you referring to string processing? Because there's a programming language out there precisely for dealing with strings (called Perl) and I'd say that's far from basic.
What verbosity?
Like in having to say to much (view<some_encoding_3>) to express a very simple thing (string/text)?
That's what typedefs are for -- you can call it anything you like.
Well:
typedef boost::strings::view<boost::strings::utf8_encoding> utf8_string;
This (well not exactly utf8_string, but that was a completely different discussion) is what I was suggesting all along :)
So why is it not exactly utf8_string? It's a view of a string that has a UTF-8 encoding.
struct person { utf8_string name, middle_name, family_name; };
Where's the verbosity in that?
Chop away the utf8_ part and I will be happy and quiet.
Unfortunately, that's not what a string is IMO. A string *again* doesn't have a default encoding and an encoding is not part of the definition of a string -- which I have already maintained earlier on.
I don't know what "acceptability" and "adoptability" mean in this context.
It is the "sex-appeal" of your class.
Weh?
It is basically about if the people would be willing to take your class and use it instead of std::string in their code and this includes doing all the necessary changes in their code.
So that's not really anything about me or my string, but more about users. Right? Beauty *is* in the eye of the beholder while quality is something we all understand but yet have a hard time defining. So I'd rather strive for quality than beauty. And if it's high quality enough then making people switch would be a matter of choice on their end. Right?
Backward compatibility to a broken implementation hardly seems like a worthy goal. Deprecation is a better route IMHO.
OK, so what do you propose your string is going to be called in the standard? Because calling it std::string is not deprecation. Deprecation would be calling it std::someotherstring and having them both in the standard for a certain period of time.
You can deprecate the interface of std::string. This is what I've been saying all along. Once you deprecate string mutation, then you end up with an interface thats similar to what I had been proposing. And at that point, the idea would have ultimately won.
Even if it does become std::string, it will be a deprecation of the original definition. Deprecation *is* an option.
OK, if it is called std::string how do you imagine this would work? The (old) std::string would be deprecated for a certain period of time and then it would be replaced by (new) std::string ?
Nope, in the transition period it would say that the mutations are actually deprecated, remove them now or it will break in the next upcoming standard.
If this is what you have in mind then I need to change my view on what that word means.
You can deprecate the current notion of std::string -- the underlying implementation can change once a new standard is ratified, and like shared_ptr<> if it's already in boost, will probably be adopted by compiler vendors with their specific enhancements for their compiler. If that's not what deprecation means I don't know any other meaning of it. HTH -- Dean Michael Berris about.me/deanberris

On 01/27/2011 04:45 AM, Matus Chochlik wrote:
... elision by patrick ... In general? Nothing. I do not have (nor did I have in the past) anything against a general efficient encoding-agnostic string if it is called general_string. But std::string IMO is and always has been primarily about handling text. I certainly do not know anyone who would store a MPEG inside std::string.
You may think it strange, but there's a lot of code out there that uses std::string as a binary buffer.
You have a sequence of bytes (in a string). You want to interpret that sequence of bytes in a given encoding (with a view).
Why does the encoding have to apply only to text?
It doesn't, and in your immutable string (or with std::string also) your idea of views is a nice one. It would have different benefits than a utf-xx_string with intrinsic encoding.
Encoding does not have to apply only to text, but my, let's call it a vision, is, that the "everyday" handling of text would use a single encoding. There are people who have invested a whole lotta of love :) and time into making it possible and they are generally called Unicode consortium. C++(1x) already adopts part of their work via the u"" and U"" literal types, because it has countless advantages. Why not take a one more step in that direction and use it for the 'string' type by default.
That won't happen with std::string though. It's in the C++ spec as behaving a certain way and you won't change that. You might have a chance of getting a utf-8_string in there though.
[snip/]
But this already happens, it's called 7-bit clean byte encoding -- barring any endianness issues, just stuff whatever you already have in a `char const *` into a socket. HTTP, FTP, and even memcached's protocol work fine without the need to interpret strings other than a sequence of bytes; my original opposition is having a string that by default looked at data in it as UTF-8 when really a string would just be a sequence of bytes not necessarily contiguous. Again, where you see a string primarily as a class for handling raw data, that can be interpreted in hundreds of different ways I see primarily string as a class for encoding human readable text.
And you see it as encoding it in utf-8. Don't forget that. It's a very specialized use out of the many that std::string supports today.
So what's the difference between a string for encoding human readable text and a string that handles raw data? Usability. It is usually more difficult to use the super-generic everything- solving things. I again for probably the 10-th time repeat that I'm not against such string in general but this is not std::string.
And neither would a string that enforced utf-8 encoding be std::string. We already have one in the spec, and it's not that.
... elision by patrick ... Unnecessary verbosity.
Do you really want all the people that now do:
struct person { std::string name; std::string middle_name; std::string family_name; // .. etc. };
to do this ?
struct person { boost::view<some_encoding_tag> name; boost::view<some_encoding_tag> middle_name; boost::view<some_encoding_tag> family_name; // .. etc. };
If their encoding is not utf-8 compatible it works with std::string, but wouldn't work with your utf-8 string. Your argument is the same as applied to your string.
... elision by patrick ...
Right, what I meant to say is that it hardly has any bearing when we're talking about engineering solutions. So your circumstances and mine may very well be different, but that doesn't change that we're trying to solve the same problem. :)
No. You're not trying to solve the same problem at all! (And neither of you are trying to deal with std::string.) You, Dean, are trying to solve an efficiency problem caused by mutable strings, and note that an external view can interpret as any encoding desired. You correctly point out that this is more general and flexible, that it has a power that can be applied to many things while giving you all the efficiency advantages of immutable data types. (Although why a general buffer for immutable data would be called string which is normally associated with text _is_ a bit confusing. I suspect you've gone down a road you never intended trying to make this point.) You, Matus, are trying to solve a problem caused by a plethora of possible encodings and the extra work that has to be done every time you have to deal with them, by specifying that a string will have an encoding type associated with it, (and in particular utf-8 as the natural default), and that the specialized string itself will enforce the encoding as well as provide ways to convert other encodings to it. (And I think the natural way to do this is with code conversion facets.) You correctly point out that this specificity allows a power in solving this one particular problem that a more general solution wouldn't be able to match. A general string with a view into it would allow you to get invalidly encoded data into it (N.B for an immutable string _into it_ would have a different meaning) and you would only know about this after the fact. These are both great things. Kudos to you both. You're both right. You guys keep arguing apples and orangutans and it makes it hard for others to talk about either one of your ideas because you're so busy going back and forth telling each other that the other doesn't get what they're trying to say. I wish you'd split into threads like [immutable string] and [unicode string]. Patrick

On Fri, Jan 28, 2011 at 5:57 AM, Patrick Horgan <phorgan1@gmail.com> wrote:
Right, what I meant to say is that it hardly has any bearing when we're talking about engineering solutions. So your circumstances and mine may very well be different, but that doesn't change that we're trying to solve the same problem. :)
No. You're not trying to solve the same problem at all! (And neither of you are trying to deal with std::string.)
You, Dean, are trying to solve an efficiency problem caused by mutable strings, and note that an external view can interpret as any encoding desired. You correctly point out that this is more general and flexible, that it has a power that can be applied to many things while giving you all the efficiency advantages of immutable data types. (Although why a general buffer for immutable data would be called string which is normally associated with text _is_ a bit confusing. I suspect you've gone down a road you never intended trying to make this point.)
Well, you're correct for the most part. Except I really did mean to call them strings. ;)
You, Matus, are trying to solve a problem caused by a plethora of possible encodings and the extra work that has to be done every time you have to deal with them, by specifying that a string will have an encoding type associated with it, (and in particular utf-8 as the natural default), and that the specialized string itself will enforce the encoding as well as provide ways to convert other encodings to it. (And I think the natural way to do this is with code conversion facets.) You correctly point out that this specificity allows a power in solving this one particular problem that a more general solution wouldn't be able to match. A general string with a view into it would allow you to get invalidly encoded data into it (N.B for an immutable string _into it_ would have a different meaning) and you would only know about this after the fact.
But validation is an algorithm, right? Why can't validation be enforced in a special view implementation? Perhaps an (ugly) throwing constructor of a view to enforce the encoding would be something that could be put in if that's the only proble... right?
These are both great things. Kudos to you both. You're both right. You guys keep arguing apples and orangutans and it makes it hard for others to talk about either one of your ideas because you're so busy going back and forth telling each other that the other doesn't get what they're trying to say.
I wish you'd split into threads like [immutable string] and [unicode string].
Although I think you're right that it's probably better to deal with the underlying strings issue in a different thread, it does invariably affect the way the algorithms for encoding/transcoding would be implemented. And then as C++ programmers that we are we will get into the efficiency advantages/disadvantages of one aspect or another at a later given time -- or we might not, remains to be seen I know. :D So sure, but I think I've pointed out enough about what I think -- if others still have questions about the immutable strings and views idea, you know which thread to reply to. :) -- Dean Michael Berris about.me/deanberris

On Thu, Jan 27, 2011 at 10:57 PM, Patrick Horgan <phorgan1@gmail.com> wrote:
On 01/27/2011 04:45 AM, Matus Chochlik wrote:
... elision by patrick ... In general? Nothing. I do not have (nor did I have in the past) anything against a general efficient encoding-agnostic string if it is called general_string. But std::string IMO is and always has been primarily about handling text. I certainly do not know anyone who would store a MPEG inside std::string.
You may think it strange, but there's a lot of code out there that uses std::string as a binary buffer.
Your're right, just because I don't use it that way does not mean that it cannot be done, that is why I said that I'm OK if we call it 'text' instead of string in one of my previous posts. [snip-of-things-that-we-basically-agree-upon/]
Usability. It is usually more difficult to use the super-generic everything- solving things. I again for probably the 10-th time repeat that I'm not against such string in general but this is not std::string.
And neither would a string that enforced utf-8 encoding be std::string. We already have one in the spec, and it's not that.
Yes, also see above. But the main reason why I strongly oppose any mentioning of 'utf8' in the name of the general-text-handling-class is basically the same as why I would oppose the general-floating-point-hanling-classes in C++ to be called 'IEEE_754_float' and 'IEEE_754_double' instead of just plain 'float' and 'double'. I (and many others around here) have dealt with various text encodings and all those problems they cause in "non-ascii" environments, so many times, that my blood pressure skyrockets :) every time I hear that term. And I do not want to be reminded about it every time when dealing with text. Let us mention the encoding only when necessary. [snip/]
No. You're not trying to solve the same problem at all! (And neither of you are trying to deal with std::string.)
You, Dean, are trying to solve an efficiency problem caused by mutable strings, and note that an external view can interpret as any encoding desired. You correctly point out that this is more general and flexible, that it has a power that can be applied to many things while giving you all the efficiency advantages of immutable data types. (Although why a general buffer for immutable data would be called string which is normally associated with text _is_ a bit confusing. I suspect you've gone down a road you never intended trying to make this point.)
You, Matus, are trying to solve a problem caused by a plethora of possible encodings and the extra work that has to be done every time you have to deal with them, by specifying that a string will have an encoding type associated with it, (and in particular utf-8 as the natural default), and that the specialized string itself will enforce the encoding as well as provide ways to convert other encodings to it. (And I think the natural way to do this is with code conversion facets.) You correctly point out that this specificity allows a power in solving this one particular problem that a more general solution wouldn't be able to match. A general string with a view into it would allow you to get invalidly encoded data into it (N.B for an immutable string _into it_ would have a different meaning) and you would only know about this after the fact.
These are both great things. Kudos to you both. You're both right. You guys keep arguing apples and orangutans and it makes it hard for others to talk about either one of your ideas because you're so busy going back and forth telling each other that the other doesn't get what they're trying to say.
Believe me, Patrick, I have had the exactly the same feeling (about the apples and orangutans) the whole time I've participated in the immutable vs. unicode string discussion. I know that Dean tries to focus on performance and does not care about encodings and I do care about performance just not so much Dean, does. The reason why I kept participating in this 'bike-shed-quarrel' is that I would hate to see the outcome to be 1 just-another-super-efficient-string and 1 just-another-unicode-string. There are plenty of those already. I would like to see the *text* handling in C++ to be addressed *in the standard* not only on the byte-sequence-level, but on the code-point/character/word/etc. level.
I wish you'd split into threads like [immutable string] and [unicode string].
I start to like the idea of immutability and if it indeed has so many advantages I don't see why the text class could not be build on the immutable_string class. Best, Matus

Matus Chochlik wrote:
I have had the exactly the same feeling (about the apples and orangutans) the whole time I've participated in the immutable vs. unicode string discussion. I know that Dean tries to focus on performance and does not care about encodings and I do care about performance just not so much Dean, does.
That's unfair and wrong. Dean's view<utf8_encoding> type was roughly equivalent to your utf8_string (names aside) all along. He espouses a different approach to managing data storage and the encoding than do you, but that doesn't mean he "does not care about encodings." _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On Fri, Jan 28, 2011 at 2:59 PM, Stewart, Robert <Robert.Stewart@sig.com> wrote:
Matus Chochlik wrote:
I have had the exactly the same feeling (about the apples and orangutans) the whole time I've participated in the immutable vs. unicode string discussion. I know that Dean tries to focus on performance and does not care about encodings and I do care about performance just not so much Dean, does.
That's unfair and wrong. Dean's view<utf8_encoding> type was roughly equivalent to your utf8_string (names aside) all along. He espouses a different approach to managing data storage and the encoding than do you, but that doesn't mean he "does not care about encodings."
OK, that is basically right, and I know that Dean has proposed the view<encoding_tag> wrapper somewhere around the beginning. And I apologize for how the above may have sounded. Maybe I should have said it in a few more words at the risk of repeating myself for the N-th time. Matus

Dean Michael Berris wrote:
On Thu, Jan 27, 2011 at 5:32 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Thu, Jan 27, 2011 at 4:49 AM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
I don't think I was questioning why UTF-8 specifically. I was questioning why there had to be a "default is UTF-8" when really it's just a sequence of bytes whether UTF-8, MPEG, Base64, MIME, etc.
Last time I checked, JPEG, MPEG, Base64, ASN1, etc., etc., were not *text* encodings. And I believe that handling text is what the whole discussion is ultimately about.
But why do you need to separate text encoding from encoding in general?
Names are important. I think this discussion would make more progress if Dean's "string" were given another name, like "bytes," and his "view" were named "string" or perhaps "string_view."
You have a sequence of bytes (in a string).
That would be a sequence of bytes in a bytes<char>, say.
You want to interpret that sequence of bytes in a given encoding (with a view).
You'd interpret the bytes<char> with a string<utf_8>, say.
Again, where you see a string primarily as a class for handling raw data, that can be interpreted in hundreds of different ways I see primarily string as a class for encoding human readable text.
So what's the difference between a string for encoding human readable text and a string that handles raw data?
There is no difference, but because "string" connotes "human readable text" to most, using a different name for the raw storage class will dissociate that connotation from the raw storage. HTH, _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On Thu, Jan 27, 2011 at 10:05 PM, Stewart, Robert <Robert.Stewart@sig.com> wrote:
Dean Michael Berris wrote:
But why do you need to separate text encoding from encoding in general?
Names are important. I think this discussion would make more progress if Dean's "string" were given another name, like "bytes," and his "view" were named "string" or perhaps "string_view."
Okay, I think I have to divorce the thought of "what a string should be" from the name "string". I like `bytes` but unfortunately it somehow implicitly conveys mutability -- because computers keey reading and writing bytes after all. Maybe a name that denotes immutability would be good. So for lack of a more creative name, I'll call it `istring` which conveys immutability and string semantics.
You have a sequence of bytes (in a string).
That would be a sequence of bytes in a bytes<char>, say.
Or, an istring. ;)
You want to interpret that sequence of bytes in a given encoding (with a view).
You'd interpret the bytes<char> with a string<utf_8>, say.
Well I'd still rather call it a view. :D
Again, where you see a string primarily as a class for handling raw data, that can be interpreted in hundreds of different ways I see primarily string as a class for encoding human readable text.
So what's the difference between a string for encoding human readable text and a string that handles raw data?
There is no difference, but because "string" connotes "human readable text" to most, using a different name for the raw storage class will dissociate that connotation from the raw storage.
Okay, fair enough. :) So now I guess it's just going to be a matter of finding a suitable name for that string of bytes data structure. -- Dean Michael Berris about.me/deanberris

Dean Michael Berris wrote:
On Thu, Jan 27, 2011 at 10:05 PM, Stewart, Robert <Robert.Stewart@sig.com> wrote:
Names are important. I think this discussion would make more progress if Dean's "string" were given another name, like "bytes," and his "view" were named "string" or perhaps "string_view."
Okay, I think I have to divorce the thought of "what a string should be" from the name "string". I like `bytes` but unfortunately it somehow implicitly conveys mutability -- because computers keep reading and writing bytes after all. Maybe a name that denotes immutability would be good.
"bytes" conveys mutability no more nor less than does "string."
So for lack of a more creative name, I'll call it `istring` which conveys immutability and string semantics.
That's short, but not descriptive. The "i" prefix is more suggestive of "interface" than "immutable" to me. Why not just go whole hog and call it "immutable_string" as Artyom suggested? _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On Thu, Jan 27, 2011 at 10:55 PM, Stewart, Robert <Robert.Stewart@sig.com> wrote:
Dean Michael Berris wrote:
On Thu, Jan 27, 2011 at 10:05 PM, Stewart, Robert <Robert.Stewart@sig.com> wrote:
Names are important. I think this discussion would make more progress if Dean's "string" were given another name, like "bytes," and his "view" were named "string" or perhaps "string_view."
Okay, I think I have to divorce the thought of "what a string should be" from the name "string". I like `bytes` but unfortunately it somehow implicitly conveys mutability -- because computers keep reading and writing bytes after all. Maybe a name that denotes immutability would be good.
"bytes" conveys mutability no more nor less than does "string."
Right. And it's 5 letters too. :D
So for lack of a more creative name, I'll call it `istring` which conveys immutability and string semantics.
That's short, but not descriptive. The "i" prefix is more suggestive of "interface" than "immutable" to me. Why not just go whole hog and call it "immutable_string" as Artyom suggested?
The only objection really is that it's too long. :D Less characters is better. /me gets a thesaurus and looks up string :D -- Dean Michael Berris about.me/deanberris

On Jan 27, 2011, at 10:04 AM, Dean Michael Berris wrote:
On Thu, Jan 27, 2011 at 10:55 PM, Stewart, Robert <Robert.Stewart@sig.com> wrote:
[snip]
That's short, but not descriptive. The "i" prefix is more suggestive of "interface" than "immutable" to me. Why not just go whole hog and call it "immutable_string" as Artyom suggested?
The only objection really is that it's too long. :D Less characters is better.
/me gets a thesaurus and looks up string :D
Ok, but why this focus on immutability? Is that not a quite orthogonal concern to the encoding problematics discussed here (as well...)? I would prefer to have this discussion be about the encoding aspect(s) rather than immutability, unless the latter somehow intrinsically enable a much more improved handling (and preferably at the interface level) of various encoding, and I seriously doubt that. So, if we keep this discussion at that of a mutable sequence of characters, according to some encoding(s), I would be less grumpy. /David

On Thu, 27 Jan 2011 13:04:45 -0500 David Bergman <David.Bergman@bergmangupta.com> wrote:
On Jan 27, 2011, at 10:04 AM, Dean Michael Berris wrote:
That's short, but not descriptive. The "i" prefix is more suggestive of "interface" than "immutable" to me. Why not just go whole hog and call it "immutable_string" as Artyom suggested?
The only objection really is that it's too long. :D Less characters is better.
/me gets a thesaurus and looks up string :D
Ok, but why this focus on immutability? Is that not a quite orthogonal concern to the encoding problematics discussed here (as well...)?
I would prefer to have this discussion be about the encoding aspect(s) rather than immutability [...]
That's my baby, which I'm still working on. Dean is working on the immutable string idea (which he's made a persuasive case for IMO, but which is really unrelated). As has been noted, the two are getting confused because they both came out of Artyom's original UTF-8 proposal; I've changed the subject line on this one to (hopefully) split the two discussions up. -- Chad Nelson Oak Circle Software, Inc. * * *

On Jan 27, 2011, at 1:19 PM, Chad Nelson wrote:
On Thu, 27 Jan 2011 13:04:45 -0500 David Bergman <David.Bergman@bergmangupta.com> wrote:
On Jan 27, 2011, at 10:04 AM, Dean Michael Berris wrote:
That's short, but not descriptive. The "i" prefix is more suggestive of "interface" than "immutable" to me. Why not just go whole hog and call it "immutable_string" as Artyom suggested?
The only objection really is that it's too long. :D Less characters is better.
/me gets a thesaurus and looks up string :D
Ok, but why this focus on immutability? Is that not a quite orthogonal concern to the encoding problematics discussed here (as well...)?
I would prefer to have this discussion be about the encoding aspect(s) rather than immutability [...]
That's my baby, which I'm still working on. Dean is working on the immutable string idea (which he's made a persuasive case for IMO, but which is really unrelated). As has been noted, the two are getting confused because they both came out of Artyom's original UTF-8 proposal; I've changed the subject line on this one to (hopefully) split the two discussions up.
Thanks! BUT, the thing is that The Other Discussion is still mentioning encoding notions and even the letters "UTF", and it should. Otherwise, that thread should be called "boost::immutable_vector" or "boost::cs_immutable_string" ;-) /David

David Bergman wrote:
BUT, the thing is that The Other Discussion is still mentioning encoding notions and even the letters "UTF", and it should. Otherwise, that thread should be called "boost::immutable_vector" or "boost::cs_immutable_string" ;-)
Dean was proposing a CS string-like type (which he called "string") as the underlying, immutable storage type to be used in various ways to assemble pieces to be transmitted, encoded, decoded, etc. His "view" type is akin to what you wish to discuss and its viability does depend upon the immutability of the underlying storage. The discussions are much more about mutability versus immutability than anything else once you get past the names and Dean's separation of storage and view types. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On Fri, Jan 28, 2011 at 2:04 AM, David Bergman <David.Bergman@bergmangupta.com> wrote:
On Jan 27, 2011, at 10:04 AM, Dean Michael Berris wrote:
On Thu, Jan 27, 2011 at 10:55 PM, Stewart, Robert <Robert.Stewart@sig.com> wrote:
[snip]
That's short, but not descriptive. The "i" prefix is more suggestive of "interface" than "immutable" to me. Why not just go whole hog and call it "immutable_string" as Artyom suggested?
The only objection really is that it's too long. :D Less characters is better.
/me gets a thesaurus and looks up string :D
Ok, but why this focus on immutability? Is that not a quite orthogonal concern to the encoding problematics discussed here (as well...)?
Two reasons why focus on immutability. First is that it deals with the underlying storage. This has to be "fool-proof" to avoid the problems of a mutable data structure. Unless you're certain that at any given point after the string is constructed that it will not change, then you can throw a lot of the potential optimizations at the algorithms and lazy transformations that can make certain operations a lot more efficient. Second is because encoding is largely a matter of interpretation rather than of actual transformation. What I mean by this is that an encoding is supposed to be a logical transformation rather than an actual physical transformation of data (although it almost always manifested as such) -- and it doesn't have to be an immediately applied algorithm either. So without the immutable guarantee from the underlying data type, you can't make "clever" re-arrangements at the algorithm implementation that can assume immutability -- things like caching data would not need to be done since the data wouldn't ever change, that copying data would be made largely unnecessary, and things like that.
I would prefer to have this discussion be about the encoding aspect(s) rather than immutability, unless the latter somehow intrinsically enable a much more improved handling (and preferably at the interface level) of various encoding, and I seriously doubt that.
Sure, and there are already algorithms that implement encodings that deal with ranges. They've always been there before. What's being talked about here is whether a string would have the encoding as an intrinsic property of a string -- and I maintain that the answer to that question (at least from my POV).
So, if we keep this discussion at that of a mutable sequence of characters, according to some encoding(s), I would be less grumpy.
So what's wrong with using ICU and the work that others have done with encoding already? Am I the only person seeing the problem with strings being mutable? (I honestly really want to know). -- Dean Michael Berris about.me/deanberris

On 28.01.2011 09:16, Dean Michael Berris wrote:
Am I the only person seeing the problem with strings being mutable? (I honestly really want to know). I really don't like the mutability of std::string. But I don't have a use for an immutable sequence of bytes. I just want text.
Sebastian

On Fri, Jan 28, 2011 at 6:19 PM, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:
On 28.01.2011 09:16, Dean Michael Berris wrote:
Am I the only person seeing the problem with strings being mutable? (I honestly really want to know).
I really don't like the mutability of std::string. But I don't have a use for an immutable sequence of bytes. I just want text.
So what if your "text" just dealt with the immutable strings underneath? Would you still object to that? -- Dean Michael Berris about.me/deanberris

On 28.01.2011 12:12, Dean Michael Berris wrote:
So what if your "text" just dealt with the immutable strings underneath? Would you still object to that?
No, if you have a good immutable byte sequence class, using it to implement text is probably a good choice. Not that I care how text is implemented. As long as it's convenient and doesn't sacrifice an undue amount of performance. Sebastian

Dean Michael Berris wrote:
Am I the only person seeing the problem with strings being mutable? (I honestly really want to know).
No, but I hadn't thought about it before this discussion. Your descriptions of the lazy transformations and views were perfect to show the value of your idea. Perhaps you should present this as BoostCon (if there's still time to submit a proposal)! _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On Fri, Jan 28, 2011 at 9:56 PM, Stewart, Robert <Robert.Stewart@sig.com> wrote:
Dean Michael Berris wrote:
Am I the only person seeing the problem with strings being mutable? (I honestly really want to know).
No, but I hadn't thought about it before this discussion. Your descriptions of the lazy transformations and views were perfect to show the value of your idea.
Perhaps you should present this as BoostCon (if there's still time to submit a proposal)!
I think the keynote speaker -- Mr. Boehm -- would do perfectly to talk about how they implemented rope. :) Besides, the cost of going to BoostCon from where I am is a bit prohibitive to say the least. Maybe if things change in between now and a couple weeks from now I might be able to make that trip. Thanks for the suggestion. :D -- Dean Michael Berris about.me/deanberris

From: Patrick Horgan
Dean Michael Berris wrote:
So for lack of a more creative name, I'll call it `istring` which conveys immutability and string semantics.
Apple might come after you. How about calling it immutable. That's not a keyword.
Patrick
Apple does not have trademark on "^i\w+" :-) Apple's trademarks: http://www.apple.com/legal/trademark/appletmlist.html So I can safely register a trademark on iFool and of course on istring. (I'm not a lawyer and don't see this as legal advice, etc.) Artyom

On Fri, Jan 28, 2011 at 12:30 PM, Artyom <artyomtnk@yahoo.com> wrote:
From: Patrick Horgan
Dean Michael Berris wrote:
So for lack of a more creative name, I'll call it `istring` which conveys immutability and string semantics.
Apple might come after you. How about calling it immutable. That's not a keyword.
Patrick
Apple does not have trademark on "^i\w+" :-) Apple's trademarks: http://www.apple.com/legal/trademark/appletmlist.html So I can safely register a trademark on iFool and of course on istring. (I'm not a lawyer and don't see this as legal advice, etc.)
I'm not saying that this (istring) is the same case, but see for example: http://apple.slashdot.org/story/10/07/22/134238/Apple-Doesnt-Appreciate-Toil... Apple is being very sensitive about things called iSomething regardless if they own it as a trademark or not. Matus

On 01/28/2011 03:30 AM, Artyom wrote:
... elision by patrick ... Apple does not have trademark on "^i\w+" :-) Apple's trademarks: http://www.apple.com/legal/trademark/appletmlist.html So I can safely register a trademark on iFool and of course on istring. (I'm not a lawyer and don't see this as legal advice, etc.)
I completely agree with you. I've seen though in the trades, that it hasn't stopped Apple from going after people with silly legal actions. Generally if people have the money to fight, they win, but it's a nuisance. Patrick

Dean Michael Berris wrote: ... elision by patrick
So what's the difference between a string for encoding human readable text and a string that handles raw data? There is no difference, but because "string" connotes "human readable text" to most, using a different name for the raw storage class will dissociate that connotation from the raw storage. Thank you. I've long thought that about std::string, people use it all
On 01/27/2011 06:05 AM, Stewart, Robert wrote: the time as a sequence of bytes. Patrick

On 26.01.2011, at 17:06, Dean Michael Berris wrote:
In one of the previous messages I laid out an algorithm template like so:
template <class String> void foo(String s) { view<encoding> encoded(s); // deal with encoded from here on out }
I don't see how such an algorithm implementation is technically feasible. If the type substituted for String doesn't intrinsically know what its encoding is, how would view<encoding> know how to present the data in the requested encoding? How would it know how to transcode? For that matter, why would foo's implementer care at all about the encoding? I cannot really think of any algorithms (save transcoding algorithms themselves) that would care about the actual encoding. What they typically want is the sequence of code points or more likely characters that the string represents. But if the string doesn't know what encoding its internal data is in, the algorithm cannot get the code points without someone telling it what the encoding is. By making the string oblivious of the data's actual encoding, you put the burden of that on the user of the string class, who now has to supply every single algorithm that wants to do something with the string beyond looking at its raw data with the actual encoding of the string. Unless I completely misunderstand what you want, of course. Sebastian

On Thu, Jan 27, 2011 at 1:05 AM, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:
On 26.01.2011, at 17:06, Dean Michael Berris wrote:
In one of the previous messages I laid out an algorithm template like so:
template <class String> void foo(String s) { view<encoding> encoded(s); // deal with encoded from here on out }
I don't see how such an algorithm implementation is technically feasible. If the type substituted for String doesn't intrinsically know what its encoding is, how would view<encoding> know how to present the data in the requested encoding? How would it know how to transcode?
Here's where it gets a little tricky. If what you substitute for String is the hypothetical `boost::string` then what happens is the view will interpret it as raw data underlying the view. If you're substituting a view<some_encoding> for String then what happens is the internal view<encoding> construction will hold a copy of the (immutable) view<some_encoding>, and upon access to the iterators would do the transcoding on the fly. Note that validation could be implemented as an algorithm external (and unique) to the encoding being presented.
For that matter, why would foo's implementer care at all about the encoding? I cannot really think of any algorithms (save transcoding algorithms themselves) that would care about the actual encoding. What they typically want is the sequence of code points or more likely characters that the string represents. But if the string doesn't know what encoding its internal data is in, the algorithm cannot get the code points without someone telling it what the encoding is. By making the string oblivious of the data's actual encoding, you put the burden of that on the user of the string class, who now has to supply every single algorithm that wants to do something with the string beyond looking at its raw data with the actual encoding of the string.
Right. In the design I have in my head, it's really split into two: The underlying immutable string type and the view that wraps these immutable strings and applies the "encoding" appropriately as part of the view's implementation. So if a user wanted to specify that a given thunk of data in memory is supposed to be viewed as UTF8, he would do something like this: boost::string s = "The quick brown fox with unicode characters"; boost::strings::view<boost::strings::utf8_encoding> encoded(s); So the interface to boost::string and for the view<...> will be the same -- expose iterators mostly -- and you'd pretty much be able to deal with either one like they were practically the same thing. Except of course those that are views largely expose a different type for the dereferenced iterator based on the specific encoding which you want to view the data in. If you wanted raw access to the bytes then you deal with the iterator from the "raw string" directly.
Unless I completely misunderstand what you want, of course.
I can't say for sure, but I think you missed the part where the view offered an encoded view while the string just basically is an immutable collection of bytes. :) HTH -- Dean Michael Berris about.me/deanberris

Dean Michael Berris wrote:
If what you substitute for String is the hypothetical `boost::string` then what happens is the view will interpret it as raw data underlying the view.
If you're substituting a view<some_encoding> for String then what happens is the internal view<encoding> construction will hold a copy of the (immutable) view<some_encoding>, and upon access to the iterators would do the transcoding on the fly.
Doesn't your approach mean that the work done to produce the code points or characters (or transcoding) is not cached? Each iteration of the view does the work again. That can be appropriate in many cases, but there are also plenty of cases in which one wants to do the decoding/transcoding and save the result for repeated use. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On Thu, Jan 27, 2011 at 2:02 AM, Stewart, Robert <Robert.Stewart@sig.com> wrote:
Dean Michael Berris wrote:
If what you substitute for String is the hypothetical `boost::string` then what happens is the view will interpret it as raw data underlying the view.
If you're substituting a view<some_encoding> for String then what happens is the internal view<encoding> construction will hold a copy of the (immutable) view<some_encoding>, and upon access to the iterators would do the transcoding on the fly.
Doesn't your approach mean that the work done to produce the code points or characters (or transcoding) is not cached? Each iteration of the view does the work again. That can be appropriate in many cases, but there are also plenty of cases in which one wants to do the decoding/transcoding and save the result for repeated use.
I think I remember saying something about "smarter" iterators at some point earlier. ;) Also notice that an instance of the view<...> can have additional information that's unique to the encoding. It's entirely possible to implement the caching at the view layer (and even at the iterators) that can even be shared across instances of the view much like how an immutable string would do it. Immutability brings a lot of good things to the table that would otherwise not be "safe" to do with a mutable string data structure. :) -- Dean Michael Berris about.me/deanberris

Dean Michael Berris wrote:
On Thu, Jan 27, 2011 at 2:02 AM, Stewart, Robert <Robert.Stewart@sig.com> wrote:
Doesn't your approach mean that the work done to produce the code points or characters (or transcoding) is not cached? Each iteration of the view does the work again. That can be appropriate in many cases, but there are also plenty of cases in which one wants to do the decoding/transcoding and save the result for repeated use.
I think I remember saying something about "smarter" iterators at some point earlier. ;) Also notice that an instance of the view<...> can have additional information that's unique to the encoding.
Isn't that a little like the cartoon of the complicated math covering a chalkboard with "and then a miracle happens" in the corner? ;-)
It's entirely possible to implement the caching at the view layer (and even at the iterators) that can even be shared across instances of the view much like how an immutable string would do it. Immutability brings a lot of good things to the table that would otherwise not be "safe" to do with a mutable string data structure. :)
Perhaps it's also reasonable to think of types that hold characters or code points as being constructed from an appropriate view of the immutable string. Those types can then cache the decoded/transcoded result of iterating a view. Those might be the higher level, encoding-aware string types that others are looking for. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On Thu, Jan 27, 2011 at 2:27 AM, Stewart, Robert <Robert.Stewart@sig.com> wrote:
Dean Michael Berris wrote:
On Thu, Jan 27, 2011 at 2:02 AM, Stewart, Robert <Robert.Stewart@sig.com> wrote:
Doesn't your approach mean that the work done to produce the code points or characters (or transcoding) is not cached? Each iteration of the view does the work again. That can be appropriate in many cases, but there are also plenty of cases in which one wants to do the decoding/transcoding and save the result for repeated use.
I think I remember saying something about "smarter" iterators at some point earlier. ;) Also notice that an instance of the view<...> can have additional information that's unique to the encoding.
Isn't that a little like the cartoon of the complicated math covering a chalkboard with "and then a miracle happens" in the corner? ;-)
Hah! LOL :D Yeah I realize that. However, short of showing you the code of how it would be done, I'd prefer the "and then a miracle happens" preview at this stage. :)
It's entirely possible to implement the caching at the view layer (and even at the iterators) that can even be shared across instances of the view much like how an immutable string would do it. Immutability brings a lot of good things to the table that would otherwise not be "safe" to do with a mutable string data structure. :)
Perhaps it's also reasonable to think of types that hold characters or code points as being constructed from an appropriate view of the immutable string. Those types can then cache the decoded/transcoded result of iterating a view. Those might be the higher level, encoding-aware string types that others are looking for.
Yes, but really I think the view<encoding> is the encoding-aware string type mostly because if you convert it to an std::string for example or into a buffer and look at it like a `char const *` or even `wchar_t const *` then you basically get what you'd need for the C or OS APIs. I just prefer calling a spade a spade and not say `string` when I really mean a `view<encoding>` -- because largely I think everyone would agree that the string data structure really doesn't have an intrinsic property that relates to an 'encoding'. HTH -- Dean Michael Berris about.me/deanberris

On 01/26/2011 07:54 PM, Dean Michael Berris wrote:
... elision by patrick ...
Yes, but really I think the view<encoding> is the encoding-aware string type mostly because if you convert it to an std::string for example or into a buffer and look at it like a `char const *` or even `wchar_t const *` then you basically get what you'd need for the C or OS APIs.
I just prefer calling a spade a spade and not say `string` when I really mean a `view<encoding>` -- because largely I think everyone would agree that the string data structure really doesn't have an intrinsic property that relates to an 'encoding'. But what some are talking about is a utf-8_string. I know it's not what you're talking about, but saying that everyone would agree would be a bit disingenuous and discount much of the preceding discussion.
I really wish this discussion would split into two, because the discussion about the benefits of an immutable string, and the discussions of an utf encoded string are two completely different discussions and you keep butting heads each saying, no, but that's not what I'm talking about. That's right. There were several threads, but everyone's jumped onto this one which I believe was started by Mr. Berris to talk about the benefits of an immutable string. Please, please, separate these threads again. Patrick

On Thu, Jan 27, 2011 at 3:19 PM, Patrick Horgan <phorgan1@gmail.com> wrote:
On 01/26/2011 07:54 PM, Dean Michael Berris wrote:
... elision by patrick ...
Yes, but really I think the view<encoding> is the encoding-aware string type mostly because if you convert it to an std::string for example or into a buffer and look at it like a `char const *` or even `wchar_t const *` then you basically get what you'd need for the C or OS APIs.
I just prefer calling a spade a spade and not say `string` when I really mean a `view<encoding>` -- because largely I think everyone would agree that the string data structure really doesn't have an intrinsic property that relates to an 'encoding'.
But what some are talking about is a utf-8_string. I know it's not what you're talking about, but saying that everyone would agree would be a bit disingenuous and discount much of the preceding discussion.
So you're saying, utf8_string is not view<utf8_encoding> as far as I've already described it?
I really wish this discussion would split into two, because the discussion about the benefits of an immutable string, and the discussions of an utf encoded string are two completely different discussions and you keep butting heads each saying, no, but that's not what I'm talking about.
Really, if you read the recent discussions, you will see that we're really talking about the same thing: a data structure that knew the encoding somehow. That somehow is, and has been determined (and agreed upon already) already suitably modeled by a view<...> that takes a string for a suitable definition of string. Note that the string *has no encoding that is intrinsic to it*.
That's right. There were several threads, but everyone's jumped onto this one which I believe was started by Mr. Berris to talk about the benefits of an immutable string. Please, please, separate these threads again.
So Mr. Berris is saying right now, if you didn't see the point: your "utf8_string" is really just a typedef to view<utf8_encoding>. The only *reasonably efficient* way of achieving this view design is if you had immutable strings. The thread has already hashed out *why* mutable strings is a bad thing (performance and design-wise) for encoding-aware algorithms. I don't see why we need to go back to that *again*. At any rate feel free to convince me otherwise that immutable strings wouldn't be a good thing for encoding/transcoding/string-or-text-centric algorithms. ;) -- Dean Michael Berris about.me/deanberris

On 01/26/2011 11:34 PM, Dean Michael Berris wrote:
On Thu, Jan 27, 2011 at 3:19 PM, Patrick Horgan<phorgan1@gmail.com> wrote:
On 01/26/2011 07:54 PM, Dean Michael Berris wrote:
... elision by patrick ...
Yes, but really I think the view<encoding> is the encoding-aware string type mostly because if you convert it to an std::string for example or into a buffer and look at it like a `char const *` or even `wchar_t const *` then you basically get what you'd need for the C or OS APIs.
I just prefer calling a spade a spade and not say `string` when I really mean a `view<encoding>` -- because largely I think everyone would agree that the string data structure really doesn't have an intrinsic property that relates to an 'encoding'. But what some are talking about is a utf-8_string. I know it's not what you're talking about, but saying that everyone would agree would be a bit disingenuous and discount much of the preceding discussion.
So you're saying, utf8_string is not view<utf8_encoding> as far as I've already described it? Exactly. Others have expressed repeatedly that they want a string with intrinsic encoding.
I really wish this discussion would split into two, because the discussion about the benefits of an immutable string, and the discussions of an utf encoded string are two completely different discussions and you keep butting heads each saying, no, but that's not what I'm talking about.
Really, if you read the recent discussions, you will see that we're really talking about the same thing: a data structure that knew the encoding somehow. That somehow is, and has been determined (and agreed upon already) already suitably modeled by a view<...> that takes a string for a suitable definition of string. Note that the string *has no encoding that is intrinsic to it*. Yes. I understand clearly that you have been talking about that. Others talked about a string with intrinsic encoding.
That's right. There were several threads, but everyone's jumped onto this one which I believe was started by Mr. Berris to talk about the benefits of an immutable string. Please, please, separate these threads again.
So Mr. Berris is saying right now, if you didn't see the point: your "utf8_string" is really just a typedef to view<utf8_encoding>. The only *reasonably efficient* way of achieving this view design is if you had immutable strings. The thread has already hashed out *why* mutable strings is a bad thing (performance and design-wise) for encoding-aware algorithms. I don't see why we need to go back to that *again*. Me either. Of course the other discussion about strings with intrinsic encoding should be in another thread. At any rate feel free to convince me otherwise that immutable strings wouldn't be a good thing for encoding/transcoding/string-or-text-centric algorithms. ;) Why on earth would I do that? They would be wonderful for many applications and as you said and I agreed to days ago, why would you want to pay an extra price for a mutable string when you didn't need one. Of course when you did need one you'd just use it.
I'd just like to see your thread as you began it, a discussion about the benefits of an immutable string. I particularly didn't like that it got hijacked to focus on how appropriate it would be for a string that represented a particular encoding. Of course that's something to think about but there's a lot more to the benefits of an immutable string than that, and you started off doing a good job of discussing it before you got distracted. I just want to see the discussions split again so in this thread discussions of all aspects of immutability vs mutability could be discussed. It seems now that you are only interested in discussing encodings and views. I wanted the discussion of immutability. Patrick

On Thu, Jan 27, 2011 at 4:49 PM, Patrick Horgan <phorgan1@gmail.com> wrote:
On 01/26/2011 11:34 PM, Dean Michael Berris wrote:
So you're saying, utf8_string is not view<utf8_encoding> as far as I've already described it?
Exactly. Others have expressed repeatedly that they want a string with intrinsic encoding.
So isn't the encoding intrinsic in the view here? I don't get the difference. If you can use a view<...> in place of a string, what is the difference?
Really, if you read the recent discussions, you will see that we're really talking about the same thing: a data structure that knew the encoding somehow. That somehow is, and has been determined (and agreed upon already) already suitably modeled by a view<...> that takes a string for a suitable definition of string. Note that the string *has no encoding that is intrinsic to it*.
Yes. I understand clearly that you have been talking about that. Others talked about a string with intrinsic encoding.
Which I've already addressed with a view<...> template. What else do others want?
So Mr. Berris is saying right now, if you didn't see the point: your "utf8_string" is really just a typedef to view<utf8_encoding>. The only *reasonably efficient* way of achieving this view design is if you had immutable strings. The thread has already hashed out *why* mutable strings is a bad thing (performance and design-wise) for encoding-aware algorithms. I don't see why we need to go back to that *again*.
Me either. Of course the other discussion about strings with intrinsic encoding should be in another thread.
The title of the thread is [boost][string] proposal -- I don't see why it should be in another thread. Am I missing something?
At any rate feel free to convince me otherwise that immutable strings wouldn't be a good thing for encoding/transcoding/string-or-text-centric algorithms. ;)
Why on earth would I do that? They would be wonderful for many applications and as you said and I agreed to days ago, why would you want to pay an extra price for a mutable string when you didn't need one. Of course when you did need one you'd just use it.
I'd just like to see your thread as you began it, a discussion about the benefits of an immutable string. I particularly didn't like that it got hijacked to focus on how appropriate it would be for a string that represented a particular encoding. Of course that's something to think about but there's a lot more to the benefits of an immutable string than that, and you started off doing a good job of discussing it before you got distracted. I just want to see the discussions split again so in this thread discussions of all aspects of immutability vs mutability could be discussed. It seems now that you are only interested in discussing encodings and views. I wanted the discussion of immutability.
Right. So more to the point, the real thing I want to focus on is the immutable string. :) Although with the question about encoding, the answer is the view. :D HTH -- Dean Michael Berris about.me/deanberris

On Thu, Jan 27, 2011 at 8:09 PM, Yakov Galka <ybungalobill@gmail.com> wrote:
On Thu, Jan 27, 2011 at 12:57, Dean Michael Berris <mikhailberis@gmail.com>wrote:
So more to the point, the real thing I want to focus on is the immutable string. :)
So isn't SGI's rope (without all the mutable interface) a good immutable string?
I think so... but there are some things that I think would be good to have in a string implementation that is by design immutable. There are certain things like: * Interning -- similar to what the flyweight implementation does but centers on strings and substrings * Reference counting -- as already pointed out by others earlier on * Lazy transformations -- which I'm not sure would fit entirely in the way SGI's rope is described * Simple DSEL for concatenation and layering transformations -- this would be good for efficiency reasons and would generally influence the internals of the implementation That said, the memory management and data structure of the rope internals would be a good model to follow IMO. -- Dean Michael Berris about.me/deanberris

On Thu, Jan 27, 2011 at 14:41, Dean Michael Berris <mikhailberis@gmail.com>wrote:
On Thu, Jan 27, 2011 at 8:09 PM, Yakov Galka <ybungalobill@gmail.com> wrote:
On Thu, Jan 27, 2011 at 12:57, Dean Michael Berris <mikhailberis@gmail.com>wrote:
So more to the point, the real thing I want to focus on is the immutable string. :)
So isn't SGI's rope (without all the mutable interface) a good immutable string?
I think so... but there are some things that I think would be good to have in a string implementation that is by design immutable. There are certain things like:
* Interning -- similar to what the flyweight implementation does but centers on strings and substrings
Not sure what you mean here... * Reference counting -- as already pointed out by others earlier on
ropes are implemented through reference counting. * Lazy transformations -- which I'm not sure would fit entirely in the
way SGI's rope is described
* Simple DSEL for concatenation and layering transformations -- this
would be good for efficiency reasons and would generally influence the internals of the implementation
I don't understand why people introduced the ^ syntax here. Plain old operator + and += looks fine for me. Probably transformations should not be part of the string interface. I don't know what transformation you are talking about, but consider transcoding or unicode case conversion. Both can't be done (in general) by starting at arbitrary place in the string, so you arrive at a design in which transformations are forward iterator ranges. Btw, can someone answer this question ( http://stackoverflow.com/questions/3894358/what-is-the-concatenation-complex...) ? -- Yakov

On 27.01.2011, at 04:54, Dean Michael Berris wrote:
I just prefer calling a spade a spade and not say `string` when I really mean a `view<encoding>` -- because largely I think everyone would agree that the string data structure really doesn't have an intrinsic property that relates to an 'encoding'.
The string data structure that I learned about in some obscure algorithm lectures in university and the thing that's represented by the type 'string' or some spelling variation of it in most programming language are really not the same thing. In particular, the latter is a type meant to store text, whereas the former is a sequence of something. That makes for a huge difference in my mind. Sebastian

On Thu, Jan 27, 2011 at 11:13 PM, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:
On 27.01.2011, at 04:54, Dean Michael Berris wrote:
I just prefer calling a spade a spade and not say `string` when I really mean a `view<encoding>` -- because largely I think everyone would agree that the string data structure really doesn't have an intrinsic property that relates to an 'encoding'.
The string data structure that I learned about in some obscure algorithm lectures in university and the thing that's represented by the type 'string' or some spelling variation of it in most programming language are really not the same thing. In particular, the latter is a type meant to store text, whereas the former is a sequence of something.
Exactly.
That makes for a huge difference in my mind.
Definitely. Which is why I still think about a string and think "oh, sequence of something" rather than see string and think "oh, text!". -- Dean Michael Berris about.me/deanberris

On Jan 27, 2011, at 10:22 AM, Dean Michael Berris wrote:
On Thu, Jan 27, 2011 at 11:13 PM, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:
On 27.01.2011, at 04:54, Dean Michael Berris wrote:
I just prefer calling a spade a spade and not say `string` when I really mean a `view<encoding>` -- because largely I think everyone would agree that the string data structure really doesn't have an intrinsic property that relates to an 'encoding'.
The string data structure that I learned about in some obscure algorithm lectures in university and the thing that's represented by the type 'string' or some spelling variation of it in most programming language are really not the same thing. In particular, the latter is a type meant to store text, whereas the former is a sequence of something.
Exactly.
That makes for a huge difference in my mind.
Definitely. Which is why I still think about a string and think "oh, sequence of something" rather than see string and think "oh, text!".
Ok, so we are back to std::vector again... Is that the "boost::string" class you are proposing? Although immutable... So, we should rid the notion of silly text-handling and focus on the sequentiality of a (CS...) string? /David

On Fri, Jan 28, 2011 at 2:16 AM, David Bergman <David.Bergman@bergmangupta.com> wrote:
On Jan 27, 2011, at 10:22 AM, Dean Michael Berris wrote:
Definitely. Which is why I still think about a string and think "oh, sequence of something" rather than see string and think "oh, text!".
Ok, so we are back to std::vector again...
Well, std::vector implies that it's contiguous. That's not what I'm thinking about.
Is that the "boost::string" class you are proposing? Although immutable...
Nope.
So, we should rid the notion of silly text-handling and focus on the sequentiality of a (CS...) string?
NO. The point is that there are two levels being discussed and that I have an idea on how to fix both by solving one level one way. The string, an underlying data structure that has its own semantics and algorithms that apply to it (substring, concatenation, etc.) would be a foundation for a view (encoded, compressed, etc.) which has its own semantics and interface similar to a string -- and offer more than what a bare string would offer. In my mind I see interpreting data in a string through some lens is actually a different concern from how the string actually behaves (and what a string is). So what I'm really saying is -- and has been from the beginning -- let's fix the borkedness of std::string by coming up with a string that is immutable, crazy efficient to deal with, has real value semantics, and a whole family of algorithms that deal with this newfangled string. Then on top of that let's implement views of these strings in the encoding we would like and implement algorithms that deal with coercing a given string to be viewed in a given encoding. The point is I'm talking in concepts and interfaces of two different things. Now if "a string that has its encoded intrinsic to its type" is really a 'view<encoding>' in my parlance then that's what I mean. Does that make sense? -- Dean Michael Berris about.me/deanberris

On Jan 27, 2011, at 10:13 AM, Sebastian Redl wrote:
On 27.01.2011, at 04:54, Dean Michael Berris wrote:
I just prefer calling a spade a spade and not say `string` when I really mean a `view<encoding>` -- because largely I think everyone would agree that the string data structure really doesn't have an intrinsic property that relates to an 'encoding'.
The string data structure that I learned about in some obscure algorithm lectures in university and the thing that's represented by the type 'string' or some spelling variation of it in most programming language are really not the same thing. In particular, the latter is a type meant to store text, whereas the former is a sequence of something.
That makes for a huge difference in my mind.
Please don't go there ;-) Yes, we all (?) know what "string" means computer-scientifically, a finite sequence of symbols, where the symbols are chosen from some alphabet. But note the *symbol* here; i.e., even in computer science, a *string* is not supposed to contain *anything*, but rather symbols. So, the actual manifestations in programming languages often *do* cover that notion quite precisely, often with an alphabet that coincides or extends one used for everyday written communication between people. What programming languages do, in addition, is to add "text handling" capabilities *on top of* the core (CS...) string. I assume we are talking about both these layers in our "boost::string" discussion, and that we have one (or one out of seven) alphabet in mind for the underlying (CS) string. Do you think we should go computer-scientific on "std::string"'s ass? And make sure it behaves as CS string and nothing else? :-) Diving deeper into the meta semantics of "symbol" would lead us too OT... /David

Hello Matus, On 21 January 2011 12:25, Matus Chochlik <chochlik@gmail.com> wrote:
Dear list,
following the whole string encoding discussion I would like to make some suggestions.
From the whole debate it is becoming clear, that instant switch from encoding-agnostic/platform-native std::string to UTF-8-encoded std::string is not likely to happen.
<snip />
/me ducks and covers :)
No need to duck and cover, you should be applauded for taking the initiative here.
The idea is that, let std::string/wstring be platform-specifically- -encoded as it is now, but also let the boost::string handle the conversions as transparently as possible so if in case the standard adopts it, std::string would become a synonym for boost::string.
This is *very* promising.
It is only partially implemented and there are two examples showing how things could work, but the real UTF-8 validation, transcoding, error handling, is of course missing. Remember it is aimed at the design of the interfaces at this point.
If you have the time, have a look and if my suggestions and/or the code looks completely wrong, please, feel free to slash it to pieces :), and if you feel up to it, propose something better.
If this or something completely different and much better that comes out of it, will be agreed upon, we could set up a dedicated git repository for Boost.String and maybe try if the new suggested collaborative development in per-boost-component repositories really works. :) If some of the people that are skilled with unicode would join or lead the effort it would be awesome.
Definitely make a Git repository. Regards, Glyn

Matus Chochlik wrote:
Dear list,
Then it was proposed that we create a utf8_t string type that would be used *together* (for all eternity) with the standard basic_string<>. While I see the advantages here, I (as I already said elsewhere) have the following problem with this approach:
Using a name like utf8_t or u8string, string_utf8, etc. at least to me (and I've consulted this off the list, with several people) suggests, that UTF-8 is still something special and IMO also sends the message that it is OK to remain forever with the various encodings and std::string as it is today.
rather than viewing std::string as a sequence of character encodings, view it as a sequence of bytes along with a few extra functions compared to std::vector. Lot's of programs use std::string in this way without depending upon any behavior related to character encoding. now, consider utf8_string as a sequence of character encodings which might be implemented in terms of std::string. It's a different thing and should have a different thing.
We should *IMO* endorse the opposite.
It is not our proper role to endorse or deprecate programming practices. It's a fools errand in any case. The best anyone can do is provide alternatives and explain why he thinks they are superior.
My suggestion is the following:
Let us create a class called boost::string that will have all the properties that a string handling class in 2011+ A.D.
What happens in 2021 A.D. when it is discovered that "they did it wrong".
should have, basically what std::string should have been.
what you (or we, or someone else) thinks string should have been. This idea depends upon a few presumptions which are not true. a) that std::string is used only for character encodings. b) that someone can know all the things that std::string might be used for as it is c) that someone now has the knowledge to design a new version of std::string which will never need be changed. Basically, if you're going to make a "new" thing - fine - just make sure you give it a new name. Robert Ramey

On Fri, Jan 21, 2011 at 5:48 PM, Robert Ramey <ramey@rrsd.com> wrote:
Matus Chochlik wrote:
Using a name like utf8_t or u8string, string_utf8, etc. at least to me (and I've consulted this off the list, with several people) suggests, that UTF-8 is still something special and IMO also sends the message that it is OK to remain forever with the various encodings and std::string as it is today.
rather than viewing std::string as a sequence of character encodings, view it as a sequence of bytes along with a few extra functions compared to std::vector. Lot's of programs use std::string in this way without depending upon any behavior related to character encoding.
Of course, this is what has been during the discussion referred-to as encoding agnostic usage. But if I use a string to refer to the same thing on different platforms (path, url, proper name, etc.) then I would like that the byte-sequence would be the same, for the following reason: Today data are commonly sent over network between computers with different platforms and even if on one machine you don't care about which byte sequence represents a string of logical characters you have to worry about it when you send it to another machine because it might interpret the sequence differently. To avoid data corruption during this process there has to be an agreement on a common representation at some point during the transfer. In the past this was not such a big deal because computers were standalone and the transcoding could be handled manually. But today moving data around is so prevalent that it becomes unfeasible to do it explicitly.
now, consider utf8_string as a sequence of character encodings which might be implemented in terms of std::string. It's a different thing and should have a different thing.
This would mean that if someone uses for example a class member variable that you intended to be just a byte sequence as a character sequence he would have to make a copy.
We should *IMO* endorse the opposite.
It is not our proper role to endorse or deprecate programming practices. It's a fools errand in any case. The best anyone can do is provide alternatives and explain why he thinks they are superior.
OK, by "endorsing" I meant here not just talking about it and convincing people that it is superior without proving it, (as it become clear to me in the other thread of the debate) but actually implementing something better as the current std::string with the properties described above and let the "market" decide. But in the end you have to believe in what you are doing.
My suggestion is the following:
Let us create a class called boost::string that will have all the properties that a string handling class in 2011+ A.D.
What happens in 2021 A.D. when it is discovered that "they did it wrong".
Then the people who find that out, will do a lot of complaining about it and eventually they will create something even better. I'm not as naive as to think that we create a string class which will be used for the next 500 years :) But if we create something that will make the life in the next 10-20 years easier, than it will be worth the effort.
should have, basically what std::string should have been.
what you (or we, or someone else) thinks string should have been.
Of course I don't think that I alone can come up with the "uber_string", but this is Boost with all its gurus :) so if there is a place where a good string class can be born then it is IMO here.
This idea depends upon a few presumptions which are not true. a) that std::string is used only for character encodings.
No, I imagine it to be (partially) backward compatible with std::string, but also to have Unicode-aware features, so it can be used as both the byte sequence and the logical-character sequence.
b) that someone can know all the things that std::string might be used for as it is
I think we can do reasonable assumptions.
c) that someone now has the knowledge to design a new version of std::string which will never need be changed.
I never said anything like this, see above.
Basically, if you're going to make a "new" thing - fine - just make sure you give it a new name.
I'm not thinking about it as a completely new thing, more like future std::string 2.0, an upgrade not a replacement. BR, Matus

On 01/21/2011 01:32 PM, Matus Chochlik wrote:
... elision by patrick .... I'm not thinking about it as a completely new thing, more like future std::string 2.0, an upgrade not a replacement.
It's not an upgrade because suddenly, many things that people do with std::string wouldn't work. You're really taking a special-case subset of what the std::string is used for and saying that it's the only one that people can use any more. If you are throwing away much of the utility and focusing in on a special case, don't name it string because it's no longer a general case string. Patrick

At Fri, 21 Jan 2011 12:25:07 +0100, Matus Chochlik wrote:
Let us create a class called boost::string that will have all the properties that a string handling class in 2011+ A.D. should have, basically what std::string should have been.
That's the direction I was aiming in. However, I'm not 100% confident in that direction, particularly because Peter seemed to be going the other way, and I know from hard experience that he's almost never wrong about anything. So take my +1 with a grain of salt :-) -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Fri, Jan 21, 2011 at 6:25 AM, Matus Chochlik <chochlik@gmail.com> wrote:
Dear list,
following the whole string encoding discussion I would like to make some suggestions.
From the whole debate it is becoming clear, that instant switch from encoding-agnostic/platform-native std::string to UTF-8-encoded std::string is not likely to happen.
Then it was proposed that we create a utf8_t string type that would be used *together* (for all eternity) with the standard basic_string<>. While I see the advantages here, I (as I already said elsewhere) have the following problem with this approach:
Using a name like utf8_t or u8string, string_utf8, etc. at least to me (and I've consulted this off the list, with several people) suggests, that UTF-8 is still something special and IMO also sends the message that it is OK to remain forever with the various encodings and std::string as it is today. We should *IMO* endorse the opposite.
IMO, Any serious Unicode string proposal has to address UTF-8 strings, UTF-16 strings, UTF-32 strings, and probably UTF strings where the particular UTF encoding is established at runtime. Applications that deal with Asian languages, do a lot of random access, or would pay a performance or storage penalty will demand more than just UTF-8 strings. There might be other variants, too, such as a BMP-string. If a Unicode string library provides a strong design framework that is clearly articulated, then an initial implementation would only have to provide the most needed types; UTF-8 and UTF-16/BMP. I really doubt any proposal will get taken very seriously is it only supports one of the UTF encodings. --Beman

At Fri, 21 Jan 2011 12:50:36 -0500, Beman Dawes wrote:
I really doubt any proposal will get taken very seriously is it only supports one of the UTF encodings.
That is not necessarily the same as saying there needs to be a string class for each of the UTF- encodings. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On Fri, Jan 21, 2011 at 12:53 PM, Dave Abrahams <dave@boostpro.com> wrote:
At Fri, 21 Jan 2011 12:50:36 -0500, Beman Dawes wrote:
I really doubt any proposal will get taken very seriously is it only supports one of the UTF encodings.
That is not necessarily the same as saying there needs to be a string class for each of the UTF- encodings.
Agreed. Without knowing exactly what functionality is to be supported for each UTF string, I find it hard to know if one basic_utf_string template could cover the most critical cases. Would a template be able to support random access iterators for the variable length encodings, for example? I just don't know. --Beman

On Fri, Jan 21, 2011 at 3:29 PM, Beman Dawes <bdawes@acm.org> wrote:
Without knowing exactly what functionality is to be supported for each UTF string, I find it hard to know if one basic_utf_string template could cover the most critical cases. Would a template be able to support random access iterators for the variable length encodings, for example? I just don't know.
Is that a critical case? Random-access over what? Code points? Characters? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 01/21/2011 09:50 AM, Beman Dawes wrote:
... elision by patrick ....
IMO, Any serious Unicode string proposal has to address UTF-8 strings, UTF-16 strings, UTF-32 strings, and probably UTF strings where the particular UTF encoding is established at runtime. Applications that deal with Asian languages, do a lot of random access, or would pay a performance or storage penalty will demand more than just UTF-8 strings. There might be other variants, too, such as a BMP-string. If a Unicode string library provides a strong design framework that is clearly articulated, then an initial implementation would only have to provide the most needed types; UTF-8 and UTF-16/BMP.
I really doubt any proposal will get taken very seriously is it only supports one of the UTF encodings.
+1 with the caveat that UTF-8 and UTF-32 is considered by many to be the most needed types with UTF-16 considered evil. (Seems to be a Windows/non-Windows split. I like them all;) So all three (four if you want to differentiate between fixed-width UTF-16/BMP (really UCS-2) and the full UTF-16) would be needed to avoid people saying that it doesn't fill their needs so why did we bother. The UTF string with run-time would carry a lot of extra code. Wouldn't a programmer know which he wanted to use internally at compile time? Patrick p.s. Nice quick description of the differences between and history of UCS-2 UCS-4 utf-8 utf-16 utf-32 at http://en.wikipedia.org/wiki/Universal_Character_Set

On Fri, Jan 21, 2011 at 8:47 PM, Patrick Horgan <phorgan1@gmail.com> wrote:
On 01/21/2011 09:50 AM, Beman Dawes wrote:
... elision by patrick ....
IMO, Any serious Unicode string proposal has to address UTF-8 strings, UTF-16 strings, UTF-32 strings, and probably UTF strings where the particular UTF encoding is established at runtime. Applications that deal with Asian languages, do a lot of random access, or would pay a performance or storage penalty will demand more than just UTF-8 strings. There might be other variants, too, such as a BMP-string. If a Unicode string library provides a strong design framework that is clearly articulated, then an initial implementation would only have to provide the most needed types; UTF-8 and UTF-16/BMP.
I really doubt any proposal will get taken very seriously is it only supports one of the UTF encodings.
+1 with the caveat that UTF-8 and UTF-32 is considered by many to be the most needed types with UTF-16 considered evil. (Seems to be a Windows/non-Windows split. I like them all;)
IIRC, Oracle supports UTF-8 and UTF-16, so a lot of folks will want UTF-16 for that reason. It isn't just Windows programmers.
So all three (four if you want to differentiate between fixed-width UTF-16/BMP (really UCS-2) and the full UTF-16) would be needed to avoid people saying that it doesn't fill their needs so why did we bother.
Yep.
The UTF string with run-time would carry a lot of extra code. Wouldn't a programmer know which he wanted to use internally at compile time?
Maybe. But I've written Geographic libraries that have to be efficient for both North American, European, and Asian languages. It used to be we knew at compile time what languages we would be dealing with. But more and more because of the internet, the libraries just have to work well everywhere. The cost of the extra code is swamped by the other costs involved. That said, such a string is far lower priority than the others.
p.s. Nice quick description of the differences between and history of UCS-2 UCS-4 utf-8 utf-16 utf-32 at http://en.wikipedia.org/wiki/Universal_Character_Set
Yep, recommended! Thanks, --Beman

On Sat, 22 Jan 2011 14:10:10 -0800, Beman Dawes <bdawes@acm.org> wrote:
On Fri, Jan 21, 2011 at 8:47 PM, Patrick Horgan <phorgan1@gmail.com> wrote:
On 01/21/2011 09:50 AM, Beman Dawes wrote:
<snip>
+1 with the caveat that UTF-8 and UTF-32 is considered by many to be the most needed types with UTF-16 considered evil. (Seems to be a Windows/non-Windows split. I like them all;)
IIRC, Oracle supports UTF-8 and UTF-16, so a lot of folks will want UTF-16 for that reason. It isn't just Windows programmers.
FYI: OS/X and iOS both also use UTF-16. Mostafa

On Sun, 23 Jan 2011 17:58:45 -0800, Mostafa <mostafa_working_away@yahoo.com> wrote:
On Sat, 22 Jan 2011 14:10:10 -0800, Beman Dawes <bdawes@acm.org> wrote:
On Fri, Jan 21, 2011 at 8:47 PM, Patrick Horgan <phorgan1@gmail.com> wrote:
On 01/21/2011 09:50 AM, Beman Dawes wrote:
<snip>
+1 with the caveat that UTF-8 and UTF-32 is considered by many to be the most needed types with UTF-16 considered evil. (Seems to be a Windows/non-Windows split. I like them all;)
IIRC, Oracle supports UTF-8 and UTF-16, so a lot of folks will want UTF-16 for that reason. It isn't just Windows programmers.
FYI: OS/X and iOS both also use UTF-16.
Though I might add that they do provide means to convert to/from from UTF-8 and their NSString class. Mostafa
participants (28)
-
Alexander Lamaison
-
Artyom
-
Beman Dawes
-
Chad Nelson
-
Christian Holmquist
-
Dave Abrahams
-
David Bergman
-
Dean Michael Berris
-
Eric Niebler
-
Glyn Matthews
-
Gregory Crosswhite
-
Hartmut Kaiser
-
Ivan Le Lann
-
Jeremy Maitin-Shepard
-
Joel de Guzman
-
Joel Falcou
-
Matus Chochlik
-
Mostafa
-
Nevin Liber
-
Patrick Horgan
-
Peter Dimov
-
Phil Endecott
-
Robert Ramey
-
Sebastian Redl
-
Stefano Delli Ponti
-
Steven Watanabe
-
Stewart, Robert
-
Yakov Galka