
Hi Boosters, There was a discussion about char[] support in the Boost.Range library. The issue seems important and I'd like to express my ideas about a possible solution. First lets sumarize problems and goals. The problems: char[] and possibly any other type that can be used as a c-string (this includes wchar_t, but also int, long and etc when used as a unicode code-point) might represent two different things: 1.) c-string literal 2.) arbitrary c-array Both views differ in lenght calculation, which is totaly incompatible and what's worse, it can lead to casual access violation when used improperly. An example: char str[] = "Hello"; // typeof(str) is char[6], str=={'h','e','l',l','o',0} In the c-string view, str have 5 letters and ends at the 'o'. So the range should be <'H','o') In c-array view str is 6 elements long and ends with '\0' The range is <'H','\0') From the user perspective, both views are equaly important, however according to the usage scenarion, one might be preferable over the other one. Important aspect to keep in mind is this strict relativnes. For example for string algorithms c-string literal is obvious default, while for a data processing library the second choice is better. Current implementation is not ideal. First of all, there is a difference between char,wchar_t[] and the rest of the types. This brings some confusion. Secondly, it is not possible to use char[] as an ordinary array. The goals: From the problem analysis above, following goals can be implied 1. we need to support both views equaly 2. a user must be always able to explicitly specify what type of view he requires 3. it should be possible for a library writer to select default view for his library. However point (2) must hold, so the user must be able to override this default. 4. Support must be present in the Boost.Range library. It is not feasible to ask library writer to provide specific workarounds/hacks. It would simply break the idea of Boost.Range library as a unified interface to range-like data structures. The solution: I propose to have two free-standing functions as_string() and as_array() (naming is not important now). Both should have the same generic signature: template<typename RangeT> boost::sub_range<RangeT> as_string(RangeT& aRange); template<typename RangeT> boost::sub_range<RangeT> as_array(RangeT& aRange); By default, the functions only copy the input range to the target. However for the types like char[], the result will differ. For as_string() will create a sub_range delimiting string literal (using char_type<char>::length for instance), while as_array() will use compile-time boundaries. In addition we might consider to open this interface for user-defined type, even if I'm not sure how it can be used. Please note, that once any of these manipulators is applied to a range following application will have no effect. Lets see how this faicility can be used: A library writer can set the default by writting algorithm like this: template<typename RangeT> ... AnAlgorithm(const RangeT& aRange) { boost::sub_range<RangeT> StrRange=as_string(aRange); // Do something with StrRange } If a user calls AnAlgorithm directly: char str[]="hello"; AnAlgorithm(str); str will be converted to a range, delimiting a string_literal. However he can alse use as_array(): char str[]={'h', 'e', 'l', 'l', 'o'}; AnAlgorithm(as_array(str)); This time no conversion will take place, since as_array() returns sub_range. Note, that for the AnAlgorithm it does not matter what default is used in the Range library. Open questions: - I have intentionaly not included a proposal for the default view that the Range library should provide. Goal of this solution is to provide a way, that is not dependant on this. I'd like to leave it for the discussion. Right now it seems, that most of the people that entered discussion prefer c-array view. I would prefer c-string view, but I'm probably biased by the fact that I'm the author of StringAlgo library. - There is a space for possible extentions to the basic proposal. For instance, as_string() migh have the second parameter that will identify a terminator. - String literal lenght can be calculated in two ways. Either by using strlenght() (or alike), or using compile-time size (N) decreased by 1 (N-1). Later approach is faster and allows to specify literals like this: char str[]="hello\0bye"; But it is different from char* handling and therefor it might be confusing. Best Regards, Pavol

"Pavol Droba" <droba@topmail.sk> wrote in message news:464931589.20050515230222@topmail.sk... | Hi Boosters, | | There was a discussion about char[] support in the Boost.Range | library. The issue seems important and I'd like to express my | ideas about a possible solution. | | First lets sumarize problems and goals. | The problems: | char[] and possibly any other type that can be used as a c-string | (this includes wchar_t, but also int, long and etc when used as a | unicode code-point) might represent two different things: | 1.) c-string literal | 2.) arbitrary c-array | | Both views differ in lenght calculation, which is totaly | incompatible and what's worse, it can lead to casual access | violation when used improperly. | | An example: | char str[] = "Hello"; | // typeof(str) is char[6], str=={'h','e','l',l','o',0} | | In the c-string view, str have 5 letters and ends at the 'o'. | So the range should be <'H','o') | In c-array view str is 6 elements long and ends with '\0' | The range is <'H','\0') | | From the user perspective, both views are equaly important, | however according to the usage scenarion, one might be preferable | over the other one. Important aspect to keep in mind is this strict | relativnes. For example for string algorithms c-string literal is | obvious default, while for a data processing library the second | choice is better. | | Current implementation is not ideal. First of all, there is a | difference between char,wchar_t[] and the rest of the types. | This brings some confusion. Secondly, it is not possible to use | char[] as an ordinary array. no default can meet all's expectations. | The goals: | From the problem analysis above, following goals can be implied | | 1. we need to support both views equaly if you remove "equally" I agree. | 2. a user must be always able to explicitly specify what type | of view he requires | 3. it should be possible for a library writer to select default | view for his library. | However point (2) must hold, so the user must be able | to override this default. | 4. Support must be present in the Boost.Range library. | It is not feasible to ask library writer to provide | specific workarounds/hacks. It would simply break the idea | of Boost.Range library as a unified interface to range-like | data structures. | | The solution: | | I propose to have two free-standing functions | as_string() and as_array() (naming is not important now). | | Both should have the same generic signature: | | template<typename RangeT> | boost::sub_range<RangeT> as_string(RangeT& aRange); | template<typename RangeT> | boost::sub_range<RangeT> as_array(RangeT& aRange); | + const overloads | By default, the functions only copy the input range to the target. | However for the types like char[], the result will differ. | For as_string() will create a sub_range delimiting string | literal (using char_type<char>::length for instance), while as_array() | will use compile-time boundaries. sounds fair. | In addition we might consider to open this interface for | user-defined type, even if I'm not sure how it can be used. with ADL. the library says using boost::as_string; foo( as_string(bar)); | Please note, that once any of these manipulators is applied to a | range following application will have no effect. | | Lets see how this faicility can be used: | | A library writer can set the default by writting algorithm like | this: | | template<typename RangeT> | ... AnAlgorithm(const RangeT& aRange) | { | boost::sub_range<RangeT> StrRange=as_string(aRange); | | // Do something with StrRange | } | | If a user calls AnAlgorithm directly: | char str[]="hello"; | AnAlgorithm(str); | | str will be converted to a range, delimiting a string_literal. | However he can alse use as_array(): | | char str[]={'h', 'e', 'l', 'l', 'o'}; | AnAlgorithm(as_array(str)); | | This time no conversion will take place, since as_array() returns | sub_range. | | Note, that for the AnAlgorithm it does not matter what default is | used in the Range library. | | Open questions: | | - I have intentionaly not included a proposal for the default view | that the Range library should provide. | Goal of this solution is to provide a way, that is not dependant | on this. | I'd like to leave it for the discussion. Right now it seems, that | most of the people that entered discussion prefer c-array view. | I would prefer c-string view, but I'm probably biased by the fact | that I'm the author of StringAlgo library. I prefer the string view too. That's how boost.tange was designed. There is one prblem with the default today IMO: char[] should call char_traits<char>::length(); | - There is a space for possible extentions to the basic proposal. | For instance, as_string() migh have the second parameter that | will identify a terminator. | | - String literal lenght can be calculated in two ways. Either by | using strlenght() (or alike), or using compile-time size (N) | decreased by 1 (N-1). for const char[], this would be the way to go...and so is it also implemented by default. -Thorsten

"Thorsten Ottosen" <nesotto@cs.auc.dk> writes:
| I'd like to leave it for the discussion. Right now it seems, that | most of the people that entered discussion prefer c-array view. | I would prefer c-string view, but I'm probably biased by the fact | that I'm the author of StringAlgo library.
I prefer the string view too.
I just have one thing to say: vector<bool>.
That's how boost.tange was designed.
Doesn't make it right. -- Dave Abrahams Boost Consulting www.boost-consulting.com

On Sun, May 15, 2005 at 08:17:15PM -0400, David Abrahams wrote:
"Thorsten Ottosen" <nesotto@cs.auc.dk> writes:
| I'd like to leave it for the discussion. Right now it seems, that | most of the people that entered discussion prefer c-array view. | I would prefer c-string view, but I'm probably biased by the fact | that I'm the author of StringAlgo library.
I prefer the string view too.
I just have one thing to say: vector<bool>.
Pardon me, but somehow I cannot figure out the point here. Can you please explain the me the connection to vector<bool> Thank you. Pavol.

Pavol Droba wrote:
On Sun, May 15, 2005 at 08:17:15PM -0400, David Abrahams wrote:
"Thorsten Ottosen" <nesotto@cs.auc.dk> writes:
| I'd like to leave it for the discussion. Right now it seems, that | most of the people that entered discussion prefer c-array view. | I would prefer c-string view, but I'm probably biased by the fact | that I'm the author of StringAlgo library.
I prefer the string view too.
I just have one thing to say: vector<bool>.
Pardon me, but somehow I cannot figure out the point here. Can you please explain the me the connection to vector<bool>
vector<bool> creates all kinds of problems because generic code can't make assumptions about the behavior of vector<T>. vector<bool> is widely regarded as a Bad Move. Dave is saying that treating char[] different than, say, int[] is inviting the same sorts of problems. It will make it difficult to deal with T[] in generic code. I agree with Dave. -- Eric Niebler Boost Consulting www.boost-consulting.com

On Mon, May 16, 2005 at 12:32:56PM -0700, Eric Niebler wrote:
Pavol Droba wrote:
On Sun, May 15, 2005 at 08:17:15PM -0400, David Abrahams wrote:
"Thorsten Ottosen" <nesotto@cs.auc.dk> writes:
| I'd like to leave it for the discussion. Right now it seems, that | most of the people that entered discussion prefer c-array view. | I would prefer c-string view, but I'm probably biased by the fact | that I'm the author of StringAlgo library.
I prefer the string view too.
I just have one thing to say: vector<bool>.
Pardon me, but somehow I cannot figure out the point here. Can you please explain the me the connection to vector<bool>
vector<bool> creates all kinds of problems because generic code can't make assumptions about the behavior of vector<T>. vector<bool> is widely regarded as a Bad Move. Dave is saying that treating char[] different than, say, int[] is inviting the same sorts of problems. It will make it difficult to deal with T[] in generic code.
I agree with Dave.
I see now what do you mean. I understand this problem and therefor I have suggested a solution that can work regardless to decision taken here. But allow me to mention another similarity. There is well knows 'C' idiom, that char[] is generaly char*. For char* we can only provide c-string handling. Therefor it brings also a confusion if char[] will be threated differently. Yet this is probably no as strong as similarity between generic T[] types. * * * * * * * * * * * But I see, that this discussion is starting to circle around and it is time to stop it. There was no objection to the solution I have proposed, so I take it for granted, that it is acceptable. For the default behaviour, we can favor either syntactic (T[]) or semantic (char*) similarity. Because syntactic similarity is usualy stronger, let's stick with this one. In addition I would suggest to remove direct support for pointer types like char* from the range library and provide it only via as_string() construct to definitely remove any kind of confusion. If there are any complaints/suggestion/comments do not hessitate to bring them on. Best Regards, Pavol.

"Eric Niebler" <eric@boost-consulting.com> wrote in message news:4288F568.9030205@boost-consulting.com... | | Pavol Droba wrote: | > On Sun, May 15, 2005 at 08:17:15PM -0400, David Abrahams wrote: | > | >>"Thorsten Ottosen" <nesotto@cs.auc.dk> writes: | >> | >> | >>>| I'd like to leave it for the discussion. Right now it seems, that | >>>| most of the people that entered discussion prefer c-array view. | >>>| I would prefer c-string view, but I'm probably biased by the fact | >>>| that I'm the author of StringAlgo library. | >>> | >>> I prefer the string view too. | >> | >>I just have one thing to say: vector<bool>. | >> | > | > | > Pardon me, but somehow I cannot figure out the point here. Can you please | > explain the me the connection to vector<bool> | > | | | vector<bool> creates all kinds of problems because generic code can't | make assumptions about the behavior of vector<T>. vector<bool> is widely | regarded as a Bad Move. Dave is saying that treating char[] different | than, say, int[] is inviting the same sorts of problems. It will make it | difficult to deal with T[] in generic code. | | I agree with Dave. are these "problems" as serious as vector<bool> ? -Thorsten

Thorsten Ottosen wrote:
"Eric Niebler" <eric@boost-consulting.com> wrote in message news:4288F568.9030205@boost-consulting.com... | | Pavol Droba wrote: | > On Sun, May 15, 2005 at 08:17:15PM -0400, David Abrahams wrote: | > | >>"Thorsten Ottosen" <nesotto@cs.auc.dk> writes: | >> | >> | >>>| I'd like to leave it for the discussion. Right now it seems, that | >>>| most of the people that entered discussion prefer c-array view. | >>>| I would prefer c-string view, but I'm probably biased by the fact | >>>| that I'm the author of StringAlgo library. | >>> | >>> I prefer the string view too. | >> | >>I just have one thing to say: vector<bool>. | >> | > | > | > Pardon me, but somehow I cannot figure out the point here. Can you please | > explain the me the connection to vector<bool> | > | | | vector<bool> creates all kinds of problems because generic code can't | make assumptions about the behavior of vector<T>. vector<bool> is widely | regarded as a Bad Move. Dave is saying that treating char[] different | than, say, int[] is inviting the same sorts of problems. It will make it | difficult to deal with T[] in generic code. | | I agree with Dave.
are these "problems" as serious as vector<bool> ?
Yes. It's exactly the same situation. You are giving collection<X> different semantics than collection<Y>. BTW, I find your use of quotes rather "disparaging". If you think this situation is different, please give /technical/ reasons. Thanks. -- Eric Niebler Boost Consulting www.boost-consulting.com

"Eric Niebler" <eric@boost-consulting.com> wrote in message news:428926DB.2090108@boost-consulting.com...
Thorsten Ottosen wrote:
are these "problems" as serious as vector<bool> ?
Yes. It's exactly the same situation. You are giving collection<X> different semantics than collection<Y>.
Actually, it's worse than with vector<bool> at least with vector<bool> and vector<T> they still have the same underlying semantcs, and still look the same over all the elements if you're careful to limit what you look at, so it's possible to write algorithms that will work for both of them. That's not the case here.
BTW, I find your use of quotes rather "disparaging". If you think this situation is different, please give /technical/ reasons. Thanks.
Thorsten's contributions would go much further if he were more willing to learn from the mistakes of the past and give weight to the acquired wisdom of a community that has been doing C++ library design for a long time. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

"Eric Niebler" <eric@boost-consulting.com> wrote in message news:428926DB.2090108@boost-consulting.com... | | Thorsten Ottosen wrote: | > "Eric Niebler" <eric@boost-consulting.com> wrote in message | > | vector<bool> creates all kinds of problems because generic code can't | > | make assumptions about the behavior of vector<T>. vector<bool> is widely | > | regarded as a Bad Move. Dave is saying that treating char[] different | > | than, say, int[] is inviting the same sorts of problems. It will make it | > | difficult to deal with T[] in generic code. | > | | > | I agree with Dave. | > | > are these "problems" as serious as vector<bool> ? | > | | | Yes. It's exactly the same situation. You are giving collection<X> | different semantics than collection<Y>. I guess my objection is to the use "exactly". If I have template< class T > class my_vec { std::vector<T> vec; }; then I might need some traits for dealing with the bool case. If I have template< class Range, class OutIter > void copy( const Range&, OutIter ); then I just need to be able to say what a Range means. I think it would be wrong for the algorithm to assume anything; that would not give the caller a chance to decide. And that is why Pavol said, let's allow copy( as_array( rng ), out ); copy( as_string( rng ), out ); copy( rng, out ); If we don't wan't to give string literals a special meaning in the range library, then we up with the rather clumsy sub_range<string> r = find( rng, as_string( "foo" ) ); compared to sub_range<string> r = find( rng, "foo" ); It is not obvious why you ever want to include the 0 in a literal, but if you want, there should be a way to do so. Currently the isn't any support in the range lib, but there should be. | BTW, I find your use of quotes rather "disparaging". If you think this | situation is different, please give /technical/ reasons. Thanks. Yeah, sorry. I guess I was equally annoyed by the claim that we had a new problem simply by stating vector<bool> was a problem. I don't see that close an analogy and so the argument made by Dave could be used to to argue for anything. I don't think the decisions of the range library was a big issue during the review; Peter Dimov was, I think, one who said something along your opinion. Anyway, it wasn't (and isn't) obvious to me that having special cases in the range library is a bad idea--- as long as you can still explicitly ask for the other generic version. br -Thorsten

"Thorsten Ottosen" <nesotto@cs.auc.dk> writes:
"Eric Niebler" <eric@boost-consulting.com> wrote in message news:428926DB.2090108@boost-consulting.com... | | | Yes. It's exactly the same situation. You are giving collection<X> | different semantics than collection<Y>.
I guess my objection is to the use "exactly".
If I have
template< class T > class my_vec { std::vector<T> vec; };
then I might need some traits for dealing with the bool case.
What kind of traits?
If I have
template< class Range, class OutIter > void copy( const Range&, OutIter );
then I just need to be able to say what a Range means.
Yes, that's what makes your case worse. It doesn't just change the interface details as with vector<T>, it changes the fundamental meaning of T[N], leading to potential undefined behavior in some very common cases.
I think it would be wrong for the algorithm to assume anything; that would not give the caller a chance to decide.
Then you can't decide anything about T[N] either. People use all kinds of arrays with terminating sentinels, not just arrays of char.
Yeah, sorry. I guess I was equally annoyed by the claim that we had a new problem simply by stating vector<bool> was a problem.
It's a valid claim, made without ad-hominem overtones.
I don't see that close an analogy and so the argument made by Dave could be used to to argue for anything.
If you don't see the analogy, you're not looking hard enough. It happens whenever a whole category of types gets a particular treatment by generic code, but just a few outliers in that category get a different treatment. And as I said in my previous posting, the problem is much worse your case because there's no useful underlying concept that is modeled by both the ordinary array and the null-terminated string.
I don't think the decisions of the range library was a big issue during the review; Peter Dimov was, I think, one who said something along your opinion.
Peter Dimov is one of those annoying people, you'll come to learn, who is almost always right. I wouldn't draw too many conclusions from the fact that perhaps nobody else objected to the design. Peter is often ahead of the perceptual curve, and some of us weren't participating in the review. -- Dave Abrahams Boost Consulting www.boost-consulting.com

"David Abrahams" <dave@boost-consulting.com> wrote in message news:ur7g69ifb.fsf@boost-consulting.com... | "Thorsten Ottosen" <nesotto@cs.auc.dk> writes: | | > "Eric Niebler" <eric@boost-consulting.com> wrote in message | > news:428926DB.2090108@boost-consulting.com... | > | | > | | > | Yes. It's exactly the same situation. You are giving collection<X> | > | different semantics than collection<Y>. | > | > I guess my objection is to the use "exactly". | > | > If I have | > | > template< class T > | > class my_vec | > { | > std::vector<T> vec; | > }; | > | > then I might need some traits for dealing with the bool case. | | What kind of traits? the kind of traits that you usually use when dealing with vector<bool>. :-) maybe you would replace vector<bool> with vector<int> under the hood. | > If I have | > | > template< class Range, class OutIter > | > void copy( const Range&, OutIter ); | > | > then I just need to be able to say what a Range means. | | Yes, that's what makes your case worse. It doesn't just change the | interface details as with vector<T>, it changes the fundamental | meaning of T[N], leading to potential undefined behavior in some very | common cases. are you saying that the couldn't happen if we change the defaults? Personally I don't care much about how most arrays are treated... but I do think having to write find( rng, as_string("foo") ); is simply wierd and would be much more common than fiddling with fixed-sized arrays with various sentinels. -Thorsten

"Thorsten Ottosen" <nesotto@cs.auc.dk> writes:
"David Abrahams" <dave@boost-consulting.com> wrote in message news:ur7g69ifb.fsf@boost-consulting.com... | "Thorsten Ottosen" <nesotto@cs.auc.dk> writes: | | > "Eric Niebler" <eric@boost-consulting.com> wrote in message | > news:428926DB.2090108@boost-consulting.com... | > | | > | | > | Yes. It's exactly the same situation. You are giving collection<X> | > | different semantics than collection<Y>. | > | > I guess my objection is to the use "exactly". | > | > If I have | > | > template< class T > | > class my_vec | > { | > std::vector<T> vec; | > }; | > | > then I might need some traits for dealing with the bool case. | | What kind of traits?
the kind of traits that you usually use when dealing with vector<bool>. :-)
I normally don't use any traits. That's part of what I've been trying to say. If you write the generic code carefully, you can avoid the non-uniformity and work on either kind of vector. Not so in your case.
maybe you would replace vector<bool> with vector<int> under the hood.
?? If someone passes me a vector<bool> I can't replace it.
| > If I have | > | > template< class Range, class OutIter > | > void copy( const Range&, OutIter ); | > | > then I just need to be able to say what a Range means. | | Yes, that's what makes your case worse. It doesn't just change the | interface details as with vector<T>, it changes the fundamental | meaning of T[N], leading to potential undefined behavior in some very | common cases.
are you saying that the couldn't happen if we change the defaults?
If you take the whole statement above together, then yes, it couldn't happen.
Personally I don't care much about how most arrays are treated... but I do think having to write
find( rng, as_string("foo") );
is simply wierd and would be much more common than fiddling with fixed-sized arrays with various sentinels.
... for you. It's much less weird for me. Fixed-size arrays with sentinels come up all the time in any code where the author wasn't comfortable deducing array sizes, for example, in normal Python/C++ binding code. And the other case you have to consider -- also very common -- is when you have fixed-size buffers of char that aren't null-terminated strings. Sometimes, to design a robust interface, it's neccessary to accept that your experience and use cases aren't universal. -- Dave Abrahams Boost Consulting www.boost-consulting.com

On Tue, 2005-05-17 at 11:20 -0400, David Abrahams wrote:
"Thorsten Ottosen" <nesotto@cs.auc.dk> writes:
"David Abrahams" <dave@boost-consulting.com> wrote in message
... for you. It's much less weird for me. Fixed-size arrays with sentinels come up all the time in any code where the author wasn't comfortable deducing array sizes, for example, in normal Python/C++ binding code.
They are used in indefinite form of BER encoding in networking as one example. SOAP uses sentinels at start and end in the form of <tag> </tag>
And the other case you have to consider -- also very common -- is when you have fixed-size buffers of char that aren't null-terminated strings.
These are very common in networking. For buffers to be passed to 'C' socket library. Often declared as char[] unless dynamically allocated when the would be declared ( often ) as char *. /ikh ======================================================================= This email may contain confidential and privileged information and is intended for the named or authorised recipients only. If you are not the named or authorised recipient of this email, please note that any copying, distribution, disclosure or use of its contents is strictly prohibited. If you have received this email in error please notify the sender immediately and then destroy it. The views expressed in this email are not necessarily those held by VNL, and VNL does not accept any liability for any action taken in reliance on the contents of this message. VNL does not guarantee that the integrity of this email has been maintained, nor that it is free of viruses, interceptions or interference. _______________________________________________________________________ This email has been scanned for all known viruses by the MessageLabs Email Security System. _______________________________________________________________________

"Iain K. Hanson" <iain.hanson@videonetworks.com> wrote in message news:1116347348.14534.688.camel@dev-ihanson.ct.uk.videonetworks.com... | On Tue, 2005-05-17 at 11:20 -0400, David Abrahams wrote: | > "Thorsten Ottosen" <nesotto@cs.auc.dk> writes: | > | > > "David Abrahams" <dave@boost-consulting.com> wrote in message | | > ... for you. It's much less weird for me. Fixed-size arrays with | > sentinels come up all the time in any code where the author wasn't | > comfortable deducing array sizes, for example, in normal Python/C++ | > binding code. | | They are used in indefinite form of BER encoding in networking as one | example. | | SOAP uses sentinels at start and end in the form of <tag> </tag> | | > And the other case you have to consider -- also very | > common -- is when you have fixed-size buffers of char that aren't | > null-terminated strings. | > | These are very common in networking. For buffers to be passed to 'C' | socket library. Often declared as char[] unless dynamically allocated | when the would be declared ( often ) as char *. what's wrong with using boost::array<char,N> for all this? -Thorsten
participants (6)
-
Alexander Nasonov
-
David Abrahams
-
Eric Niebler
-
Iain K. Hanson
-
Pavol Droba
-
Thorsten Ottosen