[RFC] string inserter/extractor "q u o t i n g"

This is a request for comments on a delimited string inserter/extractor facility for Boost. The initial motivation is a need for Boost.Filesystem class path inserters and extractors to handle paths with embedded spaces. The problems stems from standard library stream I/O not round-tripping strings with embedded spaces, so a general solution might be of interest. It is a bit surprising that Boost doesn't seem to already have such a utility. The general solution presented here follows very common practice as to the format: strings are enclosed in delimiters on output and the delimiters (if present) are stripped on input. In the external format, an escape character precedes itself or the delimiter character should either be present in the string. The default delimiter character is the double-quote (") and the default escape character is the backslash (\). These defaults can be overridden. Example: const std::string expected("foo\\bar, \" *"); std::cout << delimit(expected); std::string actual; std::stringstream ss; ss << delimit(expected); ss >> undelimit(actual); assert(expected == actual); The output to cout is "foo\\bar, \" *" A prototype is available at http://svn.boost.org/svn/boost/branches/filesystem3/boost/delimit_string.hpp. It is a header only implementation. Do you think this component, with suitable documentation, tests, etc., should become a part of Boost? If so, What library should it be part of? Seems too small to be a library all of its own. Where should the header live? What namespace should it be in? Are delimit() and undelimit() suitable names for the functions? Any other comments? Thanks, --Beman

Beman Dawes wrote:
The general solution presented here follows very common practice as to the format: strings are enclosed in delimiters on output and the delimiters (if present) are stripped on input. In the external format, an escape character precedes itself or the delimiter character should either be present in the string. The default delimiter character is the double-quote (") and the default escape character is the backslash (\). These defaults can be overridden.
Example:
const std::string expected("foo\\bar, \" *");
std::cout << delimit(expected);
std::string actual; std::stringstream ss;
ss << delimit(expected); ss >> undelimit(actual);
assert(expected == actual);
The output to cout is "foo\\bar, \" *"
I like the concept. I think delimit() and undelimit() could be augmented with "delimited" manipulators: ss << delimited(expected); ss >> delimited(actual); In the first statement, it means that "expected" should be delimited. In the second statement, it means that "actual" should be extracted from delimited input.
A prototype is available at <http://svn.boost.org/svn/boost/branches/filesystem3/boost/delimit_string.hpp>. It is a header only implementation.
Do you think this component, with suitable documentation, tests, etc., should become a part of Boost?
Yes.
What library should it be part of? Seems too small to be a library all of its own.
I immediately thought of Utility and String Algos. The latter is already categorized under "String and text processing," so it seems the likely place.
Where should the header live?
boost/algorithm/string/delimit.hpp boost/algorithm/string/delimited.hpp boost/algorithm/string/undelimit.hpp
What namespace should it be in?
boost::algorithm
Are delimit() and undelimit() suitable names for the functions?
I don't care much for "undelimit" but I can't think of anything better as "delimit" seems exactly suited to the other operation.
Any other comments?
Particularly if you accept my manipulator idea, then delimit() and undelimit() should be recast as algorithms in line with the others in Boost.StringAlgos. For example, it should be possible to get a delimited string from delimit(), it should be possible to do in-place manipulations, and it should be possible to use ranges of characters as inputs. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On Wed, 16 Jun 2010 10:05:11 -0400, Stewart, Robert wrote:
Beman Dawes wrote:
Are delimit() and undelimit() suitable names for the functions?
I don't care much for "undelimit" but I can't think of anything better as "delimit" seems exactly suited to the other operation.
I'm not convinced by delimit. The dictionary definition is to do with marking a boundary so it applies to the ""-case but not \ and others. What about escape\unescape or encode\decode. The URL specifications call this type of operation encoding [1] but I think escaping is a more common and less ambiguous name. [1] http://www.w3.org/Addressing/URL/url-spec.txt Alex -- Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

Alexander Lamaison wrote:
On Wed, 16 Jun 2010 10:05:11 -0400, Stewart, Robert wrote:
Beman Dawes wrote:
Are delimit() and undelimit() suitable names for the functions?
I don't care much for "undelimit" but I can't think of anything better as "delimit" seems exactly suited to the other operation.
I'm not convinced by delimit. The dictionary definition is to do with marking a boundary so it applies to the ""-case but not \ and others.
Granted, but the name isn't bad given that.
What about escape\unescape or encode\decode. The URL specifications call this type of operation encoding [1] but I think escaping is a more common and less ambiguous name.
You've done the reverse with "escape." Escaping only applies to the use of "\" to avoid special treatment of some characters within the string and not to the quotation marks at either end. IOW, we need a name that encompasses both delimiting and escaping. "Encode" doesn't fit well because it has to do with translating from one representation to another due to (computer) language issues. Furthermore, that term is tied irrevocably with Unicode and human language manipulations, which don't apply here. Is the problem that we have one function doing two tasks? Separate functions for delimiting and escaping (and their reverses) would alleviate the difficulty of finding suitable names for the conflated operations. That of course means that the separate functions must be efficiently composable. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On 6/16/2010 9:47 AM, Beman Dawes wrote:
std::string actual; std::stringstream ss;
ss << delimit(expected); ss >> undelimit(actual);
delimit doesn't feel right to me. I prefer quoted. It's the commonly used term in Perl, for instance. From dictionary.com quote: ... 5) to enclose (words) within quotation marks. And given the direction of the streaming, is there really a need for a separate quoted and unquoted manipulator? I'd like: ss << quoted(expected); ss >> quoted(actual); -- Eric Niebler BoostPro Computing http://www.boostpro.com

Eric Niebler wrote:
delimit doesn't feel right to me. I prefer quoted. It's the commonly used term in Perl, for instance. From dictionary.com
quote: ... 5) to enclose (words) within quotation marks.
Quotation marks are the default, but the function can be given a different character to surround the text. Wouldn't "quote" be misleading if the specified character were, say, "%"? _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On 6/16/2010 10:45 AM, Stewart, Robert wrote:
Eric Niebler wrote:
delimit doesn't feel right to me. I prefer quoted. It's the commonly used term in Perl, for instance. From dictionary.com
quote: ... 5) to enclose (words) within quotation marks.
Quotation marks are the default, but the function can be given a different character to surround the text. Wouldn't "quote" be misleading if the specified character were, say, "%"?
Perl, for instance, lets you specify the quote character. I don't find that confusing. If I went looking for this function, I'd look for some variant of "quote" first. To me, lists are delimited by e.g. commas. Even Beman, in his heart, thinks quote is the right term. Look at the subject line. ;-) -- Eric Niebler BoostPro Computing http://www.boostpro.com

Eric Niebler wrote:
On 6/16/2010 10:45 AM, Stewart, Robert wrote:
Eric Niebler wrote:
delimit doesn't feel right to me. I prefer quoted. It's the commonly used term in Perl, for instance. From dictionary.com
quote: ... 5) to enclose (words) within quotation marks.
Quotation marks are the default, but the function can be given a different character to surround the text. Wouldn't "quote" be misleading if the specified character were, say, "%"?
Perl, for instance, lets you specify the quote character. I don't find that confusing. If I went looking for this function, I'd look for some variant of "quote" first. To me, lists are delimited by e.g. commas.
I daresay you'd look for "quote" first because of your experience with Perl. Nevertheless, prior art is not unavailing.
Even Beman, in his heart, thinks quote is the right term. Look at the subject line. ;-)
:) Even assuming "quote" were generally acceptable, it doesn't connote the escaping behavior in Beman's functions. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

----- Original Message ----- From: "Stewart, Robert" <Robert.Stewart@sig.com> To: <boost@lists.boost.org> Sent: Wednesday, June 16, 2010 5:56 PM Subject: Re: [boost] [RFC] string inserter/extractor "q u o t i n g"
Eric Niebler wrote:
On 6/16/2010 10:45 AM, Stewart, Robert wrote:
Eric Niebler wrote:
quote: ... 5) to enclose (words) within quotation marks.
Quotation marks are the default, but the function can be given a different character to surround the text. Wouldn't "quote" be misleading if the specified character were, say, "%"?
Perl, for instance, lets you specify the quote character. I don't find that confusing. If I went looking for this function, I'd look for some variant of "quote" first. To me, lists are delimited by e.g. commas.
I daresay you'd look for "quote" first because of your experience with Perl. Nevertheless, prior art is not unavailing.
Even assuming "quote" were generally acceptable, it doesn't connote the escaping behavior in Beman's functions.
Any time you "quote" something, you need to use scape sequence to use the quoting character and the character to escape. +1 for quote. Vicente

To speed the process, I'll just give a progress report rather than respond individually prior messages. * I like quote and unquote as the names. Source changed. * Since these functions are I/O manipulators, and only work in that context, I've renamed the header "quote_manip.hpp". * Likewise, the library that makes the most sense to add the header too is io. So the include path will be <boost/io/quote_manip.hpp>. IO is a small library, and targets exactly the same area of the standard library. * The code has been cleaned up and the templates more closely conform to standard library practice as regards template parameters. See https://svn.boost.org/svn/boost/branches/filesystem3/boost/io/quote_manip.hp... Comments? --Beman

On 6/17/2010 3:38 PM, Beman Dawes wrote:
To speed the process, I'll just give a progress report rather than respond individually prior messages.
* I like quote and unquote as the names. Source changed.
* Since these functions are I/O manipulators, and only work in that context, I've renamed the header "quote_manip.hpp".
* Likewise, the library that makes the most sense to add the header too is io. So the include path will be <boost/io/quote_manip.hpp>. IO is a small library, and targets exactly the same area of the standard library.
* The code has been cleaned up and the templates more closely conform to standard library practice as regards template parameters.
See https://svn.boost.org/svn/boost/branches/filesystem3/boost/io/quote_manip.hp...
Comments?
I assume you mean for the insertion and extraction operators to take the proxy by const reference. It shouldn't compile as-is. I still see no reason for separate "quote" and "unquote" functions. There could just be one, called "quote" (or I like "quoted" as in: read in a quoted string, write out a quoted string). It returns an object that has both insertion and extraction operators. Const-correctness and the C-string variant can be handled by passing the string type as a template parameter: quote_proxy<string &> vs. quote_proxy<string const &> vs. quote_proxy<char const *>. -- Eric Niebler BoostPro Computing http://www.boostpro.com

On 6/17/2010 4:44 PM, Eric Niebler wrote:
On 6/17/2010 3:38 PM, Beman Dawes wrote:
To speed the process, I'll just give a progress report rather than respond individually prior messages. <snip>
* Since these functions are I/O manipulators, and only work in that context, I've renamed the header "quote_manip.hpp".
* Likewise, the library that makes the most sense to add the header too is io. So the include path will be <boost/io/quote_manip.hpp>. IO is a small library, and targets exactly the same area of the standard library.
Oh, and you probably should only be including iosfwd, not istream and ostream. -- Eric Niebler BoostPro Computing http://www.boostpro.com

On Thu, Jun 17, 2010 at 4:54 PM, Eric Niebler <eric@boostpro.com> wrote:
On 6/17/2010 4:44 PM, Eric Niebler wrote:
On 6/17/2010 3:38 PM, Beman Dawes wrote:
To speed the process, I'll just give a progress report rather than respond individually prior messages. <snip>
* Since these functions are I/O manipulators, and only work in that context, I've renamed the header "quote_manip.hpp".
* Likewise, the library that makes the most sense to add the header too is io. So the include path will be <boost/io/quote_manip.hpp>. IO is a small library, and targets exactly the same area of the standard library.
Oh, and you probably should only be including iosfwd, not istream and ostream.
Thanks! I'm apply all of your suggested changes now. --Beman

Eric Niebler wrote:
On 6/17/2010 3:38 PM, Beman Dawes wrote:
See
<https://svn.boost.org/svn/boost/branches/filesystem3/boost/io/quote_manip.hpp>
Comments?
I still see no reason for separate "quote" and "unquote" functions. There could just be one, called "quote" (or I like "quoted" as in: read in a quoted string, write out a quoted string). It returns an object that has both insertion and extraction operators.
+1 _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

Beman Dawes wrote:
To speed the process, I'll just give a progress report rather than respond individually prior messages.
Apparently, by virtue of not mentioning my suggestion in this omnibus reply, you rejected my idea of a string algorithm for doing the quoting and unquoting which forms the foundation of the insertion and extraction manipulators. I still think that is the best approach as it doesn't force the use of std::stringstream to get a quoted or unquoted string from an existing string while still supporting the IOStream insertion and extraction needed by Filesystem.
* I like quote and unquote as the names. Source changed.
Given the name change, my suggestion is for "quoted" as the manipulator -- for both insertion and extraction -- and "quote" and "unquote" for the algorithms.
* Since these functions are I/O manipulators, and only work in that context, I've renamed the header "quote_manip.hpp".
* Likewise, the library that makes the most sense to add the header too is io. So the include path will be <boost/io/quote_manip.hpp>. IO is a small library, and targets exactly the same area of the standard library.
That's reasonable for the manipulators, but the algorithms on which they'd be based, given my suggestion, should be in StringAlgos.
See <https://svn.boost.org/svn/boost/branches/filesystem3/boost/io/quote_manip.hpp>
I get a 404 error trying to access that file. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On 6/18/2010 6:38 AM, Stewart, Robert wrote:
Beman Dawes wrote:
To speed the process, I'll just give a progress report rather than respond individually prior messages.
Apparently, by virtue of not mentioning my suggestion in this omnibus reply, you rejected my idea of a string algorithm for doing the quoting and unquoting which forms the foundation of the insertion and extraction manipulators. I still think that is the best approach as it doesn't force the use of std::stringstream to get a quoted or unquoted string from an existing string while still supporting the IOStream insertion and extraction needed by Filesystem.
+1
* I like quote and unquote as the names. Source changed.
Given the name change, my suggestion is for "quoted" as the manipulator -- for both insertion and extraction -- and "quote" and "unquote" for the algorithms.
+1
* Since these functions are I/O manipulators, and only work in that context, I've renamed the header "quote_manip.hpp".
* Likewise, the library that makes the most sense to add the header too is io. So the include path will be <boost/io/quote_manip.hpp>. IO is a small library, and targets exactly the same area of the standard library.
That's reasonable for the manipulators, but the algorithms on which they'd be based, given my suggestion, should be in StringAlgos.
Yes. And once you replace <istream> and <ostream> with <iosfwd> you'll need to add <ios> for std::noskipws. And, assuming the string algorithms take an output iterator, you'll need <iterator> too for std::[io]streambuf_iterator. -- Eric Niebler BoostPro Computing http://www.boostpro.com

----- Original Message ----- From: "Eric Niebler" <eric@boostpro.com> To: <boost@lists.boost.org> Sent: Friday, June 18, 2010 3:32 PM Subject: Re: [boost] [RFC] string inserter/extractor "q u o t i n g"
On 6/18/2010 6:38 AM, Stewart, Robert wrote:
Beman Dawes wrote:
To speed the process, I'll just give a progress report rather than respond individually prior messages.
Apparently, by virtue of not mentioning my suggestion in this omnibus reply, you rejected my idea of a string algorithm for doing the quoting and unquoting which forms the foundation of the insertion and extraction manipulators. I still think that is the best approach as it doesn't force the use of std::stringstream to get a quoted or unquoted string from an existing string while still supporting the IOStream insertion and extraction needed by Filesystem.
+1
+1
* I like quote and unquote as the names. Source changed.
Given the name change, my suggestion is for "quoted" as the manipulator -- for both insertion and extraction -- and "quote" and "unquote" for the algorithms.
+1
+1
* Since these functions are I/O manipulators, and only work in that context, I've renamed the header "quote_manip.hpp".
* Likewise, the library that makes the most sense to add the header too is io. So the include path will be <boost/io/quote_manip.hpp>. IO is a small library, and targets exactly the same area of the standard library.
That's reasonable for the manipulators, but the algorithms on which they'd be based, given my suggestion, should be in StringAlgos.
+1 Vicente

On Fri, Jun 18, 2010 at 6:38 AM, Stewart, Robert <Robert.Stewart@sig.com> wrote:
Beman Dawes wrote:
To speed the process, I'll just give a progress report rather than respond individually prior messages.
Apparently, by virtue of not mentioning my suggestion in this omnibus reply, you rejected my idea of a string algorithm for doing the quoting and unquoting which forms the foundation of the insertion and extraction manipulators. I still think that is the best approach as it doesn't force the use of std::stringstream to get a quoted or unquoted string from an existing string while still supporting the IOStream insertion and extraction needed by Filesystem.
In principle, I agree with you. In practice, I don't really want to take the time to develop the algorithms, tests, documentation, etc. Perhaps someone else could take that on.
* I like quote and unquote as the names. Source changed.
Given the name change, my suggestion is for "quoted" as the manipulator -- for both insertion and extraction -- and "quote" and "unquote" for the algorithms.
I've already changed the manipulator name to "quoted". See https://svn.boost.org/svn/boost/branches/filesystem3/boost/io/quoted_manip.h...
* Since these functions are I/O manipulators, and only work in that context, I've renamed the header "quote_manip.hpp".
* Likewise, the library that makes the most sense to add the header too is io. So the include path will be <boost/io/quote_manip.hpp>. IO is a small library, and targets exactly the same area of the standard library.
That's reasonable for the manipulators, but the algorithms on which they'd be based, given my suggestion, should be in StringAlgos.
Agreed. --Beman

On 6/18/2010 1:10 PM, Beman Dawes wrote:
On Fri, Jun 18, 2010 at 6:38 AM, Stewart, Robert<Robert.Stewart@sig.com> wrote:
Beman Dawes wrote:
To speed the process, I'll just give a progress report rather than respond individually prior messages.
Apparently, by virtue of not mentioning my suggestion in this omnibus reply, you rejected my idea of a string algorithm for doing the quoting and unquoting which forms the foundation of the insertion and extraction manipulators. I still think that is the best approach as it doesn't force the use of std::stringstream to get a quoted or unquoted string from an existing string while still supporting the IOStream insertion and extraction needed by Filesystem.
In principle, I agree with you. In practice, I don't really want to take the time to develop the algorithms, tests, documentation, etc. Perhaps someone else could take that on.
Well, then. I've done a fair bit of it. I've attached quote.hpp and unquote.hpp, plus a simple test program. I haven't written any documentation, and the test isn't up to snuff, but it's enough to prove the algorithms. There may be room to improve unquote()'s logic; I didn't spend a lot of time on it. My intention is that unquote() should handle a string with multiple quoted substrings rather than just assuming that the entire string is quoted. It is certainly reasonable to think that it should only handle whole strings. In that case, it would be easy to identify malformed strings on input: either it starts and ends with the delimiter, and all other occurrences are escaped, or it is malformed. I quickly chose to throw std::logic_error from unquoted() when it fails to find a closing delimiter. There may well be a better approach; feel free to suggest alternatives. Have a look at the code and let me know what you think. ___ Rob

On 23 June 2010 03:49, Rob Stewart <robertstewart@comcast.net> wrote:
My intention is that unquote() should handle a string with multiple quoted substrings rather than just assuming that the entire string is quoted. It is certainly reasonable to think that it should only handle whole strings. In that case, it would be easy to identify malformed strings on input: either it starts and ends with the delimiter, and all other occurrences are escaped, or it is malformed.
I don't think that's right, since multiple quoted substrings normally means multiple values (apart from in C family languages, but then text outside the quotes is interpreted differently). Looking at your code, the escape should work outside of the quotes and you don't check for the end of the string after an escape.
I quickly chose to throw std::logic_error from unquoted() when it fails to find a closing delimiter. There may well be a better approach; feel free to suggest alternatives.
You should inherit from std::runtime_error, since that would be usually caused by bad input rather than programmer error. To be honest, I don't see the value of this. As this is the kind of thing which is handled well in other ways (e.g. using a parser or lexer generator, or a standard data format such as XML, JSON etc.). There tends to be odd differences in quoting, encoding and escaping styles making a generic function awkward. It's not as specific as a filename extractor and not as generic as a parser and it's not clear why there's a need for something in between. Daniel

Daniel James wrote:
On 23 June 2010 03:49, Rob Stewart <robertstewart@comcast.net> wrote:
My intention is that unquote() should handle a string with multiple quoted substrings rather than just assuming that the entire string is quoted. It is certainly reasonable to think that it should only handle whole strings. In that case, it would be easy to identify malformed strings on input: either it starts and ends with the delimiter, and all other occurrences are escaped, or it is malformed.
I don't think that's right,
I take it you mean the current implementation isn't right and not the alternative I noted, or is it that you mean neither is right.
since multiple quoted substrings normally means multiple values (apart from in C family languages, but then text outside the quotes is interpreted differently).
The point of unquoting a string is to remove the quotation marks and escape characters, not to split a string into parts.
Looking at your code, the escape should work outside of the quotes and you don't check for the end of the string after an escape.
Thanks.
I quickly chose to throw std::logic_error from unquoted() when it fails to find a closing delimiter. There may well be a better approach; feel free to suggest alternatives.
You should inherit from std::runtime_error, since that would be usually caused by bad input rather than programmer error.
Reasonable.
To be honest, I don't see the value of this. As this is the kind of thing which is handled well in other ways (e.g. using a parser or lexer generator, or a standard data format such as XML, JSON etc.). There tends to be odd differences in quoting, encoding and escaping styles making a generic function awkward. It's not as specific as a filename extractor and not as generic as a parser and it's not clear why there's a need for something in between.
Those other approaches are heavier than these algorithms, which can serve simple cases quite well. If you'd care to enumerate the special cases to which you allude, we can consider how best to address them, if support is warranted. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On 23 June 2010 11:51, Stewart, Robert <Robert.Stewart@sig.com> wrote:
To be honest, I don't see the value of this. As this is the kind of thing which is handled well in other ways (e.g. using a parser or lexer generator, or a standard data format such as XML, JSON etc.). There tends to be odd differences in quoting, encoding and escaping styles making a generic function awkward. It's not as specific as a filename extractor and not as generic as a parser and it's not clear why there's a need for something in between.
Those other approaches are heavier than these algorithms
You often need to use some kind of parser just to get the quoted string in the first place.
which can serve simple cases quite well.
What are these simple cases? I could see the use for something which reads and decodes a 'token' following something like the shell grammar and sets the iterator to the end of the token. But that's quite a specific and more complicated grammar, rather than an attempt at a simple general one.
If you'd care to enumerate the special cases to which you allude, we can consider how best to address them, if support is warranted.
Some examples are: supporting multiple delimiter characters (e.g. supporting both 'x' and "x"), delimiters made up of multiple characters (e.g., """x"""), delimiter pairs (e.g. {x}), meaningful escapes (e.g. '\n' meaning newline), whether newlines are allowed between quotes or if they should end the quoted string, how multiple quoted strings are treated (e.g. in C whitespace separated quoted strings are concatenated, in your algorithm the space between them is included), whether the parsing should be strict or loose and if it is loose, how should it recover from errors. Daniel

On 6/23/2010 2:51 PM, Daniel James wrote:
On 23 June 2010 11:51, Stewart, Robert<Robert.Stewart@sig.com> wrote:
Those other approaches are heavier than these algorithms
You often need to use some kind of parser just to get the quoted string in the first place.
which can serve simple cases quite well.
What are these simple cases?
CSV fields, pathnames, log messages.
If you'd care to enumerate the special cases to which you allude, we can consider how best to address them, if support is warranted.
Some examples are: supporting multiple delimiter characters (e.g. supporting both 'x' and "x"), delimiters made up of multiple characters (e.g., """x"""), delimiter pairs (e.g. {x}), meaningful escapes (e.g. '\n' meaning newline), whether newlines are allowed between quotes or if they should end the quoted string, how multiple quoted strings are treated (e.g. in C whitespace separated quoted strings are concatenated, in your algorithm the space between them is included), whether the parsing should be strict or loose and if it is loose, how should it recover from errors.
Those are definitely cases that I didn't intend this algorithm to cover except, perhaps, multiple delimiter characters and paired delimiters, which I hadn't considered. Semantic meaning is definitely domain specific as is the treatment of multiple delimited substrings. In the latter case, while simply removing the internal delimiters is legitimate, so is just handling first and last characters and ignoring delimiters in the rest. When considered as the inverse of quote(), unquote() should simply strip leading and trailing delimiters and look for escaped delimiters and escaped escape characters within. To supply the extra semantics you've suggested, quote() must also be enhanced significantly. ___ Rob

On 24 June 2010 02:59, Rob Stewart <robertstewart@comcast.net> wrote:
On 6/23/2010 2:51 PM, Daniel James wrote:
What are these simple cases?
CSV fields, pathnames, log messages.
There's a case I missed. CSV often uses double quotes for escape. You can see that in section 2, point 7 of this attempt to standardise them: http://www.rfc-editor.org/rfc/rfc4180.txt
To supply the extra semantics you've suggested, quote() must also be enhanced significantly.
I really wasn't suggesting that you support them. Daniel

On 6/23/2010 9:59 PM, Rob Stewart wrote:
When considered as the inverse of quote(), unquote() should simply strip leading and trailing delimiters and look for escaped delimiters and escaped escape characters within. To supply the extra semantics you've suggested, quote() must also be enhanced significantly.
I've attached a new version of unquote() (along with quote() and the updated test program) with that behavior. As implemented, unquote() stops at the first delimiter it finds after the first character (whether the first is a delimiter or not). Until then, it "unescapes" escaped characters. Note the last test case: a = "embedded "; b = quote(a); b += "quotation mark\""; BOOST_ASSERT("\"embedded \"quotation mark\"" == b); b = unquote(b); BOOST_ASSERT(a == b); In that test, "quotation mark\"" is ignored by unquote() because it stops writing to the output iterator up finding the quotation mark after the space following "embedded." It is possible for unquote() to leave the embedded quotation mark, as if it had been escaped, but it would complicate the algorithm -- unless I assume random access iterators -- and its description. This version doesn't care whether the first or last characters are delimiters, thus "muddling through" as Eric suggested. I haven't taken the time to consider other reasonable ideas for delimiter support such as multiple character delimiters or distinct start and end delimiters. ___ Rob

On 6/30/2010 6:12 PM, Rob Stewart wrote:
On 6/23/2010 9:59 PM, Rob Stewart wrote:
When considered as the inverse of quote(), unquote() should simply strip leading and trailing delimiters and look for escaped delimiters and escaped escape characters within. To supply the extra semantics you've suggested, quote() must also be enhanced significantly.
I've attached a new version of unquote() (along with quote() and the updated test program) with that behavior.
This version adds separate start and end delimiters. ___ Rob

On 6/23/2010 6:12 AM, Daniel James wrote:
On 23 June 2010 03:49, Rob Stewart <robertstewart@comcast.net> wrote:
I quickly chose to throw std::logic_error from unquoted() when it fails to find a closing delimiter. There may well be a better approach; feel free to suggest alternatives.
You should inherit from std::runtime_error, since that would be usually caused by bad input rather than programmer error.
Bad input can be a programmer error too. Here's how you decide: - Decide if passing a malformed string is a violation of unquote's preconditions. - If it is, throw something derived from logic_error. - If not, handle it gracefully or, if you cannot, throw something derived from runtime_error. Users can (in theory) check that the string is well-formed before calling unquote, but to do so they would essentially have to implement unquote themselves. So making it a precondition that the input is well-formed seems onerous in this case. I would not make it a precondition; instead, I would accept it as valid input and document that. For valid input, I would prefer if unquote found a way to muddle through and do something reasonable. Throwing an exception when a trailing quote is missing seems like smacking someone's hand when they forgot to say "mother may I?" Why not just accept it? And document that fact! If the string is malformed in a way that you really can't muddle on, then throw something derived from runtime_error.
To be honest, I don't see the value of this. As this is the kind of thing which is handled well in other ways (e.g. using a parser or lexer generator, or a standard data format such as XML, JSON etc.). There tends to be odd differences in quoting, encoding and escaping styles making a generic function awkward. It's not as specific as a filename extractor and not as generic as a parser and it's not clear why there's a need for something in between.
Parser and lexer generators are complicated beasts, and I wouldn't send a novice programmer in that direction if it could be helped. Most quoting/unquoting is very straightforward. There's a quote character and an escape character. For anything more complicated, yes, build your own. -- Eric Niebler BoostPro Computing http://www.boostpro.com

On 6/23/2010 11:01 AM, Eric Niebler wrote:
On 6/23/2010 6:12 AM, Daniel James wrote:
On 23 June 2010 03:49, Rob Stewart <robertstewart@comcast.net> wrote:
I quickly chose to throw std::logic_error from unquoted() when it fails to find a closing delimiter. There may well be a better approach; feel free to suggest alternatives.
You should inherit from std::runtime_error, since that would be usually caused by bad input rather than programmer error.
Bad input can be a programmer error too. Here's how you decide:
- Decide if passing a malformed string is a violation of unquote's preconditions. - If it is, throw something derived from logic_error.
I mean, assert, not throw. Sigh, bitten by C++'s weird exception hierarchy again. logic_error is useless because it derives from std::exception, which people catch all over the place. A precondition violation is not recoverable.
- If not, handle it gracefully or, if you cannot, throw something derived from runtime_error.
Users can (in theory) check that the string is well-formed before calling unquote, but to do so they would essentially have to implement unquote themselves. So making it a precondition that the input is well-formed seems onerous in this case. I would not make it a precondition; instead, I would accept it as valid input and document that.
For valid input, I would prefer if unquote found a way to muddle through and do something reasonable. Throwing an exception when a trailing quote is missing seems like smacking someone's hand when they forgot to say "mother may I?" Why not just accept it? And document that fact! If the string is malformed in a way that you really can't muddle on, then throw something derived from runtime_error.
-- Eric Niebler BoostPro Computing http://www.boostpro.com

On 6/23/2010 11:01 AM, Eric Niebler wrote:
For valid input, I would prefer if unquote found a way to muddle through and do something reasonable. Throwing an exception when a trailing quote is missing seems like smacking someone's hand when they forgot to say "mother may I?" Why not just accept it? And document that fact! If the string is malformed in a way that you really can't muddle on, then throw something derived from runtime_error.
Or, better yet, the algorithm could return std::pair<OutIt, error_condition>, where OutIt is as far as the algorithm got, and error_condition is an enum describing the error if any. -- Eric Niebler BoostPro Computing http://www.boostpro.com

On 6/23/2010 11:13 AM, Eric Niebler wrote:
On 6/23/2010 11:01 AM, Eric Niebler wrote:
For valid input, I would prefer if unquote found a way to muddle through and do something reasonable. Throwing an exception when a trailing quote is missing seems like smacking someone's hand when they forgot to say "mother may I?" Why not just accept it? And document that fact! If the string is malformed in a way that you really can't muddle on, then throw something derived from runtime_error.
I agree with muddling through. In a previous post, I recalled that unquote() should be the inverse of quote(). Therefore, it should simply remove the delimiter(s) from the beginning and end and unescape escaped characters within.
Or, better yet, the algorithm could return std::pair<OutIt, error_condition>, where OutIt is as far as the algorithm got, and error_condition is an enum describing the error if any.
That's not an unreasonable interface, but given the simplified behavior outlined above, I don't think I need to go that far. unquote() should just remove the starting and ending delimiter(s), if found, and unescape escaped characters from the middle. ___ Rob

-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost- bounces@lists.boost.org] On Behalf Of Eric Niebler Sent: Wednesday, June 23, 2010 7:02 PM To: boost@lists.boost.org Subject: Re: [boost] [RFC] string inserter/extractor "q u o t i n g"
Parser and lexer generators are complicated beasts, and I wouldn't send a novice programmer in that direction if it could be helped. Most quoting/unquoting is very straightforward. There's a quote character and an escape character. For anything more complicated, yes, build your own.
Do you think uri escaping (reserved chars are replaced with %hexdigit hexdigit) should be supported as well? BR, Dmitry

On 6/23/2010 11:30 AM, Dmitry Goncharov wrote:
From Eric Niebler, Wednesday, June 23, 2010 7:02 PM
Parser and lexer generators are complicated beasts, and I wouldn't send a novice programmer in that direction if it could be helped. Most quoting/unquoting is very straightforward. There's a quote character and an escape character. For anything more complicated, yes, build your own.
Do you think uri escaping (reserved chars are replaced with %hexdigit hexdigit) should be supported as well?
By this utility? No. By another string algorithm? Yeah, maybe. Could be useful for both filesystem and asio. What's your use case? -- Eric Niebler BoostPro Computing http://www.boostpro.com

-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost- bounces@lists.boost.org] On Behalf Of Eric Niebler Sent: Wednesday, June 23, 2010 7:49 PM To: boost@lists.boost.org Subject: Re: [boost] [RFC] string inserter/extractor "q u o t i n g"
Do you think uri escaping (reserved chars are replaced with %hexdigit hexdigit) should be supported as well?
By this utility? No. By another string algorithm? Yeah, maybe. Could be useful for both filesystem and asio. I am not sure how this can be useful for filesystem.
What's your use case? Uri parsing. I found out that it was easier to implement this unescaping with a function rather than with a grammar parser. I am attaching an implementation of uri_unescape().
BR, Dmitry std::string uri_unescape(std::string const& s) { std::string result; result.reserve(s.size()); size_t cur = 0; size_t prev = 0; size_t const npos = std::string::npos; while ((cur = s.find('%', prev)) != npos) { if (s.size() < cur + 3 || !isxdigit(s[cur + 1]) || !isxdigit(s[cur + 2])) return std::string(); result += s.substr(prev, cur - prev); char b[3]; b[0] = s[cur + 1]; b[1] = s[cur + 2]; b[2] = 0; char const c = static_cast<char>(strtoul(b, 0, 16)); result += c; prev = cur + 3; } result += s.substr(prev); return result; }

On 23 June 2010 16:01, Eric Niebler <eric@boostpro.com> wrote:
On 6/23/2010 6:12 AM, Daniel James wrote:
You should inherit from std::runtime_error, since that would be usually caused by bad input rather than programmer error.
Bad input can be a programmer error too.
I meant input from an external source. Which would almost always be the case for this function. Any such function must be able to cope with any possible string.
To be honest, I don't see the value of this. As this is the kind of thing which is handled well in other ways (e.g. using a parser or lexer generator, or a standard data format such as XML, JSON etc.). There tends to be odd differences in quoting, encoding and escaping styles making a generic function awkward. It's not as specific as a filename extractor and not as generic as a parser and it's not clear why there's a need for something in between.
Parser and lexer generators are complicated beasts, and I wouldn't send a novice programmer in that direction if it could be helped.
"...or a standard data format such as XML, JSON etc."
Most quoting/unquoting is very straightforward. There's a quote character and an escape character. For anything more complicated, yes, build your own.
I don't think it's ever been that simple in my experience. Daniel

At Wed, 23 Jun 2010 11:01:42 -0400, Eric Niebler wrote:
- Decide if passing a malformed string is a violation of unquote's preconditions. - If it is, throw something derived from logic_error.
Please don't throw exceptions in response to violated preconditions. The appropriate mechanism is BOOST_ASSERT. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 6/23/2010 11:12 PM, David Abrahams wrote:
At Wed, 23 Jun 2010 11:01:42 -0400, Eric Niebler wrote:
- Decide if passing a malformed string is a violation of unquote's preconditions. - If it is, throw something derived from logic_error.
Please don't throw exceptions in response to violated preconditions. The appropriate mechanism is BOOST_ASSERT.
Which I stated in my immediate follow-up to my own mail. Keep reading. ;-) - -- Eric Niebler BoostPro Computing http://www.boostpro.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEUAwUBTCLVOgeJsEDfjLbXAQJ+Wwf2IgKMvGTmRtdt9JE4urtfer4z5vupJnDH qltaZATY/Dg+Co//MBLdTYJwSWVl6oU0+XvvMsiVSVucU7DWuW3lyL/U3JtSC3ky jQqnuRkpVEri0SGEg220QnTGvKwQgXPFFtTTKWRcRDFfB2BfY5N2tKXCAqpRuTVi NTjiEG2r7EedWXQno2/CfEjb1oLQfIK1NsZ/AbLeyN7OYG41VKZA5MAiXxO7Adr2 L3E0La2HaaAxfcpl8QZTXWO7AwkFC7BmlEqi7iw8Png0vdlMyg3faiMBe25FlId4 j0Zo0Uqfrbe6AcuIwXlLtlLFYaLVknGTSYtr3UNusBKEgMQe5t4e =qVvf -----END PGP SIGNATURE-----

vicente.botet wrote:
From: "Stewart, Robert" <Robert.Stewart@sig.com>
Even assuming "quote" were generally acceptable, it doesn't connote the escaping behavior in Beman's functions.
Any time you "quote" something, you need to use scape sequence to use the quoting character and the character to escape.
Looking back, I see that is all this code is to do, so you're exactly right. I somehow still find "quote" less than ideal, but then I recall using "quoting character" without concern, for example, so I guess Eric, you, and I have convinced me. ;-) _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.
participants (9)
-
Alexander Lamaison
-
Beman Dawes
-
Daniel James
-
David Abrahams
-
Dmitry Goncharov
-
Eric Niebler
-
Rob Stewart
-
Stewart, Robert
-
vicente.botet