Boost tokenizer and range support
Hi,
I've been using the boost tokenizer successfully in the past and I've been
quite happy with it. I was using it with std::string as my token type, but
now I need to use it differently because of performance reasons (the input
string is a raw UTF8 buffer (const unsigned char*) and output is a specific
UTF16 string class). So I thought: maybe I can just tokenize the unsigned
char buffer in place using boost::iterator_range
Hi, Why don't you just use the split algorithm in the StringAlgo library? http://www.boost.org/doc/html/string_algo/usage.html#id1638440 Regards, Pavol. Florin Trofin wrote:
Hi,
I've been using the boost tokenizer successfully in the past and I've been quite happy with it. I was using it with std::string as my token type, but now I need to use it differently because of performance reasons (the input string is a raw UTF8 buffer (const unsigned char*) and output is a specific UTF16 string class). So I thought: maybe I can just tokenize the unsigned char buffer in place using boost::iterator_range
as my token type. And it almost worked! With a hack:
the tokenizer attempts to call assign on my TokenType but boost::iterator_range doesn't have such member function. I created a wrapper class that simply delegates to the iterator_range's assignment operator and it now works!
This is great because I have no more useless string constructions: I can go directly from a raw UTF8 buffer to my output string type (UTF16 based) with only one conversion and no extra allocations! I still have the nice syntax of boost tokenizer and the maximum efficiency!
I think this solution should be mentioned in the tutorial docs because it might not be obvious for everybody. Also, maybe we can eliminate the hack I did by adding an assign() to the boost range interface (this seems simpler to me than modifying the tokenizer to not call assign).
Thanks for the great work you guys put into this library!
Best regards,
Florin.
------------------------------------------------------------------------
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
Turns out that the char_separator shamelessly constructs std::strings under
the cover so I gained something but not as much as I hoped. The split
algorithm you mention requires a container to store the results so you still
have to do one allocation, correct?
Frustrating! In theory one should be able to parse a sequence of tokens
without constructing or copying any strings.
Florin.
On Wed, Mar 26, 2008 at 12:54 AM, Pavol Droba
Hi,
Why don't you just use the split algorithm in the StringAlgo library?
http://www.boost.org/doc/html/string_algo/usage.html#id1638440
Regards, Pavol.
Florin Trofin wrote:
Hi,
I've been using the boost tokenizer successfully in the past and I've been quite happy with it. I was using it with std::string as my token type, but now I need to use it differently because of performance reasons (the input string is a raw UTF8 buffer (const unsigned char*) and output is a specific UTF16 string class). So I thought: maybe I can just tokenize the unsigned char buffer in place using boost::iterator_range
as my token type. And it almost worked! With a hack:
the tokenizer attempts to call assign on my TokenType but boost::iterator_range doesn't have such member function. I created a wrapper class that simply delegates to the iterator_range's assignment operator and it now works!
This is great because I have no more useless string constructions: I can go directly from a raw UTF8 buffer to my output string type (UTF16 based) with only one conversion and no extra allocations! I still have the nice syntax of boost tokenizer and the maximum efficiency!
I think this solution should be mentioned in the tutorial docs because it might not be obvious for everybody. Also, maybe we can eliminate the hack I did by adding an assign() to the boost range interface (this seems simpler to me than modifying the tokenizer to not call assign).
Thanks for the great work you guys put into this library!
Best regards,
Florin.
------------------------------------------------------------------------
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
Hi, If you don't want to have container to store the results, you can use the split_iterator directly. split algorithm only wraps the split_iterator. http://www.boost.org/doc/libs/1_35_0/doc/html/boost/algorithm/split_iterator... http://www.boost.org/doc/libs/1_35_0/doc/html/string_algo/usage.html#id12907... Regards, Pavol. Florin Trofin wrote:
Turns out that the char_separator shamelessly constructs std::strings under the cover so I gained something but not as much as I hoped. The split algorithm you mention requires a container to store the results so you still have to do one allocation, correct?
Frustrating! In theory one should be able to parse a sequence of tokens without constructing or copying any strings.
Florin.
On Wed, Mar 26, 2008 at 12:54 AM, Pavol Droba
mailto:droba@topmail.sk> wrote: Hi,
Why don't you just use the split algorithm in the StringAlgo library?
http://www.boost.org/doc/html/string_algo/usage.html#id1638440
Regards, Pavol.
Florin Trofin wrote: > Hi, > > > I've been using the boost tokenizer successfully in the past and I've > been quite happy with it. I was using it with std::string as my token > type, but now I need to use it differently because of performance > reasons (the input string is a raw UTF8 buffer (const unsigned char*) > and output is a specific UTF16 string class). So I thought: maybe I can > just tokenize the unsigned char buffer in place using > boost::iterator_range
as my token type. > > And it almost worked! With a hack: > > the tokenizer attempts to call assign on my TokenType but > boost::iterator_range doesn't have such member function. I created a > wrapper class that simply delegates to the iterator_range's assignment > operator and it now works! > > This is great because I have no more useless string constructions: I can > go directly from a raw UTF8 buffer to my output string type (UTF16 > based) with only one conversion and no extra allocations! I still have > the nice syntax of boost tokenizer and the maximum efficiency! > > I think this solution should be mentioned in the tutorial docs because > it might not be obvious for everybody. Also, maybe we can eliminate the > hack I did by adding an assign() to the boost range interface (this > seems simpler to me than modifying the tokenizer to not call assign). > > Thanks for the great work you guys put into this library! > > > Best regards, > > > Florin. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Boost-users mailing list > Boost-users@lists.boost.org mailto:Boost-users@lists.boost.org > http://lists.boost.org/mailman/listinfo.cgi/boost-users _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org mailto:Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users ------------------------------------------------------------------------
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
participants (2)
-
Florin Trofin
-
Pavol Droba