Boost tokenizer and range support

25 Mar 2008

      Hi,

I've been using the boost tokenizer successfully in the past and I've been
quite happy with it. I was using it with std::string as my token type, but
now I need to use it differently because of performance reasons (the input
string is a raw UTF8 buffer (const unsigned char*) and output is a specific
UTF16 string class). So I thought: maybe I can just tokenize the unsigned
char buffer in place using boost::iterator_range<const unsigned char*> as my
token type.

And it almost worked! With a hack:

the tokenizer attempts to call assign on my TokenType but
boost::iterator_range doesn't have such member function. I created a wrapper
class that simply delegates to the iterator_range's assignment operator and
it now works!

This is great because I have no more useless string constructions: I can go
directly from a raw UTF8 buffer to my output string type (UTF16 based) with
only one conversion and no extra allocations! I still have the nice syntax
of boost tokenizer and the maximum efficiency!

I think this solution should be mentioned in the tutorial docs because it
might not be obvious for everybody. Also, maybe we can eliminate the hack I
did by adding an assign() to the boost range interface (this seems simpler
to me than modifying the tokenizer to not call assign).

Thanks for the great work you guys put into this library!

Best regards,

Florin.

Florin Trofin

Pavol Droba

Florin Trofin

Pavol Droba

tags

participants (2)