Has anybody managed to get tokenizer working for wide characters with
VC7.1 (boost version 1.31.0)? The following example works fine...
typedef tokenizer
Messagetokenizer worked in Unicode for me, so I experimented with your
example to try to find out what made the difference. To simplify building
in different modes, I changed it to the following:
// ==== BEGIN CODE ====
// Unicode Build: cl /D_UNICODE /EHsc /IF:\Dev\boost_1_31_0 tok.cpp
// DBCS Build: cl /EHsc /IF:\Dev\boost_1_31_0 tok.cpp
//
#include <string>
#include <string>
#include <iostream>
#include
MyTokenizer;
const boost::char_separator
I've run the same tests with gcc 3.2.2 on RH9, without any problems, so I'll
post this on the microsoft.public.vc.language newsgroup.
Keith MacDonald
"Keith MacDonald"
Messagetokenizer worked in Unicode for me, so I experimented with your example to try to find out what made the difference. To simplify building in different modes, I changed it to the following:
// ==== BEGIN CODE ==== // Unicode Build: cl /D_UNICODE /EHsc /IF:\Dev\boost_1_31_0 tok.cpp // DBCS Build: cl /EHsc /IF:\Dev\boost_1_31_0 tok.cpp // #include <string> #include <string> #include <iostream> #include
#ifdef _UNICODE typedef std::basic_string
string_t; #define _T(x) L##x #define STDOUT std::wcout #else typedef std::basic_string<char> string_t; #define _T(x) x #define STDOUT std::cout #endif typedef string_t::value_type char_t;
typedef boost::tokenizer < boost::char_separator
, string_t::const_iterator, string_t MyTokenizer;
const boost::char_separator
sep(_T("a")); int main() { #ifdef _BUG MyTokenizer token(string_t(_T("abacadaeafag")), sep); #else string_t s(_T("abacadaeafag")); MyTokenizer token(s, sep); #endif
for (MyTokenizer::const_iterator it = token.begin(); it != token.end(); ++it) STDOUT << *it;
return 0; } // ==== END CODE ====
The following table shows the output when _UNICODE and _BUG are defined:
_UNICODE _BUG Output ----------------------------- undef def " bcdefg" def def "" undef undef "bcdefg" def undef "bcdefg"
It seems that the tokenizer constructor is handling both Unicode and MBCS temporary strings incorrectly, with VC7.1.
Keith MacDonald
On Sat, 21 Feb 2004 09:10:43 -0000, Keith MacDonald wrote:
MyTokenizer token(string_t(_T("abacadaeafag")), sep);
if you take a look into tokenizer constructor template <typename Container> tokenizer(const Container& c,const TokenizerFunc& f) : first_(c.begin()), last_(c.end()), f_(f) { } you will notice that it's not storing copy of its string argument; instead it stores only its begin and end iterator. When string variable is destroyed (and in your example it's temporary variable; thus its destroyed at the end of expression) these iterators are no longer valid. Problem you are experiencing here is unusual manifestation of undefined behaviour - you are working in invalid interators. I think that program crash would be better indicatation that you have serious problem, but undefined behaviour may manifest in any other way - this time it's just as if tokenizer is empty. Of course, slight change of program or compilation options may result in crash (or anything else), until you remove undefined behaviour:
string_t s(_T("abacadaeafag")); MyTokenizer token(s, sep);
B.
Hmmm. I've been trying to use various members of the boost library as black
boxes, but this issue highlights the danger of doing so. I suppose a
language keyword is needed to specify when a non-temporary object is
required as an actual parameter. Given that there's no such thing, perhaps
it would be safer to eliminate such convenience constructors from the
library?
Keith MacDonald
"Bronek Kozicki"
On Sat, 21 Feb 2004 09:10:43 -0000, Keith MacDonald wrote:
MyTokenizer token(string_t(_T("abacadaeafag")), sep);
if you take a look into tokenizer constructor
template <typename Container> tokenizer(const Container& c,const TokenizerFunc& f) : first_(c.begin()), last_(c.end()), f_(f) { }
you will notice that it's not storing copy of its string argument; instead it stores only its begin and end iterator. When string variable is destroyed (and in your example it's temporary variable; thus its destroyed at the end of expression) these iterators are no longer valid. Problem you are experiencing here is unusual manifestation of undefined behaviour - you are working in invalid interators. I think that program crash would be better indicatation that you have serious problem, but undefined behaviour may manifest in any other way - this time it's just as if tokenizer is empty. Of course, slight change of program or compilation options may result in crash (or anything else), until you remove undefined behaviour:
string_t s(_T("abacadaeafag")); MyTokenizer token(s, sep);
B.
On Sun, 22 Feb 2004 20:36:54 -0000, Keith MacDonald wrote:
Hmmm. I've been trying to use various members of the boost library as black boxes, but this issue highlights the danger of doing so. I suppose a language keyword is needed to specify when a non-temporary object is required as an actual parameter. Given that there's no such thing, perhaps it would be safer to eliminate such convenience constructors from the library?
I think that simplest thing to do would be to explain the problem in tokenizer documentation. B. PS. There is a chance that C++ will be enriched with syntax allowing to detect rvalue (temporary value) used as function parameter, see: http://std.dkuug.dk/jtc1/sc22/wg21/docs/papers/2002/n1377.htm
participants (3)
-
Bronek Kozicki
-
Douglas G. Hanley
-
Keith MacDonald