std::string <-> std::wstring conversion

Hi all, there have been many discussions (and proposals) on this topic in the last years, but was there any result and is there _anything_ in boost today that can be used to convert between std::string and std::wstring? Stefan

Have you looked at std::codecvt? On Fri, 17 Dec 2004 14:51:20 +0100, Stefan Slapeta <stefan@slapeta.com> wrote:
Hi all,
there have been many discussions (and proposals) on this topic in the last years, but was there any result and is there _anything_ in boost today that can be used to convert between std::string and std::wstring?
Stefan
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- Cory Nelson http://www.int64.org

Cory Nelson wrote:
Have you looked at std::codecvt?
of course, but I meant some convenience functions (I remember lexical_cast has been proposed as a candidate, which obviously doesn't work). IMO it's unacceptable that you have to instantiate a locale + facet for this purpose! Stefan

Note the the serialization library contains code for doing exactly this. Its in the subdirectory Dataflow Iterators. It was needed to serialize strings to wide character archives and to serialize wstirngs to char archives. Robert Ramey Stefan Slapeta wrote:
Hi all,
there have been many discussions (and proposals) on this topic in the last years, but was there any result and is there _anything_ in boost today that can be used to convert between std::string and std::wstring?
Stefan
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Robert Ramey wrote:
Note the the serialization library contains code for doing exactly this. Its in the subdirectory Dataflow Iterators. It was needed to serialize strings to wide character archives and to serialize wstirngs to char archives.
Robert Ramey
Stefan Slapeta wrote:
Hi all,
there have been many discussions (and proposals) on this topic in the last years, but was there any result and is there _anything_ in boost today that can be used to convert between std::string and std::wstring?
The Iostreams library will contain this functionality, too. (I'm rewriting it as we speak.) When I finish, converting between wide and narrow strings should look like this: #include <boost/iostreams/code_converter.hpp> #include <boost/iostreams/copy.hpp> using namespace std; using namespace boost::io; // Function object type for widening strings. template<typename Codecvt> struct widener : unary_function<string, wstring> { wstring operator() (string s) const { wstring result; converter<Codecvt, istringstream> cvt(s); boost::io::copy(cvt, back_inserter(result)); return result; } }; // Function object type for narrowing strings. template<typename Codecvt> struct narrower : unary_function<wstring, string> { string operator() (wstring s) const { ostringstream result; converter<Codecvt, ostringstream> cvt(result); boost::io::copy(boost::make_iterator_range(s), cvt); return result.str(); } }; The Dinkumware CoreX library also contains a component, wstring_convert, for this purpose. Since it doesn't make a detour through streams, it might be more efficient. Jonathan

Jonathan Turkanis wrote:
Robert Ramey wrote:
Note the the serialization library contains code for doing exactly this.
Note that the implementation above is not dependent on codecvt nor on streams. It just converts wstring to string and back again.
The Dinkumware CoreX library also contains a component, wstring_convert, for this purpose. Since it doesn't make a detour through streams, it might be more efficient.
Jonathan
Robert Ramey

Robert Ramey wrote:
Note that the implementation above is not dependent on codecvt nor on streams.
This is absolutely crucial. Imagine you have to read XML files with many thousands of strings... BTW, that's the same reason that lexical_cast isn't applicable for many real cases. Its catastrophic performance inhibits every usage whenever speed _could_ be an issue... I would like to see this conversion functions available at a more general place than the serialization library. But probably there is much more to do to improve internationalization support than only providing string/wstring conversion ;) Stefan

Stefan Slapeta wrote:
I would like to see this conversion functions available at a more general place than the serialization library. But probably there is much more to do to improve internationalization support than only providing string/wstring conversion ;)
Without specifying the encodings for the std::string and the std::wstring, it's not possible to convert between the two. Common wstring encodings are UCS-2, UTF-16, and UCS-4/UTF-32. The common std::string encodings are far too many to list here.

Peter Dimov wrote:
Stefan Slapeta wrote:
I would like to see this conversion functions available at a more general place than the serialization library. But probably there is much more to do to improve internationalization support than only providing string/wstring conversion ;)
Without specifying the encodings for the std::string and the std::wstring, it's not possible to convert between the two. Common wstring encodings are UCS-2, UTF-16, and UCS-4/UTF-32. The common std::string encodings are far too many to list here.
The conversion implemention used in the dataflow section of the library relies upon standard library functions mblen and mbtowc. I believe that these in turn rely upon the currently selected global locale for string encodings. Robert Ramey

Stefan Slapeta wrote:
Robert Ramey wrote:
Note that the implementation above is not dependent on codecvt nor on streams.
This is absolutely crucial. Imagine you have to read XML files with many thousands of strings...
BTW, that's the same reason that lexical_cast isn't applicable for many real cases. Its catastrophic performance inhibits every usage whenever speed _could_ be an issue...
Streams aren't as slow as you think. For converting large amounts of data, a stream-based approach can quite fast. The most expensive aspect of using streams are: - initialization, if you have to create a new stream each time you want to convert or format a small amount of data - the formatted i/o operations The code_converter component doesn't incur either cost. A code_converter can be constructed once and used many times. It can also convert between narrow and wide characters without invoking any formatted i/o operations.
I would like to see this conversion functions available at a more general place than the serialization library. But probably there is much more to do to improve internationalization support than only providing string/wstring conversion ;)
The code_converter component from the iostreams library is already quite flexible. It might make sense to have a specialized component for converting strings, but I'd like it to be general enough to handle any type of string. Robert's approach, using iterator adaptors, might be a good way to go. But a Codecvt should be a template parameter: template< typename Codecvt, typename Iterator ... > class converting_iterator;
Stefan
Jonathan

Robert Ramey wrote:
Jonathan Turkanis wrote:
Robert Ramey wrote:
Note the the serialization library contains code for doing exactly this.
Note that the implementation above is not dependent on codecvt nor on streams. It just converts wstring to string and back again.
The code_converter component I mentioned doesn't use streams either. But it can be used to adapt streams -- to turn a narrow character stream into a wide character stream or vice versa. We've already discussed the tradeoffs between using iterators or objects with a socket-like interface to perform filtering: http://lists.boost.org/MailArchives/boost/msg70761.php It is crucial to support codecvts, since this is how the C++ standard libraries encapsulates code conversion. If you don't allow for a user-supplied codecvt, you can't reuse the large number that have already been written, such as the dozens that come with the Dinkumware CoreX library.
Robert Ramey
Jonathan

At 05:01 PM 12/17/2004, Jonathan Turkanis wrote:
The Dinkumware CoreX library also contains a component, wstring_convert, for this purpose. Since it doesn't make a detour through streams, it might be more efficient.
Dinkumware is proposing part of their library for standardization. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2004/n1683.html Should Boost be looking at that proposal? Seems to me we should. If we like it, we can support their proposal and perhaps do an independent implementation for Boost. If we think it needs changes, the sooner those get communicated to Dinkumware and/or the LWG, the better. --Beman

Beman Dawes wrote:
Dinkumware is proposing part of their library for standardization. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2004/n1683.html
Hmmm, looks small and nice, I guess it wouldn't be too hard to implement.
Should Boost be looking at that proposal? Seems to me we should.
If we like it, we can support their proposal and perhaps do an independent implementation for Boost. If we think it needs changes, the sooner those get communicated to Dinkumware and/or the LWG, the better.
Let's start as soon as possible :) Cheers, Stefan

Beman Dawes wrote:
At 05:01 PM 12/17/2004, Jonathan Turkanis wrote:
The Dinkumware CoreX library also contains a component, wstring_convert, >for this purpose. Since it doesn't make a detour through streams, it might be more efficient.
Dinkumware is proposing part of their library for standardization. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2004/n1683.html
Should Boost be looking at that proposal? Seems to me we should.
I'm aware of this proposal. It's what convinced me to do the current rewrite of the code_converter component, making a codecvt the first template parameter. By chance, I finally got around to it today. boost::io::code_converter is a generalization of the wbuffer_convert template from CoreX and n1683. While wbuffer_convert is a stream buffer adapter, taking a narrow-character stream buffer and propoducing a wide-character stream buffer, code_converter is a Device adapter, taking a narrow-character Device and producing a wide-character Device. Since stream buffers are models of the Device concept (http://tinyurl.com/53bvc), code_converter subsumes wbuffer_convert. wbuffer_convert is declared as follows: template< class Codecvt, class Elem = wchar_t, class Tr = std::char_traits<Elem> > class wbuffer_convert : basic_streambuf<Elem, Tr> { ... } code_converter (as I am rewriting it today) is declared as follows: template< typename Codecvt, typename Device, typename Alloc = std::allocator<char> > class code_converter { ... }; For a given Codecvt type Cvt, wbuffer_convert<Cvt> is essentially the same as boost::io::streambuf_facade < // The adapted streambuf boost::io::code_converter< // The adapted Device. Cvt, // The code conversion policy basic_streambuf<typename Cvt::extern_type> // The underlying Device > > (Here basic_streambuf<typename Cvt::extern_type> is the narrow-character device; applying code_converter yields a wide-character device; finally, applying streambuf_facade yields a wide-character stream buffer.)
If we like it, we can support their proposal and perhaps do an independent implementation for Boost.
It's easy to implement as a wrapper around code_converter. Should I add it?
If we think it needs changes, the sooner those get communicated to Dinkumware and/or the LWG, the better.
I'd rather standardize code_converter, since it's more general. ;-) Unfortunately, the Boost Iostreams library hasn't been widely used enough yet to propose even the core components for standarization. (While it hasn't been officially released, I know people are already using it since they email me frequently.) Assuming the standard iostreams library isn't going to be expanded to incorporate support for generic devices, I think wbuffer_convert is just about right. (One thing I don't understand is why the character type of wbuffer_convert is allowed to be specified as the second template argument. It seems to me that the character type should always be equal to Codevt::intern_type.)
--Beman
Jonathan

At 11:31 PM 12/17/2004, Jonathan Turkanis wrote:
Beman Dawes wrote:
... Dinkumware is proposing part of their library for standardization. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2004/n1683.html ... If we like it, we can support their proposal and perhaps do an independent implementation for Boost.
It's easy to implement as a wrapper around code_converter. Should I add it?
Not yet. N1683 hasn't been accepted by the LWG group, and is likely to change at least in minor ways. I think what we should do at this point is write a critique of N1683, supporting that proposal but also suggesting various changes. I'll try to pull something together based on comments already made, and post it here to get comments, corrections, and additional thoughts. My interest in this is that I need wstring_convert (or something like it) for internationalization of the filesystem library, and I'd much rather base that on something already in Boost and/or the standard library than have to roll my own. --Beman

"Beman Dawes" <bdawes@acm.org> wrote in message news:6.0.3.0.2.20041217180614.028a8028@mailhost.esva.net... | At 05:01 PM 12/17/2004, Jonathan Turkanis wrote: | | >The Dinkumware CoreX library also contains a component, wstring_convert, | >for this purpose. Since it doesn't make a detour through streams, it | might | >be more efficient. | | Dinkumware is proposing part of their library for standardization. See | http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2004/n1683.html | | Should Boost be looking at that proposal? Seems to me we should. | | If we like it, we can support their proposal and perhaps do an independent | implementation for Boost. If we think it needs changes, the sooner those | get communicated to Dinkumware and/or the LWG, the better. I don't like that std::basic_string is hardcoded in the interface. I talked briefly with Bill about it in Redmond. I would rather see a more complicated coverter class and then an easy function interface on top of that, maybe like vector<char> bytes; convert_to_bytes<utf8_enc>( L"foo", bytes ); vector<wchar_t> wides; convert_to_wides<utf8_enc>( bytes, wides ); or something. -Thorsten
participants (8)
-
Beman Dawes
-
Cory Nelson
-
Jonathan Turkanis
-
Peter Dimov
-
Robert Ramey
-
Stefan Slapeta
-
Stefan Slapeta
-
Thorsten Ottosen