Draft Critique of Code Conversion Proposal (N1683)

Toward the end of a thread with the subject "std::string <-> std::wstring conversion" there was some discussion of how the C++ committee N1683 proposal could be improved. I volunteered to write up our discussions. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2004/n1683.html for a copy of the proposal. Here is a draft of what I have written so far. Comments and improvements welcome. --Beman Critique of Code Conversion Proposal (N1683) -------------------------------------------- N1683==04-0123, Proposed Library Additions for Code Conversion, proposes sorely need code conversion facilities for the standard library. (See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2004/n1683.html) Without these facilities programmers concerned with internationalization are forced to reinvent the wheel; Boost has run into that problem two or three times in existing libraries, and additional times in libraries currently in the Boost pipeline. The proposal should be accepted by the LWG as a high priority need. That being said, there are several concerns described in this paper which may indicate the proposal can be further refined and improved. 1. Hard-wired byte_string type in wstring_convert ------------------------------------------------- The underlying wstring_convert design seems flexible enough to cope with conversion between any two character types which meet std::basic_string requirements. Conversion is actually performed by std::codecvt, which is already parameterized by both internalT and externalT types. It seems artificial to restrict wstring_convert::byte_string to std::basic_string<char>. New character types such as the proposed char16_t and char32_t will need conversions to and from other wide types, yet with the current restriction wstring_convert could not be used for that purpose. Suggested change: replace typedef std::basic_string<char> byte_string; with: typedef std::basic_string<typename Codecvt::extern_type> byte_string; and change from_bytes argument types accordingly. If this suggested change is accepted, it will probably make sense to rename some wstring_convert members. 2. wstring_convert template parameter Elem seems unneeded --------------------------------------------------------- The wstring_convert template parameter Elem seems unneeded. Isn't it always Codecvt::intern_type? Suggested change: remove the Elem parameter and replace Elem with Codecvt::intern_type 3. Need target-argument form for wstring_convert conversion functions --------------------------------------------------------------------- wstring_convert's conversion functions are in the form: byte_string to_bytes(const wide_string& wstr) const; While this form is often useful and should be retained, it may imply an extra copy of the result if a compiler is not smart enough to optimize the copy away. Suggested change is to add additional functions in the form: void to_bytes(const wide_string& wstr, byte_string & target) const; 4. More explicit name for wstring_convert ----------------------------------------- "wstring" might be misleading, depending on the actual types involved. "convert" is a verb, yet nouns make better class names. Suggested change: wstring_convert to: string_converter 5. Standardese needed --------------------- The proposal needs improved standardese. For example, the requirements on the template parameters need to be specified and the function description converted to canonical form. 6. Comparable changes need to be made for wbuffer_convert --------------------------------------------------------- Any of the above changes which are accepted need to be folded into wbuffer_convert. Acknowledgements ---------------- This critique is based on discussions with Thorsten Ottosen, Stefan Slapeta, and Jonathan Turkanis. Revised: 05 January 2005

From: Beman Dawes <bdawes@acm.org>
1. Hard-wired byte_string type in wstring_convert -------------------------------------------------
If this suggested change is accepted, it will probably make sense to rename some wstring_convert members.
You'll probably need to enumerate the changes you think are necessary to ensure good names are selected.
3. Need target-argument form for wstring_convert conversion functions ---------------------------------------------------------------------
wstring_convert's conversion functions are in the form:
byte_string to_bytes(const wide_string& wstr) const;
While this form is often useful and should be retained, it may imply an extra copy of the result if a compiler is not smart enough to optimize the copy away. Suggested change is to add additional functions in the form:
void to_bytes(const wide_string& wstr, byte_string & target) const;
I prefer output parameters to be first (due to the occasional need to have defaulted parameters and the desire to find output parameters in the same position relative to input parameters). Thus, without meaning to trigger a religious war, I propose this version instead: void to_bytes(byte_string & destination, wide_string const & source) const;
4. More explicit name for wstring_convert -----------------------------------------
"wstring" might be misleading, depending on the actual types involved. "convert" is a verb, yet nouns make better class names.
Suggested change: wstring_convert to: string_converter
Much better. -- Rob Stewart stewart@sig.com Software Engineer http://www.sig.com Susquehanna International Group, LLP using std::disclaimer;

| > 3. Need target-argument form for wstring_convert conversion functions | > --------------------------------------------------------------------- | > | > wstring_convert's conversion functions are in the form: | > | > byte_string to_bytes(const wide_string& wstr) const; | > | > While this form is often useful and should be retained, it may imply an | > extra copy of the result if a compiler is not smart enough to optimize the | > copy away. | > Suggested change is to add additional functions in the form: | > | > void to_bytes(const wide_string& wstr, byte_string & target) const; | | I prefer output parameters to be first (due to the occasional | need to have defaulted parameters and the desire to find output | parameters in the same position relative to input parameters). | Thus, without meaning to trigger a religious war, I propose this | version instead: | | void to_bytes(byte_string & destination, | wide_string const & source) const; I would prefer these to be free-standing functions. Was there something that prohibited that? -Thorsten

At 04:04 PM 1/5/2005, Thorsten Ottosen wrote:
| > 3. Need target-argument form for wstring_convert conversion functions | > --------------------------------------------------------------------- | > | > wstring_convert's conversion functions are in the form: | > | > byte_string to_bytes(const wide_string& wstr) const; | > | > While this form is often useful and should be retained, it may imply
an
| > extra copy of the result if a compiler is not smart enough to optimize the | > copy away. | > Suggested change is to add additional functions in the form: | > | > void to_bytes(const wide_string& wstr, byte_string & target) const; | | I prefer output parameters to be first (due to the occasional | need to have defaulted parameters and the desire to find output | parameters in the same position relative to input parameters). | Thus, without meaning to trigger a religious war, I propose this | version instead: | | void to_bytes(byte_string & destination, | wide_string const & source) const;
I would prefer these to be free-standing functions. Was there something that prohibited that?
No. I was just following the pattern set by the proposal. I'm not sure how the LWG feels about that; I'll add a mention of the issue so they can decide. Would you care to add a short rationale? Thanks for the comment, --Beman

"Beman Dawes" <bdawes@acm.org> wrote in message news:6.0.3.0.2.20050106200941.03e785f0@mailhost.esva.net... | At 04:04 PM 1/5/2005, Thorsten Ottosen wrote: | >| void to_bytes(byte_string & destination, | >| wide_string const & source) const; | > | >I would prefer these to be free-standing functions. | >Was there something that prohibited that? | | No. I was just following the pattern set by the proposal. I'm not sure how | the LWG feels about that; I'll add a mention of the issue so they can | decide. Would you care to add a short rationale? | | Thanks for the comment, well, its an algorithm, right? All the algorithms in <algorithm> are free-standing functions. In some sense, we might need both. The class interface could be used under the hood to provide the function interface. -Thorsten

At 03:05 PM 1/5/2005, Rob Stewart wrote:
From: Beman Dawes <bdawes@acm.org>
1. Hard-wired byte_string type in wstring_convert -------------------------------------------------
If this suggested change is accepted, it will probably make sense to
rename
some wstring_convert members.
You'll probably need to enumerate the changes you think are necessary to ensure good names are selected.
You are probably right. Perhaps I'll include a synopsis which shows all the changes.
3. Need target-argument form for wstring_convert conversion functions ---------------------------------------------------------------------
wstring_convert's conversion functions are in the form:
byte_string to_bytes(const wide_string& wstr) const;
While this form is often useful and should be retained, it may imply an
extra copy of the result if a compiler is not smart enough to optimize
the
copy away. Suggested change is to add additional functions in the form:
void to_bytes(const wide_string& wstr, byte_string & target)
const;
I prefer output parameters to be first (due to the occasional need to have defaulted parameters and the desire to find output parameters in the same position relative to input parameters). Thus, without meaning to trigger a religious war, I propose this version instead:
void to_bytes(byte_string & destination, wide_string const & source) const;
I personally prefer output first too, but was following the STL practice. I'll mention your comment. Probably not worth talking about further here; the LWG will make that decision. Thanks for you comments, --Beman

| 1. Hard-wired byte_string type in wstring_convert | ------------------------------------------------- | | The underlying wstring_convert design seems flexible enough to cope with | conversion between any two character types which meet std::basic_string | requirements. Conversion is actually performed by std::codecvt, which is | already parameterized by both internalT and externalT types. It seems | artificial to restrict wstring_convert::byte_string to | std::basic_string<char>. New character types such as the proposed char16_t | and char32_t will need conversions to and from other wide types, yet with | the current restriction wstring_convert could not be used for that purpose. | | Suggested change: replace | typedef std::basic_string<char> byte_string; | with: | typedef std::basic_string<typename Codecvt::extern_type> byte_string; | and change from_bytes argument types accordingly. | | If this suggested change is accepted, it will probably make sense to rename | some wstring_convert members. hm...why not remove the dependency of std::basic_string altogether and make it a template parameter. Preferable, the template parameter is just a ForwardRange FR with the requirement that a specialixation of std::char_traits exists for range_value<FR>::type. -Thorsten

At 04:07 PM 1/5/2005, Thorsten Ottosen wrote:
| 1. Hard-wired byte_string type in wstring_convert | ------------------------------------------------- | | The underlying wstring_convert design seems flexible enough to cope
with
| conversion between any two character types which meet std::basic_string | requirements. Conversion is actually performed by std::codecvt, which is | already parameterized by both internalT and externalT types. It seems | artificial to restrict wstring_convert::byte_string to | std::basic_string<char>. New character types such as the proposed char16_t | and char32_t will need conversions to and from other wide types, yet with | the current restriction wstring_convert could not be used for that purpose. | | Suggested change: replace | typedef std::basic_string<char> byte_string; | with: | typedef std::basic_string<typename Codecvt::extern_type> byte_string; | and change from_bytes argument types accordingly. | | If this suggested change is accepted, it will probably make sense to rename | some wstring_convert members.
hm...why not remove the dependency of std::basic_string altogether and make it a template parameter.
Jonathan Turkanis' original comment was:
(One thing I don't understand is why the character type of wbuffer_convert is allowed to be specified as the second template argument. It seems to me that the character type should always be equal to Codevt::intern_type.)
But I think that you are closer to the real problem with the proposal; the full string type rather than just the character type should be a template parameter. That allows any std::basic_string to be used.
Preferable, the template parameter is just a ForwardRange FR with the requirement that a specialixation of std::char_traits exists for range_value<FR>::type.
There are a lot of ways to redesign the interface. But if too many violent changes are made we are no longer talking about existing practice and it gets messy to try to convince the LWG. Anyone who wants to can propose an alternative to the committee, but my guess is that it would have to be dramatically better to have much chance. Thanks for your comment, --Beman

Beman Dawes wrote:
hm...why not remove the dependency of std::basic_string altogether and make it a template parameter.
Jonathan Turkanis' original comment was:
(One thing I don't understand is why the character type of wbuffer_convert is allowed to be specified as the second template argument. It seems to me that the character type should always be equal to Codevt::intern_type.)
But I think that you are closer to the real problem with the proposal; the full string type rather than just the character type should be a template parameter. That allows any std::basic_string to be used.
I was talking about wbuffer_convert; at the time I hadn't looked at wstring_convert very closely. Since then I started to factor the code conversion routines out of the iostreams library to make them more useful for string conversion. I haven't worked on it much since I finihsed the iostreams revision, but I was leaning toward an interface someting like this for string conversion: template<typename Codecvt = use_default> struct string_converter { // Nice name ;-) // typedefs template<typename InIt, typename OutIt> OutIt narrow(InIt first, InIt last, OutIt dest); template<typename InIt, typename OutIt> OutIt widen(InIt first, InIt last, OutIt dest); // Convenience functions: template<typename WideStr> // Version of Thorsten's suggestion basic_string<typename Codecvt::extern_type> narrow(const WideStr&); template<typename NarrowStr> // Version of Thorsten's suggestion basic_string<typename Codecvt::intern_type> widen(const NarrowStr&); }; // Convenience functions: template<typename InIt, typename OutIt> OutIt narrow(InIt first, InIt last, OutIt dest) { string_converter<> cvt; return cvt::narrow(first, last, dest); } template<typename InIt, typename OutIt> OutIt widen(InIt first, InIt last, OutIt dest) { string_converter<> cvt; return cvt::widen(first, last, dest); } template<typename WideStr> basic_string<typename Codecvt::extern_type> narrow(const WideStr& str) { string_converter<> cvt; return cvt::narrow(str); } template<typename NarrowStr> basic_string<typename Codecvt::intern_type> widen(const NarrowStr& str) { string_converter<> cvt; return cvt::widen(str); } Remarks: 1. The names 'narrow' and 'wide' could be confused with the ctype members of the same name, which do not perform code conversion, but I like them better than 'to_bytes' and 'from_bytes' (since extern_type may not represent a byte) and 'wide_to_multi_char' and 'multi_char_to_wide' (too long) 2. The narrow and widen overloads which take iterators have the same signature as std::copy. 3. If no Codecvt template parameter is specified, an instance of codecvt<wchar_t, char, mbstate_t> is fetched from the global locale. The non-member versions of narrow and widen use this option. 4. Thorsten asks why the widening and narrowing functions shouldn't be non-member functions. One answer is that code conversion can be (slightly) more efficient if a large buffer is used. Making the core conversion functions member functions allows buffers to be used for several string conversions. A second answer is that it's a bit awkward to specify a codecvt in a non-member function: narrow< utf8_codecvt_facet<char_t> > (str.begin(), str.end(), back_inserter(dest)); or narrow( str.begin(), str.end(), back_inserter(dest), utf8_codecvt_facet<wchar_t>() ); When a non-default codecvt is being used, I think it's reasonable to ask people to use a member function, the keep the non-member usage simple.
--Beman
Jonathan

Jonathan Turkanis wrote:
4. Thorsten asks why the widening and narrowing functions shouldn't be non-member functions. One answer is that code conversion can be (slightly) more efficient if a large buffer is used. Making the core conversion functions member functions allows buffers to be used for several string conversions.
Hmmm ... This is true, but unfortunately doesn't apply to the version I posted. To take advantage of a large buffer, the input would have to be presented in a manner that gives the implementaion access to underlying characters arrays. (This would be the case if, e.g., the input were basic_strings, so that data() could be used, or if there were a standard iterator category 'Contiguous Traversal': http://tinyurl.com/4wynl.) I think the added flexibility of the overloads taking iterators is more significant than the ability to buffer. Jonathan

"Jonathan Turkanis" <technews@kangaroologic.com> wrote in message news:crl0i9$7mk$1@sea.gmane.org... | Jonathan Turkanis wrote: | | > 4. Thorsten asks why the widening and narrowing functions shouldn't be | > non-member functions. One answer is that code conversion can be | > (slightly) more efficient if a large buffer is used. Making the core | > conversion functions member functions allows buffers to be used for | > several string conversions. | | Hmmm ... This is true, but unfortunately doesn't apply to the version I posted. | | To take advantage of a large buffer, the input would have to be presented in a | manner that gives the implementaion access to underlying characters arrays. | (This would be the case if, e.g., the input were basic_strings, so that data() | could be used, or if there were a standard iterator category 'Contiguous | Traversal': http://tinyurl.com/4wynl.) | | I think the added flexibility of the overloads taking iterators is more | significant than the ability to buffer. I can't really figure out how this buffering should work. Buffering of what? I agree that there should be iterator versions underneith the range interface (that's how it all works.) Jonathan: | template<typename WideStr> | basic_string<typename Codecvt::extern_type> | narrow(const WideStr& str) | { | string_converter<> cvt; | return cvt::narrow(str); | } The could be specified as template< class NarrowString, class ReadableWideForwardRange, class Codecvt > NarrowString narrow( const ReadableWideForwardRange& r, const Codecvt& cc ); ... utf8_codecvt_facet<char_t> cc; wstring w1 = L"bar"; string s1 = std::narrow<string>( L"foo", cc ); string s2 = std::narrow<string>( make_iterator_range( w1, w1 + 2 ), cc ); vector<char> s3 = std::narrow<vector<char> >( w1, cc ); Beman: | > Preferable, the template parameter is just | >a ForwardRange FR with the requirement that a specialixation | >of std::char_traits exists for range_value<FR>::type. | |There are a lot of ways to redesign the interface. But if too many violent |changes are made we are no longer talking about existing practice and it |gets messy to try to convince the LWG. Anyone who wants to can propose an |alternative to the committee, but my guess is that it would have to be |dramatically better to have much chance. Well, we haven't talked about N1683 yet. :-) -Thorsten

Thorsten Ottosen wrote:
"Jonathan Turkanis" <technews@kangaroologic.com> wrote:
4. Thorsten asks why the widening and narrowing functions shouldn't be non-member functions. One answer is that code conversion can be (slightly) more efficient if a large buffer is used. Making the core conversion functions member functions allows buffers to be used for several string conversions.
I think the added flexibility of the overloads taking iterators is more significant than the ability to buffer.
I can't really figure out how this buffering should work. Buffering of what?
Look at the interface of std::codecvt, e.g. at the member function in. This function takes a Byte array as input and write wide characters to a second Byte array. Since in() is a virtual function, it's slightly faster to call it once or twice per string than to call it once for each character in a string. Similar remarks hold for out. To make this work, you need (i) the input to be presented as a character array (std::string or const char* is fine, but a pair of forward iterators isn't) (ii) a good sized buffer for output The example I presented doesn't satisfy (i). But in the absense of performance data I won't worry about it.
I agree that there should be iterator versions underneith the range interface (that's how it all works.)
Jonathan:
template<typename WideStr> basic_string<typename Codecvt::extern_type> narrow(const WideStr& str) { string_converter<> cvt; return cvt::narrow(str); }
The could be specified as
template< class NarrowString, class ReadableWideForwardRange, class Codecvt > NarrowString narrow( const ReadableWideForwardRange& r, const Codecvt& cc ); ...
First, as much as I love Boost.Range and would like to see it standardized, I don't think the code conversion proposal should use it. The second problem is that it's awkard to specify the codecvt instance when you just want a codecvt to be grabbed from the globale locae: std::locale loc = locale::global(); const std::codecvt<wchar_t, char, std::mbstate_t>& cvt = std::use_facet< std::codecvt<wchar_t, char, std::mbstate_t> >(loc); // etc. std::string s = narrow(ws, cvt); Most people will want to be able to write std::string s = narrow(ws); Perhaps a good solution would be to have overloads, in addition to the ones I showed, with a signature including a codecvt instance, as in your example. Jonathan

"Jonathan Turkanis" <technews@kangaroologic.com> wrote in message news:crlh51$6kh$1@sea.gmane.org... | Thorsten Ottosen wrote: | > "Jonathan Turkanis" <technews@kangaroologic.com> wrote: | | >>> 4. Thorsten asks why the widening and narrowing functions shouldn't | >>> be non-member functions. One answer is that code conversion can be | >>> (slightly) more efficient if a large buffer is used. Making the core | >>> conversion functions member functions allows buffers to be used for | >>> several string conversions. | | >> I think the added flexibility of the overloads taking iterators is | >> more significant than the ability to buffer. | > | > I can't really figure out how this buffering should work. Buffering | > of what? | | Look at the interface of std::codecvt, e.g. at the member function in. This | function takes a Byte array as input and write wide characters to a second Byte | array. Since in() is a virtual function, it's slightly faster to call it once or | twice per string than to call it once for each character in a string. Similar | remarks hold for out. To make this work, you need | | (i) the input to be presented as a character array (std::string or const char* | is fine, but a pair of forward iterators isn't) sorry I should have looked. that said, we can of course also consider changes to codecvt interface; we are allowed to do that. | > Jonathan: | >> template<typename WideStr> | >> basic_string<typename Codecvt::extern_type> | >> narrow(const WideStr& str) | >> { | >> string_converter<> cvt; | >> return cvt::narrow(str); | >> } | > | > The could be specified as | > | > template< class NarrowString, class ReadableWideForwardRange, class | > Codecvt > NarrowString narrow( const ReadableWideForwardRange& r, | > const Codecvt& cc ); ... | | First, as much as I love Boost.Range and would like to see it standardized, I | don't think the code conversion proposal should use it. I just assumed that a Range based version was not that far from the iterator version you wanted | The second problem is that it's awkard to specify the codecvt instance when you | just want a codecvt to be grabbed from the globale locae: | | std::locale loc = locale::global(); | const std::codecvt<wchar_t, char, std::mbstate_t>& cvt = | std::use_facet< std::codecvt<wchar_t, char, std::mbstate_t> >(loc); | // etc. | std::string s = narrow(ws, cvt); | | Most people will want to be able to write | | std::string s = narrow(ws); template< class NarrowString, class Range, class Codecvt > inline NarrowString narrow( const Range& r, const Codecvt& = std::use_facet< std::codecvt< typename range_value<Range>::type, typename range_value<NarrowString>::type, std::mbstate_t>
( locale::global() ) );
| Perhaps a good solution would be to have overloads, in addition to the ones I | showed, with a signature including a codecvt instance, as in your example. yes. -Thorsten

Thorsten Ottosen wrote:
"Jonathan Turkanis" <technews@kangaroologic.com> wrote in message
First, as much as I love Boost.Range and would like to see it standardized, I don't think the code conversion proposal should use it.
I just assumed that a Range based version was not that far from the iterator version you wanted
I agree. But if it were included in the proposal, wouldn't you have to define the relevant range concepts, as well as begin() and end()? It doesn't seem like the right place to do so.
The second problem is that it's awkard to specify the codecvt instance when you just want a codecvt to be grabbed from the globale locae:
template< class NarrowString, class Range, class Codecvt > inline NarrowString narrow( const Range& r, const Codecvt& = std::use_facet< std::codecvt< typename range_value<Range>::type,
typename range_value<NarrowString>::type,
std::mbstate_t> >( locale::global() ) );
Is it known that function template parameters will be deducible from default function arguments in C++0x?
Perhaps a good solution would be to have overloads, in addition to the ones I showed, with a signature including a codecvt instance, as in your example.
yes.
Ok.
-Thorsten
Jonathan

"Jonathan Turkanis" <technews@kangaroologic.com> wrote in message news:crmhbe$5lt$1@sea.gmane.org... | Thorsten Ottosen wrote: | >> The second problem is that it's awkard to specify the codecvt | >> instance when you just want a codecvt to be grabbed from the globale locae: | | > template< class NarrowString, class Range, class Codecvt > | > inline NarrowString narrow( const Range& r, | > const Codecvt& = | > std::use_facet< | > std::codecvt< typename | > range_value<Range>::type, | > | > typename range_value<NarrowString>::type, | > | > std::mbstate_t> >( locale::global() ) ); | | Is it known that function template parameters will be deducible from default | function arguments in C++0x? no, my fault... -Thorsten

Jonathan Turkanis wrote:
template<typename WideStr> basic_string<typename Codecvt::extern_type>
actually the return type here is always std::string, since there is no Codecvt parameter
narrow(const WideStr& str) { string_converter<> cvt; return cvt::narrow(str); }
template<typename NarrowStr> basic_string<typename Codecvt::intern_type>
and here it's always std::wstring.
widen(const NarrowStr& str) { string_converter<> cvt; return cvt::widen(str); }
Jonathan
participants (4)
-
Beman Dawes
-
Jonathan Turkanis
-
Rob Stewart
-
Thorsten Ottosen