[nowide] Easy Unicode For Windows: Request For Comments/Preliminary Review

Hello all Booster, I comments on a library that I want to submit for a formal review. The library provides an implementation of standard C and C++ library functions such that their inputs are UTF-8 aware on Windows without requiring using Wide API to make program work on Windows. Library: Boost.Nowide Download: http://cppcms.com/files/nowide/nowide.zip Documents: http://cppcms.com/files/nowide/html/ Features: http://cppcms.com/files/nowide/html/index.html#main_the_solution Tested On: OS: Windows 7 32/64 bit, Linux Compilers: GCC-4.6, MSVC-10 Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.com/ CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/

On Mon, May 28, 2012 at 3:33 PM, Artyom Beilis <artyomtnk@yahoo.com> wrote:
[...] The library provides an implementation of standard C and C++ library functions such that their inputs are UTF-8 aware on Windows without requiring using Wide API to make program work on Windows.
Hi, I'm happy that this is getting to be proposed to boost. My comments: * I find the way you handle the main() arguments elegant. * I don't like that the convert function is overloaded for both narrow and wide conversions. Rationale: Consider the following real-world scenario: // Some existing overloaded function, like std::fstream constructor on dinkumware void f(const std::string &s); // 3rd party 'ANSI' codepage void f(const std::wstring &s); // 3rd party 'UNICODE' std::string str = get_utf8_string(); f(convert(str)); // we want to call the wide string version Now during development we may change it to: std::wstring str = get_string_from_windows(); // we changed only this line f(convert(str)); // and forgot to change this one. oops... Solution: This is an error that can be caught at compile time, we just have to state the intent clearly. Use alternative names? (narrow/widen) Cheers, -- Yakov

________________________________ From: Yakov Galka <ybungalobill@gmail.com> On Mon, May 28, 2012 at 3:33 PM, Artyom Beilis <artyomtnk@yahoo.com> wrote:
[...] The library provides an implementation of standard C and C++ library functions such that their inputs are UTF-8 aware on Windows without requiring using Wide API to make program work on Windows.
Hi,
I'm happy that this is getting to be proposed to boost.
Also note, it is different from the old version of my nowide library I published once: added argv, argc, env and cin/cout/cerr/log such that you can actually write and read Unicode characters to/from console...
My comments:
* I find the way you handle the main() arguments elegant.
* I don't like that the convert function is overloaded for both narrow and wide conversions. Rationale: Consider the following real-world scenario:
// Some existing overloaded function, like std::fstream constructor on dinkumware void f(const std::string &s); // 3rd party 'ANSI' codepage void f(const std::wstring &s); // 3rd party 'UNICODE'
std::string str = get_utf8_string(); f(convert(str)); // we want to call the wide string version
Now during development we may change it to:
std::wstring str = get_string_from_windows(); // we changed only this line f(convert(str)); // and forgot to change this one. oops...
Solution: This is an error that can be caught at compile time, we just have to state the intent clearly. Use alternative names? (narrow/widen)
Very good point. I'll change them to widen/narrower
Cheers, -- Yakov
Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.com/ CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/

Hello, To make the purpose of Boost.Nowide more clear I'll add an example from the docs. Let's write a simple program that confirms the C++2011 or C++2003 standard that counts a number of lines in the file: #include <fstream> #include <iostream> int main(int argc,char **argv) { if(argc!=2) { std::cerr << "Usage: file_name" << std::endl; return 1; } std::ifstream f(argv[1]); if(!f) { std::cerr << "Can't open a file " << argv[1] << std::endl; return 1; } int total_lines = 0; while(f) { if(f.get() == '\n') total_lines++; } f.close(); std::cout << "File " << argv[1] << " has " << total_lines << " lines" << std::endl; return 0; } Any Bugs? This trivial program would not work on Windows if the file name is Unicode file name, argv - does not hold Unicode string, std::ifstream can't open Uicode file name and std::cout can't print Unicode characters to the console... Boost.Nowide provides an alternative for common standard library function and suggest a general pattern to handle Unicode strings in the cross platform program: #include <boost/nowide/fstream.hpp> #include <boost/nowide/iostream.hpp> #include <boost/nowide/args.hpp> int main(int argc,char **argv) { // // Fix arguments - argv holds Unicode string (UTF-8) // boost::nowide::args a(argc,argv); if(argc!=2) { boost::nowide::cerr << "Usage: file_name" << std::endl; return 1; } // // Fix fstream it can open a file using Unicode file name (UTF-8) // boost::nowide::ifstream f(argv[1]); if(!f) { // // cerr can print Unicode characters to console regardless console code page // boost::nowide::cerr << "Can't open a file " << argv[1] << std::endl; return 1; } int total_lines = 0; while(f) { if(f.get() == '\n') total_lines++; } f.close(); // // cout can print Unicode characters to console regardless console code page // boost::nowide::cout << "File " << argv[1] << " has " << total_lines << " lines" << std::endl; return 0; } This is the general approach, it also provides glue conversion functions to handle Unicode on API boundary level where needed #ifdef _WIN32 bool copy_file(std::string const &src,std::string const &tgt) { return CopyFileW(boost::nowide::convert(src).c_str(), boost::nowide::convert(tgt).c_str(), TRUE); } #else bool copy_file(std::string const &src,std::string const &tgt) { // POSIX implementation } #endif Waiting for Comments: Download: http://cppcms.com/files/nowide/nowide.zip Documents: http://cppcms.com/files/nowide/html/ Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.com/ CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/
Hello all Booster,
I wan to get comments on a library that I want to submit for a formal review.
The library provides an implementation of standard C and C++ library functions such that their inputs are UTF-8 aware on Windows without requiring using Wide API to make program work on Windows.
Library: Boost.Nowide Download: http://cppcms.com/files/nowide/nowide.zip Documents: http://cppcms.com/files/nowide/html/ Features: http://cppcms.com/files/nowide/html/index.html#main_the_solution
Tested On:
OS: Windows 7 32/64 bit, Linux Compilers: GCC-4.6, MSVC-10

Hi, I just read the documentation, so far this library looks nice. On Tue, May 29, 2012 at 11:25 PM, Artyom Beilis <artyomtnk@yahoo.com> wrote:
boost::nowide::args a(argc,argv);
args is an object maintaining the lifetime of the new values that argv will point to. Is my understanding correct? Joel Lamotte

----- Original Message -----
From: Klaim - Joël Lamotte <mjklaim@gmail.com> To: boost@lists.boost.org Cc: Sent: Tuesday, May 29, 2012 5:38 PM Subject: Re: [boost] [nowide] Easy Unicode For Windows: Request For Comments/Preliminary Review
Hi, I just read the documentation, so far this library looks nice.
On Tue, May 29, 2012 at 11:25 PM, Artyom Beilis <artyomtnk@yahoo.com> wrote:
boost::nowide::args a(argc,argv);
args is an object maintaining the lifetime of the new values that argv will point to.
Is my understanding correct?
Joel Lamotte
Yes, you understand correctly. So main function int main(int argc,char **argv[,char **env]) { ... } Simply changed to int main(int argc,char **argv[,char **env]) { boost::nowide::args a(argc,argv[,env]) ... } Where a is args instance that holds the "replaced" values. Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.com/ CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/

On Tue, May 29, 2012 at 11:46 PM, Artyom Beilis <artyomtnk@yahoo.com> wrote:
int main(int argc,char **argv[,char **env]) {
boost::nowide::args a(argc,argv[,env])
...
}
I know this is somewhat stupid but isn't it easy to get in this case? int main(int argc,char **argv[,char **env]) { { boost::nowide::args a(argc,argv[,env]) } //... std::cout << argv[0] ; // crash? } Or do you restore argv in the args destructor? Joel Lamotte

Hi, I just read the documentation, so far this library looks nice.
+1
So main function
int main(int argc,char **argv[,char **env]) { ...
}
Simply changed to
int main(int argc,char **argv[,char **env]) { boost::nowide::args a(argc,argv[,env])
...
}
Having done things similar to this before (if never with any elegance). I think that the above won't work in all situations. The problem is that the CRT may have already applied some conversion to argv strings based on your current code page if your application was called with utf16le strings (this is rarely the case if typed at a command prompt but certainly can happen if called via a shortcut or via a user clicking of a file with an associated file type) The solution is to use wmain for windows rather than main and do any conversion to utf8 there not in main eg finding some (rather hacked I admit) example I've done. #ifdef WIN32 //Use wmain under windows to get unicode strings which we convert to utf-8 - standard conversion is to map //to local code page rather than to utf-8 int wmain(int argc, wchar_t* wargv[]) { //convert wargv to utf-8 strings char ** argv = new char *[argc]; for ( int i = 0; i< argc; ++i ) { utf8string temp( wargv[i] ); //cvt to utf8 argv[i] = new char[ temp.getBufferSize() ]; memcpy( argv[i], temp.c_str(), temp.getBufferSize() ); } #else int main(int argc, char* argv[]) { #endif Where utf8string is some class I've used doing efficient utf8,utf16,UCS32 conversions which behaves like a std:string with a few bells and whistles on. It might be worth while adding this wmain workaround information into your library and providing a boost::nowide::args constructor which takes wchar_t Hope this is of some use. Alex ps apologies if formatting is odd - nabble crashed on me trying to reply so ended up using Outlook and modifying reply manually from message digest which is never great ....

On Tue, May 29, 2012 at 7:19 PM, Alex Perry <Alex.Perry@smartlogic.com>wrote:
Having done things similar to this before (if never with any elegance). I think that the above won't work in all situations.
You guessed the implementation incorrectly. Please see the sources. I've always done it the way you said (with wmain/main), but this is why I said that Artyom's solution is more elegant. args doesn't even read the argc/argv arguments. It uses CommandLineToArgvW and GetEnvironmentStringsW to get the wide strings from Windows, then converts them to UTF-8 and assigns the pointer to it back to the local parameters of main. -- Yakov

Yakov Galka wrote
You guessed the implementation incorrectly. Please see the sources. I've always done it the way you said (with wmain/main), but this is why I said that Artyom's solution is more elegant.
Doh! - should have looked before posting - very neat! But maybe a note/explanation in the documentation? - If for nothing else just so those who think they are cleverer than they are (like me) could use this library without having to browse the source ? Alex -- View this message in context: http://boost.2283326.n4.nabble.com/nowide-Easy-Unicode-For-Windows-Request-F... Sent from the Boost - Dev mailing list archive at Nabble.com.

I know this is somewhat stupid but isn't it easy to get in this case?
int main(int argc,char **argv[,char **env]) { { boost::nowide::args a(argc,argv[,env])
} //... std::cout << argv[0] ; // crash?
}
Or do you restore argv in the args destructor?
Joel Lamotte
Actually good point and good idea to restore old argc/argv parameters. I'll add this! Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.com/ CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/

On Wed, May 30, 2012 at 3:55 AM, Artyom Beilis <artyomtnk@yahoo.com> wrote:
Actually good point and good idea to restore old argc/argv parameters.
I'll add this!
Ah yes I only figured now that you didn't have any constructor... Happy to help. Joel Lamotte

Hi Artyom, On Mon, May 28, 2012 at 2:33 PM, Artyom Beilis <artyomtnk@yahoo.com> wrote:
I comments on a library that I want to submit for a formal review.
The library provides an implementation of standard C and C++ library functions such that their inputs are UTF-8 aware on Windows without requiring using Wide API to make program work on Windows.
here are my 0.02 Euro: I completely agree that for general-purpose text storage and handling (reading lines from text-file/console, reading user input from GUI, displaying formatted (and localized) messages to the user in a UI, etc., etc.) UTF-8 should *finally* be adopted. The other encodings (including UCS-2, UTF-16/32) have their uses, but should be treated as special cases. The nowide library is certainly useful within the (limited) scope of working with text obtained from the OS and passed to the OS where you can make some assumptions and guess the encoding that the OS uses and do the conversions from and to UTF8, BUT ... many text-handling applications tend also use third-party libraries which also have their own ideas about text encodings and your library would be *much* more useful if it allowed to "talk" to such libraries (or devices). So let me reiterate some points I already mentioned in the earlier text-related discussions here: 1) Let's use std::string as a encoding-agnostic string as it has always been - the encoding of the data stored in string should be application dependent. 2) Let's implement a text storage class (and let's call it) text; This class would store text (internally in whatever encoding is the "best" at the specific platform) and would have the following function defined: /* UTF-8 encoded */ std::sting str(text t); - This function would return a std::string containing the text stored in t encoded in UTF-8. template <typename SymbolicEncodingTag> text text::from(std::basic_string<SymbolicEncodingTag::CharT> s) - This function would convert the string stored in s to text assuming that s is encoded in encoding specified by SymbolicEncodingTag. template <typename SymbolicEncodingTag> std::basic_string<SymbolicEncodingTag::CharT> text::to(text t); - This function would convert the text stored in t to a std::string encoded in encoding specified by SymbolicEncodingTag. The encoding tags would specify both concrete encodings like UTF-16 or ISO-8859-2, etc. and symbolic encodings like OS (which would autodetect the OS's encoding) or libFoo which would use libFoo's encoding. Actually the library would not have to specify many tags for concrete third-party libraries (maybe only the most popular). Instead it would provide some means to define the tags to applications based on their needs. The text class would be used to store text in class members, functions parameters, variables, etc. and would be converted to string (in whatever encoding) only when the contents of the text has to be examined byte-by-byte, CP-by-CP, etc. or passed to the OS, library or device requiring a specific encoding. Also initialization of text from c-string-literals should be handled correctly on various platforms/compilers. If I'm not terribly mistaken all the code for conversions between encodings already is part of Boost.Locale. Then all the useful things like the nowide::args class and the wrappers around iostreams, etc. could be implemented on top of that. Best, Matus

----- Original Message -----
From: Matus Chochlik <chochlik@gmail.com> To: boost@lists.boost.org Cc: Sent: Wednesday, May 30, 2012 10:20 AM Subject: Re: [boost] [nowide] Easy Unicode For Windows: Request For Comments/Preliminary Review
Hi Artyom,
On Mon, May 28, 2012 at 2:33 PM, Artyom Beilis <artyomtnk@yahoo.com> wrote:
I comments on a library that I want to submit for a formal review.
The library provides an implementation of standard C and C++ library functions such that their inputs are UTF-8 aware on Windows without requiring using Wide API to make program work on Windows.
here are my 0.02 Euro:
I completely agree that for general-purpose text storage and handling (reading lines from text-file/console, reading user input from GUI, displaying formatted (and localized) messages to the user in a UI, etc., etc.) UTF-8 should *finally* be adopted. The other encodings (including UCS-2, UTF-16/32) have their uses, but should be treated as special cases.
The nowide library is certainly useful within the (limited) scope of working with text obtained from the OS and passed to the OS where you can make some assumptions and guess the encoding that the OS uses and do the conversions from and to UTF8, BUT ...
many text-handling applications tend also use third-party libraries which also have their own ideas about text encodings and your library would be *much* more useful if it allowed to "talk" to such libraries (or devices).
So let me reiterate some points I already mentioned in the earlier text-related discussions here:
[snip]
1) Let's use std::string as a encoding-agnostic string... [snip]
2) Let's implement a text storage class (and let's call it) text; This class would store text ... [snip]
[snip] The encoding tags would specify both concrete encodings like UTF-16 or ISO-8859-2, etc. and symbolic encodings [snip]
I want to stop this direction and discussion before it begins. This library is not generic library to handle text in all encodings and handle all possible 3rd part libraries and convert between them, and this library is not intended to be so. The potential user of this library do not want to handle 101 encodings one wants to use ONE and SINGLE encoding all over its application and convert the strings to Wide encoding on Windows libraries boundaries and pass the UTF-8 string as is on Unix programs. Note: The developer that uses this library considers ANSI API as broken and only Wide API is a valid API on Windows. So no this library is not Boost.Text it is: "I want to use UTF-8 in may application... and I want to use only Wide API on Windows as the only correct API to use"
[snip]
If I'm not terribly mistaken all the code for conversions between encodings already is part of Boost.Locale.
Yes and Boost.Nowide uses UTF-to-UTF conversion part (that is header only one in Boost.Locale)
Then all the useful things like the nowide::args class and the wrappers around iostreams, etc. could be implemented on top of that.
The library does not reinvent the wheel :-), it uses boost::locale::utf... (which I BTW the author of it)
Best,
Matus
Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.com/ CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/

1) Let's use std::string as a encoding-agnostic string... [snip]
2) Let's implement a text storage class (and let's call it) text; This class would store text ... [snip]
[snip] The encoding tags would specify both concrete encodings like UTF-16 or ISO-8859-2, etc. and symbolic encodings [snip]
I want to stop this direction and discussion before it begins.
strange, but OK :)
This library is not generic library to handle text in all encodings and handle all possible 3rd part libraries and convert between them, and this library is not intended to be so.
I know it is not. What I'm saying is that it could be. There are lots of libraries that pick some subset of text handling, implement some useful things and then stop. Which is a shame because text handling su*ks in C++ and Boost is one of the platforms that have the influence to finally improve things.
The potential user of this library do not want to handle 101 encodings one wants to use ONE and SINGLE encoding all over its application and convert the strings to Wide encoding on Windows libraries boundaries and pass the UTF-8 string as is on Unix programs.
See above.
Note: The developer that uses this library considers ANSI API as broken and only Wide API is a valid API on Windows.
You will get no arguments from me, I agree with you on this point.
So no this library is not Boost.Text it is:
"I want to use UTF-8 in may application... and I want to use only Wide API on Windows as the only correct API to use"
Which limits the usability of the library, because most (Windows and Linux) applications that I worked on also used third party libraries which sometimes have their own issues with encodings (similar to the Windows API)
If I'm not terribly mistaken all the code for conversions between encodings already is part of Boost.Locale.
Yes and Boost.Nowide uses UTF-to-UTF conversion part (that is header only one in Boost.Locale)
Then all the useful things like the nowide::args class and the wrappers around iostreams, etc. could be implemented on top of that.
The library does not reinvent the wheel :-), it uses boost::locale::utf... (which I BTW the author of it)
I never said that it reinvents the wheel (and I know that you are the author of Boost.Locale, I wrote one of the reviews) I certainly don't want to push you into something that you don't want to do. You asked for opinions I just gave you mine.

----- Original Message -----
From: Matus Chochlik <chochlik@gmail.com> To: boost@lists.boost.org Cc: Sent: Wednesday, May 30, 2012 2:25 PM Subject: Re: [boost] [nowide] Easy Unicode For Windows: Request For Comments/Preliminary Review
1) Let's use std::string as a encoding-agnostic string... [snip]
2) Let's implement a text storage class (and let's call it)
text;
This class would store text ... [snip]
[snip] The encoding tags would specify both concrete encodings like UTF-16 or ISO-8859-2, etc. and symbolic encodings [snip]
I want to stop this direction and discussion before it begins.
strange, but OK :)
The problem is that there were enough discussions on this topic... and they had not brought a solution.
This library is not generic library to handle text in all encodings and handle all possible 3rd part libraries and convert between them, and this library is not intended to be so.
I know it is not. What I'm saying is that it could be. There are lots of libraries that pick some subset of text handling, implement some useful things and then stop. Which is a shame because text handling su*ks in C++ and Boost is one of the platforms that have the influence to finally improve things.
The problem is that Unicode handling is so wide that it is almost impossible to cover everything, and generally you need to cut at some point. In any case I'd prefer to have a narrow range and useful library that does what it should to do. I used this nowide approach in CppCMS framework and it saved me huge amount of problems, so I'm sharing it. Also note boost::nowide::c(out|in|err) is really something interesting as it makes things finally work
I certainly don't want to push you into something that you don't want to do. You asked for opinions I just gave you mine.
I see :-) Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.com/ CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Mon, May 28, 2012 at 8:33 AM, Artyom Beilis <artyomtnk@yahoo.com> wrote:
Hello all Booster,
I comments on a library that I want to submit for a formal review.
The library provides an implementation of standard C and C++ library functions such that their inputs are UTF-8 aware on Windows without requiring using Wide API to make program work on Windows.
Both the above and the docs seem to focus on the problems of UTF-8 awareness on Windows. That's a problem well worth solving, but... Am I correct in assuming that the library allows writing portable programs that handle UTF-8 strings correctly on other operating systems, too, regardless of whether the native narrow string encoding is UTF-8 or something different? For example, a POSIX-like operating system set up to use some legacy Asian character set encoding? --Beman

----- Original Message -----
From: Beman Dawes <bdawes@acm.org> On Mon, May 28, 2012 at 8:33 AM, Artyom Beilis <artyomtnk@yahoo.com> wrote:
Hello all Booster,
I comments on a library that I want to submit for a formal review.
The library provides an implementation of standard C and C++ library functions such that their inputs are UTF-8 aware on Windows without requiring using Wide API to make program work on Windows.
Both the above and the docs seem to focus on the problems of UTF-8 awareness on Windows. That's a problem well worth solving, but...
Am I correct in assuming that the library allows writing portable programs that handle UTF-8 strings correctly on other operating systems, too, regardless of whether the native narrow string encoding is UTF-8 or something different? For example, a POSIX-like operating system set up to use some legacy Asian character set encoding?
--Beman
Great Question. No, on POSIX platforms it is actually inherently incorrect to convert strings to/from locale encodings. You can create, remove a file, pass it as a parameter to program like "\xFF\xFF.txt" (invalid UTF-8) and it would work if the current locale is UTF-8 locale. Also if you change the locale from let's say en_US.UTF-8 to en_US.ISO-8859-1 it would not magically change all files in OS or the strings a user may pass to the program. (This would work on all POSIX OSs and even under Mac OS X) POSIX OSes treat strings as NUL terminated cookies. So altering their content according to the locale would actually lead to incorrect behavior. for example of I create a program "rm" #include <cstdio.h> int main(int argc,char **argv) { for(int i=1;i<argc;i++) std::remove(argv[i]); return 0; } It would work on with ANY locale and changing the strings would lead to incorrect behavior. The meaning of locale under POSIX platform does not have the same effect in comparison to the locale means under Windows platform. Also few additional points: - Under POSIX platform locale sometimes does not have encoding: frequently used C locale does not actually define encoding! - Non UTF-8 locales considered today deprecated, and it is common practice to require that the program would run under UTF-8 locale especially when it can be trivially changed by setting one environment variable. Bottom line: 1. The situation is not symmetric under POSIX platforms strings are cookies unlike under Windows platform. 2. There are good reasons not to alter the encoding. Artyom Beilis P.S.: I had already send this message to the list but it seems to be lost.
participants (7)
-
Alex Perry
-
alex_perry
-
Artyom Beilis
-
Beman Dawes
-
Klaim - Joël Lamotte
-
Matus Chochlik
-
Yakov Galka