
On Tue, Sep 7, 2010 at 11:55 PM, Yakov Galka <ybungalobill@gmail.com> wrote:
On Wed, Sep 8, 2010 at 06:28, Artyom <artyomtnk@yahoo.com> wrote:
2. What strings should be used? std::string, std::wstring, custom string like Qt's QString or GTKmm's ustring?
As a windows programmer I say: use UTF-8 with std::string. See Pavel's answer here: http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmf.... Why UTF-8? In non gui code std::string is more common. Single byte encoding is also default for std::exception::what() for example. Anyway if there is no consensus on this topic you still can use both through a configuration (typedef std::string/std::wstring tstring;).
UTF-8 is variable-length encoded. Some people find that inconvenient, because they need to search for substrings instead of simple code units. UTF-16 is variable-length encoded. Some people like to treat it as fixed-length, and they are doing it wrong. The Windows API was originally built with UCS-2 in mind -- it only later got UTF-16 support added. I often wonder if they would have gone that route had they known UCS-2 would soon run out of uses. UTF-32 is fixed-length, but it wastes a lot of space for most cultures. If you are processing Unicode correctly, then you will most likely need to deal with grapheme clusters (the visible individual characters that get rendered to your screen). Grapheme clusters can be made of multiple code points, and there are even different combinations of code points that create the same grapheme cluster. So for any real Unicode processing, the encoding does not matter because it is always effectively variable-length and often not usable with simple Unicode-ignorant functions like strstr(). I see two use cases with strings: a) You are using your strings so trivially that it doesn't matter that they are Unicode or anything else. You're basically just copying sequences of bytes around that you got from somewhere else, and probably eventually passing them to a renderer which handles (b). b) You are processing your strings in a meaningful way, where it matters that they are Unicode, and it will be unfortunately complex no matter what you do. So I'd make the argument that it _does not matter_ what encoding is used. Make an arbitrary choice! -- Cory Nelson http://int64.org