
On Sun, Aug 14, 2011, Artyom Beilis wrote:
Except that these are default narrow string encodings on Windows (and sometimes even on Linux) at China, Japan and Korea...
No they are not rare encodings.
Well we are talking about *cough* Windows *cough* here, it's not like it is widely used by developers in preference of their *own* choice. (except Shift-JIS) But for me I'd think of these encodings as depreciated encodings that developers should not use it in newer programs. Of course we got old generations programmers who don't care about portability and insist on using their old time favorite encodings, and we know how long it takes to fully depreciate something. But Boost.Ustr's intended audience is for those who do *want* everything Unicode badly but is forced to somehow deal with small portion of legacy code that still uses the old time MBCS encodings. On the other hand, I don't really consider any use case to make Boost.Ustr easy for hard core developers who *insist* on continue using the MBCS encodings while expecting Boost.Ustr to let them use new Unicode libraries on their MBCS strings. Sure you can do that, but it is currently out of my scope and intention to support it.
The better solution would be to create an index of Shift-JIS -> code-point and use it, you can probably do it in lazy way on first attempt of backward iteration.
This is what I do for Boost.Locale.
Sorry I thought index is the same as translation table, and that's how the decoding is supposed to work?
Ok I see this code:
class dynamic_codepoint_iterator_object : public std::iterator<std::bidirectional_iterator_tag, codepoint_type> [...]
THIS IS VERY BAD DESIGN.
------------------
I've been there.
Think of
while(pos!=end) { code_point = *pos; ++pos; }
How many virtual calls for one code point required?
1. equals 2. deference 3. increment
This is horrible way to do things.
It depends on how you look at it actually, but I'm not surprised that most C++ programmers would complain on such design and anything that involves virtual function. (I remember someone also complained that your Boost.Locale library contains even minimal number of virtual functions. :) People can should at it the same way they should how many machine instructions are spent on a for-loop in Python or Javascript. It is actually a matter of preference on whether you prefer minimal coding with slower performance, or more coding with better performance. For me I'd just choose the right design for the right situation. I cleanly separated the static and dynamic part into two separate classes, so that you have full freedom to choose whichever class that you see fit. The reason I design dynamic_unicode_string in such way is so that you can work transparently with any types of string, be it std::string, std::u16string, std::wstring, std::vector, or anything else. And surely with such a flexible design the only way to achieve it is by using virtual functions. If you are performance critical, you can just use unicode_string_adapter and ignore this dynamic_unicode_string class.
I've started to work on generic "abstract-iterator" for Boost.Locale however hadn't completed the work yet. It allows to reduce virtual call per-character below 1.
Can you show me the code of your abstract-iterator? Perhaps your objective is different from mine so our designs are different, or perhaps you do have a better design that I can learn from.
This is not a way to go.
(BTW you had forgot clone() member function)
Oh yah, ok I'll add that when I have the time. Thanks!
The entire motivation behind this library was to provide some "Unicode Warrping" over different encodings.
And if you tell me that the library is "locale" agnostic makes it unsuitable for the proposed motivation because non-Unicode encodings do change in run-time.
But forget locale - according to the motivation requires to support at least runtime OS ANSI codepage to Unicode which it does not support.
This is the biggest flaw of the current library.
I am sorry but can you give me an example of how that actually happens? I'm not familiar with Windows development so I don't know most of the quirks on it. By changing locale on run-time, do you mean that an existing char* string stored on the stack/heap can suddenly change it's byte content to fit certain encoding that is changed on run time? Or is the run-time change only affects new char* strings obtained from the Windows API while old char* strings still retain their old byte content and encoding? If it is the latter I don't see why the proposed snippet in my previous message doesn't solve your problem? By locale-agnostic what I actually mean is to leave the locale related problems to a higher layer. I think locale and encodings are two separate issues and they should really be implemented in different classes and libraries. And actually I don't like the design of locale-aware function without delegating the actual functionality to a locale-agnostic function. For me, I think all locale-aware functions should be implemented in two steps: one that detects the locale and then delegates to another specific function that does the work and ignores the current locale.
Or maybe just convert "uncommon-encoding" to UTF-8/UTF-16/UTF-32 and forget all the wrapper?
Yes if you can do it and assume all std::string is UTF-8 then of course you don't need Boost.Ustr anymore. But until that happens, Boost.Ustr is what I think will be at the moment. :)
Dear Soares Chen Ruo Fei,
(Sorry I have no idea what is the first name :-) )
You can call me Soares, which is my unofficial English name that I give to myself because my given name is hard to pronounce by English people. Chen is my family name. I have to keep my given name in the email because that is my official name and I thought it might be easier for GSoC to track on my posts.
Don't get me wrong. I see what you are trying to do and it is good thing.
There is a one big problem:
You entered very-very-very dangerous swamp.
It is very easy to do wrong assumptions not because you don't try to do the best but because even such a small problem is very diverse. And the "best" you can do is to assume that if there can be something wrong it will.
I do realize that I am *always* trying to do something very dangerous for almost all my previous and current projects. I don't like the harsh criticisms I get for my radical ideas and ambitions, and I don't like myself choosing on projects that are "dangerous". But since that's who I really am, I just have no choice but to live with it and hope to strive for an eventual success.
It is not an accident that there is no "unicode string" in Boost and that there are too many "fancy" Unicode strings around like QString, icu::UnicodeString, gtk::ustring and many others that somehow fail to provide the goodies you need.
That doesn't stop me from trying and learn something valuable in case it still fail. After all, this is what GSoC really should be about right, to try to solve something challenging and learn by making mistakes. ;) cheers, Soares