Re: [boost] [locale] Review of Boost.Locale library

17 Apr 2011

      ----- Original Message ----
...
From: Edward Diener <eldiener@tropicsoft.com>
1. Documentation
The layout of the main  page is decent, but I would have expected a discussion
there, or as a first  topic, of what Locale brings that the C++ standard
locale does not have. I was  disappointed not to find such a discussion.
Actually standard locale provides many things in limited way, each topic
discussed in each section, I don't think that it was nessary to provide
explanation for each one of them specifically.

More detains can be seen there:

  http://cppcms.sourceforge.net/boost_locale/html/appendix.html#rationale_why
...
[snip]
...
a. Introduction to C++ Standard Library localization  support
The common critical problems of the C++ locale portion of the 
standard library seems spurious to me. The problems mentioned
are really about  implementations or programmer usage,
not the C++ locale library itself. The only  valid problem
mentioned I find there is that the C++ standard library
did not  attempt to standardize any locale names.
This makes using C++ locales based on  locale names non-portable.
There are much more mentioned:

1. I've seen lots of libraries broken because of setting global locale
2. Almost all standard C++ libraries have bugs in locale implementation,
   for example both GCC's libstd++ and SunStuio's standard library
   may generate invalid UTF-8?!
3. Some things defected by design
4. Some things badly implemented
5. Some things (like message formatting) are just... not existent.
6. Most standard libraries provide only C and POSIX locale...

Are there few? There all mentioned.

I assume you hadn't worked to much with locales of standard C++
library because there are lots of issues.
...
Unfortunately the issues there make a very  weak argument for the Locale 
library itself.
The library unifies and fixes existing problems.
...
b. Locale  generation
I would have liked it if the doc here specified where one  finds valid lists of 
language,
country, encoding, and variant which make up a  locale name. Without this 
information,
the one valid problem mentioned regarding  C++ locale is also a problem with 
Locale.
Actually I explicitly referred to standards ISO-639 and ISO-3166. 

These lists are updated once in a while and I don't think Boost.Locale library
should list them.
...
The note about wide strings and  8-bit encoding makes no sense to me at all.
If I am using a wide string  encoding, why would I not be using wide string 
iostreams ?
You may not specify wide strings encoding it would be assumed it is US-ASCII.

But I strongly not recommend doing it. And I recommend always using
UTF-8 and it is mentioned in the documentation.
...
c.  Collation
There is no explanation about what 'collation' is about.
This  is very disappointing, as it makes the rest of the discussion difficult 
to  follow.
Especially for you and thous who are not familiar with Localization terminology
there is a glossary:

   http://cppcms.sourceforge.net/boost_locale/html/appendix.html#glossary

So quick glance in it would answer for your question.
...
This is one reason why I dislike documentation  which attempts to teach by 
example.
It always seems to assume that if it throws  examples at the reader before 
anything
about the classes/templates in the  examples have been mentioned, that this is 
somehow
an effective way of learning  a library. Instead it just creates confusion and 
serves
unfortunately as a way  by which a library implementer does not have to explain 
how
the classes in his  library actually work or relate to each other.
It takes some time to learn what is std::locale and how it works.

I had written small introduction to std::locale but I do expect developer
and user to open some documentation of the standard C++ library
in order to understand things deeply.

I can't cover every possible topic in this tutorial as I
expect user who comes to localize the software to be able
to learn about it not only from the given tutorial but also
from other sources.

Same as library that provides TCP/IP and Networking support
assumes that you know a little about sockets, network addresses
ports and so on.
...
[snip] 
d. Conversions
"You  may notice that there are existing functions
to_upper and to_lower in the  Boost.StringAlgo library.
The difference is that these function operate over
an  entire string instead of performing
incorrect character-by-character  conversions."
I do not understand how these conversion functions use a locale.
The example gives: boost::locale::to_upper(gruben) used in a stream. 
Is this  function using the locale imbued in the iostream ?
Actually a click on the function names in tutorial brings you to the
reference documentation that shows that they receive locale as parameter.

In most applications that use single language in their interface
the global locale is expected to be defined which would make
everything much easier.
...
[snip]
e. Numbers, Time and Currency formatting and  parsing
A bunch of ICU flags are mentioned but with no indication 
about  how these are supposed to be used by iostreams.
These flags look like they are  supposed to be used by C-like
format printf statements but since Locale uses  iostreams I 
can not understand their purpose with Locale.
First of all there are not ICU flags.

These are stream manipulators that I assume you are familiar
with, same as std::hex or std::setprecision

I don't know how familiar your are with stream manipulators
but they behave the same as standard's library manipulators.
...
f. Messages  Formatting (Translation)
Gnu gettext should be explained when it  interfaces with Locale
Just telling someone to learn Gnu gettext is not  adequate.
Other than that the explanation is pretty thorough.
The point there is a huge knowledge base about GNU Gettext.
I can write a many-many pages about it. I try to keep it
simple and clean.

You can always learn more from huge amount of tutorials
around.

You will have to master yourself with translation programs
like poedit or Lokalize to work with dictionaries you
will have to learn best practices.

Localization is very wide topic and it is not trivial
to explain it in the scope of this document, that is
why external sources are required.

Same as Boost.Asio should not explain on what socket
is.
...
g.  Character Set Conversions
An explanation of what character sets are, and  what
character set conversions entail, should be the beginning
of this  documentation.
Character set conversion is more utility function,
boost.locale a way beyond this.
...
h. Localized Text Formatting
"Each format specifier  is enclosed within {} brackets.."
These are not "brackets" but "braces".  Brackets are '[]'.
Noted, thanks.
...
i. In general
It is confusing to me how  generated locales affect the
functionality of the different sections presented  under
'Using Boost.Locale'. In a number of situations I am looking 
at classes or  functions and I have no idea how these pick up
a locale. I do understand that  when used with iostreams the
locale is determined by the locale imbued in the  iostream.
But outside of iostreams I do not understand from the documentation
what locale is being used.
All reference documentation refers in many places to something like

   std::string foobar(std::string c,std::locale const &loc=std::locale());

Which means get default locale - which is global one.

This isn't a concept of the Boost.Locale but rather the concept
of the standard C++ library
...
If it is the C++ global locale, the documentation  should say so.
This entire issue about how locales are actually being 
used in  various parts of the library should be 
explained as part of an overall  explanation of the library.
I find this good overall explanation of the library
the major flaw in the documentation.
It is something that is part of the standard library not Boost.Locale
Maybe it not always clear, but this is why you need sometimes to
get "dirty" and actually write some code to get used to the concepts.

Same as you will never understand how Boost.Asio works till you
write some code with it.
...
[snip]
In general I think that using  the global locale is a bad programming
practice when one specifically intends to  work with locales.
Unfortunately it was hard for me to understand how individual
locales are used with each of the parts of the library from the
documentation.  But I will assume for the time being, because it
seems the only correct design,  that each part of the library which
is documented can work with some non-global  locale which is created
and passed around as necessary.
Actually 99% or programs around use global locale, most of interactive
programs have one user that speaks in one language, thus
setting the locale globally more then makes sense. For example
GtkMM and Gtk at all supports only gloabl locale and it
servers it quite well.

The sitation when you need to have sereral locales in the program
is generally client server solutions.

For example in the CppCMS web framework I developed Boost.Locale for
it uses locale in special context and the output stream.

It really domain dependent and developer should decide how
to move locale around.
...
The only design  flaw which I could discover in the library
was in message translation.
If so, I hope I'll convinse you that the desision Boost.Locale
did is the right one and would give the best results
for software localization.

Sometimes it is hard to understand why the best practices
are the way they are till you get burned, especially
for topic like localization were each persone see's only
his side of the story (his language his culture)
and this is natural.
...
The fact  that translation always
begin from English ( or perhaps some other narrow  character
language ) to something else is horrendous. I can understand
that the  Locale implementer wanted to use something popular 
and that already exists, but  an idea so tremendously flawed
in its conception either needs to be changed, if  possible,
or discarded for something better. I do understand that
translation is  just one part of this large library,
but I hope that the implementer undestands  how ridiculous
it is to assume that non-English programmers are going to
be  willing to translate from English to language X rather
than from their own  language to language X.
I'll explain it the way I explained it before,

there are many reasons to have English as core/source
language rather then other native language of
the developer.

Living aside technical notes (I'll talk about them later)
I'll explain why English source strings is the best
pratice. 

a) Most of the software around is being only partially
   translated, it is frequent that the translations 
   are out of sync with major development line,

   Beta versions come usually with only limited
   translation support.

   Now consider yourself beta test of a program
   developed in Japan by programmers who does not
   know English well and you try to open a file
   and you see a message:

      "File this is bad format" // Bad English

   Or see:

      "これは不正なファイル形式です。" // Good Japanese (actually
                                // translated with google :-) 

   Opps?!

   I hope now it is more clear. Even for most 
   of us who do not speak English well are already
   familiar to see English as international
   language and would be able to handle partially
   translated software.

   But with this "natural method" it would be 
   "all greek to us"

b) In many cases, especially when it comes to GNU Gettext
   your customer or just a volonteer can take
   a dictionary template, sit for about an hour
   or two with "Poedit" and give acceptable quality
   translation of medium size program.

   Load it test it and send it back.

   This is actually happens 95% of time on open
   source world and it happens in closed source
   world as well.

   Reason - it is easy, accesable to everyone
   and you do not have to be a programmer to do
   this. You even do not have to be a professional
   tanslator with a degree in Lingustics to
   translate messages from English to your
   own language.

That is why it is the best practice,
that is why all translation systems around use
same technique for this.

It is not rediculas, it is not strange
it is reality and actually is not so bad
reality.

Now technical reasons:

1. There is a total mess with encodings between different
   compilers.

   It would be quite hard to relate on charset of the
   localized strings in source.

   For windows it would likely be one of 12XX or 9xx codepages
   For Unix it would likely to be UTF-8

   And it is actually impossible to make both MSVC
   and GCC to see same UTF-8 encoded string
   in same source

       L"שלום-سلام-pease"

   Because MSVC would want a stupid BOM from you and
   all other "sane" compilers would not accept BOM 
   in sources at all.

2. You want to be able to conver the string to the target
   string very fast if it is missed in dictionary,
   so if you use wide strings and you can't use wide
   Unicode strings in sources because of the problem
   above you will have to do charset conversion.

   When for English and ASCII it would be just byte by
   byte casting.

   You don't want to do charset conversion for every
   string around in runtime.

3. All translation systems (not Gettext only) assume
   ASCII keys as input.

   And you do not want to create yet another
   new translation file format as you will
   need to:

   a) Port it to other languages as projects
      may want to use unified system

   b) Develop nice GUI tools like Lokalize
      or Poedit (that had been under development
      for years)

   c) Try to convinse all users around that
      your reinvented wheel is better then gettext

      (and it wouldn't be better)

I hope not it is clear enough.

(I'll keep this description for futute
 reviewers, as I had never thought that
 somebody would actually ask about it)
...
I am assuming that all other parts of the library  support
both narrow character encodings and wide character
encodings fully, and  that at least UTF-16 is always
supported using wide characters. It was really  hard
for me to make out from the docs whether or not this was the case.
Small notice wchar_t != UTF-16, in fact wide strings
are UTF-16 only on Windows on all other platforms they 
are UTF-32.

That is why I almost never relate to UTF-16 directly
but rather to wide or narrow strings.

For best portability, it is best to use UTF-8 and
narrow strings.
...
I  believe a great deal of work was put into the library,
and that this work is  invaluable in bringing locale usage
into C++ in a better way than it is  currently supported 
in C++ locale.
But I would have to vote "No" that the  library should be accepted
into Boost at the current time, with some provisos  which would most
likely gain my own change to a "Yes" vote in the  future.
1) The documentation should explain the differences and  improvements 
   of Locale over C++ locale in a good general way.
It is mentioned in rationale and all over the tutorial
several times.

If you need some specific questions or problems I can add them
to tutorial.
...
2) The  documentation should explain the use of locales for each of the  
topics.
3) A number of topics should discuss what they are about in  general,
   and more time should be given to discuss how the classes/templates  relate 
to each other.
I can't explain all Unicode/Localization related topics in the tutorial
in same way TCP/IP library can't explain socket concepts from the begginning.
...
4) Message translation should be reconsidered. I  don't mean to say that the 
way it
   is done is enough to have the library  rejected, but I can not believe
   that it is not a flawed system no matter what  the popularity of Gnu gettext 
might be.
It would not happen because it is a matter of the best practice
and correct approach. These things hadn't been invented by
myself but rather by many experienced people working in this
area for many years.

Same as you would not say your student to use
gets even it is possible to use in certain situations
just because it is wrong practice.
...
My major trouble with the  library, which has led to my "No" vote, is that
I can not really understand how  to use the library from the documentation
or what the library really offers over  and above C++ locales. 
I realize the library programmer may not himself be a  native English
 speaker, but if I can not really understand a library from its 
documentation in a way that makes sense to me I can not vote for
its inclusion  into Boost. I strongly suspect that if I were to
understand the functionality of  the library through a more rigorous
explanation of its topics, and each topics  relationship to a locale
class and various encodings, I mught well vote for its  inclusion into
Boost. But for now, and in the state which the documentation  resides
for me, I can not do so. So I hope this review will be undersstood at
least partially as a request to improve the docs as much as anything  else.
This way or other, thank you for the
review and I hope you reconsider your vote
after the deep reason I had gave you.

Artyom