Re: [boost] GSoC Proposal Preparation For Encoding Awared String

18 Mar 2011

      ...
From: Soares Chen <crf@hypershell.org>
Hi all,
[snip]
I think there are several  options that I can choose for my project:
1. To use Chad Nelson's code as  base, try to incorporate other ideas
proposed in the mailing list, integrate  with Boost.Locale, and make it
Boost quality to submit for review. If this  option is chosen, I wish
that Chad Nelson can be my mentor.
2. To start a  new code base, gather and compile ideas suggested in
mailing list, final  design decisions made by me and my mentor but not
the community (to keep the  project going on fast), make it Boost
quality and submit for review.
3. To  start the boost::string project, where another better string is
reinvented  and fix all the weaknesses of std::string.
4. Adopt different proposal, and  improve on existing project such as
Boost.Unicode [2] or Boost.Locale [3]  such that it really solves the
encoding awareness problem.
5. Any other  suggestion?
Hello,

I want you to address several points:

It would be very hard to get the consensus about the way to solve the 
problem.

Probably the best and the most wishful thinking solution is to assume
that all strings are UTF-8 based, however it is not the reality.

The problem is actually not the string but rather the way you code.

Even if you create a perfect UTF-8 string and then call

    fopen(your_perfect_string.c_str(),"r")

Under windows... And it would not work <sigh... damn Windows>

As you can see from multiple discussions, there are many 
contradicting requirements about how should string look
like and what should it bring with.

If you want to provide better Unicode awareness to Boost you 
don't need new cool utf-XYZ string, you need a policy.

I think boost::filesystem v3 is a big step forward, it allows you
to use UTF-8 strings on Windows which I think is a really good
beginning.

This is my opinion.

Boost.Locale and several other my projects (CppCMS, CppDB) live happily
with std::string.

The problem is that in vast majority of cases you don't need encoding aware
string, as so many operations you usually do on strings are encoding
agnostic. But this is other story.

Bottom line, if you want to improve Unicode awareness of Boost 
I think you need to adopt Boost.Filesystem v3 like policy
all over the code base of Boost.

1. Use Wide API as native one in Boost everywhere under Windows
2. Use char * API as native one in Boost everywhere under non-Windows platforms
3. Use std::codecvt to handle this (after many tricks... )

The Unicode String/Encoding Aware String is the last thing to do
not the first thing.

Why?

1. Because you will never get the consensus about what is the "right-thing"
   to do (wide, narrow, utf-8, utf-16) etc.

   Project that are handled and directed by a single source or management
   like Qt, GTK(mm), Java, C#, Python or others may decide what is the
   right thing.

   This will never happen in Boost as it is too pluralistic even in cases
   where it does not always make sense, just because the way libraries
   are developed, reviewed and got in - based on public reviews
   that eventually encourages diversity.

2. Because you would not likely to be able to enforce users to actually
   use your string. As boost is more about collaboration then enforcement
   of specific style.

3. Even heavy discussions there hadn't got to any conclusion. So what would
   happen and final review of your library?

My $0.02

Artyom