Re: [boost] [rfc] Unicode GSoC project

15 May 2009

      Mathias Gaunard wrote:
...
Eric Niebler wrote:
...
The invariant of what? The internal data over which the iterators 
traverse? Which iterators? All of them? Are you really talking about 
an invariant (something that is true of the data both before an after 
each operation completes), or of pre- or post-conditions?
The invariant satisfied by a model of UnicodeRange.
If R is a model of UnicodeRange, any instance of R r shall verify
that begin(r)...end(r) is a valid normalized unicode string properly 
encoded in UTF-x.
The invariant must be satisfied at any time.
So a range that lazily verifies well-formedness while it is being 
traversed cannot satisfy the UnicodeRange concept? I think if we were to 
ever have a Unicode string type, it would make sense to define a class 
invariant. For the Unicode concepts (which have not yet been specified), 
each operation in the concept may specify pre- and post-conditions, in 
addition to complexity guarantees and semantics.
...
Functions taking models of UnicodeRange as input can assume the 
invariant as a precondition on their input and should have it as a 
postcondition on it too.
Functions providing models of UnicodeRange as output should have it as a 
postcondition on its output.
...
, but maybe
...
that should be something orthogonal. I personally don't think it's 
really useful for general-purpose text though.
I should hope there is a way to operate on valid Unicode ranges that 
happen not to be in normalization form C.
A way to operate on such data would be normalizing it beforehand. No 
information is supposed to be lost by normalizing to form C.
That's one way, yes. Another way would be to provide an adaptor that 
normalizes on the fly, if such a thing could be done. There may still be 
other ways.
...
Substring search for example requires the two strings being compared to 
be normalized, or at least relevant parts, so that canonically 
equivalent things can compare as being equal.
Or else Unicode comparison algorithms can be written generally so as not 
to require both sequences to be normalized first. It could also dispatch 
to a more efficient implementation if the ranges are both known to be in 
the same normalization form.
...
We can choose to do that behind the user's back or rather to make it so 
that we don't need to; the latter allows to keep things simple.
Are we writing a handful of one-off routines that manipulate /some/ 
Unicode data in /some/ predefined ways, or are we trying to build a the 
foundation upon which a full, generic Unicode library can be based? I'm 
hoping for the latter. That's a much bigger job. It's ok if we do the 
former, as long as we see it for what it is: the non-generic kernel from 
which we can later distill generic algorithms and concepts.
...
Of course, being normalization-form-agnostic and making it a separate 
concept, allowing to select the best algorithm possible (I may not have 
the time to write all versions however), is more powerful because it 
doesn't do any concession.
Right.
...
I just want to know whether it's really worth it to complicate this.
For GSoC, we may decide not. The first step of the generic programming 
process is to write a big pile of non-generic code and start sifting 
through it to find the concepts. It's possible that your GSoC project 
produces the non-generic code. But it will not be a Boost.Unicode 
library until the generic Unicode algorithms have been distilled and the 
concepts defined.
...
...
The library provides the following core types in the boost namespace:
uchar8_t
uchar16_t
uchar32_t
In C++0x, these are called char, char16_t and char32_t. I think 
uchar8_t is unnecessary, and for a Boost Unicode library, 
boost::char16 and boost::char32 would work just fine. On a C++0x 
compiler, they should be typedefs for char16_t and char32_t.
The character types not being unsigned could be lead to issues during 
promotions or conversions.
I also personally think "char" is better to mean "locale-specific 
character" than "utf-8 character", so I thought a distinct name for the 
type was more appropriate.
Anyway, embracing the standard way is what should be done, I agree. I'll 
just have to make sure I'm careful of conversions.
Good.
...
Does Boost has macros to detect these yet?
BOOST_NO_CHAR16_T
BOOST_NO_CHAR32_T
...
...
And UnicodeGrapheme concept doesn't make sense to me. You say, "A 
model of UnicodeGrapheme is a range of Unicode code points that is a 
single grapheme cluster in Normalized Form C." A grapheme cluster != 
Unicode code point. It may be many code points representing a base 
character an many zero-width combining characters. So what exactly is 
being traversed by a UnicodeGrapheme range?
An UnicodeGrapheme is one grapheme cluster, i.e. a range of code points.
Basically, to iterate over grapheme clusters, a range of code points 
would be adapted into a range of ranges of code points.
OK, I see. It's a range of code points that demarcates a single, whole 
grapheme cluster. This could be useful.
...
...
The concepts are of critical importance, and these don't seem right to 
me. My C++0x concept-foo is weak, and I'd like to involve many more 
people in this discussion.
In C++0x concept-foo, I would say associated the validity invariant with 
a non-auto concept (i.e. the model has to specify it implements the 
concept, unlike auto concepts which are implicitly structurally matched 
to models), so as to distinguish ranges that aim at maintaining that 
invariants from ranges that don't.
Whoa, we're nowhere near talking about auto vs. non-auto C++0x concepts. 
We don't even have Unicode algorithms to work from, so we don't know 
what form the concepts will take. You're jumping to the end. I repeat: 
you don't know what the concepts are and you won't know until you write 
some concrete algorithms and try to lift generic implementations from them.

Please have a look at: http://www.generic-programming.org/
...
...
The purpose of the concepts are to allow algorithms to be implemented 
generically in terms of the operations provided by the concepts. So, 
what algorithms do we need, and how can we express them generically in 
terms of concepts? Without that most critical step, we'll get the 
concepts all wrong.
Concepts are not just about operations, but also about semantics.
A single pass range and a forward range provide the same operations, but 
they've got different semantics.
A single pass range can be traversed once, a forward range, which 
refines it, can be traversed any number of times.
Ahem. I know.
...
Refining the range concepts to create a new concept meaning a range that 
satisfies a given predicate doesn't seem that different in spirit.
The problem is that it is not possible to ensure that predicate 
programmatically with range adapters, so we have to fallback to design 
by contract.
Show me the (concrete) code.
...
Now, if it is believed this is a bad idea to model invariants as a 
concept, I simply will remove the concept and just deal with raw ranges 
directly, without any concept(=invariant) checking.
Now we're talking. Ditch the concepts. Write the algorithms. Then try to 
make them generic and *find* the concepts.

-- 
Eric Niebler
BoostPro Computing
http://www.boostpro.com