
Mathias Gaunard wrote:
Eric Niebler wrote:
The invariant of what? The internal data over which the iterators traverse? Which iterators? All of them? Are you really talking about an invariant (something that is true of the data both before an after each operation completes), or of pre- or post-conditions?
The invariant satisfied by a model of UnicodeRange. If R is a model of UnicodeRange, any instance of R r shall verify that begin(r)...end(r) is a valid normalized unicode string properly encoded in UTF-x.
The invariant must be satisfied at any time.
So a range that lazily verifies well-formedness while it is being traversed cannot satisfy the UnicodeRange concept? I think if we were to ever have a Unicode string type, it would make sense to define a class invariant. For the Unicode concepts (which have not yet been specified), each operation in the concept may specify pre- and post-conditions, in addition to complexity guarantees and semantics.
Functions taking models of UnicodeRange as input can assume the invariant as a precondition on their input and should have it as a postcondition on it too. Functions providing models of UnicodeRange as output should have it as a postcondition on its output.
, but maybe
that should be something orthogonal. I personally don't think it's really useful for general-purpose text though.
I should hope there is a way to operate on valid Unicode ranges that happen not to be in normalization form C.
A way to operate on such data would be normalizing it beforehand. No information is supposed to be lost by normalizing to form C.
That's one way, yes. Another way would be to provide an adaptor that normalizes on the fly, if such a thing could be done. There may still be other ways.
Substring search for example requires the two strings being compared to be normalized, or at least relevant parts, so that canonically equivalent things can compare as being equal.
Or else Unicode comparison algorithms can be written generally so as not to require both sequences to be normalized first. It could also dispatch to a more efficient implementation if the ranges are both known to be in the same normalization form.
We can choose to do that behind the user's back or rather to make it so that we don't need to; the latter allows to keep things simple.
Are we writing a handful of one-off routines that manipulate /some/ Unicode data in /some/ predefined ways, or are we trying to build a the foundation upon which a full, generic Unicode library can be based? I'm hoping for the latter. That's a much bigger job. It's ok if we do the former, as long as we see it for what it is: the non-generic kernel from which we can later distill generic algorithms and concepts.
Of course, being normalization-form-agnostic and making it a separate concept, allowing to select the best algorithm possible (I may not have the time to write all versions however), is more powerful because it doesn't do any concession.
Right.
I just want to know whether it's really worth it to complicate this.
For GSoC, we may decide not. The first step of the generic programming process is to write a big pile of non-generic code and start sifting through it to find the concepts. It's possible that your GSoC project produces the non-generic code. But it will not be a Boost.Unicode library until the generic Unicode algorithms have been distilled and the concepts defined.
The library provides the following core types in the boost namespace:
uchar8_t uchar16_t uchar32_t
In C++0x, these are called char, char16_t and char32_t. I think uchar8_t is unnecessary, and for a Boost Unicode library, boost::char16 and boost::char32 would work just fine. On a C++0x compiler, they should be typedefs for char16_t and char32_t.
The character types not being unsigned could be lead to issues during promotions or conversions.
I also personally think "char" is better to mean "locale-specific character" than "utf-8 character", so I thought a distinct name for the type was more appropriate.
Anyway, embracing the standard way is what should be done, I agree. I'll just have to make sure I'm careful of conversions.
Good.
Does Boost has macros to detect these yet?
BOOST_NO_CHAR16_T BOOST_NO_CHAR32_T
And UnicodeGrapheme concept doesn't make sense to me. You say, "A model of UnicodeGrapheme is a range of Unicode code points that is a single grapheme cluster in Normalized Form C." A grapheme cluster != Unicode code point. It may be many code points representing a base character an many zero-width combining characters. So what exactly is being traversed by a UnicodeGrapheme range?
An UnicodeGrapheme is one grapheme cluster, i.e. a range of code points.
Basically, to iterate over grapheme clusters, a range of code points would be adapted into a range of ranges of code points.
OK, I see. It's a range of code points that demarcates a single, whole grapheme cluster. This could be useful.
The concepts are of critical importance, and these don't seem right to me. My C++0x concept-foo is weak, and I'd like to involve many more people in this discussion.
In C++0x concept-foo, I would say associated the validity invariant with a non-auto concept (i.e. the model has to specify it implements the concept, unlike auto concepts which are implicitly structurally matched to models), so as to distinguish ranges that aim at maintaining that invariants from ranges that don't.
Whoa, we're nowhere near talking about auto vs. non-auto C++0x concepts. We don't even have Unicode algorithms to work from, so we don't know what form the concepts will take. You're jumping to the end. I repeat: you don't know what the concepts are and you won't know until you write some concrete algorithms and try to lift generic implementations from them. Please have a look at: http://www.generic-programming.org/
The purpose of the concepts are to allow algorithms to be implemented generically in terms of the operations provided by the concepts. So, what algorithms do we need, and how can we express them generically in terms of concepts? Without that most critical step, we'll get the concepts all wrong.
Concepts are not just about operations, but also about semantics. A single pass range and a forward range provide the same operations, but they've got different semantics. A single pass range can be traversed once, a forward range, which refines it, can be traversed any number of times.
Ahem. I know.
Refining the range concepts to create a new concept meaning a range that satisfies a given predicate doesn't seem that different in spirit.
The problem is that it is not possible to ensure that predicate programmatically with range adapters, so we have to fallback to design by contract.
Show me the (concrete) code.
Now, if it is believed this is a bad idea to model invariants as a concept, I simply will remove the concept and just deal with raw ranges directly, without any concept(=invariant) checking.
Now we're talking. Ditch the concepts. Write the algorithms. Then try to make them generic and *find* the concepts. -- Eric Niebler BoostPro Computing http://www.boostpro.com