[rfc] Unicode GSoC project

Mathias Gaunard

12 May 2009 12 May '09

6:50 p.m.

Hi everyone. I'm in charge of the Unicode Google Summer of Code project. I have been working on range adaptors to iterate over code points in an UTF-x string as well as converting back those code points to UTF-y for the past week and I stopped working on these for a bit to put together some short documentation (which is my first quickbook document, so it may not be very pretty). This is not a documentation of the final work, but rather that of what I'm working on at the moment. I would like to know everyone's opinion of the concepts I am defining, which assume the range that is being worked on is indeed a valid unicode range in a particular encoding, as well as the system used to enforce those concepts. Also, I put the normalization form C as part of the invariant, but maybe that should be something orthogonal. I personally don't think it's really useful for general-purpose text though. While the system doesn't provide conversion from other character sets, this can easily be added by using assume_utf32. For example, using an ISO-8859-1 string as input to assume_utf32 just works, since ISO-8859-1 is included verbatim into Unicode. The documentation contains as well some introductory Unicode material. You can find the documentation online here: http://mathias.gaunard.emi.u-bordeaux1.fr/unicode/doc/html/

Show replies by date

Phil Endecott

13 May 13 May

10:55 a.m.

Hi Mathias, Mathias Gaunard wrote:

...

I have been working on range adaptors to iterate over code points in an UTF-x string as well as converting back those code points to UTF-y for the past week

I would be interested to see this code. I encourage you to share what you have done as soon as possible, to get prompt feedback.

...

short documentation http://mathias.gaunard.emi.u-bordeaux1.fr/unicode/doc/html/

Some feedback based on that document: UTF-16 .... This is the recommended encoding for dealing with Unicode. Recommended by who? It's not the encoding that I would normally recommend. make_utf8(Range&& range); Assumes range range is a properly encoded UTF-8 range in Normalization Form C. Iterating the range may throw an exception if it isn't. as_utf8(Range&& range); Return type is a model of UnicodeRange whose value type is uchar8_t. To me, the word "make" suggests that the former is actually doing a conversion. But it's the latter, "as", that does that. Can we think of something better? (Can anyone suggest any precidents?) Regards, Phil.

Alp Mestan

11:10 a.m.

On Wed, May 13, 2009 at 12:55 PM, Phil Endecott < spam_from_boost_dev@chezphil.org> wrote:

...

To me, the word "make" suggests that the former is actually doing a conversion. But it's the latter, "as", that does that. Can we think of something better? (Can anyone suggest any precidents?)

of_utf8 ? -- Alp Mestan

Mathias Gaunard

10:35 p.m.

Phil Endecott wrote:

...

I would be interested to see this code. I encourage you to share what you have done as soon as possible, to get prompt feedback.

I have some code on the boost sandbox svn, but it doesn't implement the documentation I gave. It's just some fairly heavyweight iterator adapters will sanity checks built on top of a general iterator concept, that I probably need to refine for efficiency.

...

Some feedback based on that document:

UTF-16 .... This is the recommended encoding for dealing with Unicode.

Recommended by who? It's not the encoding that I would normally recommend.

The Unicode standard, in some technical notes: http://www.unicode.org/notes/tn12/ It recommends the use of UTF-16 for general purpose text processing. It also states that UTF-8 is good for compatibility and data exchange, and UTF-32 uses just too much memory and is thus quite a waste.

...

make_utf8(Range&& range); Assumes range range is a properly encoded UTF-8 range in Normalization Form C. Iterating the range may throw an exception if it isn't.

as_utf8(Range&& range); Return type is a model of UnicodeRange whose value type is uchar8_t.

To me, the word "make" suggests that the former is actually doing a conversion. But it's the latter, "as", that does that. Can we think of something better? (Can anyone suggest any precidents?)

I kind of named it randomly. I also thought of verify_utf8, but wouldn't this be a better name for a function that eagerly checks the range is valid? I see three options here: 1) We assume the range is valid and don't bother checking anything 2) We assume the range is valid but still do sanity checks as we go to avoid raising undefined behaviour. 3) We check the whole range, and know we can use it without any checks afterwards Are all three options good to have? Should option 2 do just the checks it needs or should it assert the whole invariant? Or should option 2 just be the behaviour of option 1 in debug mode?

Phil Endecott

14 May 14 May

11:45 a.m.

Hi Mathias, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:

...

Phil Endecott wrote:

...
UTF-16 .... This is the recommended encoding for dealing with Unicode.

Recommended by who? It's not the encoding that I would normally recommend.

The Unicode standard, in some technical notes: http://www.unicode.org/notes/tn12/ It recommends the use of UTF-16 for general purpose text processing.

It also states that UTF-8 is good for compatibility and data exchange, and UTF-32 uses just too much memory and is thus quite a waste.

From that document: Status This document is a Unicode Technical Note. It is supplied purely for informational purposes and publication does not imply any endorsement by the Unicode Consortium. .... Conclusion Unicode is the best way to process and store text. While there are several forms of Unicode that are suitable for processing, it is best to use the same form everywhere in a system, and to use UTF-16 in particular for two reasons: 1. The vast majority of characters (by frequency of use) are on the BMP. 2. For seamless integration with the majority of existing software with good Unicode support. I don't find either of those claims very convincing. I hope that your library will not try to make UTF-16 some sort of default encoding, or otherwise give it special treatment. Regards, Phil.

Jeffrey Bosboom

4:19 p.m.

Phil Endecott wrote:

...

I hope that your library will not try to make UTF-16 some sort of default encoding, or otherwise give it special treatment.

I agree. Many users are going to want to read UTF-8 and write UTF-8. If the library needs to convert to UTF-16 for processing, the library should hide this from users. --Jeffrey Bosboom

Mathias Gaunard

11:23 p.m.

Phil Endecott wrote:

...

I don't find either of those claims very convincing. I hope that your library will not try to make UTF-16 some sort of default encoding, or otherwise give it special treatment.

No, the library will work with any out of UTF-8, UTF-16 and UTF-32. The comment is just advisory.

Scott McMurray

15 May 15 May

6:31 a.m.

On Wed, May 13, 2009 at 18:35, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:

...

Phil Endecott wrote:

...
Some feedback based on that document:

UTF-16 .... This is the recommended encoding for dealing with Unicode.

Recommended by who? It's not the encoding that I would normally recommend.

The Unicode standard, in some technical notes: http://www.unicode.org/notes/tn12/ It recommends the use of UTF-16 for general purpose text processing.

It also states that UTF-8 is good for compatibility and data exchange, and UTF-32 uses just too much memory and is thus quite a waste.

I really think UTF-8 should be the recommended one, since it forces people to remember that it's no longer one unit, one "character". Even in Beman Dawes's talk (http://www.boostcon.com/site-media/var/sphene/sphwiki/attachment/2009/05/07/...) where slide 11 mentions UTF-32 and remembers that UTF-16 can still take 2 encoding units per codepoint, slide 13 says that UTF-16 is "desired" where "random access critical". What kind of real-world use do people have for random access, anyways? Even UTF-32 isn't random access for the things I can think of that people would care about, what with combining codepoints and ligatures and other such things. As an aside, I'd like to see comparisons between compressed UTF-8 and compressed UTF-16, since neither one is random-access anyways, and it seems to me that caring about size of text before compression is about as important as the performance of a program with the optimizer turned off.

Mathias Gaunard

2:12 p.m.

Scott McMurray wrote:

...

I really think UTF-8 should be the recommended one, since it forces people to remember that it's no longer one unit, one "character".

Even in Beman Dawes's talk (http://www.boostcon.com/site-media/var/sphene/sphwiki/attachment/2009/05/07/...) where slide 11 mentions UTF-32 and remembers that UTF-16 can still take 2 encoding units per codepoint, slide 13 says that UTF-16 is "desired" where "random access critical".

I don't plan on supporting random access for UTF-16. UTF-16 is still faster than UTF-8 because UTF-8 requires more complex decoding. UTF-16 has only two cases, making it easier to optimize branches under the likely and unlikely case.

Phil Endecott

4:23 p.m.

Mathias Gaunard wrote:

...

UTF-16 is still faster than UTF-8 because UTF-8 requires more complex decoding. UTF-16 has only two cases, making it easier to optimize branches under the likely and unlikely case.

Faster for what? Have you benchmarked it? I think my UTF-8 code is fast. I would enjoy the challenge of demonstrating that. But the "for what" question must be answered first. Phil.

Beman Dawes

5:36 p.m.

On Fri, May 15, 2009 at 2:31 AM, Scott McMurray <me22.ca+boost@gmail.com> wrote:

...

On Wed, May 13, 2009 at 18:35, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:

...
Phil Endecott wrote:

...
Some feedback based on that document:

UTF-16 .... This is the recommended encoding for dealing with Unicode.

Recommended by who? It's not the encoding that I would normally recommend.

The Unicode standard, in some technical notes: http://www.unicode.org/notes/tn12/ It recommends the use of UTF-16 for general purpose text processing.

It also states that UTF-8 is good for compatibility and data exchange, and UTF-32 uses just too much memory and is thus quite a waste.

I really think UTF-8 should be the recommended one, since it forces people to remember that it's no longer one unit, one "character".

Even in Beman Dawes's talk (http://www.boostcon.com/site-media/var/sphene/sphwiki/attachment/2009/05/07/...) where slide 11 mentions UTF-32 and remembers that UTF-16 can still take 2 encoding units per codepoint, slide 13 says that UTF-16 is "desired" where "random access critical".

It is really important to recognize that there isn't a single recommended Unicode encoding. The most appropriate encoding can only be chosen in relationship to a particular application and/or algorithm. UTF-8 and UTF-16 are both are in heavy use because they serve somewhat different needs. UTF-32 isn't used as often as those other two in strings, but I've found it very useful for passing around single codepoints. And then some needs change at runtime, so at least for strings an adaptive encoding is needed.

...

What kind of real-world use do people have for random access, anyways? Even UTF-32 isn't random access for the things I can think of that people would care about, what with combining codepoints and ligatures and other such things.

There are several related issues, assuming we are talking about strings. Some operations are doable but uncommon, so the cost of doing them should only be incurred if they are actually needed. Some operations are unsafe without prior knowledge of the string contents, but are perfectly safe with knowledge of the contents. Some operations may be quite a bit cheaper in C++0x that C++03. etc., etc. It is hard to talk in the abstract; we need to see the actual algorithms first. --Beman

Eric Niebler

14 May 14 May

5:09 p.m.

Mathias Gaunard wrote:

...

Hi everyone. I'm in charge of the Unicode Google Summer of Code project.

I have been working on range adaptors to iterate over code points in an UTF-x string as well as converting back those code points to UTF-y for the past week and

That's good, these are needed. Also needed are tables that store the various character properties, and (hopefully) some parsers that build the tables directly from the Unicode character database so we can easily rev it whenever the database changes.

...

I stopped working on these for a bit to put together some short documentation (which is my first quickbook document, so it may not be very pretty). This is not a documentation of the final work, but rather that of what I'm working on at the moment.

I would like to know everyone's opinion of the concepts I am defining, which assume the range that is being worked on is indeed a valid unicode range in a particular encoding, as well as the system used to enforce those concepts.

Also, I put the normalization form C as part of the invariant

The invariant of what? The internal data over which the iterators traverse? Which iterators? All of them? Are you really talking about an invariant (something that is true of the data both before an after each operation completes), or of pre- or post-conditions? , but maybe

...

that should be something orthogonal. I personally don't think it's really useful for general-purpose text though.

I should hope there is a way to operate on valid Unicode ranges that happen not to be in normalization form C.

...

While the system doesn't provide conversion from other character sets, this can easily be added by using assume_utf32. For example, using an ISO-8859-1 string as input to assume_utf32 just works, since ISO-8859-1 is included verbatim into Unicode.

I personally haven't taken the time to learn how ICU handles Unicode input and character set conversions. It might be illustrative to see how an established and respected Unicode library handles issues like this.

...

The documentation contains as well some introductory Unicode material.

You can find the documentation online here: http://mathias.gaunard.emi.u-bordeaux1.fr/unicode/doc/html/

Thanks for posting this. Some comments. <<Core Types>> The library provides the following core types in the boost namespace: uchar8_t uchar16_t uchar32_t In C++0x, these are called char, char16_t and char32_t. I think uchar8_t is unnecessary, and for a Boost Unicode library, boost::char16 and boost::char32 would work just fine. On a C++0x compiler, they should be typedefs for char16_t and char32_t. <<Concepts>> I strongly disagree with requiring normalization form C for the concept UnicodeRange. There are many more valid Unicode sequences. And UnicodeGrapheme concept doesn't make sense to me. You say, "A model of UnicodeGrapheme is a range of Unicode code points that is a single grapheme cluster in Normalized Form C." A grapheme cluster != Unicode code point. It may be many code points representing a base character an many zero-width combining characters. So what exactly is being traversed by a UnicodeGrapheme range? The concepts are of critical importance, and these don't seem right to me. My C++0x concept-foo is weak, and I'd like to involve many more people in this discussion. The purpose of the concepts are to allow algorithms to be implemented generically in terms of the operations provided by the concepts. So, what algorithms do we need, and how can we express them generically in terms of concepts? Without that most critical step, we'll get the concepts all wrong. I imagine we'll want algorithms for converting from one encoding to another, or from one normalization form (or, more likely, from no normalization form) to another, so we'll need to constrain the algorithms to specific encodings and/or normalization forms. We'll also need a concept that represents Unicode input that hasn't yet been normalized (perhaps in each of the encodings?). Point is, the concrete algorithms must come first. We may end up back with a single perfectly general UnicodeRange that all algorithms can be implemented in terms of. That'd be nice, but I bet we end up with refinements for the different encodings/normalized forms that make it possible to implement some algorithms much more efficiently. (I stopped reading the docs at this point.) -- Eric Niebler BoostPro Computing http://www.boostpro.com

Phil Endecott

10:28 p.m.

Eric Niebler wrote:

...

Mathias Gaunard wrote:

...

Also needed are tables that store the various character properties, and (hopefully) some parsers that build the tables directly from the Unicode character database so we can easily rev it whenever the database changes.

For the record, I have scripts that can generate ISO-8859-* to/from unicode tables from the downloaded data; I'll happily contribute this if it is useful to anyone.

...

The library provides the following core types in the boost namespace:

uchar8_t uchar16_t uchar32_t

In C++0x, these are called char, char16_t and char32_t.

I liked that idea of making them obviously-unsigned; I had some nasty bugs with my UTF-8 code where I made invalid assumptions about signs. But of course being consistent with C++0x is more important.

...

I strongly disagree with requiring normalization form C for the concept UnicodeRange. There are many more valid Unicode sequences.

Agreed.

...

the concrete algorithms must come first.

Agreed. Mathias, I would love to see a sort of "end user perspective" view of how this library will be used, i.e. its scope and basic usage pattern. Phil.

Mathias Gaunard

15 May 15 May

3:11 a.m.

Eric Niebler wrote:

...

Mathias Gaunard wrote:

...

...
I have been working on range adaptors to iterate over code points in an UTF-x string as well as converting back those code points to UTF-y for the past week and

That's good, these are needed. Also needed are tables that store the various character properties, and (hopefully) some parsers that build the tables directly from the Unicode character database so we can easily rev it whenever the database changes.

I will attack the tables as soon as iteration over code values, code points and grapheme clusters is finished (that would use the tables, but I want to have the mechanisms defined first), and that there is an accepted design as to how to represent unicode data and interact with it. Which hopefully, should not be in too much time.

...

The invariant of what? The internal data over which the iterators traverse? Which iterators? All of them? Are you really talking about an invariant (something that is true of the data both before an after each operation completes), or of pre- or post-conditions?

The invariant satisfied by a model of UnicodeRange. If R is a model of UnicodeRange, any instance of R r shall verify that begin(r)...end(r) is a valid normalized unicode string properly encoded in UTF-x. The invariant must be satisfied at any time. Functions taking models of UnicodeRange as input can assume the invariant as a precondition on their input and should have it as a postcondition on it too. Functions providing models of UnicodeRange as output should have it as a postcondition on its output.

...

, but maybe

...
that should be something orthogonal. I personally don't think it's really useful for general-purpose text though.

I should hope there is a way to operate on valid Unicode ranges that happen not to be in normalization form C.

A way to operate on such data would be normalizing it beforehand. No information is supposed to be lost by normalizing to form C. Substring search for example requires the two strings being compared to be normalized, or at least relevant parts, so that canonically equivalent things can compare as being equal. We can choose to do that behind the user's back or rather to make it so that we don't need to; the latter allows to keep things simple. Of course, being normalization-form-agnostic and making it a separate concept, allowing to select the best algorithm possible (I may not have the time to write all versions however), is more powerful because it doesn't do any concession. I just want to know whether it's really worth it to complicate this.

...

The library provides the following core types in the boost namespace:

uchar8_t uchar16_t uchar32_t

In C++0x, these are called char, char16_t and char32_t. I think uchar8_t is unnecessary, and for a Boost Unicode library, boost::char16 and boost::char32 would work just fine. On a C++0x compiler, they should be typedefs for char16_t and char32_t.

The character types not being unsigned could be lead to issues during promotions or conversions. I also personally think "char" is better to mean "locale-specific character" than "utf-8 character", so I thought a distinct name for the type was more appropriate. Anyway, embracing the standard way is what should be done, I agree. I'll just have to make sure I'm careful of conversions. Does Boost has macros to detect these yet?

...

And UnicodeGrapheme concept doesn't make sense to me. You say, "A model of UnicodeGrapheme is a range of Unicode code points that is a single grapheme cluster in Normalized Form C." A grapheme cluster != Unicode code point. It may be many code points representing a base character an many zero-width combining characters. So what exactly is being traversed by a UnicodeGrapheme range?

An UnicodeGrapheme is one grapheme cluster, i.e. a range of code points. Basically, to iterate over grapheme clusters, a range of code points would be adapted into a range of ranges of code points.

...

The concepts are of critical importance, and these don't seem right to me. My C++0x concept-foo is weak, and I'd like to involve many more people in this discussion.

In C++0x concept-foo, I would say associated the validity invariant with a non-auto concept (i.e. the model has to specify it implements the concept, unlike auto concepts which are implicitly structurally matched to models), so as to distinguish ranges that aim at maintaining that invariants from ranges that don't.

...

The purpose of the concepts are to allow algorithms to be implemented generically in terms of the operations provided by the concepts. So, what algorithms do we need, and how can we express them generically in terms of concepts? Without that most critical step, we'll get the concepts all wrong.

Concepts are not just about operations, but also about semantics. A single pass range and a forward range provide the same operations, but they've got different semantics. A single pass range can be traversed once, a forward range, which refines it, can be traversed any number of times. Refining the range concepts to create a new concept meaning a range that satisfies a given predicate doesn't seem that different in spirit. The problem is that it is not possible to ensure that predicate programmatically with range adapters, so we have to fallback to design by contract. Now, if it is believed this is a bad idea to model invariants as a concept, I simply will remove the concept and just deal with raw ranges directly, without any concept(=invariant) checking.

Eric Niebler

4:09 p.m.

Mathias Gaunard wrote:

...

Eric Niebler wrote:

...
The invariant of what? The internal data over which the iterators traverse? Which iterators? All of them? Are you really talking about an invariant (something that is true of the data both before an after each operation completes), or of pre- or post-conditions?

The invariant satisfied by a model of UnicodeRange. If R is a model of UnicodeRange, any instance of R r shall verify that begin(r)...end(r) is a valid normalized unicode string properly encoded in UTF-x.

The invariant must be satisfied at any time.

So a range that lazily verifies well-formedness while it is being traversed cannot satisfy the UnicodeRange concept? I think if we were to ever have a Unicode string type, it would make sense to define a class invariant. For the Unicode concepts (which have not yet been specified), each operation in the concept may specify pre- and post-conditions, in addition to complexity guarantees and semantics.

...

Functions taking models of UnicodeRange as input can assume the invariant as a precondition on their input and should have it as a postcondition on it too. Functions providing models of UnicodeRange as output should have it as a postcondition on its output.

...
, but maybe

...
that should be something orthogonal. I personally don't think it's really useful for general-purpose text though.

I should hope there is a way to operate on valid Unicode ranges that happen not to be in normalization form C.

A way to operate on such data would be normalizing it beforehand. No information is supposed to be lost by normalizing to form C.

That's one way, yes. Another way would be to provide an adaptor that normalizes on the fly, if such a thing could be done. There may still be other ways.

...

Substring search for example requires the two strings being compared to be normalized, or at least relevant parts, so that canonically equivalent things can compare as being equal.

Or else Unicode comparison algorithms can be written generally so as not to require both sequences to be normalized first. It could also dispatch to a more efficient implementation if the ranges are both known to be in the same normalization form.

...

We can choose to do that behind the user's back or rather to make it so that we don't need to; the latter allows to keep things simple.

Are we writing a handful of one-off routines that manipulate /some/ Unicode data in /some/ predefined ways, or are we trying to build a the foundation upon which a full, generic Unicode library can be based? I'm hoping for the latter. That's a much bigger job. It's ok if we do the former, as long as we see it for what it is: the non-generic kernel from which we can later distill generic algorithms and concepts.

...

Of course, being normalization-form-agnostic and making it a separate concept, allowing to select the best algorithm possible (I may not have the time to write all versions however), is more powerful because it doesn't do any concession.

Right.

...

I just want to know whether it's really worth it to complicate this.

For GSoC, we may decide not. The first step of the generic programming process is to write a big pile of non-generic code and start sifting through it to find the concepts. It's possible that your GSoC project produces the non-generic code. But it will not be a Boost.Unicode library until the generic Unicode algorithms have been distilled and the concepts defined.

...

...
The library provides the following core types in the boost namespace:

uchar8_t uchar16_t uchar32_t

In C++0x, these are called char, char16_t and char32_t. I think uchar8_t is unnecessary, and for a Boost Unicode library, boost::char16 and boost::char32 would work just fine. On a C++0x compiler, they should be typedefs for char16_t and char32_t.

The character types not being unsigned could be lead to issues during promotions or conversions.

I also personally think "char" is better to mean "locale-specific character" than "utf-8 character", so I thought a distinct name for the type was more appropriate.

Anyway, embracing the standard way is what should be done, I agree. I'll just have to make sure I'm careful of conversions.

Good.

...

Does Boost has macros to detect these yet?

BOOST_NO_CHAR16_T BOOST_NO_CHAR32_T

...

...
And UnicodeGrapheme concept doesn't make sense to me. You say, "A model of UnicodeGrapheme is a range of Unicode code points that is a single grapheme cluster in Normalized Form C." A grapheme cluster != Unicode code point. It may be many code points representing a base character an many zero-width combining characters. So what exactly is being traversed by a UnicodeGrapheme range?

An UnicodeGrapheme is one grapheme cluster, i.e. a range of code points.

Basically, to iterate over grapheme clusters, a range of code points would be adapted into a range of ranges of code points.

OK, I see. It's a range of code points that demarcates a single, whole grapheme cluster. This could be useful.

...

...
The concepts are of critical importance, and these don't seem right to me. My C++0x concept-foo is weak, and I'd like to involve many more people in this discussion.

In C++0x concept-foo, I would say associated the validity invariant with a non-auto concept (i.e. the model has to specify it implements the concept, unlike auto concepts which are implicitly structurally matched to models), so as to distinguish ranges that aim at maintaining that invariants from ranges that don't.

Whoa, we're nowhere near talking about auto vs. non-auto C++0x concepts. We don't even have Unicode algorithms to work from, so we don't know what form the concepts will take. You're jumping to the end. I repeat: you don't know what the concepts are and you won't know until you write some concrete algorithms and try to lift generic implementations from them. Please have a look at: http://www.generic-programming.org/

...

...
The purpose of the concepts are to allow algorithms to be implemented generically in terms of the operations provided by the concepts. So, what algorithms do we need, and how can we express them generically in terms of concepts? Without that most critical step, we'll get the concepts all wrong.

Concepts are not just about operations, but also about semantics. A single pass range and a forward range provide the same operations, but they've got different semantics. A single pass range can be traversed once, a forward range, which refines it, can be traversed any number of times.

Ahem. I know.

...

Refining the range concepts to create a new concept meaning a range that satisfies a given predicate doesn't seem that different in spirit.

The problem is that it is not possible to ensure that predicate programmatically with range adapters, so we have to fallback to design by contract.

Show me the (concrete) code.

...

Now, if it is believed this is a bad idea to model invariants as a concept, I simply will remove the concept and just deal with raw ranges directly, without any concept(=invariant) checking.

Now we're talking. Ditch the concepts. Write the algorithms. Then try to make them generic and *find* the concepts. -- Eric Niebler BoostPro Computing http://www.boostpro.com

Beman Dawes

5:41 p.m.

On Fri, May 15, 2009 at 12:09 PM, Eric Niebler <eric@boostpro.com> wrote:

...

Mathias Gaunard wrote:

...
...

...
The concepts are of critical importance, and these don't seem right to me. My C++0x concept-foo is weak, and I'd like to involve many more people in this discussion.

In C++0x concept-foo, I would say associated the validity invariant with a non-auto concept (i.e. the model has to specify it implements the concept, unlike auto concepts which are implicitly structurally matched to models), so as to distinguish ranges that aim at maintaining that invariants from ranges that don't.

Whoa, we're nowhere near talking about auto vs. non-auto C++0x concepts. We don't even have Unicode algorithms to work from, so we don't know what form the concepts will take. You're jumping to the end. I repeat: you don't know what the concepts are and you won't know until you write some concrete algorithms and try to lift generic implementations from them.

Please have a look at: http://www.generic-programming.org/

Strongly agree.

...

...
...
The purpose of the concepts are to allow algorithms to be implemented generically in terms of the operations provided by the concepts. So, what algorithms do we need, and how can we express them generically in terms of concepts? Without that most critical step, we'll get the concepts all wrong.

Concepts are not just about operations, but also about semantics. A single pass range and a forward range provide the same operations, but they've got different semantics. A single pass range can be traversed once, a forward range, which refines it, can be traversed any number of times.

Ahem. I know.

...
Refining the range concepts to create a new concept meaning a range that satisfies a given predicate doesn't seem that different in spirit.

The problem is that it is not possible to ensure that predicate programmatically with range adapters, so we have to fallback to design by contract.

Show me the (concrete) code.

Exactly.

...

...
Now, if it is believed this is a bad idea to model invariants as a concept, I simply will remove the concept and just deal with raw ranges directly, without any concept(=invariant) checking.

Now we're talking. Ditch the concepts. Write the algorithms. Then try to make them generic and *find* the concepts.

Amen. --Beman

Phil Endecott

4:37 p.m.

Mathias Gaunard wrote:

...

Eric Niebler wrote:

...
Mathias Gaunard wrote:

...

...
I should hope there is a way to operate on valid Unicode ranges that happen not to be in normalization form C.

A way to operate on such data would be normalizing it beforehand.

Sorry, but that doesn't seem like the right choice to me. If I am processing some lump of UTF-* text in a way that doesn't care about grapheme clusters (etc), then I don't want to have to waste effort on an unnecessary normalisation step. Why do you think that your algorithms - whatever algorithms they are, which we don't know yet - benefit from this precondition? Phil.

5874

Age (days ago)

5877

Last active (days ago)

List overview

Download

16 comments

7 participants

participants (7)

Alp Mestan
Beman Dawes
Eric Niebler
Jeffrey Bosboom
Mathias Gaunard
Phil Endecott
Scott McMurray