Unicode: what kind of binary compatibility do we want?

Mathias Gaunard

2 Jun 2009 2 Jun '09

3:09 a.m.

As I am finishing putting property look-up together for the Unicode library GSoC, I am wondering what kind of binary compatibility it should aim for. The work from Graham Barnett back in 2005 defined an abstract base class with virtual functions for every unicode-related feature but I believe that's overkill. Basically, the current property design I have is like this struct some_property { enum type { some_default, some_value1, some_value2, ... _count; } }; some_property::type get_some_property(char32 ch); With get_some_property a simple look-up in the table, but the table layout being version dependent it would need to be in the library TUs. However, a new version of the library may return a value that is not within the enum. Should it then work like this? some_property::type get_some_property(char32 ch) { some_property::type p = get_some_property_impl(ch); if(p >= some_property::_count) return some_property::some_default; return p; } some_property::type get_some_property_impl(char32 ch); Is that suitable? Or do we want more/less flexibility? Apart from that, expect a documentation update by the end of the week.

Show replies by date

Cory Nelson

2 Jun 2 Jun

3:19 a.m.

On Mon, Jun 1, 2009 at 8:09 PM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:

...

As I am finishing putting property look-up together for the Unicode library GSoC, I am wondering what kind of binary compatibility it should aim for.

The work from Graham Barnett back in 2005 defined an abstract base class with virtual functions for every unicode-related feature but I believe that's overkill.

Basically, the current property design I have is like this

struct some_property { enum type { some_default, some_value1, some_value2, ... _count; } };

some_property::type get_some_property(char32 ch);

With get_some_property a simple look-up in the table, but the table layout being version dependent it would need to be in the library TUs.

However, a new version of the library may return a value that is not within the enum.

Should it then work like this?

some_property::type get_some_property(char32 ch) { some_property::type p = get_some_property_impl(ch); if(p >= some_property::_count) return some_property::some_default; return p; }

some_property::type get_some_property_impl(char32 ch);

Is that suitable? Or do we want more/less flexibility?

I have never had an expectation of binary compatibility with Boost -- as far as I know none of the existing libraries promise it but I could be wrong. I would prefer not to have it wasting any cycles checking compatibility, especially with Unicode which often seems to sneak its way into perf-critical code. Just my two cents. -- Cory Nelson http://int64.org

Stewart, Robert

3:18 p.m.

Mathias Gaunard wrote On Monday, June 01, 2009 11:09 PM

...

As I am finishing putting property look-up together for the Unicode library GSoC, I am wondering what kind of binary compatibility it should aim for.

[snip]

...

Basically, the current property design I have is like this

struct some_property { enum type { some_default, some_value1, some_value2, ... _count; } };

some_property::type get_some_property(char32 ch);

With get_some_property a simple look-up in the table, but the table layout being version dependent it would need to be in the library TUs.

However, a new version of the library may return a value that is not within the enum.

Should it then work like this?

some_property::type get_some_property(char32 ch) { some_property::type p = get_some_property_impl(ch); if(p >= some_property::_count) return some_property::some_default; return p; }

I don't know the implications of this, but I generally dislike the idea of a silent fallback. I'd prefer to see two interfaces: one throws an exception on out of range values and one that accepts a default value to return in those cases. It might be useful to determine compatibility when the library starts, perhaps via an initialization call, and use the Strategy Pattern to determine the implementation. (When compatible, a property's accesses are unchecked. When incompatible, the property's accesses are checked.) _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

Mathias Gaunard

4:21 p.m.

Stewart, Robert wrote:

...

I don't know the implications of this, but I generally dislike the idea of a silent fallback. I'd prefer to see two interfaces: one throws an exception on out of range values and one that accepts a default value to return in those cases.

If the character has some new property value it means it had the default property value (which isn't really a property, it's more like a "other" or "any") in the previous versions, I'm fairly sure Unicode guarantees this.

...

It might be useful to determine compatibility when the library starts, perhaps via an initialization call, and use the Strategy Pattern to determine the implementation. (When compatible, a property's accesses are unchecked. When incompatible, the property's accesses are checked.)

That would mean virtual function call overhead, which should be higher than a simple branching in an inlined function.

Rogier van Dalen

4:54 p.m.

On Tue, Jun 2, 2009 at 17:21, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:

...

Stewart, Robert wrote:

...
I don't know the implications of this, but I generally dislike the idea of a silent fallback. I'd prefer to see two interfaces: one throws an exception on out of range values and one that accepts a default value to return in those cases.

If the character has some new property value it means it had the default property value (which isn't really a property, it's more like a "other" or "any") in the previous versions, I'm fairly sure Unicode guarantees this.

I think that's correct. The Unicode Standard 5.0 says (section 3.5, D26, p 84): "Default property value: The value (or in some cases small set of values) of a property associated with unassigned code points or with encoded characters for which the property is irrelevant." A few pages down, D40 has more information about stability of properties. I think the rationale is that code points unknown to an application may well be valid in a newer version of Unicode, so using default behaviour is most desirable. Notice, Mathias, that every single design decision you make will be scrutinised. In general it probably makes sense to explicitly reference the Unicode standard everywhere in your code and documentation. Cheers, Rogier

Ilya Sokolov

3 Jun 3 Jun

6:43 a.m.

Rogier van Dalen wrote:

...

On Tue, Jun 2, 2009 at 17:21, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:

...
Stewart, Robert wrote:

...
I don't know the implications of this, but I generally dislike the idea of a silent fallback. I'd prefer to see two interfaces: one throws an exception on out of range values and one that accepts a default value to return in those cases. If the character has some new property value it means it had the default property value (which isn't really a property, it's more like a "other" or "any") in the previous versions, I'm fairly sure Unicode guarantees this.

I think that's correct. The Unicode Standard 5.0 says (section 3.5, D26, p 84):

"Default property value: The value (or in some cases small set of values) of a property associated with unassigned code points or with encoded characters for which the property is irrelevant."

By this definition you can't return the default for assigned code point for with the property is not irrelevant.

Stewart, Robert

1:08 p.m.

Mathias Gaunard wrote On Tuesday, June 02, 2009 12:22 PM

...

Stewart, Robert wrote:

...
I don't know the implications of this, but I generally dislike the idea of a silent fallback. I'd prefer to see two interfaces: one throws an exception on out of range values and one that accepts a default value to return in those cases.

If the character has some new property value it means it had the default property value (which isn't really a property, it's more like a "other" or "any") in the previous versions, I'm fairly sure Unicode guarantees this.

It is important to be positive about this. If the default *is* appropriate for a new property value in all cases, then your approach is reasonable.

...

...
It might be useful to determine compatibility when the library starts, perhaps via an initialization call, and use the Strategy Pattern to determine the implementation. (When compatible, a property's accesses are unchecked. When incompatible, the property's accesses are checked.)

That would mean virtual function call overhead, which should be higher than a simple branching in an inlined function.

Until you satisfy the requirements, performance isn't important. If your default handling approach is reasonable, then you must measure performance in real use cases. Until those points are addressed, you don't know the implications of using virtual calls. Don't dismiss them out of hand. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

Rogier van Dalen

2 Jun 2 Jun

5:08 p.m.

On Tue, Jun 2, 2009 at 04:09, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:

...

The work from Graham Barnett back in 2005 defined an abstract base class with virtual functions for every unicode-related feature but I believe that's overkill.

Basically, the current property design I have is like this

struct some_property { enum type { some_default, some_value1, some_value2, ... _count; } };

some_property::type get_some_property(char32 ch);

I don't remember Graham's rationale, but I can see two reasons why he may have chosen that design. (1) Looking at common query sequences, for example, in Unicode normalisation, I think you'll extract a number of properties of one code point after one another. That may have to be optimised. (2) Some OSs contain Unicode databases; some standard libraries do; and some people may use the private-use code points. Plugging in different databases should probably be possible. I haven't thought this through, so correct me if I'm wrong. I'm not sure your current design works well for (1). I think (2) can be solved differently than with virtual functions, but a sketch of how to integrate this might put any doubts to rest. Hope this helps. Cheers, Rogier

5915

Age (days ago)

5916

Last active (days ago)

List overview

Download

7 comments

5 participants

participants (5)

Cory Nelson
Ilya Sokolov
Mathias Gaunard
Rogier van Dalen
Stewart, Robert