Unicode string

newer
RE: [boost] Generic non-derivable...

older
RE: [boost] Re: Numeric...

Vladimir Prus

7 Apr 2004 7 Apr '04

7:57 a.m.

Miro Jurisic wrote:

...

On the other hand, in order to manipulate a Unicode string without violating constraints on well-formedness, you have to consider the string as a sequence of abstract characters (unless, of course, you constrain yourself to string transformations which operate on code point sequences yet guarantee that strings remain well-formed; there are few such transformations -- concatenation is one of them under certain constraints).

[snip]

...

capital letter C; combining caron; lowercase letter e

it contains two abstract characters, but three UCS4 code points; therefore, removing the first character from that string means removing the first two code points of three. Removing just the first code point would leave you with a combining caron followed by a lowercase letter e, which is not a well-formed Unicode string.

Hi Miro, so the point is that when using string-as-code-point-container, even searching and removing a character/substring might get invalid string? E.g. even looking for string 'foo' you theoretically can find string 'foo' followed by composing character, and removing just 'foo' will be invalid?

...

basic_string is not the abstraction you are looking for, but it's also the only one that is readily available in STL/boost today. It may serve as a good starting point (questionable, IMNSHO), but it should most definitely not be treated as the right thing to use for Unicode in the long term.

I wonder what's the right abstraction then? Is it necessary to have a class to represent abstract character, with all composing characters? - Volodya

Show replies by date

Miro Jurisic

7 Apr 7 Apr

8:55 a.m.

In article <200404071157.53018.ghost@cs.msu.su>, Vladimir Prus <ghost@cs.msu.su> wrote:

...

so the point is that when using string-as-code-point-container, even searching and removing a character/substring might get invalid string? E.g. even looking for string 'foo' you theoretically can find string 'foo' followed by composing character, and removing just 'foo' will be invalid?

Yes, and this is true of all Unicode encodings. Essentially, transformations that select or remove portions of a string require you to be aware of character boundaries. Searching, substrings, and character removal are such transformations, whereas concatenation isn't, so if you have to strings in the same encoding, you can concatenate them without dealing with character boundaries, and that's about it.

...

...
basic_string is not the abstraction you are looking for, but it's also the only one that is readily available in STL/boost today. It may serve as a good starting point (questionable, IMNSHO), but it should most definitely not be treated as the right thing to use for Unicode in the long term.

I wonder what's the right abstraction then? Is it necessary to have a class to represent abstract character, with all composing characters?

That's one way to go, yes; note that the moment you utter those words, you put yourself into the position of designing a Unicode API :-) which you said you don't want to do at this time. meeroh -- If this message helped you, consider buying an item from my wish list: <http://web.meeroh.org/wishlist>

Vladimir Prus

9:10 a.m.

Miro Jurisic wrote:

...

...
so the point is that when using string-as-code-point-container, even searching and removing a character/substring might get invalid string? E.g. even looking for string 'foo' you theoretically can find string 'foo' followed by composing character, and removing just 'foo' will be invalid?

Yes, and this is true of all Unicode encodings. Essentially, transformations that select or remove portions of a string require you to be aware of character boundaries. Searching, substrings, and character removal are such transformations, whereas concatenation isn't, so if you have to strings in the same encoding, you can concatenate them without dealing with character boundaries, and that's about it.

Okay.

...

...
...
basic_string is not the abstraction you are looking for, but it's also the only one that is readily available in STL/boost today. It may serve as a good starting point (questionable, IMNSHO), but it should most definitely not be treated as the right thing to use for Unicode in the long term.

I wonder what's the right abstraction then? Is it necessary to have a class to represent abstract character, with all composing characters?

That's one way to go, yes; note that the moment you utter those words, you put yourself into the position of designing a Unicode API :-) which you said you don't want to do at this time.

You almost caugth me ;-) I've changed the message subject on purpose -- to indicate that I'm not longer talking about program_options. I'm interested how 'right' unicode string can be implemented, but I don't think sure it's possible to design such a string now, so program_options will still have to use much simpler approach. - Volodya

Miro Jurisic

9:38 a.m.

In article <c50ghn$9fd$1@sea.gmane.org>, Vladimir Prus <ghost@cs.msu.su> wrote:

...

...
...
I wonder what's the right abstraction then? Is it necessary to have a class to represent abstract character, with all composing characters?

That's one way to go, yes; note that the moment you utter those words, you put yourself into the position of designing a Unicode API :-) which you said you don't want to do at this time.

You almost caugth me ;-) I've changed the message subject on purpose -- to indicate that I'm not longer talking about program_options. I'm interested how 'right' unicode string can be implemented, but I don't think sure it's possible to design such a string now, so program_options will still have to use much simpler approach.

I am somewhat reluctant to discuss this in detail at this time, not because I have something to hide, but because I have something to learn: I need to investigate some aspect of Unicode, the ICU library, and locales and facets in the C++ standard before I can form a more complete picture of the design of a Unicode string. However, I don't have the time to do all the research right now, because there are other things I need to do that I am getting paid to do, and full Unicode support is not on my work too list. Basically, I know enough to know how _not_ to do it, but I am not sure that I know enough to know how to do it right :-) However, I currently think that there are legitimate reasons why one would want to view a Unicode string as (in increasing order of complexity): - a sequence of code points (this is useful for serialization) - a sequence of encoded characters (this is useful for transcoding) - a sequence of abstract characters (this is useful for most high-level string transformations, such as substrings, find, etc.) Therefore I think that a Unicode string should probably not be represented as a container of any one of those three, but instead should have an interface that lets you treat it in different ways depending on your needs. (One way to do this is to have three kinds of iterators for Unicode strings). Also, as I mentioned elsewhere, Unicode strings do not lend themselves to performance and iteration characteristics provided by std::string; in particular, constant time random access is not going to work for two of those three views of a string. I think that a Unicode string is much better matched to characteristics of SGI's rope class, but I haven't had the time to research that in detail. meeroh -- If this message helped you, consider buying an item from my wish list: <http://web.meeroh.org/wishlist>

Vladimir Prus

10:10 a.m.

Hi Miro,

...

...
You almost caugth me ;-) I've changed the message subject on purpose -- to indicate that I'm not longer talking about program_options. I'm interested how 'right' unicode string can be implemented, but I don't think sure it's possible to design such a string now, so program_options will still have to use much simpler approach.

I am somewhat reluctant to discuss this in detail at this time, not because I have something to hide, but because I have something to learn: I need to investigate some aspect of Unicode, the ICU library, and locales and facets in the C++ standard before I can form a more complete picture of the design of a Unicode string. However, I don't have the time to do all the research right now, because there are other things I need to do that I am getting paid to do, and full Unicode support is not on my work too list.

That's what I think too. There's too many unicode issue and too little time.

...

Basically, I know enough to know how _not_ to do it, but I am not sure that I know enough to know how to do it right :-)

However, I currently think that there are legitimate reasons why one would want to view a Unicode string as (in increasing order of complexity):

- a sequence of code points (this is useful for serialization) - a sequence of encoded characters (this is useful for transcoding) - a sequence of abstract characters (this is useful for most high-level string transformations, such as substrings, find, etc.)

Therefore I think that a Unicode string should probably not be represented as a container of any one of those three, but instead should have an interface that lets you treat it in different ways depending on your needs. (One way to do this is to have three kinds of iterators for Unicode strings).

This seems reasonable. Hopefully one day someone will take the time to really think though all the issues and implement something. - Volodya

Jeremy Maitin-Shepard

2 p.m.

Miro Jurisic <macdev@meeroh.org> writes:

...

[snip]

...

Also, as I mentioned elsewhere, Unicode strings do not lend themselves to performance and iteration characteristics provided by std::string; in particular, constant time random access is not going to work for two of those three views of a string. I think that a Unicode string is much better matched to characteristics of SGI's rope class, but I haven't had the time to research that in detail.

I believe that in fact random access is a linear operation, whereas rope random access is an O(log N) operation (I believe). -- Jeremy Maitin-Shepard

7792

Age (days ago)

7792

Last active (days ago)

List overview

Download

5 comments

3 participants

participants (3)

Jeremy Maitin-Shepard
Miro Jurisic
Vladimir Prus

Unicode string

Vladimir Prus

Miro Jurisic

Vladimir Prus

Miro Jurisic

Vladimir Prus

Jeremy Maitin-Shepard

tags

participants (3)