Work that has been done on Unicode

older
[archive] -- error compiling CVS...

loufoque

10 Sep 2006 10 Sep '06

2:14 p.m.

It seems Erik Wien and then Graham Barnett have worked on full-featured implementations of Unicode for boost but never managed to finish and polish that work. Erik Wien submitted his code but the link he gave is now dead, and Graham Barnett never gave more than a few headers. I would be interested in getting all work that has been done by both developers, so if any of you has part of it please share it.

Show replies by date

loufoque

16 Sep 16 Sep

12:22 a.m.

Since no one has old code for reuse, I will start to write a few usable tools from scratch. Note that I am not an Unicode expert nor a C++ guru. I am just willing to work in that area and hope my code could be useful to some. Feel free to comment and give ideas, since I think the design is the most important thing first, especially for usage with boost, even though this topic has already been discussed a few times. string/wstring is not really suited to contain unicode data, since of limitations of char_traits, the basic_string interface, and the dependance on locales of the string and wstring types. I think it is better to consider the string, char[], wstring and wchar_t[] types to be in the system locales and to use a separate type for unicode strings. The aim would then be to provide an abstract unicode string type independent from C++ locales on the grapheme clusters level, while also giving access to lower levels. It would only handle unicode in a generic way at the beginning (no locales or tailored things). This string could maintain the characters in a normalized form (which means potential loss of information about singleton characters) in order to allow more efficient comparison and searching. It would use a policy-based design in order to be as generic as possible and therefore customizable on many levels, allowing to use the data structure and encoding you need for interfacing with other libraries. The policy-based design would also provide functionality similar to flex_string, to explicitly choose whether to use COW or other optimizations depending on the situation. There would also be a const versions, following the const_string design. Just like super_string, the class would bundle algorithms from string_algo, since it can probably implement them in a more efficient way than iterating over the grapheme clusters.

Aristid Breitkreuz

10:55 a.m.

Hi, Am Samstag, den 16.09.2006, 02:22 +0200 schrieb loufoque:

...

Since no one has old code for reuse, I will start to write a few usable tools from scratch.

Well, I have some new code with few functionality. Feel free to contact me personally if you are interested.

...

Note that I am not an Unicode expert nor a C++ guru. I am just willing to work in that area and hope my code could be useful to some.

I do think that's important. As we saw lack of Unicode support is a serious problem even hindering a boost XML library.

...

Feel free to comment and give ideas, since I think the design is the most important thing first, especially for usage with boost, even though this topic has already been discussed a few times.

string/wstring is not really suited to contain unicode data, since of limitations of char_traits, the basic_string interface, and the dependance on locales of the string and wstring types. I think it is better to consider the string, char[], wstring and wchar_t[] types to be in the system locales and to use a separate type for unicode strings.

In the optimal case, the system is Unicode-aware. Oh well.

...

The aim would then be to provide an abstract unicode string type independent from C++ locales on the grapheme clusters level, while also giving access to lower levels.

Abstract? You mean like virtual foo() = 0; ? Probably better not.

...

It would only handle unicode in a generic way at the beginning (no locales or tailored things).

That's fine. Do you have plans on which Unicode encoding to use internally?

...

This string could maintain the characters in a normalized form (which means potential loss of information about singleton characters) in order to allow more efficient comparison and searching.

It would use a policy-based design in order to be as generic as possible and therefore customizable on many levels, allowing to use the data structure and encoding you need for interfacing with other libraries.

The policy-based design would also provide functionality similar to flex_string, to explicitly choose whether to use COW or other optimizations depending on the situation.

There would also be a const versions, following the const_string design.

Be careful that it remains usable.

...

Just like super_string, the class would bundle algorithms from string_algo, since it can probably implement them in a more efficient way than iterating over the grapheme clusters.

Oh well I don't know if I like that. You probably should concentrate on the core functionality first. You probably can add specialisations to string_algo later. Way later :-). Kind regards, Aristid Breitkreuz

...

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

loufoque

5:55 p.m.

Aristid Breitkreuz wrote :

...

Abstract? You mean like virtual foo() = 0; ? Probably better not.

No, I just meant an encoding agnostic unicode string type. That could, indeed, have been implemented through inheritance and polymorphism but I chose the use of templates instead for performance.

...

...
It would only handle unicode in a generic way at the beginning (no locales or tailored things).

That's fine. Do you have plans on which Unicode encoding to use internally?

UTF-8, UTF-16 and UTF-32 would all be available for implementations, and each one would be able to take or give the other ones for input/output.

...

Be careful that it remains usable.

I would like to design it to be customizable, but of course there would be a typedef with a specialization which is the more fitting in most cases.

Aristid Breitkreuz

17 Sep 17 Sep

1:08 a.m.

Am Samstag, den 16.09.2006, 19:55 +0200 schrieb loufoque:

...

Aristid Breitkreuz wrote : [snip]

...
That's fine. Do you have plans on which Unicode encoding to use internally?

UTF-8, UTF-16 and UTF-32 would all be available for implementations, and each one would be able to take or give the other ones for input/output.

I guess that every single supported type is extra complexity, right? Would not UTF-8 (for brevity and compatibility) and UTF-32 (because it might be better for some algorithms) suffice?

Peter Bindels

4:10 p.m.

On 17/09/06, Aristid Breitkreuz <aribrei@arcor.de> wrote:

...

Am Samstag, den 16.09.2006, 19:55 +0200 schrieb loufoque:

...
Aristid Breitkreuz wrote : [snip]

...
That's fine. Do you have plans on which Unicode encoding to use internally?

UTF-8, UTF-16 and UTF-32 would all be available for implementations, and each one would be able to take or give the other ones for input/output.

I guess that every single supported type is extra complexity, right? Would not UTF-8 (for brevity and compatibility) and UTF-32 (because it might be better for some algorithms) suffice?

That's not entirely accurate. UTF-8 is Latin-centric, so that all latin texts can be processed in linear time, taking longer for the rest. UTF-16 is common-centric, in that it works efficiently for all common texts in all common scriptures, except for a few. Choosing UTF-8 over UTF-16 would make the implementation (and accompanying software) slow in all parts of the world that aren't solely using Latin characters. That would be most of Europe, Asia, Africa, South-America and a number of people in North-America and Australia. Forcing them to UTF-32 makes for quite a lot worse memory use than could reasonably be expected. I see quite a lot of use for the UTF-16 case, perhaps even more than the UTF-8 one.

Aristid Breitkreuz

6:01 p.m.

Am Sonntag, den 17.09.2006, 18:10 +0200 schrieb Peter Bindels:

...

On 17/09/06, Aristid Breitkreuz <aribrei@arcor.de> wrote:

...
Am Samstag, den 16.09.2006, 19:55 +0200 schrieb loufoque:

...
Aristid Breitkreuz wrote : [snip]

...
That's fine. Do you have plans on which Unicode encoding to use internally?

UTF-8, UTF-16 and UTF-32 would all be available for implementations, and each one would be able to take or give the other ones for input/output.

I guess that every single supported type is extra complexity, right? Would not UTF-8 (for brevity and compatibility) and UTF-32 (because it might be better for some algorithms) suffice?

That's not entirely accurate. UTF-8 is Latin-centric, so that all latin texts can be processed in linear time, taking longer for the rest.

I thought that for algorithmic processing, UTF-32 is optimal in most cases?

...

UTF-16 is common-centric, in that it works efficiently for all common texts in all common scriptures, except for a few.

This is some 90% space overhead for German / French / ... (European if you want) texts. And 100% for English texts.

...

Choosing UTF-8 over UTF-16 would make the implementation (and accompanying software) slow in all parts of the world that aren't solely using Latin characters.

Are you talking about memory overhead? AFAIK UTF-8 is quite good for that. It might be slightly suboptimal for some Asian scripts but I'm not sure about that. It is guaranteed that UTF-8 consumes never ever more than 4 bytes.

...

That would be most of Europe, Asia, Africa, South-America and a number of people in North-America and Australia.

Yes, those people (I am one of them) don't use solely Latin (=ASCII-7?) characters. Still, I'd usually prefer UTF-8.

...

Forcing them to UTF-32 makes for quite a lot worse memory use than could reasonably be expected. I see quite a lot of use for the UTF-16 case, perhaps even more than the UTF-8 one.

UTF-32 is _always_ bad on memory. Because Unicode won't use more than I think 21 bits ever. But UTF-32 is great for some algorithms. (OK, maybe Unicode still has some traps hindering efficient UTF-32 algorithms, who knows?)

...

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

loufoque

6:20 p.m.

Peter Bindels wrote :

...

That's not entirely accurate. UTF-8 is Latin-centric, so that all latin texts can be processed in linear time, taking longer for the rest.

Huh? Not really. All non ASCII characters, including latin ones, require more than one byte per character. It can still be processed in linear time though, it just means you can't have random access.

...

UTF-16 is common-centric, in that it works efficiently for all common texts in all common scriptures, except for a few. Choosing UTF-8 over UTF-16 would make the implementation (and accompanying software) slow in all parts of the world that aren't solely using Latin characters.

I doubt the overhead is really noticeable. UTF-16 just makes validation and iteration a little simpler.

...

That would be most of Europe, Asia, Africa, South-America and a number of people in North-America and Australia. Forcing them to UTF-32 makes for quite a lot worse memory use than could reasonably be expected.

UTF-32 allows random access but that's rather useless since you need to iterate over the string anyway to handle combining characters.

Peter Bindels

8:05 p.m.

On 17/09/06, loufoque <mathias.gaunard@etu.u-bordeaux1.fr> wrote:

...

Peter Bindels wrote :

...
That's not entirely accurate. UTF-8 is Latin-centric, so that all latin texts can be processed in linear time, taking longer for the rest.

Huh? Not really. All non ASCII characters, including latin ones, require more than one byte per character.

Ok, I'll come back on Latin, I intended to say, the Latin-section represented in ASCII-7.

...

...
UTF-16 is common-centric, in that it works efficiently for all common texts in all common scriptures, except for a few. Choosing UTF-8 over UTF-16 would make the implementation (and accompanying software) slow in all parts of the world that aren't solely using Latin characters.

I doubt the overhead is really noticeable. UTF-16 just makes validation and iteration a little simpler.

Indexing in UTF32 is trivial. Indexing in UTF16 is fairly trivial, and by the definition of the boundary between the base UTF-16 plane and the higher plane you should treat all characters >0xFFFF (encoded with two entries) as very irregular. You could then keep an array of indexes where these characters appear in your string (adding a slight bit to the overhead) making overhead constant-time except for the occurrences of those characters. You cannot add this technique to UTF-8 texts because non-7-bit characters are a lot more common. Add to that that UTF-8 2-byte encoding only supports 13-bit entries. That means that all characters from 0x2000...0xD7FF and 0xE000...0xFFFC use a byte more than they would in UTF-16. I checked this, this includes about all of Asia, in particular including all common Japanese and Chinese characters, as well as a number of Latin extended characters. You can see the ranges of unicode characters in the filenames of the links at: http://www.unicode.org/charts/

...

UTF-32 allows random access but that's rather useless since you need to iterate over the string anyway to handle combining characters.

That's a point I hadn't thought of. In that case, what advantages does UTF-32 hold over any of the other two?

loufoque

18 Sep 18 Sep

1:28 a.m.

Peter Bindels wrote :

...

Indexing in UTF32 is trivial. Indexing in UTF16 is fairly trivial, and by the definition of the boundary between the base UTF-16 plane and the higher plane you should treat all characters >0xFFFF (encoded with two entries) as very irregular. You could then keep an array of indexes where these characters appear in your string (adding a slight bit to the overhead) making overhead constant-time except for the occurrences of those characters.

I had already thought of this. This would allow random access for O(log(n)), n being the number of surrogate pairs in the string. It also allows mutable iterators to not invalidate each other when modifying the string. However, for working on the grapheme clusters level, I don't think random access of the underlying encoding brings anything useful. Some searching algorithms need random access though, but these could work on the bytes with a few checks, since all UTF encodings (except UTF-7) guarantee that a sequence must not occur within a longer sequence or across the boundary of two other sequences. While composing characters may be rare in some scripts, there are numerous in others so indexing them doesn't seem like such a good idea. It can be interesting though to provide random access for people who want to work at a lower level though, but that's not a priority.

...

You cannot add this technique to UTF-8 texts because non-7-bit characters are a lot more common.

Another problem with utf-8 is that reverse iterating is a bit more expensive than forward iterating, because you don't know how many bytes you have to go back until you encounter the first of the multibyte group. UTF-8 isn't usually a good idea unless you need it to feed GTK+ or others libs. Actually, size-wise, another interesting encoding is GB18030, especially if working with a mix of Chinese and English. However there is no direct mapping with Unicode code points, so UTF-16 stays the best choice by default.

...

That's a point I hadn't thought of.

That's what the whole "grapheme cluster" thing is about. <e acute> and <e><acute> should be considered equal, each one being a single grapheme cluster, what an end-user thinks of as a single character. Also searching foo in foo<acute> shouldn't give any result, since the last character is o<acute>, not o. All algorithms must ensure that elements of a grapheme cluster don't get separated. This can be achieved simply by working with iterators over them or, in a more optimized way, by working carefully with code units or code points. That is why, along with the default grapheme cluster interface, access will be given to lower-level layers, for the power users.

...

In that case, what advantages does UTF-32 hold over any of the other two?

First some people might want to directly work with code points and not grapheme clusters, just like most (if not all) other unicode string classes work. With UTF-32 the code units are code points, so working on them is very lightweight. Also, it can be useful if you need to interface with something that uses UTF-32 since that way no conversion is needed for input and output. That is also why there will be an utf-8 backend too.

Rogier van Dalen

1 Oct 1 Oct

10:03 a.m.

On 9/16/06, loufoque <mathias.gaunard@etu.u-bordeaux1.fr> wrote:

...

Since no one has old code for reuse, I will start to write a few usable tools from scratch. Note that I am not an Unicode expert nor a C++ guru. I am just willing to work in that area and hope my code could be useful to some.

I'm sorry to enter the discussion this late, but I was unable to reply earlier. Graham Barnett and I started on a Unicode library implementation a year ago but failed to deliver anything. I can offer you two things. One is some codecvt facets for UTF-8 and UTF-16, slightly faster and more up to date than I think are in Boost now. I've been thinking how this whole Unicode thing should proceed recently, so I'll also offer some advice. Feel free to comment and give ideas, since I think the design is the

...

most important thing first, especially for usage with boost, even though this topic has already been discussed a few times.

string/wstring is not really suited to contain unicode data, since of limitations of char_traits, the basic_string interface, and the dependance on locales of the string and wstring types. I think it is better to consider the string, char[], wstring and wchar_t[] types to be in the system locales and to use a separate type for unicode strings.

The aim would then be to provide an abstract unicode string type independent from C++ locales on the grapheme clusters level, while also giving access to lower levels. It would only handle unicode in a generic way at the beginning (no locales or tailored things). This string could maintain the characters in a normalized form (which means potential loss of information about singleton characters) in order to allow more efficient comparison and searching.

I fully agree with this. It may be a good idea to separate the library into smaller modules. The grapheme-based string will probably use a string of code points underlyingly. Given that, you may want to implement a UTF library first, which should just deal with the codepoints <-> code units conversion. Setting out to design this UTF library first will also concentrate and streamline the discussion. The Boost community is English-language centred, and not everyone may be intimately familiar with the concept of grapheme clusters. When building a real Unicode library on top of a UTF library, discussion can focus on handling grapheme clusters, normalisation, and the Unicode database you'll need for that. (Note that when you say "comparison" and "searching" you're speaking of just binary comparison; for locale-specific collation you'll probably want to attach sort keys to strings for efficiency. That's for later, though.) Just my 2p. I'd be delighted to explain my views in more detail. Regards, Rogier

Nils Springob

18 Sep 18 Sep

2:25 p.m.

What would you think of an iterator based approach with the following interface: // Bideirectional iterator for UTF8 / UTF16 coded data // I: Raw data iterator // C: Unicode character type template <class I, typename C> class unicode_iterator; // I: Unicode character iterator template <typename I> class unicode_string; // some standard types: typedef unicode_iterator<std::string::iterator, wchar_t> utf8_iterator; typedef unicode_iterator<std::basic_string<short>::iterator, wchar_t> utf16_iterator; typedef std::wstring::iterator utf32_iterator; typedef unicode_string<utf8_iterator> > utf8_string; typedef unicode_string<utf16_iterator> > utf16_string; typedef unicode_string<utf32_iterator> > utf32_string; This would allow a lot of algorithms to work directly together whith a "slim" unicode framework.

Tomas Pecholt

4:14 p.m.

Note that some of these have been already done by John Maddock as I can see in regex header-directory. Tomas "Nils Springob" <nils.springob@nicai-systems.de> píse v diskusním príspevku news:eema7s$9ue$1@sea.gmane.org...

...

What would you think of an iterator based approach with the following interface:

// Bideirectional iterator for UTF8 / UTF16 coded data // I: Raw data iterator // C: Unicode character type template <class I, typename C> class unicode_iterator;

// I: Unicode character iterator template <typename I> class unicode_string;

// some standard types: typedef unicode_iterator<std::string::iterator, wchar_t> utf8_iterator; typedef unicode_iterator<std::basic_string<short>::iterator, wchar_t> utf16_iterator; typedef std::wstring::iterator utf32_iterator;

typedef unicode_string<utf8_iterator> > utf8_string; typedef unicode_string<utf16_iterator> > utf16_string; typedef unicode_string<utf32_iterator> > utf32_string;

This would allow a lot of algorithms to work directly together whith a "slim" unicode framework.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

John Maddock

4:46 p.m.

Tomas Pecholt wrote:

...

...
Note that some of these have been already done by John Maddock as I can see in regex header-directory.

Yep, they're quite similar to this idea.

...

...
...
// some standard types: typedef unicode_iterator<std::string::iterator, wchar_t> utf8_iterator; typedef unicode_iterator<std::basic_string<short>::iterator, wchar_t> utf16_iterator; typedef std::wstring::iterator utf32_iterator;

typedef unicode_string<utf8_iterator> > utf8_string; typedef unicode_string<utf16_iterator> > utf16_string; typedef unicode_string<utf32_iterator> > utf32_string;

Note that wchar_t is *not* guarenteed to be 32-bits as you seem to be assuming here (it's not on windows for example). Just a heads up. John.

Nils Springob

6:49 p.m.

That's fine! I didn't expected it to be in the regex header dir... OK, using the existing code would give something like the following: template <class Char8Iterator = std::string::iterator> typedef unicode_string<u32_to_u8_iterator<Char8Iterator>, u8_to_u32_iterator<Char8Iterator>> utf8_string; template <class Char16Iterator = std::basic_string<boost::uint_16_t>::iterator> typedef unicode_string<u32_to_u16_iterator<Char16Iterator>, u16_to_u32_iterator<Char16Iterator>> utf16_string; template <class Char32Iterator = std::basic_string<boost::uint_32_t>::iterator> typedef unicode_string<Char32Iterator, Char32Iterator> utf32_string; This would store utf8 and utf16 strings internally in the raw format and allow access to the 32 bit utf32 values. The unicode_string class should implement most of the std::basic_string methods, however the complexity of these methods would be in most cases linear! bool empty() // constant = O(1); size_type size() // linear = O(s2.size()); append (const unicode_string & s2) // linear = O(s2.size()); append (uint32_t & uc) // constant = O(1); insert (size_type pos, const unicode_string & s2) // linear = O(s1.size()+s2.size()); insert (size_type pos, uint32_t & uc) // linear = O(s1.size()); int compare (const unicode_string & s2) // linear = O(s1.size()+s2.size()); erase (size_type pos, uint32_t & uc) // linear = O(s1.size()); replace (size_type i, size_type n, const unicode_string & s2) // linear = O(s1.size()+s2.size()); substr (size_type i, size_type n) // linear = O(s1.size()); all find methods would have the same complexity as the corresponding std::basic_string methods, because they can be transformed to work on the raw data! isalpha, isupper and the other functions can be defined on the utf32 values (or the wchar_t version can be used on some platforms) in the boost namespace. In the same way, the Unicode Category Values can be implemented for the utf32 values. Additionally there could be iterators to support other encodings like latin1 (latin1_to_u32<> and u32_to_latin1<>). Transformations could be done by simple assignment: latin1_string l1 = "a simple test with äöü"; // given that latin1 is the system encoding utf8_string u8 = l1; this would convert a latin1 encoded string into a utf8 encoded string.

Aristid Breitkreuz

19 Sep 19 Sep

2:45 p.m.

Am Montag, den 18.09.2006, 16:25 +0200 schrieb Nils Springob:

...

What would you think of an iterator based approach with the following interface:

Iterators are good.

...

// Bideirectional iterator for UTF8 / UTF16 coded data // I: Raw data iterator // C: Unicode character type template <class I, typename C> class unicode_iterator;

Fair enough.

...

// I: Unicode character iterator template <typename I> class unicode_string;

Judging from the code below, this would mean that the iterator implicitly specifies the internal string type? Something like typename iterator_traits<I>::container_type cont; ?

...

// some standard types: typedef unicode_iterator<std::string::iterator, wchar_t> utf8_iterator;

You can't assume wchar_t is UTF-32. You can't even assume it's 32 bits.

...

typedef unicode_iterator<std::basic_string<short>::iterator, wchar_t> utf16_iterator;

You can't assume short is 16-bits (but that's more or less academic).

...

typedef std::wstring::iterator utf32_iterator;

Same.

David Abrahams

20 Sep 20 Sep

2:49 a.m.

Aristid Breitkreuz <aribrei@arcor.de> writes:

...

You can't assume short is 16-bits (but that's more or less academic).

Academic or no, I'd love to hear your reasoning on that one. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Martin Bonner

10:42 a.m.

David Abrahams writes:

...

Aristid Breitkreuz <aribrei@arcor.de> writes:

...
You can't assume short is 16-bits (but that's more or less academic).

Academic or no, I'd love to hear your reasoning on that one.

The standard is clear that short must be /at least/ 16 bits, but you certainly can't assume that the following assert will not fire: unsigned short i = 32768; i *= 2; assert( i == 0 ); In practise, except on esoteric embedded DSPs where sizeof(almost anything) is 1 and char is 32 bits, I can't imagine that assert firing any time in the next 20 years (In 20 years time, I expect 128 bit platforms to appear, and short *might* start meaning /32 bits/). -- Martin Bonner Martin.Bonner@Pitechnology.com Pi Technology, Milton Hall, Ely Road, Milton, Cambridge, CB4 6WZ, ENGLAND Tel: +44 (0)1223 203894

David Abrahams

3:22 p.m.

"Martin Bonner" <martin.bonner@pitechnology.com> writes:

...

David Abrahams writes:

...
Aristid Breitkreuz <aribrei@arcor.de> writes:

...
You can't assume short is 16-bits (but that's more or less academic).

Academic or no, I'd love to hear your reasoning on that one.

The standard is clear that short must be /at least/ 16 bits

Okay, that's all I needed to hear, thanks. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Aristid Breitkreuz

4:21 p.m.

Am Mittwoch, den 20.09.2006, 11:22 -0400 schrieb David Abrahams:

...

"Martin Bonner" <martin.bonner@pitechnology.com> writes:

...
David Abrahams writes:

...
Aristid Breitkreuz <aribrei@arcor.de> writes:

...
You can't assume short is 16-bits (but that's more or less academic).

Academic or no, I'd love to hear your reasoning on that one.

"Don't assume type sizes" firstly usually holds true. Also, while /not/ owning a copy of the standard, I could not figure out a theoretical reason why short must not be longer than 16 bits. (Oh well.)

...

...
The standard is clear that short must be /at least/ 16 bits

Okay, that's all I needed to hear, thanks.

Maybe I should have clarified that I was talking about short not necessarily being /exactly/ 16 bits. Regards Aristid Breitkreuz

Hartmut Kaiser

21 Sep 21 Sep

12:55 p.m.

...

...
...
Aristid Breitkreuz <aribrei@arcor.de> writes:

...
You can't assume short is 16-bits (but that's more or less academic).

Academic or no, I'd love to hear your reasoning on that one.

The standard is clear that short must be /at least/ 16 bits

Okay, that's all I needed to hear, thanks.

Actually that's not correct. The Standard is quite clear when it states [3.9.1.2]: There are four signed integer types: "signed char", "short int", "int", and "long int." In this list, each type provides at least as much storage as those preceding it in the list. And [3.9.1.3]: For each of the signed integer types, there exists a corresponding (but different) unsigned integer type: "unsigned char", "unsigned short int", "unsigned int", and "unsigned long int," each of which occupies the same amount of storage and has the same alignment requirements (3.9) as the corresponding signed integer type) HTH Regards Hartmut

...

-- Dave Abrahams Boost Consulting www.boost-consulting.com

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Peter Dimov

1:16 p.m.

Hartmut Kaiser wrote:

...

...
...
...
Aristid Breitkreuz <aribrei@arcor.de> writes:

...
You can't assume short is 16-bits (but that's more or less academic).

Academic or no, I'd love to hear your reasoning on that one.

The standard is clear that short must be /at least/ 16 bits

Okay, that's all I needed to hear, thanks.

Actually that's not correct. The Standard is quite clear when it states [3.9.1.2]:

There are four signed integer types: "signed char", "short int", "int", and "long int." In this list, each type provides at least as much storage as those preceding it in the list.

And [3.9.1.3]:

For each of the signed integer types, there exists a corresponding (but different) unsigned integer type: "unsigned char", "unsigned short int", "unsigned int", and "unsigned long int," each of which occupies the same amount of storage and has the same alignment requirements (3.9) as the corresponding signed integer type)

You need the C standard to get the complete picture; it states that USHRT_MAX is at least 65535.

Hartmut Kaiser

1:39 p.m.

Peter Dimov wrote:

...

...
Actually that's not correct. The Standard is quite clear when it states [3.9.1.2]:

There are four signed integer types: "signed char", "short int", "int", and "long int." In this list, each type provides at least as much storage as those preceding it in the list.

And [3.9.1.3]:

For each of the signed integer types, there exists a corresponding (but different) unsigned integer type: "unsigned char", "unsigned short int", "unsigned int", and "unsigned long int," each of which occupies the same amount of storage and has the same alignment requirements (3.9) as the corresponding signed integer type)

You need the C standard to get the complete picture; it states that USHRT_MAX is at least 65535.

The corresponding section of the C Standard you're referring to states [Appendix E:1, implementation limits (informative)]: The contents of the header <limits.h> are given below, in alphabetical order. The minimum magnitudes shown shall be replaced by implementation-defined magnitudes with the same sign. So no restriction imposed from the Standard here, AFAIU. Regards Hartmut

...

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Peter Dimov

1:45 p.m.

Hartmut Kaiser wrote:

...

Peter Dimov wrote:

...
You need the C standard to get the complete picture; it states that USHRT_MAX is at least 65535.

The corresponding section of the C Standard you're referring to states [Appendix E:1, implementation limits (informative)]:

I was thinking about 5.2.4.2.1 (in C99, I don't have C90 to check.) "Their implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign."

6827

Age (days ago)

6848

Last active (days ago)

List overview

Download

23 comments

11 participants

participants (11)

Aristid Breitkreuz
David Abrahams
Hartmut Kaiser
John Maddock
loufoque
Martin Bonner
Nils Springob
Peter Bindels
Peter Dimov
Rogier van Dalen
Tomas Pecholt