Re: [Boost-users] size_type doubts / integer library..

18 Aug 2008

      On Sunday 17 August 2008 09:32:40 Zeljko Vrba wrote:
...
On Sat, Aug 16, 2008 at 11:50:05PM +0300, dizzy wrote:
...
I do not agree. They generally do their job fine (which is provide
portable support to work with native, non checked, platform integer
types). For any other needs you should probably use another tool.
Well, they do *not* do their job fine: (-1U < 2) == false, which is a
mathematical nonsense (more on that below).
That does not prove anything. Any good tool has bad use cases. It does not 
prove integer types do not do their job fine for what was their purpose. You 
seem to want more than their purpose.
...
Signed arithmetic overflow is undefined behavior, and some CPUs actually
raise an exception on overflow (e.g. MIPS).  Every 'a+b' expression, with a
and b being signed integers, is potential UB.  Some machines (e.g. x86) do
not raise an exception but set an overflow flag which may be tested by a
single instruction, yet I don't know of a compiler which offers run-time
overflow checking as a code-generation option.  Portable checks for
overflow (relying on bit operations) incur immense overhead (certainly much
greater than a single instruction).
Which is the reasoning why they were not introduced in the standard since it's 
prohibitive or impossible in certain platforms. So? That just means the native 
integer types inherit the many limitations of the low level CPU arithmetic 
operations of the many platforms out there. Kinda logical to do so if you ask 
me (since C++ tries to be defined so it has fast implementations on the many 
native platforms out there instead of trying to be strict on his own virtual 
machine).
...
Writing an extra set of parentheses is not visually intrusive or
cluttering.
Says who?
...
Writing static_cast<int>(expr), (int)(expr), or surrounding the
offending code with #pragma _is_ cluttering and intrusive.
Common, if we are to argue about how clumsy it _looks_ writing additional 
parenthesis vs static_cast I think we have too much time on our hands. There 
are countless other compromises we do when programming in C++ just because for 
the rest it works so well (and no I don't see that as a flaw in the language, 
meaning one could have all the good things of C++ without the bad things, I 
see that as result of the good things of C++, every language feature has side 
effects, some which we may not find good in certain contexts but that does not 
make them a flaw).
...
Yes, I want my
code to be short, concise, and easily readable, in addition to being
correct.  So shoot me :-)
Well that's you. C++ was also made to:
- not pay for anything you don't need (I think RTTI is an exception there)
- allow writing code that is very close to the native platform
- be source compatible with C as much as possible
...
I have researched comp.lang.c++.moderated archives on this topic, and other
sources, and found two advices:
Peter van der Linden in "Expert C Programming: Deep C Secrets" writes:
"Avoid unnecessary complexity by minimizing your use of unsigned types.
Specifically, don't use an unsigned type to represent a quantity just
because it will never be negative (e.g., "age" or "national_debt")."
A quote of Bjarne Stroustrup: "The unsigned integer types are ideal for
uses that treat storage as a bit array. Using an unsigned instead of an int
to gain one more bit to represent positive integers is almost never a good
idea. Attempts to ensure that some values are positive by declaring
variables unsigned will typically be defeated by the implicit conversion
rules."
I personally use "unsigned" to mean values that cannot be negative and I 
always careful about comparing them with signed types and such cases. That is, 
in the low level code of course (it's also on layers, starting at some layer 
it becomes unsigned after being properly checked for negative values and from 
that point on it is used only as unsigned internally tho the outside world can 
send signed values). In the high level code (which I am not writing, since my 
work focuses on low level code) I would use something like your proposed 
library (which proves that C++ not only offers native types to satisfy _my_ 
needs, you can also use them to do something to satisfy _your_ needs thus they 
are not "flawed").
...
Yet, all of the STL uses an unsigned type for size_type and the return
value of size().
It uses std::allocator::size_type which is size_t for obvious reasons. Why 
size_t is unsigned is a question to be asked for the C people who made it so 
in C90 I guess :) Either way they are low level enough to not complain about 
that choice.
...
As much as I'd like to use only signed ints, this becomes
prohibitive (due to warnings) when mixing them with STL.  An yes, I've been
bitten several times in the past by implicit signed -> unsigned conversions
in relational comparisons.  The most sane thing would be to throw an
exception at runtime if one compares a negative integer with some unsigned
quantity, instead of getting false for '-1 < 2U', which is a mathematical
nonsense.
You imply some code to check for that thus violating the basic principle of 
not paying for what you don't need.
...
signedness of a type *should not* affect its numerical value in
mathematical operations.
But it can't do that without runtime cost. There are 3 options the compiler 
can do about the a < b (a being signed, b being unsigned):
- it can, as it does now, convert a to unsigned (which may result in 
unspecified value) and compare that; you find it wrong
- it can convert b to signed and compare with a; you think this is right but 
it isn't because it's exactly as the option above, it has cases in which 
conversion of b to signed results in unspecified value (because the unsigned b 
value may be bigger than what the signed a can represent); would you still 
prefer this over the above option? maybe so but others will not since the need 
exactly the other way (the current solution)
- it can promote both arguments to the signed integer type that supports all 
values of both (say long) and compare that; this may seem fine to you but it 
has the following problems:
	- it assumes there is a type that can represent both those values; we can 
clearly imagine that a and b might have been long and unsigned long in the 
beginning so that won't work; not to mention there is no guarantee that even 
if they are int then long can fit them both; long can have the same size of 
int (as it does so on many platforms)
	- promoting to a bigger type than int and comparing that may prove more 
expensive than comparing ints (which the standard defines as being the natural 
integer type of the platform, thus the "fastest" one); so then we would be 
violating the principle of paying for what we don't need in cases of those 
people that they do know a is not negative thus a < b won't do anything bad;

Your solution to this problem?
...
I *do* realize that.  However, simple wrappers won't fix operators.  I like
having as many low-overhead sanity checks in my code as possible, but I
don't want to litter the code with a bunch of assert()s in front of every
arithmetic expression.  Should I replace every 'if(a < b)' with 'if(less(a,
b))' and provide a bunch of overloaded less() functions?
No but why would you do it? You can't know at certain locations in your 
program that something may be negative or not? That's actually _why_ I use 
unsigned types, to signal that it can't be negative from that moment on (ie 
the user code sees the interface taking an "int" but the called function 
converts it unsigned after checking for the precondition of being positive 
value and then from that point on uses only unsigneds so having unsigned types 
actually helps knowing that variable can't be negative).

Or otherwise, how do you know a pointer can't be invalid at every point of use 
in the program? You can't, you can have invariants based on the code flow up 
to that point or to help you a bit with this you can have the value stored in 
something that moves the precondition earlier (like a reference, for which to 
be initialized you would need to dereference the pointer and if that's invalid 
the UB happens then). That's the same with receiving a user value as int and 
storing it in unsigned to be used later as unsigned. The later code sees 
unsigned (just as the later pointer code sees the reference) and knows it's a 
positive value (just as the later pointer code knows it's a valid reference). 
They know this because at the location of converting from int to unsigned (or 
from pointer to reference) you performed the check.
...
Oh, the library would allow conversion from integer<32, 0, 4294967295> to
integer<16, 0, 65535>, and insert an appropriate run-time range-check.
Sounds fine.
...
So
types integer<N1,L1,H1> and integer<N2, L2, H2> would be compatible if the
intersection of [L1,H1] and [L2,H2] is not empty; the resulting type would
be coerced to integer<max(N1,N2), min(L1, L2), max(H1, H2)> (though other
bounds are necessary for other operations, e.g., addition), with a run-time
check inserted (this makes it possible to maintain rather lax bounds at
compile-time).
===
Actually, I think I could simplify my requirements a bit:
- define template class integer<T> with T == char, short, int, long, long
long (no unsigned types allowed!)
So you can't know at some point in code if some value is positive without 
assumption (may be error prone) or runtime check (too costly). I think you 
should allow to signal syntactically that some type takes only positive 
values, it is something often needed in code to know this for sure about some 
variable. It won't break your invariants, it can be made compatible with the 
rest of your library ideas.
...
- allow initialization from _any_ integer<T> or primitive type, larger or
    smaller, signed or unsigned, provided it is in the range of T; throw
    exception otherwise
Initialization with implicit conversion too? If so that might become tricky 
with the bellow requirement about conversion to integer types (in general it's 
tricky to get implicit conversions right).
...
- allow conversion to _any_ underlying integer type, signed or unsigned,
    provided the value fits; otherwise throw an exception
  - allow mixed arithmetic between mixed integer<T> classes (e.g.
integer<int> + integer<char> would be defined) as well as between
integer<T> and any primitive type, subject to the conversion rules above
  - arithmetic would be checked for overflow
This, as you said, can be overly expensive on some platforms or not? But if 
you need it sure.
...
- comparisons between integer<T> and unsigned types are allowed as long
    as integer<T> is positive or 0; otherwise an exception is thrown
Mathematically speaking "0" is a positive value ;) On the serious side, why 
throw? Wouldn't be better to still try to perform them somehow (and only if 
there is no underlying common type that can represent all their values and be 
able to compare them throw). This of course comes with associated costs and if 
as you say below you want a flag that disables all associated runtime costs 
then I understand why you need to throw here in the debug version.
...
- bit manipulation operators would behave as if T were unsigned; an
additional method or function signed_shift(integer<T>) would be provided
But you should make some more strict requirements then. You can't assume a 
negative value encoding (such as 2's complement) thus you should not allow 
bitwise operations affecting more than the bits forming the value, ie without 
the sign bit. Otherwise you (as the C++ standard) get "unspecified value".
...
- arrays of integer<T> must have the same low-level layout as arrays of T
I'm not sure how this can be accomplished. Sure you can have only the integer 
value as the single non static data field but I don't think the standard says 
that a struct A { int a; /* any members here except non static data members */ 
}; has the same binary layout with a struct A { int a; };. I know it makes 
sense to have but I'm not sure about the letter of the standard on this.

Conclusion: it's a nice library. It's something that I also suggested to some 
colleagues of mine when they ran into some of those signed/unsigned warnings 
issues (into their high level code). But notice that some of your checks may 
be too expensive, more expensive than someone is willing to pay to get those 
features, so then to make it generic you will need to do some sort of compile 
time configuration, like for example with policies. You say you would be using 
it only in debug mode so you think of it just as a form of assert() but using 
the operator overloading syntactic sugar to still have common integer 
arithmetic syntax without assert()s polluting the code everywhere. But you 
would use it as native types when in release mode, so then the C++ integer 
types are not flawed that you would use them in release mode? Your only 
complain is that they were flawed for not having expensive runtime checks when 
used in debug mode? :)
...
Example:
integer<int> a;
integer<short> b;
short result = a + b + 7;
This code would convert b to integer<int>, check for overflow before (or
after[*]) adding it to a, compute a+b, check for overflow before adding 7,
add 7 to the result, check that the result fits into short, done.  If any
check fails, an exception is thrown.
[*] Overflow checks could be implemented in platform-specific assembler;
e.g. x86 sets the overflow flag _after_ addition.
This is the semantic that I'd like to have in debug versions; in "release"
version of the program, everything would behave as ordinary arithmeic on
primitive types.  I'm studying numeric_cast<>, but.. it covers only
conversions, not other items listed above.
...
There is a lot of possible dangerous code that can be done in C or C++
and some compilers tell you about it but that do not make it wrong code.
Warnings
The problem is that every 'a+b', with a and b signed, is dangerous
(potential UB), unless you either 1) formally *prove* that the expression
won't overflow or 2) insert extra run-time checks (which clutter the code).
:/
Sure, just as "*p" is potential UB unless you:
- formally prove p is a valid pointer
- you insert extra run-time checks (with pointers this is actually worse since 
other than the null pointer value you have no way to determine if the pointer 
is invalid)

As I said, it's just normal C++, many places of potential danger but that 
might be fine.
...
Something as common as simple addition should at least have an _option_ of
*always*, under *all* circumstances having defined behavior.
Portable and without runtime costs, no. It has no such option. That's the 
reason why undefined behavior has been introduced in C (and in C++). There are 
many cases where because of technical and portability reasons, to provide a 
defined behavior would be either impossible or too expensive. So the standard 
says then it's undefined behavior.
...
The integer<>
class proposed above is just one of possibilities; but that should have
been included in the standard :/
You mean in the standard library? Well I work often with open source projects, 
and in the open source community we have a saying when someone complains about 
something not already available in software: "if it's not available it means 
it's not needed". To a certain degree this is valid for the current problem 
too. Search on the Internet, how many std-like hash implementations you find 
vs how many such range checked integer libraries you find? That is a good 
estimation of the need of such a library IMO. Yes, it is useful, but it is not 
very very useful. Plus, because it can be implemented in C++ proves that C++ 
integer types are not flawed and they allow you to do what you need.

-- 
Dizzy
			"Linux is obsolete" -- AST