Re: [boost] Is there a way to reliably read an integer from a stream?

21 Oct 2005

      On Wed, Oct 19, 2005 at 03:03:07PM +1300, Simon Buchan wrote:
...
Oliver Kullmann wrote:
...
Hello,
Consider the following basic task: Given an integer variable "n"
int n;
and an input stream "in"
std::istream in;
read an integer from in and put it into n. The "solution"
in >> n;
is not a solution, since according to 27.6.1.2.2 from the C++ standard
the locale's num_get<> object is invoked, which according to
22.2.2.1.2 from the C++ standard invokes scanf from the C library, 
Actually, it just says it should behave /like/ scanf: it is defined in 
terms of it, not (necessarily) implemented with it.
That's correct, but from the semantical point of view it doesn't matter
(since we are not interested here in implementation details).
...
...
which then according to 7.19.6.2 from the C standard, clause 10, 
yields undefined behaviour if the number represented in the stream is not
representable by the type int.
The standard specifies that when input fails (like a formatted input 
operation not finding the right format), the stream's failbit is set,
that it says, but this is only a very restricted kind of error (see below),
and it doesn't seem to include the case where the number read from the input
stream is too big to be represented (this cannot happen for unsigned integers,
but it can happen for signed integers).
...
so 
just:
in >> n;
if(in.fail()) {...}
does what you want.
I don't think so, that this is guaranteed to help in case the number is too big
(but definitely it SHOULD help);
below I will argue that scanf as defined in the C99 standard shows undefined 
behaviour in this case, while the C++03 standard is broken here (is "undefined
itself"), so everything seems to be up to the compiler.
...
(The more general "if(!in)" works as well, also 
including a corrupted stream and EOF)
...
Thus, since "in" represents here user input, and user input shall never
yield undefined behaviour, we cannot use "in >> n;".
I assume you mean "user input shall never yield *defined* behaviour" ;-)
here I was more referring to a "generalised user", like a template, and those
guys are nice.

In what follows I will first report on my experimentation with g++ (this is positive),
then I try to interpret what the C++ standard says (I believe this must fail, as the
standard is broken here), and then finally what the C99 standard says (to me this says
clearly "undefined behaviour"). By the way, I don't have access to the C89 standard
(seems ridiculously expensive?), but I would hope that C99 is an improvement over C89 (?).

---------------------------------------------------------------------------

First the (simple) test program:

// Oliver Kullmann, 19.10.2005 (Swansea)

#include <cassert>
#include <sstream>

template <typename Int>
void test_correct(const char* const n_string) {
  std::istringstream in(n_string);
  Int n = 0;
  in >> n;
  assert(in);
  std::ostringstream out;
  out << n;
  assert(out);
  assert(out.str() == n_string);
}

template <typename Int>
void test_error(const char* const too_big) {
  std::istringstream in(too_big);
  Int n = 0;
  in >> n; // UNDEFINED BEHAVIOUR ?!
  assert(not in);
  assert(n == 0);
}

void test_cases_32() {
  test_error<short>("32768");
  test_correct<short>("32767");
  test_error<short>("-32769");
  test_correct<short>("-32768");

  test_error<int>("2147483648");
  test_correct<int>("2147483647");
  test_error<int>("-2147483649");
  test_correct<int>("-2147483648");

  test_error<long>("2147483648");
  test_correct<long>("2147483647");
  test_error<long>("-2147483649");
  test_correct<long>("-2147483648");

//   test_error<long long>("9223372036854775808");
//   test_correct<long long>("9223372036854775807");
//   test_error<long long>("-9223372036854775809");
//   test_correct<long long>("-9223372036854775808");
}

void test_cases_64() {
  test_error<short>("32768");
  test_correct<short>("32767");
  test_error<short>("-32769");
  test_correct<short>("-32768");

  test_error<int>("2147483648");
  test_correct<int>("2147483647");
  test_error<int>("-2147483649");
  test_correct<int>("-2147483648");

  test_error<long>("9223372036854775808");
  test_correct<long>("9223372036854775807");
  test_error<long>("-9223372036854775809");
  test_correct<long>("-9223372036854775808");

//   test_error<long long>("9223372036854775808");
//   test_correct<long long>("9223372036854775807");
//   test_error<long long>("-9223372036854775809");
//   test_correct<long long>("-9223372036854775808");
}

int main() {

#ifndef __WORDSIZE
#  error "Macro __WORDSIZE not defined"
#endif

#if __WORDSIZE == 32
  test_cases_32();
#elif __WORDSIZE == 64
  test_cases_64();
#else
#  error "Unknown wordsize"
#endif
}

Likely this won't work on all platforms, but on a 
standard Linux/Unix platform I believe __WORDSIZE is
defined, and the numerical values are standard.

The above program run successfully (i.e., without asserting)
with g++ versions 3.4.3, 3.4.4, 4.0.0. 4.0.1, 4.0.2, on 32 and
on 64 bit platforms.

I don't know about other compilers on other platforms, but I hope
that the results would be the same, which in my interpretation would
mean, that the compilers are using the undefined behaviour in a positive
way (turning it into defined behaviour).

-------------------------------------------------------------------------------

What does the C++ standard (version from 2003) say to it?
Section 22.2.2.1 "Class template num_get" seems to be the relevant
place: Reading of a number happens in three phases:
1. Determination of a "conversion specifier" (likely referring to the
conversion specifiers for fscanf from the C standard).
2. Reading of characters (using facets to handle decimal point
and grouping).
3. Now the results are interpreted and stored. The standard says:

   The result of stage 2 processing can be one of
    - A sequence of chars has been accumulated in stage 2 that is converted (according
    to the rules of scanf) to a value of the type of val. This value is stored in val and
    ios_base::goodbit is stored in err.
    - The sequence of chars accumulated in stage 2 would have caused scanf to report
    an input failure. ios_base::failbit is assigned to err.

That's it about the conversion. It speaks about the results of stage 2, but
actually in the first sub-point it seems to introduce something new, namely
the conversion of the sequence into a value (interesting, that the whole
point of num_get, namely getting a number, is only mentioned as a kind of a side-remark,
referring to some rules of scanf, while actually there are none).

It is unclear whether the above paragraph is meant normative, setting the standard, or
descriptive, asserting some properties. That is, should the implementation enforce that
we have only either success or input failure? Or is the above paragraph a description of the outcome of
stage 2 and its interpretation.

What does "input failure" mean? I could be meant in a "common-sense" meaning, which would
be quite unfortunate, since "input failure" is explained in the C standard.
On the other hand, the above case distinction mentions only "success" or "input failure",
while the C99 standard mentions "input failure", "matching failure" and also "undefined
behaviour". Input failure in the C standard is very restricted (see below), basically referring
only to coding error. So something like values which are too big doesn't seem to exist.

It is unclear, whether the C++ standards wants to delegate as much as possible to the C standard here,
or whether it wants to put an additional layer of interpretation on top of it (besides these
formatting issues mentioned above).

-------------------------------------------------------------------------------

Now what about fscanf (and its specialised version, scanf)?

In Section 7.19.6.2 "The fscanf function" of the C99 standard we find

    "Failures are described as input failures (due to the occurrences of an encoding error or
    the unavailability of input characters), or matching failures (due to inappropriate
    input)."

In Section 7.19 3 "Files", Point 14 we find:

    An encoding error occurs if the character sequence presented to the underlying
    mbrtowc function does not form a valid (generalized) multibyte character, of
    if the code value passed to the underlying wcrtomb does not correspond to a
    valid (generalized) multibyte character. ...

So we (somehow) get what input failures are (and numbers too big don't belong to them),
while "matching failures" are not explained, but from the usage it seems to me that
they only refer to syntactical appropriateness.

Finally in Point 10 of Section 7.19.6.2 we have:

    Except in the case of a % specifier, the input item (or, in the case of a %n directive,
    the count of input characters) is converted to a type appropriate to the conversion
    specifier. If the input item is not a matching sequence, the execution of the directive
    fails: this condition is a matching failure. Unless assignment suppression was indicated
    by a *, the result of the conversion is placed in the object pointed to by the first
    argument following the format argument that has not already received a conversion
    result. If this object does not have an appropriate type, of if the result of the conversion
    cannot be represented in the object, the behaviour is undefined.

So this describes the process of writing raw bytes to the place of the appropriate argument,
and if that place is not right for it, then we get undefined behaviour. So it seems clear to me,
if the character sequence represents a signed integer too big for int, and the type of the argument
is int, then we get undefined behaviour.

So good, so bad.

---------------------------------------------------------------------------------------

And finally: What has this to do with Boost?
If the standard is weak, then Boost should help it.
Now in this case it seems to me that the standard
is weak, but the compilers are strong, so that no
action is perhaps really needed here.

But at least for the library I'm developing I will use the above
code as a platform test (in the form of a regression test).

And the whole issues is of basic importance: There should be some
programs out there using std::cin >> n for int's n (for example).
And just leaving it to the mercy of the compilers, whether these programs
are bound to run into undefined behaviour or not, seems not a good idea to me.

Oliver