
On Wed, Oct 19, 2005 at 03:03:07PM +1300, Simon Buchan wrote:
Oliver Kullmann wrote:
Hello,
Consider the following basic task: Given an integer variable "n"
int n;
and an input stream "in"
std::istream in;
read an integer from in and put it into n. The "solution"
in >> n;
is not a solution, since according to 27.6.1.2.2 from the C++ standard the locale's num_get<> object is invoked, which according to 22.2.2.1.2 from the C++ standard invokes scanf from the C library, Actually, it just says it should behave /like/ scanf: it is defined in terms of it, not (necessarily) implemented with it.
That's correct, but from the semantical point of view it doesn't matter (since we are not interested here in implementation details).
which then according to 7.19.6.2 from the C standard, clause 10, yields undefined behaviour if the number represented in the stream is not representable by the type int. The standard specifies that when input fails (like a formatted input operation not finding the right format), the stream's failbit is set,
that it says, but this is only a very restricted kind of error (see below), and it doesn't seem to include the case where the number read from the input stream is too big to be represented (this cannot happen for unsigned integers, but it can happen for signed integers).
so just: in >> n; if(in.fail()) {...} does what you want.
I don't think so, that this is guaranteed to help in case the number is too big (but definitely it SHOULD help); below I will argue that scanf as defined in the C99 standard shows undefined behaviour in this case, while the C++03 standard is broken here (is "undefined itself"), so everything seems to be up to the compiler.
(The more general "if(!in)" works as well, also including a corrupted stream and EOF)
Thus, since "in" represents here user input, and user input shall never yield undefined behaviour, we cannot use "in >> n;".
I assume you mean "user input shall never yield *defined* behaviour" ;-)
here I was more referring to a "generalised user", like a template, and those guys are nice. In what follows I will first report on my experimentation with g++ (this is positive), then I try to interpret what the C++ standard says (I believe this must fail, as the standard is broken here), and then finally what the C99 standard says (to me this says clearly "undefined behaviour"). By the way, I don't have access to the C89 standard (seems ridiculously expensive?), but I would hope that C99 is an improvement over C89 (?). --------------------------------------------------------------------------- First the (simple) test program: // Oliver Kullmann, 19.10.2005 (Swansea) #include <cassert> #include <sstream> template <typename Int> void test_correct(const char* const n_string) { std::istringstream in(n_string); Int n = 0; in >> n; assert(in); std::ostringstream out; out << n; assert(out); assert(out.str() == n_string); } template <typename Int> void test_error(const char* const too_big) { std::istringstream in(too_big); Int n = 0; in >> n; // UNDEFINED BEHAVIOUR ?! assert(not in); assert(n == 0); } void test_cases_32() { test_error<short>("32768"); test_correct<short>("32767"); test_error<short>("-32769"); test_correct<short>("-32768"); test_error<int>("2147483648"); test_correct<int>("2147483647"); test_error<int>("-2147483649"); test_correct<int>("-2147483648"); test_error<long>("2147483648"); test_correct<long>("2147483647"); test_error<long>("-2147483649"); test_correct<long>("-2147483648"); // test_error<long long>("9223372036854775808"); // test_correct<long long>("9223372036854775807"); // test_error<long long>("-9223372036854775809"); // test_correct<long long>("-9223372036854775808"); } void test_cases_64() { test_error<short>("32768"); test_correct<short>("32767"); test_error<short>("-32769"); test_correct<short>("-32768"); test_error<int>("2147483648"); test_correct<int>("2147483647"); test_error<int>("-2147483649"); test_correct<int>("-2147483648"); test_error<long>("9223372036854775808"); test_correct<long>("9223372036854775807"); test_error<long>("-9223372036854775809"); test_correct<long>("-9223372036854775808"); // test_error<long long>("9223372036854775808"); // test_correct<long long>("9223372036854775807"); // test_error<long long>("-9223372036854775809"); // test_correct<long long>("-9223372036854775808"); } int main() { #ifndef __WORDSIZE # error "Macro __WORDSIZE not defined" #endif #if __WORDSIZE == 32 test_cases_32(); #elif __WORDSIZE == 64 test_cases_64(); #else # error "Unknown wordsize" #endif } Likely this won't work on all platforms, but on a standard Linux/Unix platform I believe __WORDSIZE is defined, and the numerical values are standard. The above program run successfully (i.e., without asserting) with g++ versions 3.4.3, 3.4.4, 4.0.0. 4.0.1, 4.0.2, on 32 and on 64 bit platforms. I don't know about other compilers on other platforms, but I hope that the results would be the same, which in my interpretation would mean, that the compilers are using the undefined behaviour in a positive way (turning it into defined behaviour). ------------------------------------------------------------------------------- What does the C++ standard (version from 2003) say to it? Section 22.2.2.1 "Class template num_get" seems to be the relevant place: Reading of a number happens in three phases: 1. Determination of a "conversion specifier" (likely referring to the conversion specifiers for fscanf from the C standard). 2. Reading of characters (using facets to handle decimal point and grouping). 3. Now the results are interpreted and stored. The standard says: The result of stage 2 processing can be one of - A sequence of chars has been accumulated in stage 2 that is converted (according to the rules of scanf) to a value of the type of val. This value is stored in val and ios_base::goodbit is stored in err. - The sequence of chars accumulated in stage 2 would have caused scanf to report an input failure. ios_base::failbit is assigned to err. That's it about the conversion. It speaks about the results of stage 2, but actually in the first sub-point it seems to introduce something new, namely the conversion of the sequence into a value (interesting, that the whole point of num_get, namely getting a number, is only mentioned as a kind of a side-remark, referring to some rules of scanf, while actually there are none). It is unclear whether the above paragraph is meant normative, setting the standard, or descriptive, asserting some properties. That is, should the implementation enforce that we have only either success or input failure? Or is the above paragraph a description of the outcome of stage 2 and its interpretation. What does "input failure" mean? I could be meant in a "common-sense" meaning, which would be quite unfortunate, since "input failure" is explained in the C standard. On the other hand, the above case distinction mentions only "success" or "input failure", while the C99 standard mentions "input failure", "matching failure" and also "undefined behaviour". Input failure in the C standard is very restricted (see below), basically referring only to coding error. So something like values which are too big doesn't seem to exist. It is unclear, whether the C++ standards wants to delegate as much as possible to the C standard here, or whether it wants to put an additional layer of interpretation on top of it (besides these formatting issues mentioned above). ------------------------------------------------------------------------------- Now what about fscanf (and its specialised version, scanf)? In Section 7.19.6.2 "The fscanf function" of the C99 standard we find "Failures are described as input failures (due to the occurrences of an encoding error or the unavailability of input characters), or matching failures (due to inappropriate input)." In Section 7.19 3 "Files", Point 14 we find: An encoding error occurs if the character sequence presented to the underlying mbrtowc function does not form a valid (generalized) multibyte character, of if the code value passed to the underlying wcrtomb does not correspond to a valid (generalized) multibyte character. ... So we (somehow) get what input failures are (and numbers too big don't belong to them), while "matching failures" are not explained, but from the usage it seems to me that they only refer to syntactical appropriateness. Finally in Point 10 of Section 7.19.6.2 we have: Except in the case of a % specifier, the input item (or, in the case of a %n directive, the count of input characters) is converted to a type appropriate to the conversion specifier. If the input item is not a matching sequence, the execution of the directive fails: this condition is a matching failure. Unless assignment suppression was indicated by a *, the result of the conversion is placed in the object pointed to by the first argument following the format argument that has not already received a conversion result. If this object does not have an appropriate type, of if the result of the conversion cannot be represented in the object, the behaviour is undefined. So this describes the process of writing raw bytes to the place of the appropriate argument, and if that place is not right for it, then we get undefined behaviour. So it seems clear to me, if the character sequence represents a signed integer too big for int, and the type of the argument is int, then we get undefined behaviour. So good, so bad. --------------------------------------------------------------------------------------- And finally: What has this to do with Boost? If the standard is weak, then Boost should help it. Now in this case it seems to me that the standard is weak, but the compilers are strong, so that no action is perhaps really needed here. But at least for the library I'm developing I will use the above code as a platform test (in the form of a regression test). And the whole issues is of basic importance: There should be some programs out there using std::cin >> n for int's n (for example). And just leaving it to the mercy of the compilers, whether these programs are bound to run into undefined behaviour or not, seems not a good idea to me. Oliver