Is there a way to reliably read an integer from a stream?

Hello, Consider the following basic task: Given an integer variable "n" int n; and an input stream "in" std::istream in; read an integer from in and put it into n. The "solution" in >> n; is not a solution, since according to 27.6.1.2.2 from the C++ standard the locale's num_get<> object is invoked, which according to 22.2.2.1.2 from the C++ standard invokes scanf from the C library, which then according to 7.19.6.2 from the C standard, clause 10, yields undefined behaviour if the number represented in the stream is not representable by the type int. Thus, since "in" represents here user input, and user input shall never yield undefined behaviour, we cannot use "in >> n;". It seems that in the boost library there is no tool to remedy the situation: - Although boost::lexical_cast does not specify which errors are checked and which are not, it seems that lexical_cast just forwards the action to some std::stringstream object, and the same problem as above arises. - Boost::numeric_cast only works with objects of numeric types, and thus does not contribute to the problem of converting a character sequence into an integer. So, it seems, that this fundamental problem is neither addressed by the C++ standard library nor by the Boost library, and thus I have to write a parser etc. myself for the arithmetic types in C++ ? Is this really so?? Oliver P.S. The easiest way to parse the input from in and check for overflow / underflow seems to me to get the sequence of digits into a std::string digits; erase leading zeros, and compare this string to the string std::string ref = boost::lexical_cast<std::string>(std::numerical_limits<int>::max()); In case of overflow: If digits.size() < ref.size(), then fine, if digits.size() > ref.size(), then bad, otherwise we use lexicographical comparison. A convenient design then perhaps is a wrapper class template <typename Int> class safe_wrapper; such that we can use safe_wrapper<int> n; in >> n; if (not in) throw Error;

Oliver Kullmann wrote:
Hello,
Consider the following basic task: Given an integer variable "n"
int n;
and an input stream "in"
std::istream in;
read an integer from in and put it into n. The "solution"
in >> n;
is not a solution, since according to 27.6.1.2.2 from the C++ standard the locale's num_get<> object is invoked, which according to 22.2.2.1.2 from the C++ standard invokes scanf from the C library, Actually, it just says it should behave /like/ scanf: it is defined in terms of it, not (necessarily) implemented with it. which then according to 7.19.6.2 from the C standard, clause 10, yields undefined behaviour if the number represented in the stream is not representable by the type int. The standard specifies that when input fails (like a formatted input operation not finding the right format), the stream's failbit is set, so just: in >> n; if(in.fail()) {...} does what you want. (The more general "if(!in)" works as well, also including a corrupted stream and EOF)
Thus, since "in" represents here user input, and user input shall never yield undefined behaviour, we cannot use "in >> n;". I assume you mean "user input shall never yield *defined* behaviour" ;-)

On Wed, Oct 19, 2005 at 03:03:07PM +1300, Simon Buchan wrote:
Oliver Kullmann wrote:
Hello,
Consider the following basic task: Given an integer variable "n"
int n;
and an input stream "in"
std::istream in;
read an integer from in and put it into n. The "solution"
in >> n;
is not a solution, since according to 27.6.1.2.2 from the C++ standard the locale's num_get<> object is invoked, which according to 22.2.2.1.2 from the C++ standard invokes scanf from the C library, Actually, it just says it should behave /like/ scanf: it is defined in terms of it, not (necessarily) implemented with it.
That's correct, but from the semantical point of view it doesn't matter (since we are not interested here in implementation details).
which then according to 7.19.6.2 from the C standard, clause 10, yields undefined behaviour if the number represented in the stream is not representable by the type int. The standard specifies that when input fails (like a formatted input operation not finding the right format), the stream's failbit is set,
that it says, but this is only a very restricted kind of error (see below), and it doesn't seem to include the case where the number read from the input stream is too big to be represented (this cannot happen for unsigned integers, but it can happen for signed integers).
so just: in >> n; if(in.fail()) {...} does what you want.
I don't think so, that this is guaranteed to help in case the number is too big (but definitely it SHOULD help); below I will argue that scanf as defined in the C99 standard shows undefined behaviour in this case, while the C++03 standard is broken here (is "undefined itself"), so everything seems to be up to the compiler.
(The more general "if(!in)" works as well, also including a corrupted stream and EOF)
Thus, since "in" represents here user input, and user input shall never yield undefined behaviour, we cannot use "in >> n;".
I assume you mean "user input shall never yield *defined* behaviour" ;-)
here I was more referring to a "generalised user", like a template, and those guys are nice. In what follows I will first report on my experimentation with g++ (this is positive), then I try to interpret what the C++ standard says (I believe this must fail, as the standard is broken here), and then finally what the C99 standard says (to me this says clearly "undefined behaviour"). By the way, I don't have access to the C89 standard (seems ridiculously expensive?), but I would hope that C99 is an improvement over C89 (?). --------------------------------------------------------------------------- First the (simple) test program: // Oliver Kullmann, 19.10.2005 (Swansea) #include <cassert> #include <sstream> template <typename Int> void test_correct(const char* const n_string) { std::istringstream in(n_string); Int n = 0; in >> n; assert(in); std::ostringstream out; out << n; assert(out); assert(out.str() == n_string); } template <typename Int> void test_error(const char* const too_big) { std::istringstream in(too_big); Int n = 0; in >> n; // UNDEFINED BEHAVIOUR ?! assert(not in); assert(n == 0); } void test_cases_32() { test_error<short>("32768"); test_correct<short>("32767"); test_error<short>("-32769"); test_correct<short>("-32768"); test_error<int>("2147483648"); test_correct<int>("2147483647"); test_error<int>("-2147483649"); test_correct<int>("-2147483648"); test_error<long>("2147483648"); test_correct<long>("2147483647"); test_error<long>("-2147483649"); test_correct<long>("-2147483648"); // test_error<long long>("9223372036854775808"); // test_correct<long long>("9223372036854775807"); // test_error<long long>("-9223372036854775809"); // test_correct<long long>("-9223372036854775808"); } void test_cases_64() { test_error<short>("32768"); test_correct<short>("32767"); test_error<short>("-32769"); test_correct<short>("-32768"); test_error<int>("2147483648"); test_correct<int>("2147483647"); test_error<int>("-2147483649"); test_correct<int>("-2147483648"); test_error<long>("9223372036854775808"); test_correct<long>("9223372036854775807"); test_error<long>("-9223372036854775809"); test_correct<long>("-9223372036854775808"); // test_error<long long>("9223372036854775808"); // test_correct<long long>("9223372036854775807"); // test_error<long long>("-9223372036854775809"); // test_correct<long long>("-9223372036854775808"); } int main() { #ifndef __WORDSIZE # error "Macro __WORDSIZE not defined" #endif #if __WORDSIZE == 32 test_cases_32(); #elif __WORDSIZE == 64 test_cases_64(); #else # error "Unknown wordsize" #endif } Likely this won't work on all platforms, but on a standard Linux/Unix platform I believe __WORDSIZE is defined, and the numerical values are standard. The above program run successfully (i.e., without asserting) with g++ versions 3.4.3, 3.4.4, 4.0.0. 4.0.1, 4.0.2, on 32 and on 64 bit platforms. I don't know about other compilers on other platforms, but I hope that the results would be the same, which in my interpretation would mean, that the compilers are using the undefined behaviour in a positive way (turning it into defined behaviour). ------------------------------------------------------------------------------- What does the C++ standard (version from 2003) say to it? Section 22.2.2.1 "Class template num_get" seems to be the relevant place: Reading of a number happens in three phases: 1. Determination of a "conversion specifier" (likely referring to the conversion specifiers for fscanf from the C standard). 2. Reading of characters (using facets to handle decimal point and grouping). 3. Now the results are interpreted and stored. The standard says: The result of stage 2 processing can be one of - A sequence of chars has been accumulated in stage 2 that is converted (according to the rules of scanf) to a value of the type of val. This value is stored in val and ios_base::goodbit is stored in err. - The sequence of chars accumulated in stage 2 would have caused scanf to report an input failure. ios_base::failbit is assigned to err. That's it about the conversion. It speaks about the results of stage 2, but actually in the first sub-point it seems to introduce something new, namely the conversion of the sequence into a value (interesting, that the whole point of num_get, namely getting a number, is only mentioned as a kind of a side-remark, referring to some rules of scanf, while actually there are none). It is unclear whether the above paragraph is meant normative, setting the standard, or descriptive, asserting some properties. That is, should the implementation enforce that we have only either success or input failure? Or is the above paragraph a description of the outcome of stage 2 and its interpretation. What does "input failure" mean? I could be meant in a "common-sense" meaning, which would be quite unfortunate, since "input failure" is explained in the C standard. On the other hand, the above case distinction mentions only "success" or "input failure", while the C99 standard mentions "input failure", "matching failure" and also "undefined behaviour". Input failure in the C standard is very restricted (see below), basically referring only to coding error. So something like values which are too big doesn't seem to exist. It is unclear, whether the C++ standards wants to delegate as much as possible to the C standard here, or whether it wants to put an additional layer of interpretation on top of it (besides these formatting issues mentioned above). ------------------------------------------------------------------------------- Now what about fscanf (and its specialised version, scanf)? In Section 7.19.6.2 "The fscanf function" of the C99 standard we find "Failures are described as input failures (due to the occurrences of an encoding error or the unavailability of input characters), or matching failures (due to inappropriate input)." In Section 7.19 3 "Files", Point 14 we find: An encoding error occurs if the character sequence presented to the underlying mbrtowc function does not form a valid (generalized) multibyte character, of if the code value passed to the underlying wcrtomb does not correspond to a valid (generalized) multibyte character. ... So we (somehow) get what input failures are (and numbers too big don't belong to them), while "matching failures" are not explained, but from the usage it seems to me that they only refer to syntactical appropriateness. Finally in Point 10 of Section 7.19.6.2 we have: Except in the case of a % specifier, the input item (or, in the case of a %n directive, the count of input characters) is converted to a type appropriate to the conversion specifier. If the input item is not a matching sequence, the execution of the directive fails: this condition is a matching failure. Unless assignment suppression was indicated by a *, the result of the conversion is placed in the object pointed to by the first argument following the format argument that has not already received a conversion result. If this object does not have an appropriate type, of if the result of the conversion cannot be represented in the object, the behaviour is undefined. So this describes the process of writing raw bytes to the place of the appropriate argument, and if that place is not right for it, then we get undefined behaviour. So it seems clear to me, if the character sequence represents a signed integer too big for int, and the type of the argument is int, then we get undefined behaviour. So good, so bad. --------------------------------------------------------------------------------------- And finally: What has this to do with Boost? If the standard is weak, then Boost should help it. Now in this case it seems to me that the standard is weak, but the compilers are strong, so that no action is perhaps really needed here. But at least for the library I'm developing I will use the above code as a platform test (in the form of a regression test). And the whole issues is of basic importance: There should be some programs out there using std::cin >> n for int's n (for example). And just leaving it to the mercy of the compilers, whether these programs are bound to run into undefined behaviour or not, seems not a good idea to me. Oliver

Oliver Kullmann wrote:
22.2.2.1.2 from the C++ standard invokes scanf from the C library, which then according to 7.19.6.2 from the C standard, clause 10, yields undefined behaviour if the number represented in the stream is not
which version of C standard is in your mind? I believe that C90 that C++ standard referes to, does not mention UB. B.

On Thu, Oct 20, 2005 at 08:12:05PM +0100, Bronek Kozicki wrote:
Oliver Kullmann wrote:
22.2.2.1.2 from the C++ standard invokes scanf from the C library, which then according to 7.19.6.2 from the C standard, clause 10, yields undefined behaviour if the number represented in the stream is not
which version of C standard is in your mind?
I have the C99 standard (as a book), while the C90 standard (I thought it would be C89?) seems to cost a fortune.
I believe that C90 that C++ standard referes to, does not mention UB.
But then it should be the case that the C99 standard only makes more precise what the older standard left out? Oliver

Oliver Kullmann wrote:
I have the C99 standard (as a book), while the C90 standard (I thought it would be C89?) seems to cost a fortune.
C++ explicitly referes to ISO/IEC 9899:1990 . You may call it C89, I will to consistently call it C90. The point remains that C++ does not refer to any newer version of the C standard.
I believe that C90 that C++ standard referes to, does not mention UB.
But then it should be the case that the C99 standard only makes more precise what the older standard left out?
No. The wording in C90 standard is actually important. It simply leaves this question not-standarized and the C++ standard does not add anything in this respect. However, if (or rather when) C++ is updated to refer to the newer version of the C standard, it will have to be considered. B.

On Sat, Oct 22, 2005 at 09:21:08PM +0100, Bronek Kozicki wrote:
Oliver Kullmann wrote:
I have the C99 standard (as a book), while the C90 standard (I thought it would be C89?) seems to cost a fortune.
C++ explicitly refers to ISO/IEC 9899:1990 . You may call it C89, I will to consistently call it C90. The point remains that C++ does not refer to any newer version of the C standard.
I forgot about that technical corrigendum issue, so you are right, it should be C90. The problem with the C++ reference to the C90 standard is, that the C90 standard costs currently $103.10 (http://webstore.ansi.org/ansidocstore/product.asp?sku=AS+3955%2D1991) (actually, I believe the price used to be even higher). There are preliminary versions out there, for example http://monolith.consiste.dimap.ufrn.br/~david/ENSEIGNEMENT/SUPPORT/n843.pdf I looked up the documentation of the fscanf function there, and it looks like as I expected it to be: The C99 standard only makes things more precise.
I believe that C90 that C++ standard refers to, does not mention UB.
But then it should be the case that the C99 standard only makes more precise what the older standard left out?
No. The wording in C90 standard is actually important.
Sure, for the language lawyers. But my point here is to argue that it is NOT POSSIBLE to read integers in C++ from an input stream without running into undefined behaviour (if we do not have perfect control over the size of numbers), and w.r.t. this the C90 standard is just worse than the C99 standard.
It simply leaves this question not-standarized and the C++ standard does not add anything in this respect.
The only difference between "not-standardized" and "undefined behaviour" is, that in the latter case we are at least conscious about it, while in the former case we have no clue (and closing the eyes before a problem doesn't usually solve the problem).
However, if (or rather when) C++ is updated to refer to the newer version of the C standard, it will have to be considered.
I think we should consider it rather now that if we only relying on the standard, and not on the compiler, that then int n; std::cin >> n; should not be used. Within the Boost library perhaps the most appropriate place to address this issue would be with boost::lexical_cast. Perhaps one could have a checked version; or at least the test suite of the Boost library could contain checks whether the compiler seems to correctly handle reading of integers (and other fundamental types) too big to be represented. Oliver

Oliver Kullmann wrote:
Sure, for the language lawyers. But my point here is to argue that it is NOT POSSIBLE to read integers in C++ from an input stream without running into undefined behaviour (if we do not have perfect control over the size of numbers), and w.r.t. this the C90 standard is just worse than the C99 standard.
It simply leaves this question not-standarized and the C++ standard does not add anything in this respect.
The only difference between "not-standardized" and "undefined behaviour" is, that in the latter case we are at least conscious about it, while in the former case we have no clue (and closing the eyes before a problem doesn't usually solve the problem).
No at all. If you were right, then we would have been unable to write multithreaded programs, programs that use dynamic libraries etc. - because these isues are left out of standard (C++ and C). The fact is that when something is left out of standard, it is actually left to implementor. The very issue you are rising here has been actually discussed by the C++ standard committee and conclusion was that currently there is no risk of UB. There are worse problems to solve, eg: int main() { int a, b; std::cin >> a >> b; int c = a * b; // potential UB here } B.
participants (3)
-
Bronek Kozicki
-
Oliver Kullmann
-
Simon Buchan