Undefined behaviour when using iostream read and signed char
My question is similar to this but a bit more specific. I am writing a function to read a 32-bit unsigned integer from a istream represented using little endian. In C something like this would work:
#include <stdio.h>
#include <inttypes.h>
uint_least32_t foo(FILE* file)
{
unsigned char buffer[4];
fread(buffer, sizeof(buffer), 1, file);
uint_least32_t ret = buffer[0];
ret |= (uint_least32_t) buffer[1] << 8;
ret |= (uint_least32_t) buffer[2] << 16;
ret |= (uint_least32_t) buffer[3] << 24;
return ret;
}
But if I try to do something similar using a istream
I run into what I think is undefined behaviour
uint_least32_t bar(istream& file)
{
char buffer[4];
file.read(buffer, sizeof(buffer));
// The casts to unsigned char are to prevent sign extension on systems where
// char is signed.
uint_least32_t ret = (unsigned char) buffer[0];
ret |= (uint_least32_t) (unsigned char) buffer[1] << 8;
ret |= (uint_least32_t) (unsigned char) buffer[2] << 16;
ret |= (uint_least32_t) (unsigned char) buffer[3] << 24;
return ret;
}
It is undefined behaviour on systems where char is signed and there isn't two's complement and it cannot represent the number -128, so it can't represent 256 different chars. In foo
it will work even if char is signed because section 7.21.8.1 of the C11 standard (draft N1570) says that fread
uses unsigned char
not char
and unsigned char
has to be able to represent all values in the range 0 to 255 inclusive.
Does bar
really cause undefined behavior when tries to read the number 0x80
and if so is there a workaround still using a std::istream
?
Edit: The undefined behaviour I am referring to is caused by the istream::read
into buffer
not the cast from buffer to unsigned char. For example if it is a sign+magnitude machine and char is signed then 0x80 is negative 0, but negative 0 and positive 0 must always compare equal according to the standard. If that is the case then there are only 255 different signed chars and you cannot represent a byte with a char. The casts will work because it will always add UCHAR_MAX + 1
to negative numbers (section 4.7 of draft C++11 standard N3242) when casting signed to unsigned.
I think I have the answer: bar
does not cause undefined behaviour.
In the accepted answer of this question, R.. says:
On a non-twos-complement system, signed char will not be suitable for accessing the representation of an object. This is because either there are two possible signed char representations which have the same value (+0 and -0), or one representation that has no value (a trap representation). In either case, this prevents you from doing most meaningful things you might do with the representation of an object. For example, if you have a 16-bit unsigned integer 0x80ff, one or the other byte, as a signed char, is going to either trap or compare equal to 0.
Note that on such an implementation (non-twos-complement), plain char needs to be defined as an unsigned type for accessing the representations of objects via char to work correctly. While there's no explicit requirement, I see this as a requirement derived from other requirements in the standard.
This would seem to be the case because section 3.9 paragraph 2 of C++11 (draft N3242) says:
For any object (other than a base-class subobject) of trivially copyable type T, whether or not the object holds a valid value of type T, the underlying bytes (1.7) making up the object can be copied into an array of char or unsigned char. If the content of the array of char or unsigned char is copied back into the object, the object shall subsequently hold its original value.
If char
was signed and had multiple object representations for some value (such as 0 in sign+magnitude) then if object was copied to a char array and then back into the object, it might not have the same value afterwords because the char array could change to a different object representation. That would contradict the quote above, so char
must be unsigned if the machine's signed char
has multiple object representations for the same value representation (eg On a sign+value machine both 0x80 and 0x00 would represent 0). This means that bar
is defined behaviour because the only case where it is undefined behaviour would require that char
is signed and has a odd representation the would not satisfy the above quote from the standard.
上一篇: 什么信号不应该被多线程阻塞