Parsing a binary file. What is a modern way?
I have a binary file with some layout I know. For example let format be like this:
The file should look like (I added spaces for readability):
5 hello 3 0.0 0.1 0.2 -0.3 -0.4 -0.5
Here 5 - is 2 bytes: 0x05 0x00. "hello" - 5 bytes and so on.
Now I want to read this file. Currently I do it so:
char buffer[2]
unsigned short len{ *((unsigned short*)buffer) };
. Now I have length of a string. vector<char>
and create a std::string
from this vector. Now I have string id. char bufferFloat[4]
and cast *((float*)bufferFloat)
for every float. This works, but for me it looks ugly. Can I read directly to unsigned short
or float
or string
etc. without char [x]
creating? If no, what is the way to cast correctly (I read that style I'm using - is an old style)?
PS: while I wrote a question, the more clearer explanation raised in my head - how to cast arbitrary number of bytes from arbitrary position in char [x]
?
Update: I forgot to mention explicitly that string and float data length is not known at compile time and is variable.
The C way, which would work fine in C++, would be to declare a struct:
#pragma pack(1)
struct contents {
// data members;
};
Note that
And then cast the read buffer directly into the struct type:
std::vector<char> buf(sizeof(contents));
file.read(buf.data(), buf.size());
contents *stuff = reinterpret_cast<contents *>(buf.data());
Now if your data's size is variable, you can separate in several chunks. To read a single binary object from the buffer, a reader function comes handy:
template<typename T>
const char *read_object(const char *buffer, T& target) {
target = *reinterpret_cast<const T*>(buffer);
return buffer + sizeof(T);
}
The main advantage is that such a reader can be specialized for more advanced c++ objects:
template<typename CT>
const char *read_object(const char *buffer, std::vector<CT>& target) {
size_t size = target.size();
CT const *buf_start = reinterpret_cast<const CT*>(buffer);
std::copy(buf_start, buf_start + size, target.begin());
return buffer + size * sizeof(CT);
}
And now in your main parser:
int n_floats;
iter = read_object(iter, n_floats);
std::vector<float> my_floats(n_floats);
iter = read_object(iter, my_floats);
Note: As Tony D observed, even if you can get the alignment right via #pragma
directives and manual padding (if needed), you may still encounter incompatibility with your processor's alignment, in the form of (best case) performance issues or (worst case) trap signals. This method is probably interesting only if you have control over the file's format.
If it is not for learning purpose, and if you have freedom in choosing the binary format you'd better consider using something like protobuf which will handle the serialization for you and allow to interoperate with other platforms and languages.
If you cannot use a third party API, you may look at QDataStream
for inspiration
Currently I do it so:
load file to ifstream
read this stream to char buffer[2]
cast it to unsigned short
: unsigned short len{ *((unsigned short*)buffer) };
. Now I have length of a string.
That last risks a SIGBUS
, performance and/or endianness issues. I'd suggest reading the two characters then you can say (x[0] << 8) | x[1]
(x[0] << 8) | x[1]
or vice versa, using htons
if needing to correct for endianness.
vector<char>
and create a std::string
from this vector
. Now I have string id. No need... just read directly into the string:
std::string s(the_size, ' ');
if (input_fstream.read(&s[0], s.size()) &&
input_stream.gcount() == s.size())
...use s...
read
next 4 bytes and cast them to unsigned int
. Now I have a stride. while
not end of file read
float
s the same way - create a char bufferFloat[4]
and cast *((float*)bufferFloat)
for every float
. Better to read the data directly over the unsigned int
s and floats
, as that way the compiler will ensure correct alignment.
This works, but for me it looks ugly. Can I read directly to unsigned short
or float
or string
etc. without char [x]
creating? If no, what is the way to cast correctly (I read that style I'm using - is an old style)?
struct Data
{
uint32_t x;
float y[6];
};
Data data;
if (input_stream.read((char*)&data, sizeof data) &&
input_stream.gcount() == sizeof data)
...use x and y...
Note the code above avoids reading data into potentially unaligned character arrays, wherein it's unsafe to reinterpret_cast
data in a potentially unaligned char
array (including inside a std::string
) due to alignment issues. Again, you may need some post-read conversion with htonl
if there's a chance the file content differs in endianness. If there's an unknown number of float
s, you'll need to calculate and allocate sufficient storage with alignment of at least 4 bytes, then aim a Data*
at it... it's legal to index past the declared array size of y
as long as the memory content at the accessed addresses was part of the allocation and holds a valid float
representation read in from the stream. Simpler - but with an additional read so possibly slower - read the uint32_t
first then new float[n]
and do a further read
into there....
Practically, this type of approach can work and a lot of low level and C code does exactly this. "Cleaner" high-level libraries that might help you read the file must ultimately be doing something similar internally....
链接地址: http://www.djcxy.com/p/40310.html下一篇: 解析二进制文件。 什么是现代方式?