Improve execution time to read binary file

2018-06-29 14:24:09

I've written code to process a big binary file (more than 2 GB) reading in chunks of 1024 bytes each. The file contains blocks of data and each block is separated by two bytes in sequence, 5D5B = 0x5D 0x5B.

The code works, but for big files the execution time is more than 1:30 hours and when I do the same with a kind of equivalent Ruby script the execution time is less than 15 min.

You can test the code with the file "input.txt" below, and you'll see that it prints each block correctly. You can create file "input.txt" with the line "File.WriteAllBytes()..." or create file "input.txt" in Notepad with the following content without (double quotes):

" ][How][many][words][we][have][here?][6][or][more?] "

I'm using the BinaryReader class and the seek method to read in chunks of 20 bytes in this example (1024 bytes with a big file), since the file contains only 50 bytes and then looking for position of the beginning of last block within each chunk and store it in var lastPos, since the last chunk could be incomplete.

Is there a way to improve my code to get a faster execution time?

I'm not sure if the issue is BinaryReader or to do with thousands of seek operations. The first goal is to get each block to apply some parsing to each one, but it seems much of the time is being consumed in the separation of blocks.

static void Main(string[] args)
{
    File.WriteAllBytes("C:/input.txt", new byte[] { 0x5d, 0x5b, 0x48, 0x6f, 0x77, 0x5d, 0x5b, 0x6d, 0x61, 0x6e,
                                                    0x79, 0x5d, 0x5b, 0x77, 0x6f, 0x72, 0x64, 0x73, 0x5d, 0x5b,
                                                    0x77, 0x65, 0x5d, 0x5b, 0x68, 0x61, 0x76, 0x65, 0x5d, 0x5b,
                                                    0x68, 0x65, 0x72, 0x65, 0x3f, 0x5d, 0x5b, 0x36, 0x5d, 0x5b,
                                                    0x6f, 0x72, 0x5d, 0x5b, 0x6d, 0x6f, 0x72, 0x65, 0x3f, 0x5d } );

    using (BinaryReader br = new BinaryReader(File.Open("C:/input.txt", FileMode.Open)))
    {
        int lastPos = 0;
        int EachChunk = 20;
        long ReadFrom = 0;
        int c = 0;
        int count = 0;
        while(lastPos != -1 ) {
            lastPos = -1;
            br.BaseStream.Seek(ReadFrom, SeekOrigin.Begin);
            byte[] data = br.ReadBytes(EachChunk);

            //Loop to look for position of last clock in current chunk
            int k = data.Length - 1;
            while(k > 0 && lastPos == -1) {
                lastPos = (data[k] == 91 && data[k-1] == 93 ? (k - 1) : (-1) );
                k--;
            }

            if (lastPos != -1) {
                Array.Resize(ref data, lastPos);
            } // Resizing array up to the last block position

            // Storing position of pointer where will begin next chunk
            ReadFrom += lastPos + 2;

            //Converting Binary data to string of hex numbers.
            SoapHexBinary shb = new SoapHexBinary(data);

            //Replace separator by Newline
            string str = shb.ToString().Replace("5D5B", Environment.NewLine);

            //Use StringReader to process each block as a line, using the newline as separator
            using (StringReader reader = new StringReader(str))
            {
                // Loop over the lines(blocks) in the string.
                string Block;
                count = c;
                while ((Block = reader.ReadLine()) != null)
                {
                    if ((String.IsNullOrWhiteSpace(Block) ||
                         String.IsNullOrEmpty(Block)) == false) {

                        // +++++ Further process for each block +++++++++++++++++++++++++
                        count++;
                        Console.WriteLine("Block # {0}: {1}", count, Block);
                        // ++++++++++++++++++++++++++++++++++++++++++++++++++
                    }
                }
            }
            c = count;
        }
    }
    Console.ReadLine();
}

Update:

I found an issue. In Mike Burdick's code the buffer begins to growp up when 5B is found and is printed when 5D is found, but since each block is separated by 0x5D0x5B, if there is a 5D alone or a 5B alone inside any block, the code is beginning to load or clear the buffer and only should load the buffer when the sequence 5D5B is found, not only when 5B is found, if not the result is different.

You can test with this input, where I added a 5D or a 5B inside the blocks. I resume only when 5D5B is found and the buffer can be loaded, since 5D5B is like the "newline" separator.

    File.WriteAllBytes("C:/input1.txt", new byte[] {
                                        0x5D, 0x5B, 0x48, 0x5D, 0x77, 0x5D, 0x5B, 0x6d, 0x5B, 0x6e,
                                        0x5D, 0x5D, 0x5B, 0x77, 0x6f, 0x72, 0x64, 0x73, 0x5D, 0x5B,
                                        0x77, 0x65, 0x5D, 0x5B, 0x68, 0x61, 0x76, 0x65, 0x5D, 0x5B,
                                        0x68, 0x65, 0x72, 0x65, 0x3f, 0x5D, 0x5B, 0x36, 0x5D, 0x5B,
                                        0x6f, 0x72, 0x5D, 0x5B, 0x6d, 0x6f, 0x72, 0x65, 0x3f, 0x5D });

Update 2:

I've tried Mike Burdick's code, but it is not given correct outputs. For example, if you change the content of the input file to contain this:

82-F][How]]][ma[ny]][words%][we][[have][here?]]

The output should be (the below output is presented in ASCII to see it more clearly):

    82-F
    How]]
    ma[ny]
    words%
    we
    [have
    here?]]

Besides that, do you think BinaryReader is a kind of slow? When I test with a bigger file, the execution is still very slow.

Update #3:

I've been testing Mike Burdick's code. Maybe it is not the best modification of Mike Burdick's code, since I've modified to handle ] or [ that could appear in the middle of each block. It seems to work and only seems to fail to print the last "]" if the file ends with "]".

For example, the same content as before: "][How][many][words][we][have][here?][6][or][more?]"

My modification of Mike Burdick code is:

    static void OptimizedScan(string fileName)
    {
        const byte startDelimiter = 0x5d;
        const byte endDelimiter = 0x5b;

        using (BinaryReader reader = new BinaryReader(File.Open(fileName, FileMode.Open)))
        {
            List<byte> buffer = new List<byte>();
            List<string> buffer1 = new List<string>();

            bool captureBytes = false;
            bool foundStartDelimiter = false;
            int wordCount = 0;

            SoapHexBinary hex = new SoapHexBinary();

            while (true)
            {
                byte[] chunk = reader.ReadBytes(1024);

                if (chunk.Length > 0)
                {
                    foreach (byte data in chunk)
                    {
                        if (data == startDelimiter && foundStartDelimiter == false)
                        {
                            foundStartDelimiter = true;
                        }
                        else if (data == endDelimiter && foundStartDelimiter)
                        {
                            wordCount = DisplayWord(buffer, wordCount, hex);

                            // Start capturing
                            captureBytes = true;
                            foundStartDelimiter = false;
                        }
                        else if ((data == startDelimiter && foundStartDelimiter) ||
                                 (data == endDelimiter && foundStartDelimiter == false))
                        {
                            buffer.Add(data);
                        }
                        else if (captureBytes)
                        {
                            buffer.Add(data);
                        }
                    }
                }
                else
                {
                    break;
                }
            }

            if (foundStartDelimiter)
            {
                buffer.Add(startDelimiter);
            }
            DisplayWord(buffer, wordCount, hex);

我认为这在代码方面更快更简单：

    static void OptimizedScan(string fileName)
    {
        const byte startDelimiter = 0x5d;
        const byte endDelimiter = 0x5b;

        using (BinaryReader reader = new BinaryReader(File.Open(fileName, FileMode.Open)))
        {
            List<byte> buffer = new List<byte>();

            bool captureBytes = false;
            bool foundStartDelimiter = false;
            int wordCount = 0;

            SoapHexBinary hex = new SoapHexBinary();

            while (true)
            {
                byte[] chunk = reader.ReadBytes(1024);

                if (chunk.Length > 0)
                {
                    foreach (byte data in chunk)
                    {
                        if (data == startDelimiter)
                        {
                            foundStartDelimiter = true;
                        }
                        else if (data == endDelimiter && foundStartDelimiter)
                        {
                            wordCount = DisplayWord(buffer, wordCount, hex);

                            // Start capturing
                            captureBytes = true;
                            foundStartDelimiter = false;
                        }
                        else if (captureBytes)
                        {
                            if (foundStartDelimiter)
                            {
                                buffer.Add(startDelimiter);
                            }

                            buffer.Add(data);
                        }
                    }
                }
                else
                {
                    break;
                }
            }

            if (foundStartDelimiter)
            {
                buffer.Add(startDelimiter);
            }

            DisplayWord(buffer, wordCount, hex);
        }
    }

链接地址: http://www.djcxy.com/p/82622.html

上一篇: 在运行时从MFT读取文件内容

下一篇: 提高执行时间以读取二进制文件