提高执行时间以读取二进制文件

2018-06-29 14:24:09

我已经编写代码来处理每个1024字节的大块二进制文件（大于2 GB）。该文件包含数据块，每个块依次由两个字节分隔，5D5B = 0x5D 0x5B。

代码有效，但对于大文件，执行时间超过1:30小时，而当我使用相同的Ruby脚本执行相同操作时，执行时间少于15分钟。

您可以使用下面的文件“input.txt”来测试代码，并且您会看到它正确地打印每个块。您可以在记事本中创建带有“File.WriteAllBytes（）...”行的文件“input.txt”，或者在记事本中创建文件“input.txt”，其中以下内容不带（双引号）：

“ ] [如何] [很多] [单词] [我们] [有] [这里？] [6] [或] [更多？] ”

我在本例中使用BinaryReader类和seek方法来读取20个字节的数据块（1024个字节和一个大文件），因为该文件只包含50个字节，然后查找每个块中最后一个块的开始位置大块并将其存储在var lastPos中，因为上一个块可能不完整。

有没有办法改进我的代码以获得更快的执行时间？

我不确定问题是BinaryReader还是数千次查找操作。第一个目标是让每个模块对每个模块应用一些解析，但似乎大部分时间都是在模块分离中消耗的。

static void Main(string[] args)
{
    File.WriteAllBytes("C:/input.txt", new byte[] { 0x5d, 0x5b, 0x48, 0x6f, 0x77, 0x5d, 0x5b, 0x6d, 0x61, 0x6e,
                                                    0x79, 0x5d, 0x5b, 0x77, 0x6f, 0x72, 0x64, 0x73, 0x5d, 0x5b,
                                                    0x77, 0x65, 0x5d, 0x5b, 0x68, 0x61, 0x76, 0x65, 0x5d, 0x5b,
                                                    0x68, 0x65, 0x72, 0x65, 0x3f, 0x5d, 0x5b, 0x36, 0x5d, 0x5b,
                                                    0x6f, 0x72, 0x5d, 0x5b, 0x6d, 0x6f, 0x72, 0x65, 0x3f, 0x5d } );

    using (BinaryReader br = new BinaryReader(File.Open("C:/input.txt", FileMode.Open)))
    {
        int lastPos = 0;
        int EachChunk = 20;
        long ReadFrom = 0;
        int c = 0;
        int count = 0;
        while(lastPos != -1 ) {
            lastPos = -1;
            br.BaseStream.Seek(ReadFrom, SeekOrigin.Begin);
            byte[] data = br.ReadBytes(EachChunk);

            //Loop to look for position of last clock in current chunk
            int k = data.Length - 1;
            while(k > 0 && lastPos == -1) {
                lastPos = (data[k] == 91 && data[k-1] == 93 ? (k - 1) : (-1) );
                k--;
            }

            if (lastPos != -1) {
                Array.Resize(ref data, lastPos);
            } // Resizing array up to the last block position

            // Storing position of pointer where will begin next chunk
            ReadFrom += lastPos + 2;

            //Converting Binary data to string of hex numbers.
            SoapHexBinary shb = new SoapHexBinary(data);

            //Replace separator by Newline
            string str = shb.ToString().Replace("5D5B", Environment.NewLine);

            //Use StringReader to process each block as a line, using the newline as separator
            using (StringReader reader = new StringReader(str))
            {
                // Loop over the lines(blocks) in the string.
                string Block;
                count = c;
                while ((Block = reader.ReadLine()) != null)
                {
                    if ((String.IsNullOrWhiteSpace(Block) ||
                         String.IsNullOrEmpty(Block)) == false) {

                        // +++++ Further process for each block +++++++++++++++++++++++++
                        count++;
                        Console.WriteLine("Block # {0}: {1}", count, Block);
                        // ++++++++++++++++++++++++++++++++++++++++++++++++++
                    }
                }
            }
            c = count;
        }
    }
    Console.ReadLine();
}

更新：

我发现了一个问题。在Mike Burdick的代码中，当找到5B并在找到5D时打印，缓冲区开始增长，但由于每个块被0x5D0x5B分隔开，所以如果在任何块内单独存在5D或单独存在5B，则代码开始加载或清除缓冲区，只有在找到序列5D5B时才加载缓冲区，而不是只有在找到5B时才加载缓冲区，否则结果不同。

你可以用这个输入来测试，在块内添加5D或5B。我只有在找到5D5B并且可以加载缓冲区时才会继续，因为5D5B就像是“换行符”分隔符。

    File.WriteAllBytes("C:/input1.txt", new byte[] {
                                        0x5D, 0x5B, 0x48, 0x5D, 0x77, 0x5D, 0x5B, 0x6d, 0x5B, 0x6e,
                                        0x5D, 0x5D, 0x5B, 0x77, 0x6f, 0x72, 0x64, 0x73, 0x5D, 0x5B,
                                        0x77, 0x65, 0x5D, 0x5B, 0x68, 0x61, 0x76, 0x65, 0x5D, 0x5B,
                                        0x68, 0x65, 0x72, 0x65, 0x3f, 0x5D, 0x5B, 0x36, 0x5D, 0x5B,
                                        0x6f, 0x72, 0x5D, 0x5B, 0x6d, 0x6f, 0x72, 0x65, 0x3f, 0x5D });

更新2：

我试过迈克伯迪克的代码，但它没有得到正确的输出。例如，如果您更改输入文件的内容以包含此内容：

82-F] [如何]]] [MA [纽约] [字％] [我们] [有] [在这里？]

输出应该是（下面的输出用ASCII表示，以便更清楚地看到它）：

    82-F
    How]]
    ma[ny]
    words%
    we
    [have
    here?]]

除此之外，你认为BinaryReader是一种缓慢吗？当我测试一个更大的文件时，执行速度仍然很慢。

更新＃3：

我一直在测试Mike Burdick的代码。也许这不是迈克伯迪克的代码的最佳修改，因为我已经修改，以处理]或[可能出现在每块的中间。它似乎工作，只有似乎无法打印最后的“]”，如果文件以“]”结尾。

例如，与以前相同的内容： "][How][many][words][we][have][here?][6][or][more?]"

我对Mike Burdick代码的修改是：

    static void OptimizedScan(string fileName)
    {
        const byte startDelimiter = 0x5d;
        const byte endDelimiter = 0x5b;

        using (BinaryReader reader = new BinaryReader(File.Open(fileName, FileMode.Open)))
        {
            List<byte> buffer = new List<byte>();
            List<string> buffer1 = new List<string>();

            bool captureBytes = false;
            bool foundStartDelimiter = false;
            int wordCount = 0;

            SoapHexBinary hex = new SoapHexBinary();

            while (true)
            {
                byte[] chunk = reader.ReadBytes(1024);

                if (chunk.Length > 0)
                {
                    foreach (byte data in chunk)
                    {
                        if (data == startDelimiter && foundStartDelimiter == false)
                        {
                            foundStartDelimiter = true;
                        }
                        else if (data == endDelimiter && foundStartDelimiter)
                        {
                            wordCount = DisplayWord(buffer, wordCount, hex);

                            // Start capturing
                            captureBytes = true;
                            foundStartDelimiter = false;
                        }
                        else if ((data == startDelimiter && foundStartDelimiter) ||
                                 (data == endDelimiter && foundStartDelimiter == false))
                        {
                            buffer.Add(data);
                        }
                        else if (captureBytes)
                        {
                            buffer.Add(data);
                        }
                    }
                }
                else
                {
                    break;
                }
            }

            if (foundStartDelimiter)
            {
                buffer.Add(startDelimiter);
            }
            DisplayWord(buffer, wordCount, hex);

我认为这在代码方面更快更简单：

    static void OptimizedScan(string fileName)
    {
        const byte startDelimiter = 0x5d;
        const byte endDelimiter = 0x5b;

        using (BinaryReader reader = new BinaryReader(File.Open(fileName, FileMode.Open)))
        {
            List<byte> buffer = new List<byte>();

            bool captureBytes = false;
            bool foundStartDelimiter = false;
            int wordCount = 0;

            SoapHexBinary hex = new SoapHexBinary();

            while (true)
            {
                byte[] chunk = reader.ReadBytes(1024);

                if (chunk.Length > 0)
                {
                    foreach (byte data in chunk)
                    {
                        if (data == startDelimiter)
                        {
                            foundStartDelimiter = true;
                        }
                        else if (data == endDelimiter && foundStartDelimiter)
                        {
                            wordCount = DisplayWord(buffer, wordCount, hex);

                            // Start capturing
                            captureBytes = true;
                            foundStartDelimiter = false;
                        }
                        else if (captureBytes)
                        {
                            if (foundStartDelimiter)
                            {
                                buffer.Add(startDelimiter);
                            }

                            buffer.Add(data);
                        }
                    }
                }
                else
                {
                    break;
                }
            }

            if (foundStartDelimiter)
            {
                buffer.Add(startDelimiter);
            }

            DisplayWord(buffer, wordCount, hex);
        }
    }

链接地址: http://www.djcxy.com/p/82621.html

上一篇: Improve execution time to read binary file

下一篇: Add strings into a java list on cucumber feature