find a missing number using limited memory

2018-06-25 06:53:15

The problem is, given an input file with four billion unique integers, provide an algorithm to generate an integer which is not contained in the file, assume only have 10 MB of memory.

Searched for some solutions and posted code below, one of which is to store integers into bit-vector blocks (each block representing a specific range of integers among 4 billion range, each bit in the block represent for an integer), and using another counter for each block, to count the number of integers in each block. So that if number of integers is less than the block capacity for integers, scan the bit-vector of the block to find which are missing integers.

My question for this solution is, why "the nearer to the middle that we pick, the less memory will be used at any given time" , here are more context,

The array in the first pass can fit in 10 megabytes, or roughly 2^23 bytes, of memory. Since each element in the array is an int, and an int is 4 bytes, we can hold an array of at most about 2^21 elements. So, we can deduce the following:

在这里输入图像描述

Therefore, we can conclude the following: 2^10< rangeSize <2^26, and these conditions give us a good amount of "wiggle room," but the nearer to the middle that we pick, the less memory will be used at any given time.

public class QuestionB {
    public static int bitsize = 1048576; // 2^20 bits (2^17 bytes)
    public static int blockNum = 4096; // 2^12
    public static byte[] bitfield = new byte[bitsize/8];
    public static int[] blocks = new int[blockNum];

    public static void findOpenNumber() throws FileNotFoundException {
        int starting = -1;
        Scanner in = new Scanner (new FileReader ("Chapter 10/Question10_3/input_file_q10_3.txt"));
        while (in.hasNextInt()) {
            int n = in.nextInt();
            blocks[n / (bitfield.length * 8)]++;
        }

        for (int i = 0; i < blocks.length; i++) {
            if (blocks[i] < bitfield.length * 8){
                /* if value < 2^20, then at least 1 number is missing in
                 * that section. */
                starting = i * bitfield.length * 8;
                break;
            }
        }

        in = new Scanner(new FileReader("Chapter 10/Question10_3/input_file_q10_3.txt"));
        while (in.hasNextInt()) {
            int n = in.nextInt();
            /* If the number is inside the block that’s missing 
             * numbers, we record it */
            if (n >= starting && n < starting + bitfield.length * 8) {
                bitfield [(n-starting) / 8] |= 1 << ((n - starting) % 8);
            }
        }

        for (int i = 0 ; i < bitfield.length; i++) {
            for (int j = 0; j < 8; j++) {
                /* Retrieves the individual bits of each byte. When 0 bit 
                 * is found, finds the corresponding value. */
                if ((bitfield[i] & (1 << j)) == 0) {
                    System.out.println(i * 8 + j + starting);
                    return;
                }
            }
        }       
    }

    public static void main(String[] args) throws FileNotFoundException {
        findOpenNumber();
    }

}

If you form M blocks each of size 2^32/M, the total memory required is M+2^27/M words (32 bits). This function reaches a minimum when M=√2^27, which is halfway between 1 and 2^27 blocks. The minimum is 2^14.5 words, about 92 KBytes.

This is very clear on a bilogarithmic plot.

I like this question. I'll give it additional thought but I think if disk space and time is not an issue, you can break the numbers into 100k blocks, and sort them in each file. Any block that doesn't have 100k entries will have a gap. It's not elegant at all but it gets the ball rolling.

链接地址: http://www.djcxy.com/p/70796.html

上一篇: 你如何处理每个整数基数排序的地方？

下一篇: 使用有限的内存找到缺少的号码