Write a program to find 100 largest numbers out of an array of 1 billion numbers

I recently attended an interview where I was asked "write a program to find 100 largest numbers out of an array of 1 billion numbers."

I was only able to give a brute force solution which was to sort the array in O(nlogn) time complexity and take the last 100 numbers.

Arrays.sort(array);

The interviewer was looking for a better time complexity, I tried a couple of other solutions but failed to answer him. Is there a better time complexity solution?


You can keep a priority queue of the 100 biggest numbers, iterate through the billion numbers, whenever you encounter a number greater than the smallest number in the queue (the head of the queue), remove the head of the queue and add the new number to the queue.

EDIT: as Dev noted, with a priority queue implemented with a heap, the complexity of insertion to queue is O(logN)

In the worst case you get billionlog2(100) which is better than billionlog2(billion)

In general, if you need the largest K numbers from a set of N numbers, the complexity is O(NlogK) rather than O(NlogN) , this can be very significant when K is very small comparing to N.

EDIT2:

The expected time of this algorithm is pretty interesting, since in each iteration an insertion may or may not occur. The probability of the i'th number to be inserted to the queue is the probability of a random variable being larger than at least iK random variables from the same distribution (the first k numbers are automatically added to the queue). We can use order statistics (see link) to calculate this probability. For example, lets assume the numbers were randomly selected uniformly from {0, 1} , the expected value of (iK)th number (out of i numbers) is (ik)/i , and chance of a random variable being larger than this value is 1-[(ik)/i] = k/i .

Thus, the expected number of insertions is:

And the expected running time can be expressed as:

( k time to generate the queue with the first k elements, then nk comparisons, and the expected number of insertions as described above, each takes an average log(k)/2 time)

Note that when N is very large comparing to K , this expression is a lot closer to n rather than NlogK . This is somewhat intuitive, as in the case of the question, even after 10000 iterations (which is very small comparing to a billion), the chance of a number to be inserted to the queue is very small.


If this is asked in an interview, I think the interviewer probably wants to see your problem solving process, not just your knowledge of algorithms.

The description is quite general so maybe you can ask him the range or meaning of these numbers to make the problem clear. Doing this may impress an interviewer. If, for example, these numbers stands for people's age of within a country (eg China),then it's a much easier problem. With a reasonable assumption that nobody alive is older than 200, you can use an int array of size 200(maybe 201) to count the number of people with the same age in just one iteration. Here the index means the age. After this it's a piece of cake to find 100 largest number. By the way this algo is called counting sort .

Anyway, making the question more specific and clearer is good for you in an interview.


You can iterate over the numbers which takes O(n)

Whenever you find a value greater than the current minimum, add the new value to a circular queue with size 100.

The min of that circular queue is your new comparison value. Keep on adding to that queue. If full, extract the minimum from the queue.

链接地址: http://www.djcxy.com/p/70794.html

上一篇: 使用有限的内存找到缺少的号码

下一篇: 编写一个程序,从10亿个数字中找出100个最大的数字