A puzzle on data structure

I faced this puzzle question[ related to data structure ] in a coding competition.

There is a planet of trees (real trees not tree data structure!!). It has billions or even trillions of trees. The king orders to find the median of ages (in years and integers) of all the trees using say carbon dating. ( Method does not matter. ) Note: The Median is the "middle number" in a sorted list of numbers.

Constraints:
1. The oldest tree is known to be 2000 years old.
2. They have single machine which can store integers in range from -infinity to +infinity.
3. But the number of such integers that can be stored in memory at a time is 1 million.

so, once you store 1 million integers to store next one you must delete already stored one.
So somehow they have to keep track of median as they go on counting the ages of trees.
How can they do this?

My approach
Use a variant of external sort to sort the ages in chunks & write them in file.
Apply k-way merging[for the chunks].
The problem with above approach is that it needs two scan of the file.

I can think of another approach which uses the information The oldest tree is known to be 2000 years old.
Cannot we take a count array [ as range of ages of tree is fixed ]?

I want to know is there any better approach?
Does there exist any method where we do not need to store the data in the file?[ where only main memory is sufficient? ]


You can do this by storing just 2001 integers. Create an array of different possible ages

ages[2001] // [0..2000]

when you count a tree

ages[thisAge]++

Then computing the median is trivial. You seem to have hit on this solution in the second approach you mention, but then you say I want to know is there any better approach?

Does there exist any method where we do not need to store the data in the file?[where only main memory is sufficient?]

I don't undertstand why you ask if there exists any method where main memory is sufficient. Doesn't an array of 2001 integer fit in main memory?

Using the approach above, you can fill your array of counts, and then calculate the median by iterating through the counts, keeping a sum total as you go. When your sum reaches half the total number of trees, you have the median. This requires one pass through all the trees to count, plus a pass through part of the count array of some number <=2001. So this is O(n). You could, instead, keep track of the median with this array as you go, but it would not really improve on the solution.


The approach you recommended (an array of 2001 years) is O(n), with one fast operation per tree, so that is optimal.

Well, almost optimal. At some point during the count the number of remaining trees will be insufficient to change the result. For example, if I count half + 1 of the trees, and all are exactly 100 years old, then I have my answer: 100 years.

But if the trees are well-scattered in age, then the number of required trees will be close to the total number.

链接地址: http://www.djcxy.com/p/70790.html

上一篇: 如何改进这个基数的实现

下一篇: 数据结构上的一个难题