th biggest product in a large matrix of numbers, fast

2018-06-21 04:06:48

I'm working on a sorting/ranking algorithm that works with quite large number of items and I need to implement the following algorithm in an efficient way to make it work:

There are two lists of numbers. They are equally long, about 100-500 thousand items. From this I need to find the n-th biggest product between these lists, ie. if you create a matrix where on top you have one list, on the side you have the other one and each cell is the product of the number above and the number on the side.

Example: The lists are A=[1, 3, 4] and B=[2, 2, 5] . Then the products are [2, 2, 5, 6, 6, 15, 8, 8, 20] . If I wanted the 3rd biggest from that it would be 8.

The naive solution would be to simply generate those numbers, sort them and then select the n-th biggest. But that is O(m^2 * log m^2) where m is the number of elements in the small lists, and that is just not fast enough.

I think what I need is to first sort the two small lists. That is O(m * log m) . Then I know for sure that the biggest one A[0]*B[0]. Second biggest one is either A[0]*B[1] or A[1]*B[0], ...

I feel like this could be done in O(f(n)) steps, independent of the size of the matrix. But I can't figure out an efficient way to do this part.

Edit: There was an answer that got deleted, which suggested to remember position in the two sorted sets and then look at A[a]*B[b+1] and A[a+1]*B[b], returning the bigger one and incrementing a/b. I was going to post this comment before it got deleted:

This won't work. Imagine two lists A=B=[3,2,1]. This will give you matrix like [9,6,3 ; 6,4,2 ; 3,2,1]. So you start at (0,0)=9, go to (0,1)=6 and then the choice is (0,2)=3 or (1,1)=4. However, this will miss the (1,0)=6 which is bigger then both. So you can't just look to the two neighbors but you have to backtrack.

I think it can be done in O(n log n + n log m) . Here's a sketch of my algorithm, which I think will work. It's a little rough.

Sort A descending. (takes O(m log m) )

Sort B descending. (takes O(m log m) )

Let s be min(m, n) . (takes O(1) )

Create s lazy sequence iterators L[0] through L[s-1] . L[i] will iterate through the s values A[i]*B[0] , A[i]*B[1] , ..., A[i]*B[s-1] . (takes O(s) )

Put the iterators in a priority queue q . The iterators will be prioritized according to their current value. (takes O(s) because initially they are already in order)

Pull n values from q . The last value pulled will be the desired result. When an iterator is pulled, it is re-inserted in q using its next value as the new priority. If the iterator has been exhausted, do not re-insert it. (takes O(n log s) )

In all, this algorithm will take O(m log m + (s + n)log s) , but s is equal to either m or n .

I don't think there is an algorithm of O(f(n)), which is independent of m.

But there is a relatively fast O(n*logm) algo:

At first, we sort the two arrays, we get A[0] > A[1] > ... > A[m-1] and B[0] > B[1] > ... > B[m-1]. (This is O(mlogm), of course.)

Then we build a max-heap, whose elements are A[0]*B[0], A[0]*B[1], ... A[0]*B[m-1]. And we maintain a "pointer array" P[0], P[1], ... P[m-1]. P[i]=x means that B[i]*A[x] is in the heap currently. All the P[i] are zero initially.

In each iteration, we pop the max element from the heap, which is the next largest product. Assuming it comes from B[i]*A[P[i]] (we can record the elements in the heap come from which B[i]), we then move the corresponding pointer forward: P[i] += 1, and push the new B[i] * A[P[i]] into the heap. (If P[i] is moved to out-of-range (>=m), we simply push a -inf into the heap.)

After the n-th iteration, we get the n-th largest product.

There are n iterations, and each one is O(logm).

Edit: add some details

You don't need to sort the the 500 000 elements to get the top 3.

Just take the first 3, put them in a SortedList, and iterate over the list, replacing the smallest of the 3 elements with the new value, if that is higher, and resort the resulting list.

Do this for both lists, and you'll end with a 3*3 matrix, where it should be easy to take the 3rd value.

Here is an implementation in scala.

If we assume n is smaller than m, and A=[1, 3, 4] and B=[2, 2, 5], n=2:

You would take (3, 4) => sort them (4,3)
Then take (2,5) => sort them (5, 2)

You could now do an zipped search. Of course the biggest product now is (5, 4). But the next one is either (4*2) or (5*3). For longer lists, you could keep in mind what the result of 4*2 was, compare it only with the next product, taken the other way. That way you would only calculate one product too much.

链接地址: http://www.djcxy.com/p/59460.html

上一篇: telnet发送一个http请求

下一篇: 第一大数字矩阵中最大的产品，速度很快