duplicate random values from a very large range

2018-06-13 01:27:46

I would like a function that will produce k pseudo-random values from a set of n integers, zero to n-1, without repeating any previous result. k is less than or equal to n. O(n) memory is unacceptable because of the large size of n and the frequency with which I'll need to re-shuffle.

These are the methods I've considered so far:

Array : Normally if I wanted duplicate-free random values I'd shuffle an array, but that's O(n) memory. n is likely to be too large for that to work.

long nextvalue(void) {
  static long array[4000000000];
  static int s = 0;
  if (s == 0) {
    for (int i = 0; i < 4000000000; i++) array[i] = i;
    shuffle(array, 4000000000);
  }
  return array[s++];
}

n-state PRNG : There are a variety of random number generators that can be designed so as to have a period of n and to visit n unique states over that period. The simplest example would be:

long nextvalue(void) {
static long s = 0;
static const long i = 1009; // assumed co-prime to n
  s = (s + i) % n;
  return s;
}

The problem with this is that it's not necessarily easy to design a good PRNG on the fly for a given n , and it's unlikely that that PRNG will approximate a fair shuffle if it doesn't have a lot of variable parameters (even harder to design). But maybe there's a good one I don't know about.

m-bit hash : If the size of the set is a power of two, then it's possible to devise a perfect hash function f() which performs a 1:1 mapping from any value in the range to some other value in the range, where every input produces a unique output. Using this function I could simply maintain a static counter s , and implement a generator as:

long nextvalue(void) {
  static long s = 0;
  return f(s++);
}

This isn't ideal because the order of the results is determined by f() , rather than random values, so it's subject to all the same problems as above.

NPOT hash : In principle I can use the same design principles as above to define a version of f() which works in an arbitrary base, or even a composite, that is compatible with the range needed; but that's potentially difficult, and I'm likely to get it wrong. Instead a function can be defined for the next power of two greater than or equal to n , and used in this construction:

long nextvalue(void) {
  static long s = 0;
  long x = s++;
  do { x = f(x); } while (x >= n);
}

But this still have the same problem (unlikely to give a good approximation of a fair shuffle).

Is there a better way to handle this situation? Or perhaps I just need a good function for f() that is highly parameterisable and easy to design to visit exactly n discrete states.

One thing I'm thinking of is a hash-like operation where I contrive to have the first j results perfectly random through carefully designed mapping, and then any results between j and k would simply extrapolate on that pattern (albeit in a predictable way). The value j could then be chosen to find a compromise between a fair shuffle and a tolerable memory footprint.

First of all, it seems unreasonable to discount anything that uses O(n) memory and then discuss a solution that refers to an underlying array. You have an array. Shuffle it. If that doesn't work or isn't fast enough, come back to us with a question about it.

You only need to perform a complete shuffle once. After that, draw from index n , swap that element with an element located randomly before it and increase n , modulo element count. For example, with such a large dataset I'd use something like this.

Prime numbers are an option for hashes, but probably not the same way you think. Using two Mersenne primes ( low and high , perhaps 0xefff and 0xefffffff ) you should be able to come up with a much more general-purpose hashing algorithm.

size_t hash(unsigned char *value, size_t value_size, size_t low, size_t high) {
    size_t x = 0;
    while (value_size--) {
        x += *value++;
        x *= low;
    }
    return x % high;
}
#define hash(value, value_size, low, high) (hash((void *) value, value_size, low, high))

This should produce something fairly well distributed for all inputs larger than about two octets for example, with the minor troublesome exception for zero byte prefixes. You might want to treat those differently.

So... what I've ended up doing is digging deeper into pre-existing methods to try to confirm their ability to approximate a fair shuffle.

I take a simple counter, which itself is guaranteed to visit every in-range value exactly once, and then 'encrypt' it with an n-bit block cypher. Rather, I round the range up to a power of two, and apply some 1:1 function; then if the result is out of range I repeat the permutation until the result is in range.

This can be guaranteed to complete eventually because there are only a finite number of out-of-range values within the power-of-two range, and they cannot enter into a always-out-of-range cycle because that would imply that something in the cycle was mapped from two different previous states (one from the in-range set, and another from the out-of-range set), which would make the function not bijective.

So all I need to do is devise a parameterisable function which I can tune to an arbitrary number of bits. Like this one:

uint64_t mix(uint64_t x, uint64_t k) {
  const int s0 = BITS * 4 / 5;
  const int s1 = BITS / 5 + (k & 1);
  const int s2 = BITS * 2 / 5;
  k |= 1;

  x *= k;
  x ^= (x & BITMASK) >> s0;
  x ^= (x << s1) & BITMASK;
  x ^= (x & BITMASK) >> s2;
  x += 0x9e3779b97f4a7c15;

  return x & BITMASK;
}

I know it's bijective because I happen to have its inverse function handy:

uint64_t unmix(uint64_t x, uint64_t k) {
  const int s0 = BITS * 4 / 5;
  const int s1 = BITS / 5 + (k & 1);
  const int s2 = BITS * 2 / 5;
  k |= 1;
  uint64_t kp = k * k;
  while ((kp & BITMASK) > 1) {
    k *= kp;
    kp *= kp;
  }

  x -= 0x9e3779b97f4a7c15;
  x ^= ((x & BITMASK) >> s2) ^ ((x & BITMASK) >> s2 * 2);
  x ^= (x << s1) ^ (x << s1 * 2) ^ (x << s1 * 3) ^ (x << s1 * 4) ^ (x << s1 * 5);
  x ^= (x & BITMASK) >> s0;
  x *= k;

  return x & BITMASK;
}

This allows me to define a simple parameterisable PRNG like this:

uint64_t key[ROUNDS];
uint64_t seed = 0;
uint64_t rand_no_rep(void) {
  uint64_t x = seed++;
  do {
    for (int i = 0; i < ROUNDS; i++) x = mix(x, key[i]);
  } while (x >= RANGE);
  return x;
}

Initialise seed and key to random values and you're good to go.

Using the inverse function to lets me determine what seed must be to force rand_no_rep() to produce a given output; making it much easier to test.

So far I've checked the cases where constant a , it is followed by constant b . For ROUNDS==1 pairs collide on exactly 50% of the keys (and each pair of collisions is with a different pair of a and b ; they don't all converge on 0, 1 or whatever). That is, for various k , a specific a -followed-by- b cases occurs for more than one k (this must happen at least one). Subsequent values values do not collide in that case, so different keys aren't falling into the same cycle at different positions. Every k gives a unique cycle.

50% collisions comes from 25% being not unique when they're added to the list (count itself, and count the guy it ran into). That might sound bad but it's actually lower than birthday paradox logic would suggest. Selecting randomly, the percentage of new entries that fail to be unique looks to converge between 36% and 37%. Being "better than random" is obviously worse than random, as far as randomness goes, but that's why they're called pseudo-random numbers.

Extending that to ROUNDS==2 , we want to make sure that a second round doesn't cancel out or simply repeat the effects of the first.

This is important because it would mean that multiple rounds are a waste of time and memory, and that the function cannot be paramaterised to any substantial degree. It could happen trivially if mix() contained all linear operations (say, multiply and add, mod RANGE ). In that case all of the parameters could be multiplied/added together to produce a single parameter for a single round that would have the same effect. That would be disappointing, as it would reduce the number of attainable permutations to the size of just that one parameter, and if the set is as small as that then more work would be needed to ensure that it's a good, representative set.

So what we want to see from two rounds is a large set of outcomes that could never be achieved by one round. One way to demonstrate this is to look for the original b -follows- a cases with an additional parameter c , where we want to see every possible c following a and b .

We know from the one-round testing that in 50% of cases there is only one c that can follow a and b because there is only one k that places b immediately after a . We also know that 25% of the pairs of a and b were unreachable (being the gap left behind by half the pairs that went into collisions rather than new unique values), and the last 25% appear for two different k .

The result that I get is that given a free choice of both keys, it's possible to find about five eights of the values of c following a given a and b . About a quarter of the a / b pairs are unreachable (it's a less predictable, now, because of the potential intermediate mappings into or out of the duplicate or unreachable cases) and a quarter have a , b , and c appear together in two sequences (which diverge afterwards).

I think there's a lot to be inferred from the difference between one round and two, but I could be wrong about that and I need to double-check. Further testing gets harder; or at least slower unless I think more carefully about how I'm going to do it.

I haven't yet demonstrated that amongst the set of permutations it can produce, that they're all equally likely; but this is normally not guaranteed for any other PRNG either.

It's fairly slow for a PRNG, but it would fit SIMD trivially.

链接地址: http://www.djcxy.com/p/37278.html

上一篇: 如何简洁，便携地彻底播种mt19937 PRNG？

下一篇: 从一个非常大的范围复制随机值