?

Log in

No account? Create an account
Programming puzzle - Arvind Narayanan's journal [entries|archive|friends|userinfo]

Programming puzzle [Jul. 24th, 2009|10:40 am]
Arvind Narayanan
[Tags|, , ]

If you're a programmer, I suggest giving this puzzle a shot because you're likely to encounter this problem a few times in your programming career :-)

You're given N objects and a probability distribution function over these objects specified as a an array [x_1, x_2 ... x_N] of arbitrary nonnegative real numbers. Write a function that generates independent samples from this distribution. An O(N) pre-processing step is allowed, but after that, each new sample should be generated in expected constant time.

Any takers? Too easy?

Edit. Assume that you can generate a uniform real number in the interval [0, 1] at unit cost.
LinkReply

Comments:
From: eightbit
2009-07-24 05:49 pm (UTC)
Can we sample log N uniform bits in constant time? Or just one bit?
(Reply) (Thread)
[User Picture]From: arvindn
2009-07-24 06:11 pm (UTC)
oh, i was actually assuming that you can generate a uniform sample from the interval [0, 1] in constant time. i guess this is equivalent to log N uniform bits.

i'm sure it's provably impossible with 1 bit: even if all the weights are equal, you still need to choose from among N alternatives, which takes log N bits of randomness. thanks for bringing that up. i will edit the post.

the spirit of the puzzle is that computation time is the scarce resource, rather than random bits.

Edited at 2009-07-24 06:14 pm (UTC)
(Reply) (Parent) (Thread)
From: (Anonymous)
2009-07-25 07:42 am (UTC)

Clarification

Could you specify your sample clearly? Sample of size N with unique objects which means order matters or Sample of size N with duplicates allowed and # of dups proporational to the prob. distribution x_n?
(Reply) (Parent) (Thread)
[User Picture]From: arvindn
2009-07-25 07:51 am (UTC)

Re: Clarification

firstly the sample is not of size N. rather, N is the number of objects. the number of samples is unspecified; but each time the algorithm is called it must return a new, independent sample in expected constant time.

if that seems unclear, another way to phrase the problem is as follows:

write an algorithm to draw M independent samples from the prob. distribution [x_1, x_2 ... x_N] in time O(M + N).

secondly, as implied by the term independent samples, each sample is to be chosen independently of the others, which means there will (probably) be duplicates. i.e, it is "sampling with replacement" rather than "sampling without replacement."

hope that helps.

Edited at 2009-07-25 08:08 am (UTC)
(Reply) (Parent) (Thread)
From: eightbit
2009-07-26 03:34 pm (UTC)

Re: Clarification

Still looking for a solution? When looking for sublinear-time algorithms for problems, it seems like the model of computation can easily cloud the thought process that the puzzle is meant to invoke. Requirements involving expected-constant time have a similar effect.

I expect the natural first thought invoked by the problem to be the following. We partition [0,1] into N subintervals, each with length proportional to x_i. Then, given a random sample from [0,1], we only need to look up which subinterval it falls into, and then output the index of that subinterval.

Given that we are ignoring model issues, and allowed to sample reals, I'll assume that we are only meant to do limited things with the reals - maybe only comparison. A naive approach then tries to look up the interval using comparison-based search on a binary tree. This takes log N time, however.

If we stick to the comparison idea, though, the next step should be exploit the fact that we only need expected constant time. This suggests something like Huffman coding. But if we consider the uniform distribution on N elements, it is clear that this won't help.

So is the trick meant to extend from here? Or have I assumed something incorrect? I have seen several puzzles in that past that get to this point and then introduce an ideal hash function with O(1) lookup. I'm hoping there's more of a trick than that!
(Reply) (Parent) (Thread)
[User Picture]From: arvindn
2009-07-26 08:33 pm (UTC)

Re: Clarification

"When looking for sublinear-time algorithms for problems, it seems like the model of computation can easily cloud the thought process that the puzzle is meant to invoke."

yes, this is absolutely correct. there are hidden log factors everywhere.

i'm assuming that you can also do arithmetic operations on the reals at unit cost. i think this is the only sensible model -- with comparisons alone, you cannot even "evaluate" a real in constant time, so the preprocessing cannot run in time O(N).

i feel confident that there is no way to "exploit" the arbitrary precision of the reals to do more than a constant number of logical operations at once, because you need more than constant time to "set up" the reals to operate on. in any case, my solution does not "cheat" in this manner (or exploit the computation model in any way).

finally, while writing up my solution i realized i was very naive in thiking it might be easy. took more than 2 hours and i'm still not done. so i hope that it will be informative.

btw here is a copy of my current draft. i need to elaborate a few things before i can present it to a wider audience, but i hope things are clear to you. if you could quickly see if you can spot any holes, that would be much appreciated!

https://docs.google.com/Doc?id=dgwzqjjp_278g93mhvgj
(Reply) (Parent) (Thread)
From: ext_200976
2009-07-25 09:46 pm (UTC)

O(N)+O(M * N/2) for me

I don't think I know how to do this. The best I can do:

rand_element_generator(x_n):
* create a second array [c_1, c_2,...c_N] with the cumulative probablity distribution: c_1 = x_1, c_i = c_(i-1) + x_i.
* if necessary, normalize the cumulative array by dividing each element by c_N.
* return a generator that on each call chooses a random number u on 0.0 .. 1.0, and returns n such that
       c_(n-1) <= u < c_n 
  (where c_0 = 0). 

For small N, the final step is best done as a linear search, O(N)+O(M * N/2). If N is significant you could do a binary (log N) search, making it O(N) + O(M * log N). In the linear search case you can pre-sort x_n from largest to smallest and re-label or map the output.

I guess I could make a lookup table: make an array where each element A_i, i in 0..999, is the value that rand_element_generator would return if (i/1000) were chosen -- the smallest n such that (i / 1000) < x_n. Now I can
* generate random number u on 0.0 .. 0.1
* set i = int( 1000 * i )
* if A_i == A_(i-1) or A_i = A_(i-1)+1, then return A_i. Otherwise linear search between A_(i-1) and A_i.


This last seems fiddly and inelegant.

If I understand right, this is the underlying problem in the FASTA challenge on the 'programming language shootout':

http://shootout.alioth.debian.org/u32q/benchmark.php?test=fasta&lang=gcc&id=4

They seem to generally use linear search, perhaps because N is small.
(Reply) (Thread)
From: ext_200978
2009-07-25 10:32 pm (UTC)

one possible solution

Maintain a family of lgN hash functions e.g. [h_0,h_1,....,h_lgN-1], such that h_i(obj) --> [0,1]. Assuming that we can generate such a Hashmap using some property of the given set of N objects.

Now generate all subsets of these lgN hash functions (which would be N of them) and assign them to each of the cell of our original probability distribution array say prob_distrib[x_1,x_2,...x_N].

Now take each of the N objects one by one and apply the corresponding set of hash-functions to it and maintain a reverse map of hash functions to these objects in a different hash table such that an object is assigned to all those hash fns keys for which h_i(obj) >= prob_distrib[x_i], so basically we are mapping these N objects to a set of lgN hash functions. This step will run in O(N) time with a hidden large constant in O(N) with the assumption that not all the lgN bits will be set for each cell of the array prob_distrib.

As we can sample lgN bits in O(1) time, select the hash functions with the corresponding bits set and then return the corresponding objects mapped to these hash fns which would be our independent sample upon each call to generate these lgN bits.

My two cents,
(Reply) (Thread)
From: (Anonymous)
2009-07-26 07:55 pm (UTC)

don't know CS, but from a probabilist's point of view

So, how about the inverse CDF method. It is known that if X is a random variable,
then F(X) is uniformly distributed (where F is the Cumulative distribution function of X).
Thus sampling from the distribution of $X$ is equivalent to computing F^{-1}(U), where $U$
is a uniform [0,1] random variable. Now, here is where I don't if my solution answers your question.
Since $X$ is discrete, F^{-1} is easily computable. I don't know if this takes the O(N) preprocessing
allowed or we need more than $O(N)$ steps....Anyways, look forward to your answer...
(Reply) (Thread)