Posts Tagged ‘ problem ’

Reservoir Sampling – Java

I’m currently preparing for the first semester exams and while browsing over the Internet i found some one posted his interview with google and there was a question about Reservoir Sampling, i got interested and started googling about how this algorithm works and i ended up writing this post, i explained this algorithm as simple as i can, enjoy the science 😉

what is Reservoir Sampling ?

Reservoir sampling is a family of randomized algorithms for randomly choosing k samples from a list S containing n items, where n is either a very large or unknown number.  (From wikipedia)

Suppose you have unknown number of queries (google search queries last month for ex.) and you want to select random sample  from this unknown number of queries

what is the problem with unknown size of queries?

you can’t use random function directly here, there is no limit here , you may refer to a wrong index !,  fine why can’t we loop through the queries and get the numer of queries ? , simply this will increase complexity of the algorithm with another O(n).

so, how to choose random sample from this unknown number of queries using only one pass O(n) , here comes the Reservoir Sampling Algorithm:

for simplicity i will use a text file inclues a long article like this one Open Source i copied the article and saved it in a text file name “data.txt”, suppose we don’t know how many lines in this article . Continue reading