Reservoir Sampling – Java

I’m currently preparing for the first semester exams and while browsing over the Internet i found some one posted his interview with google and there was a question about Reservoir Sampling, i got interested and started googling about how this algorithm works and i ended up writing this post, i explained this algorithm as simple as i can, enjoy the science ūüėČ

what is Reservoir Sampling ?

Reservoir sampling is a family of randomized algorithms for randomly choosing k samples from a list S containing n items, where n is either a very large or unknown number.  (From wikipedia)

Suppose you have unknown number of queries (google search queries last month for ex.) and you want to select random sample  from this unknown number of queries

what is the problem with unknown size of queries?

you can’t use random function directly here, there is no limit here , you may refer to a wrong index !,¬† fine why can’t we loop through the queries and get the numer of queries ? , simply this will increase complexity of the algorithm with another O(n).

so, how to choose random sample from this unknown number of queries using only one pass O(n) , here comes the Reservoir Sampling Algorithm:

for simplicity i will use a text file inclues a long article like this one Open Source i copied the article and saved it in a text file name “data.txt”, suppose we don’t know how many lines in this article .

here is the complete function, i will dissect it line by line later :

public void reservoirSampling () throws FileNotFoundException, IOException
 {
 File f = new File("data.txt");
 BufferedReader br = new BufferedReader(new FileReader(f));   // creating buffered reader object to read the file contains our data
 String currentLine;
 int reservoirSize=10;
 List <String> reservoirList= new ArrayList<String>(reservoirSize);   //reservoirList is where our selected lines stored
 int count=0;                           // we will use this counter to count the current line numebr while iterating
 Random ra = new Random();

int randomNumber;
 while ((currentLine = br.readLine()) != null)                             // here we will iterate through the file till it ends
 {
 count ++;                                         // increase the line number
 if (count<=10)
 {
 reservoirList.add(currentLine);
 }
 else
 {
randomNumber = (int)ra.nextInt(count);
 if (randomNumber<reservoirSize)
 {
 reservoirList.set(randomNumber, currentLine);
 }
 }
 }
 }

1) loop through the file Iteratively (remember we don’t know the number of lines )

while ((currentLine = br.readLine()) != null) 

2) fill the reservoir (reservoir is where you will put your selected random sample and it have a fixed size suppose 10 random lines and we initiate it with the first 10 lines of  the file).

if (count<=10)
 {
 reservoirList.add(currentLine);
 }

3) if the line is not in the first 10 lines then we create a random number between 0 and current line number and that is what nextInt function does.

else
 {
 int rand = (int)ra.nextInt(count);
}

4) the last step: if the random number is less than 10 (the number of lines we will choose or the reservoirList size) then we will replace currentLine with this
random number index in the reservoirList

if (randomNumber<reservoirSize)
 {
 reservoirList.set(randomNumber, currentLine);
 }

i hope this helped, if you are interested in statistics side of this algorithm  and the proof of fairness you can read here and here .

Advertisements
    • Nikhil
    • October 25th, 2012

    will this code ever terminate ?

    • Nikhil
    • October 25th, 2012

    ok, got it

    • vicky
    • September 25th, 2015

    this code not make any changes

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s