## Reservoir Sampling – Java

I’m currently preparing for the first semester exams and while browsing over the Internet i found some one posted his interview with google and there was a question about Reservoir Sampling, i got interested and started googling about how this algorithm works and i ended up writing this post, i explained this algorithm as simple as i can, enjoy the science 😉

what is Reservoir Sampling ?

Reservoir samplingis a family of randomized algorithms for randomly choosingksamples from a listScontainingnitems, wherenis either a very large or unknown number. (From wikipedia)

Suppose you have unknown number of queries (google search queries last month for ex.) and you want to select random sample from this unknown number of queries

**what is the problem with unknown size of queries?**

you can’t use random function directly here, there is no limit here , you may refer to a wrong index !, fine why can’t we loop through the queries and get the numer of queries ? , simply this will increase complexity of the algorithm with another O(n).

**so, how to choose random sample from this unknown number of queries using only one pass O(n) , here comes the Reservoir Sampling Algorithm:**

for simplicity i will use a text file inclues a long article like this one Open Source i copied the article and saved it in a text file name “data.txt”, suppose we don’t know how many lines in this article .

here is the complete function, i will dissect it line by line later :

public void reservoirSampling () throws FileNotFoundException, IOException { File f = new File("data.txt"); BufferedReader br = new BufferedReader(new FileReader(f)); // creating buffered reader object to read the file contains our data String currentLine; int reservoirSize=10; List <String> reservoirList= new ArrayList<String>(reservoirSize); //reservoirList is where our selected lines stored int count=0; // we will use this counter to count the current line numebr while iterating Random ra = new Random(); int randomNumber; while ((currentLine = br.readLine()) != null) // here we will iterate through the file till it ends { count ++; // increase the line number if (count<=10) { reservoirList.add(currentLine); } else { randomNumber = (int)ra.nextInt(count); if (randomNumber<reservoirSize) { reservoirList.set(randomNumber, currentLine); } } } }

**1) loop through the file Iteratively (remember we don’t know the number of lines ) **

while ((currentLine = br.readLine()) != null)

### 2) fill the reservoir (reservoir is where you will put your selected random sample and it have a fixed size suppose 10 random lines and we initiate it with the first 10 lines of the file).

if (count<=10) { reservoirList.add(currentLine); }

### 3) if the line is not in the first 10 lines then we create a random number between 0 and current line number and that is what nextInt function does.

else { int rand = (int)ra.nextInt(count); }

### 4) the last step: if the random number is less than 10 (the number of lines we will choose or the reservoirList size) then we will replace currentLine with this

random number index in the reservoirList

if (randomNumber<reservoirSize) { reservoirList.set(randomNumber, currentLine); }

i hope this helped, if you are interested in statistics side of this algorithm and the proof of fairness you can read here and here .

will this code ever terminate ?

ok, got it

this code not make any changes