Exercise 26
Info

This is an in-class exercise. An exercise page like this one will contain a brief description but is intended to be supplemented by discussion during our meeting time. Complete the exercise to the best of your ability in the time given. Feel free to talk with other students as you work, and do not be afraid to ask questions if you get stuck. Aim to complete as much as possible during our meeting, and check your solution by running the code. You are encouraged to work at home to complete what you do not get through today, and ask questions over Piazza or in office hours.

Learning Objectives

Reinforces basic C++ concepts

  • C++ basics: references
  • STL: vectors and iterators
  • Recursion
Goal

Complete the implementation of code which transforms a probability density function into a cumulative distribution function. Then compare naive and fast implementations of finding the index of a given percentile.

Motivation

Suppose that you have a loaded N-sided die. You can represent the probability of rolling a particular number using an array of N floating point elements, so that the indices range from 0 to N-1. (We’re computer scientists, so we start counting at zero, and even our dice start at zero). As such, the values in the array will be in the range [0,1] (you can’t have negative probabilities and the probability of rolling any value cannot be greater than 100%) and the sum of the values in the array should be 1 (with probability 100%, you will get some value). This array of values is called a probability distribution function (PDF). Using this representation, the probability of rolling an i is PDF[i].

Your intent will be to gamble with this die – you will be placing bets that, on a given roll, the value on the die will be less than or equal to some number. To do this effectively you want to change the way the array stores the probability information so that instead of answering the question “what is the likelihood that I roll an i?” you can answer the question “what is the likelihood that I roll something with value less than or equal to i?”. To do this you want change your PDF to a cumulative distribution function (CDF) – an array whose entry at index i is the sum of the probabilities of rolling a value less than or equal to i.

Finally, as part of the gambling you will want to change the betting odds on the fly. For example, while you may have started with 1-to-1 odds, you may decide that you want to put a 3-to-1 bet that you will get a value less than i. To do this effectively (i.e. to win money) you need to be able to find the value i with the probability of rolling something less than or equal to i is at least one third. Using your precomputed CDF, you can do this by finding the last index i for which CDF[i] <= 1./4.

Implementation

In your code:

Part 1

Get started by running git pull to update your clone of the public repository, and then copying the exercises/ex26 directory into your personal git repository. Confirm that you can see the template files for today’s exercise by typing ls exercises/ex26 – you should see a file named distribution.cpp.

Part 2

The supplied file distribution.cpp will not compile out of the box since the function make_cumulative is not defined. You will need to declare and define this function. This will take a vector of values in the range [0,1], summing to one, and change them so that after the function call the value in the i-th entry is sum the of the values in entries {0,…,i} of the input vector. Note that:

  1. Since the original entries are all in the range [0,1], after the call to make_cumulative the entries should be strictly non-decreasing.
  2. Since the sum of the original entries was 1, after the call to make_cumulative the last entry should have value 1.

Part 3

You should now be able to compile distribution.cpp by typing:

g++ -std=c++11 -Wall -Wextra -pedantic distribution.cpp -o distribution

Next, confirm that you have correctly implemented the make_cumulative function by running the executable distribution using a histogram with 30 bins and 10000 random samples by typing:

echo 30 10000 0 | ./distribution

(If your CDF does not look reasonable, you will trigger an assertion.)

Part 4

Now, implement the body of the naive_find_last_iterator function. This function starts at the begin iterator and advances sequentially through the container until it finds the last iterator with the property that the value of the associated entry is less than or equal to v. Confirm that you have implemented the function correctly by running the executable distribution using a histogram with 30 bins, 100000 random samples, and 10 calls to the naive_find_last_iterator function by typing:

echo 30 10000 10 | ./distribution

(If your implementation works, you should see “Confirmed that the naive find seems reasonable”, though you will still trigger an assertion for your as-of-yet-unimplemented find_fast_last_iterator function.)

Part 5

Now, implement the body of the fast_find_last_iterator function. Like the naive_find_last_iterator function, this will return the last iterator with the property that the value of the associated entry is less than or equal to v. However, instead of advancing sequentially, you should find the iterator using recursion – splitting the range in two and only checking the half-range that could contain the iterator. Confirm that you have implemented the function correctly by running the executable distribution using a histogram with 30 bins, 100000 random samples, and 10 calls to the naive_find_last_iterator function by typing:

echo 30 10000 10 | ./distribution

(If your implementation works, you should see “Confirmed that the naive find seems reasonable” and “Confirmed that the fast find seems reasonable”.)

Hint: Keep in mind that the iterators for the std::vector< double > class are Random Access Iterators. This means that not only can you advance them by calling iter++, but you get the iterator n elements ahead by usning iter+n and you can get the number of elements between two iterators by calling iter2-iter.

Part 6

Stress-test your implementations by running on a histogram with 1000000 bins, 10000 random samples, and 100 calls to the naive_find_last_iterator and fast_find_last_iterator functions by typing:

echo 1000000 10000 100 | ./distribution

What do the running times tell you?

Reminder

Remember to add and commit to your local repo copy as your work. Push to your remote repo when finished. [No submission to Gradescope for this exercise.] Use exit to logout from your ugrad account when finished. If you continue to work on the program after class, make sure to keep your repo updated as well!