COMP-597: Reinforcement Learning – Assignment 1 solved




Bandit algorithms [100 points] For this assignment, you will carry out some experimentation with bandit algorithms, in order to help you understand what we discussed in class, and to get used to the way in which we will run experiments for other assignments as well. You should submit a notebook with your code, results and explanations. 1. [5 points] Write a small simulator for a Bernoulli bandit with k arms. The probability of success pi for each arm i ∈ {1, …k} should be provided as an input. The bandit should have a function called ”sample” which takes as input the index of an action and provides a reward sample. Recall that a Bernoulli bandit outputs either 1 or 0, drawn from a binomial distribution of parameter pk. Test your code with 3 arms of parameters q∗ = [0.5, , 0.5 − δ, 0.5 − 2δ, with δ = 0.1. Generate and save a set of 50 samples for each action. For the test, plot one graph for each action, containing the reward values obtained over the 100 draws, the empirical mean of the values, and the true q∗ for each arm. Each graph will have an x-axis that goes to 50, two horizontal lines (true value and estimated value) and a set of points of value 0 and 1. 2. [5 points] Code the rule for estimating action values discussed in lecture 2, with a fixed learning rate α, in a function called update, and using the incremental computation of the mean presented in lecture 2, in a function called updateAvg. Using the previous data, plot for each action a graph showing the estimated q value as a function of the number of samples, using averaging as well as α = 0.01 and α = 0.1, and the true value. Each graph should have two curves and a horizontal line. 3. [10 points] Repeat the above experiment 100 times, starting with action value estimates of 0. Each run will still contain 100 samples for each action. Plot the same graph as above, but where the curves have the average and standard error over the 100 runs. Explain in 1-2 sentences what you observe. Which of the α values is better? How do they compare to averaging? If you wanted to optimize further, in what range of α would you look for better values? 4. [20 points] Code the -greedy algorithm discussed in class, with averaging updates, with  provided as an input. You will run 100 independent runs, each consisting of 1000 time steps. Plot the following graphs: (a) The reward received over time, averaged at each time step over the 100 independent runs (with no smoothing over the time steps), and the standard error over the 100 runs (b) The fraction of runs (out of 100) in which the first action (which truly is best) is also estimated best based on the action values 1 (c) The instantaneous regret lt (as discussed in lecture 3) (averaged over the 100 runs) (d) The total regret Lt up to time step t (as discussed in lecture 3) (averaged over the 100 runs) Generate this set of graphs, for the following values of : 0, 1/8, 1/4, 1/2, 1. Explain what you observe in the graphs and discuss the effect of  you observe. 5. [5 points] For  = 1/4 and  = 1/8, plot the same graphs for α = 0.1, α = 0.01, α = 0.001 and averaging. Explain in 2 sentences what you observe. 6. [20 points] Write a function that implements the UCB algorithm discussed in lecture 2. Set c = 2. Plot the same graphs as above for α = 0.1, α = 0.01, α = 0.001 and averaging. Explain briefly the behavior you observe. 7. [20 points] Write a function that implements the Thompson sampling to be discussed in lecture 4. Plot the same graphs as above for α = 0.1, α = 0.01, α = 0.001 and averaging. Explain briefly the behavior you observe. 8. [5 points] For each of the algorithms, pick the best hyper-parameter combination you have observed (explain how you decided what ”best” means). Plot together the curves for this setting. Comment on the relative behavior of the different algorithms. 9. [10 points] Let us now consider a non-stationary problem. Let δ = 0.1 and imagine that after 500 time steps, the parameter of actions 2 and 3 become 0.5 + δ and 0.5 + 2δ respectively. Run for each of the three algorithms a fixed value of α = 0.1 and the averaging value estimation. For  use values 1/4 and 1/8. Using these values, plot only the reward graph as above (you should have 2 lines for -greedy, one for UCB and one for Thompson sampling, for each learning rate setting). Explain what you see in the graph. Based on these results, which algorithm is best suited to cope with non-stationarity? 2