In this assignment, you are going to work with the same inverted pendulum
benchmark as you did in the previous assignment by P ID controllers. The
main purpose is to stabilize the cart pole through a reinforcement learning
approach. More specifically, we transfer our inverted pendulum problem into
a grid world form where we want to find state and Q values to control the
system. To obtain the mentioned grid world, it is necessary to discretize
the spaces of states and actions. Following this approach, each cell in the
resulted grid world represents a pair of state and action, while its value shows
the corresponding Q value of being in that state and taking that action. To
stabilize the pole angle, we take (theta, theta−dot) as the two states required
by the controller. Furthermore, considering a discretized action space, we
assume that we have three options at each state: go to the left (action = 0),
to the right (action = 2), or do nothing (action = 1).
2 States are accessible
In the base code, there are 40 steps considered in discretization of the state
space for both of the mentioned states. This results in a grid world with
shape of (40, 40, 3) where the first two elements stand for the states and ”3”
corresponds to the number of possible actions at each state. Based on this:
(A) Initialize all Q values with 0, and include a random action which is
applied when ever: np.random.rand() > 0.8. Then, stabilize the pole
angle using the argmax of Q values at each state. Note, you can use
the random action in the beginning and for a finite time. Otherwise,
this random action may play the role of a disturbance. (30%)
(B) Change 0.8 to 0.9, and discuss the difference and the role of the random
(C) As you may have seen, the cart most of the times runs out of the screen,
what is the reason? Suggest a solution to solve this. (No need to code
for this part, just explain your idea.) (15%)
(D) Following step 1, find the state values. Report the final 40 ×40 shaped
state value matrix. (15%)
(E) Investigate the role of the learning rate and the discounting factor by
changing them and conducting simulations to see various results. (20%)
3 Things to submit
In summary, for part A, report the final simulation where the pole angle is
stabilized (theta vs time plot please).
For part B, try to stabilize the pole angle (does not necessarily work); then,
report the best theta vs time plot from the most recent outcomes and explain
the role of the random action.
In part C, just explain your suggested solution.
For part D, report the final state values.
For part E, perform different simulations with various learning rates to understand the role of this parameter in convergence and stability of the results;
then, do the same for the discounting factor. You should explain what you
got, no need for reporting the exact Q or state values.
You will need to submit both your code and a report(in pdf format please)
zipped together. Please be concise and report interesting things like challenges.