Reinforcement Learning with OpenAI Gym
ma 04 januari 2021
After completing the RL Specialization on Coursera I put some of the techniques in practice. It's one thing to learn techniques in a course, but you learn more when you apply it yourself. OpenAI Gym is an open-source python environment with RL problems you can build your own solutions for.
In this post I will focus on RL, and how to write them in Python. For a number of problems RL without function approximation works quite well, and the code is quick to run and debug. Always a plus. In later posts I will dive deeper into course coding, and neural networks.
So lets write a simple SARSA agent. It will need to choose actions on all steps of the simulation, and learn from the feedback it gets using the SARSA algorithm. I added support for optimistic initialization (init_val) for extra exploration early on, and the parameters for discount, learning rate and epsilon can be modified.
import gym import numpy as np import matplotlib.pyplot as plt import scipy as sp import random class sarsa_agent(): def __init__(self, num_actions, num_states, init_val=0, learn_rate=0.1, discount=0.9, epsilon=0.9): self.num_actions = num_actions self.num_states = num_states self.learn_rate = learn_rate self.discount = discount self.epsilon = epsilon self.Q = np.zeros((num_states, num_actions))+init_val self.last_action = None self.last_state = None def policy(self, state): if np.random.rand()<self.epsilon: # np.argmax can be used, but will pick lower indexed actions in case of ties return np.random.choice(np.flatnonzero(self.Q[state,:] == self.Q[state,:].max())) else: return np.random.randint(self.num_actions) def first_step(self, state): action = self.policy(state) self.last_action = action self.last_state = state return action def step(self, state, reward): action = self.policy(state) self.Q[self.last_state, self.last_action] += self.learn_rate * (reward + self.discount*self.Q[state, action] - self.Q[self.last_state, self.last_action]) self.last_action = action self.last_state = state return action def last_step(self, reward): self.Q[self.last_state, self.last_action] += self.learn_rate * (reward - self.Q[self.last_state, self.last_action])
Now let's run this agent against an OpenAI Gym environment. I chose Taxi-v3, since I didn't know that one yet, and it has (not too many) discrete states. This means a normal SARSA/Q-learning algorithm will probably be able to solve the problem.
runs, eps, max_t = 10, 3000, 200 env = gym.make(\'Taxi-v3\') r_mat = np.zeros((runs, eps, max_t)) for i_run in range(runs): print(i_run) agent = sarsa_agent(env.action_space.n, env.observation_space.n, learn_rate=0.2, init_val=0) for i_episode in range(eps): state = env.reset() action = agent.first_step(state) state, reward, done, info = env.step(action) r_mat[i_run, i_episode, 0] = reward for t in range(1, max_t): action = agent.step(state, reward) state, reward, done, info = env.step(action) r_mat[i_run, i_episode, t] = reward if done: agent.last_step(reward) break env.close() plt.figure(figsize=(16,9)) plt.plot(np.average(np.sum(r_mat, axis=2), axis=0), \'.\', alpha=0.2) plt.title(\'Performance over 10 runs for SARSA agent in Taxi-v3 environment\') plt.ylabel(\'Average reward\') plt.xlabel(\'Episode #\') plt.show() plt.close()
You can see that the agent learns quickly. In the first few episodes the agent collects less and less penalty, and in less than 1000 episodes of the game the agent hardly makes any mistakes anymore.
Going forward there are a few parameters we could tweak. The discount factor and epsilon are probably not worth the effort. Since the default values work pretty well for a wide range of problems. In other environments initial optimism (high initial values for each state, action pair to encourage early exploration) would be something to look into, but this environment only gives negative or zero rewards, so the init_val of 0 is already quite optimistic. More useful would be to tweak the learning rate.
Another way to go is to make the agent more complex. Think changing from epsilon greedy to another e-soft policy. Or implementing Expected-SARSA, where the transition probabilities for all actions are used in the learning update.
But that's all for a next time.