### Reinforcement Learning with OpenAI Gym

ma 04 januari 2021

After completing the RL Specialization on Coursera I put some of the techniques in practice. It's one thing to learn techniques in a course, but you learn more when you apply it yourself. OpenAI Gym is an open-source python environment with RL problems you can build your own solutions for.

In this post I will focus on RL, and how to write them in Python. For a number of problems RL without function approximation works quite well, and the code is quick to run and debug. Always a plus. In later posts I will dive deeper into course coding, and neural networks.

So lets write a simple SARSA agent. It will need to choose actions on all steps of the simulation, and learn from the feedback it gets using the SARSA algorithm. I added support for optimistic initialization (init_val) for extra exploration early on, and the parameters for discount, learning rate and epsilon can be modified.

``````import gym
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import random

class sarsa_agent():
def __init__(self, num_actions, num_states, init_val=0, learn_rate=0.1, discount=0.9, epsilon=0.9):
self.num_actions = num_actions
self.num_states = num_states
self.learn_rate = learn_rate
self.discount = discount
self.epsilon = epsilon

self.Q = np.zeros((num_states, num_actions))+init_val
self.last_action = None
self.last_state = None

def policy(self, state):
if np.random.rand()<self.epsilon:
# np.argmax can be used, but will pick lower indexed actions in case of ties
return np.random.choice(np.flatnonzero(self.Q[state,:] == self.Q[state,:].max()))
else:
return np.random.randint(self.num_actions)

def first_step(self, state):
action = self.policy(state)

self.last_action = action
self.last_state = state

return action

def step(self, state, reward):
action = self.policy(state)

self.Q[self.last_state, self.last_action] += self.learn_rate * (reward + self.discount*self.Q[state, action] - self.Q[self.last_state, self.last_action])

self.last_action = action
self.last_state = state

return action

def last_step(self, reward):
self.Q[self.last_state, self.last_action] += self.learn_rate * (reward - self.Q[self.last_state, self.last_action])
``````

Now let's run this agent against an OpenAI Gym environment. I chose Taxi-v3, since I didn't know that one yet, and it has (not too many) discrete states. This means a normal SARSA/Q-learning algorithm will probably be able to solve the problem.

``````runs, eps, max_t = 10, 3000, 200

env = gym.make(\'Taxi-v3\')
r_mat = np.zeros((runs, eps, max_t))
for i_run in range(runs):
print(i_run)
agent = sarsa_agent(env.action_space.n, env.observation_space.n, learn_rate=0.2, init_val=0)
for i_episode in range(eps):
state = env.reset()
action = agent.first_step(state)
state, reward, done, info = env.step(action)
r_mat[i_run, i_episode, 0] = reward
for t in range(1, max_t):
action = agent.step(state, reward)
state, reward, done, info = env.step(action)
r_mat[i_run, i_episode, t] = reward
if done:
agent.last_step(reward)
break
env.close()

plt.figure(figsize=(16,9))
plt.plot(np.average(np.sum(r_mat, axis=2), axis=0), \'.\', alpha=0.2)
plt.title(\'Performance over 10 runs for SARSA agent in Taxi-v3 environment\')
plt.ylabel(\'Average reward\')
plt.xlabel(\'Episode #\')
plt.show()
plt.close()
`````` You can see that the agent learns quickly. In the first few episodes the agent collects less and less penalty, and in less than 1000 episodes of the game the agent hardly makes any mistakes anymore.

Going forward there are a few parameters we could tweak. The discount factor and epsilon are probably not worth the effort. Since the default values work pretty well for a wide range of problems. In other environments initial optimism (high initial values for each state, action pair to encourage early exploration) would be something to look into, but this environment only gives negative or zero rewards, so the init_val of 0 is already quite optimistic. More useful would be to tweak the learning rate.

Another way to go is to make the agent more complex. Think changing from epsilon greedy to another e-soft policy. Or implementing Expected-SARSA, where the transition probabilities for all actions are used in the learning update.

But that's all for a next time.