Reinforcement Learning with tiling

wo 07 april 2021

To use RL in a continuous space we need a model to map the potentialy infinite space to a Q value the agent can use to base descisions on. An example of such a problem space is the Mountaincar environment in OpenAI Gym. A solution for mapping the continuous space to a Q value is tiling.

Tiling comes down to creating a grid over the continuous environment, each block is either active or not. To cope better with with hard edges a few layers of shifted grids are usually used. This allows gradients to be much smoother. How many tiles are in each dimension of the grid, and how many layers is up to the user. More tiles will allow for a better estimation, but will be more "expensive" to update and store.

This trade-off for number of tiles can be seen as one between generalisation and specificity. An infinitely large grid, with only 1 layer will be extremely specific at one location, but it will take a long time for higher or lower Q values to propagate over the grid. Coarser tiles, and more layers allows for easier speading over the space, but in the extreme example of only 1 tile for the entire space will also lose location information.

The update rule for a tiling agent is quite close to a normal agent, but multiple tiles can be active (as many as there are layers, since the agent is always in 1 gridcell per layer). If you lower the learning rate by the number of layers the total adjustment to the Q values is the same as before.

There are two major downsides to tiling in my opinion:

In the code example below a SARSA tile agent solves the Mountaincar problem.

import gym
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import random

# Set up tiling space, and mapping fuction
sp = np.linspace(0, 1.04, 21)
shift = np.linspace(0,0.0399,4)
tile_low = np.concatenate([sp[:-1]-s for s in shift])
tile_high = np.concatenate([sp[1:]-s for s in shift])


def state_to_tiles(obs):
    obs_sq = (
        (obs[0]+1.2)/1.8,
        (obs[1]+0.07)/0.14
    )

    mask_0 = np.logical_and((obs_sq[0]>=tile_low), (obs_sq[0]<tile_high))
    mask_1 = np.logical_and((obs_sq[1]>=tile_low), (obs_sq[1]<tile_high))
    mask_obs = np.concatenate([
        mask_0*el_mask_1 for el_mask_1 in mask_1
    ])

    return (
        np.concatenate([mask_obs*True, mask_obs*False, mask_obs*False]),
        np.concatenate([mask_obs*False, mask_obs*True, mask_obs*False]),
        np.concatenate([mask_obs*False, mask_obs*False, mask_obs*True])
    )

# SARSA Tile agent
class tile_agent():
    def __init__(self, num_actions, learn_rate=0.1, discount=0.9, epsilon=0.9):
        self.num_actions = num_actions
        self.lr = learn_rate / 16
        self.discount = discount
        self.epsilon = epsilon

        self.Q_tiles = np.zeros((19200))
        self.last_action = None
        self.last_state = None
        self.last_tiles = None

    def policy(self, state):
        if np.random.rand()<self.epsilon:
            q_s = np.zeros((self.num_actions))
            all_tiles = state_to_tiles(state)
            for a in range(self.num_actions):
                q_s[a] = np.sum(self.Q_tiles[all_tiles[a]])

            action = np.random.choice(np.flatnonzero(q_s == q_s.max())) # With random tie braking
            tiles = all_tiles[action]
        else:
            action = np.random.randint(self.num_actions)
            tiles = state_to_tiles(state)[action]

        return action, tiles

    def first_step(self, state):
        action, tiles = self.policy(state)

        self.last_action = action
        self.last_state = state
        self.last_tiles = tiles

        return action

    def step(self, state, reward):
        action, tiles = self.policy(state)

        self.Q_tiles[self.last_tiles] += self.lr * (reward + self.discount*np.sum(self.Q_tiles[tiles]) - np.sum(self.Q_tiles[self.last_tiles]))

        self.last_action = action
        self.last_state = state
        self.last_tiles = tiles

        return action


    def last_step(self, reward):
        self.Q_tiles[self.last_tiles] += self.lr * (reward - self.Q_tiles[self.last_tiles])

# Run experiment
runs, eps, max_t = 1, 2001, 501

env = gym.make('MountainCar-v0')
r_mat = np.zeros((runs, eps, max_t))
for i_run in range(runs):
    agent = tile_agent(env.action_space.n, learn_rate=0.2, discount=0.9, epsilon=1)
    ep_reward = 0
    for i_episode in range(eps):
        if i_episode % 100 == 0:
            print(i_episode)
        state = env.reset()
        action = agent.first_step(state)
        state, reward, done, info = env.step(action)
        r_mat[i_run, i_episode, 0] = reward
        for t in range(1, max_t):
            action = agent.step(state, reward)
            state, reward, done, info = env.step(action)
            r_mat[i_run, i_episode, t] = reward
            if done:
                agent.last_step(reward)
                break
env.close()