The Inefficient Exploration Problem
- Markov Decision Processes
gym
Dependencies
Build The Taxi Environment
Breadkown the UI
Defining Initial State
gym provides location awards preview
- Rewards In-Depth
Train The Model
- Inspect the trained model's initial state
Run The Simulation Using The Trained Model# Reinforcement Learning An AI "agent" (process, etc) learns to make decisions based on the agent's learning of it's "environment".
The process "explores" the possible environmental "states" and "actions".

Q-Learning is an implementation of Reinforcement learning:

environmental states as variable s
actions in each state as a
A value for each action/state combination: Q
The "q value" starts at 0, and as the process "learns" the q values for each action/state, the q-value changes: negatively when "negative" outcomes, and positive when positive outcomes

The Inefficient Exploration Problem

Exploring can be inefficient. The more options available in the environment, the longer it will take to "master" the environment & decided on the "best" choices to make at each state.

Markov Decision Processes

Perhaps a simple concept: a framework for decision-making where outcomes are parly random, and partly under the control of a decision-maker.

gym

gym, in python, is a lib that can allow for building a reinforcement-learning process.
Note: for this particular file to run use pip3 install --force-reinstall -v gym==0.15.3

Dependencies

In [1]:

import gym
import random
import numpy as np
from IPython.display import clear_output
from time import sleep

Build The Taxi Environment

In [2]:

random.seed(1234)
# render_mode="ansi"
streets = gym.make("Taxi-v3").env
streets.reset()
print(streets.render())

+---------+
|R: |[43m [0m: :[35mG[0m|
| : | : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+

None

Breadkown the UI

Let's break down what we're seeing here:

R, G, B, and Y are pickup or dropoff locations for a person
The BLUE letter indicates where we need to pick someone up from
The MAGENTA letter indicates where that passenger wants to go to
The solid lines represent walls that the taxi cannot cross.
The filled rectangle represents the taxi itself - it's yellow when empty, and green when carrying a passenger.

Understanding State

Our little world here, which we've called "streets", is a 5x5 grid.
The state of this world at any time can be defined by:

Where the taxi is (one of 5x5 = 25 locations)
What the current destination is (4 possibilities)
Where the passenger is (5 possibilities: at one of the destinations, or inside the taxi)

So there are a total of 25 x 4 x 5 = 500 possible states that describe our world.

Understanding Actions

For each state, there are six possible actions:

Move South, East, North, or West
Pickup a passenger
Drop off a passenger

Understanding Rewards + Penalties

Q-Learning will take place using the following rewards and penalties at each state:

A successfull drop-off yields +20 points
Every time step taken while driving a passenger yields a -1 point penalty
Picking up or dropping off at an illegal location yields a -10 point penalty

Moving across a wall just isn't allowed at all.

Defining Initial State

Let's define an initial state:

taxi at location (x=2, y=3),
the passenger at pickup location 2
the destination at location 0

In [3]:

# creating initial taxi location
startingX = 2
startingY = 3
passengerLocation = 2
pickupLocation = 0
initial_state = streets.encode(startingX, startingY, passengerLocation, pickupLocation)

streets.s = initial_state

print(streets.render())

+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : :[43m [0m: |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+

None

gym provides location awards preview

some docs
The gym library, for this "taxi" environment, provides a "reward table" by passing a state to the P method. The P method returns a dictionary, where each available action, here 6 actions (move south/north/east/west, pickup, dropoff), gets a key in the dictionary. each value of each dictionary key contains the probabilities that correlate with each action:

the "learned" probability value of taking said action on said state
the next "state"
the "reward" value
if the action ends in a successful drop-off for that state

In [4]:

streets.P[initial_state]

Out [4]:

{0: [(1.0, 368, -1, False)],
 1: [(1.0, 168, -1, False)],
 2: [(1.0, 288, -1, False)],
 3: [(1.0, 248, -1, False)],
 4: [(1.0, 268, -10, False)],
 5: [(1.0, 268, -10, False)]}

Rewards In-Depth

Here's how to interpret this:

each row corresponds to a potential action at this state:
- move South, North, East, or West, pickup, or dropoff
The four values in each row are... (looking at the first row as an example)
- the probability assigned to that action (1.0)
- the next state that results from that action (368)
- the reward for that action (-1)
- and whether that action indicates a successful dropoff took place (False)

Moving North from this state would put us into state number 368, incur a penalty of -1 for taking up time, and does not result in a successful dropoff.

Train The Model

So, let's do Q-learning! First we need to train our model. At a high level, we'll train over 10,000 simulated taxi runs. For each run, we'll step through time, with a 10% chance at each step of making a random, exploratory step instead of using the learned Q values to guide our actions.

In [5]:

q_table = np.zeros([streets.observation_space.n, streets.action_space.n])

learning_rate = 0.1
discount_factor = 0.6
exploration = 0.1
epochs = 10000

for taxi_run in range(epochs):
    state = streets.reset()
    done = False
    
    while not done:
        random_value = random.uniform(0, 1)
        if (random_value < exploration):
            action = streets.action_space.sample() # Explore a random action
        else:
            action = np.argmax(q_table[state]) # Use the action with the highest q-value
            
        next_state, reward, done, info = streets.step(action)
        
        prev_q = q_table[state, action]
        next_max_q = np.max(q_table[next_state])
        new_q = (1 - learning_rate) * prev_q + learning_rate * (reward + discount_factor * next_max_q)
        q_table[state, action] = new_q
        
        state = next_state

Inspect the trained model's initial state

So now we have a table of Q-values that can be quickly used to determine the optimal next step for any given state! Let's check the table for our initial state above:

In [6]:

q_table[initial_state]

Out [6]:

array([-2.4033722 , -2.39842765, -2.40161804, -2.3639511 , -9.20582162,
       -7.31525097])

The lowest q-value here corresponds to the action "go West", which makes sense - that's the most direct route toward our destination from that point. It seems to work!

Run The Simulation Using The Trained Model

Let's see it in action!

In [10]:

for tripnum in range(1, 11):
    state = streets.reset()   
    done = False
    trip_length = 0
    
    while not done and trip_length < 25:
        action = np.argmax(q_table[state])
        next_state, reward, done, *_ = streets.step(action)
        clear_output(wait=True)
        print("Trip number " + str(tripnum) + " Step " + str(trip_length))
        print(streets.render())
        sleep(.1)
        state = next_state
        trip_length += 1
        
    sleep(1)

Trip number 10 Step 17
+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|[35m[34;1m[43mY[0m[0m[0m| : |B: |
+---------+
  (Dropoff)
None

Page Tags:

python

data-science

jupyter

learning

numpy