Python Reinforcement Learning
上QQ阅读APP看书,第一时间看更新

Policy iteration

In policy iteration, first we initialize a random policy. Then we will evaluate the random policies we initialized: are they good or not? But how can we evaluate the policies? We will evaluate our randomly initialized policies by computing value functions for them. If they are not good, then we find a new policy. We repeat this process until we find a good policy.

Now let us see how to solve the frozen lake problem using policy iteration.

Before looking at policy iteration, we will see how to compute a value function, given a policy. 

We initialize value_table as zero with the number of states:

value_table = np.zeros(env.nS)

Then, for each state, we get the action from the policy, and we compute the value function according to that action and state as follows:

        updated_value_table = np.copy(value_table)
for state in range(env.nS):
action = policy[state]
value_table[state] = sum([trans_prob * (reward_prob + gamma * updated_value_table[next_state])
for trans_prob, next_state, reward_prob, _ in env.P[state][action]])

We break this when the difference between value_table and updated_value_table is less than our threshold:

threshold = 1e-10
if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):
break

Look at the following complete function:

def compute_value_function(policy, gamma=1.0):
value_table = np.zeros(env.nS)
threshold = 1e-10
while True:
updated_value_table = np.copy(value_table)
for state in range(env.nS):
action = policy[state]
value_table[state] = sum([trans_prob * (reward_prob + gamma * updated_value_table[next_state])
for trans_prob, next_state, reward_prob, _ in env.P[state][action]])
if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):
break
return value_table

Now we will see how to perform policy iteration, step by step. 

First, we initialize random_policy as zero NumPy array with shape as number of states:

 random_policy = np.zeros(env.observation_space.n)

Then, for each iteration, we calculate the new_value_function according to our random policy:

new_value_function = compute_value_function(random_policy, gamma)

We will extract the policy using the calculated new_value_function. The extract_policy function is the same as the one we used in value iteration:

 new_policy = extract_policy(new_value_function, gamma)

Then we check whether we have reached convergence, that is, whether we found the optimal policy by comparing random_policy and the new policy. If they are the same, we will break the iteration; otherwise we update random_policy with new_policy:

if (np.all(random_policy == new_policy)):
print ('Policy-Iteration converged at step %d.' %(i+1))
break
random_policy = new_policy

Look at the complete policy_iteration function:

def policy_iteration(env,gamma = 1.0):
random_policy = np.zeros(env.observation_space.n)
no_of_iterations = 200000
gamma = 1.0
for i in range(no_of_iterations):
new_value_function = compute_value_function(random_policy, gamma)
new_policy = extract_policy(new_value_function, gamma)
if (np.all(random_policy == new_policy)):
print ('Policy-Iteration converged at step %d.' %(i+1))
break
random_policy = new_policy
return new_policy

Thus, we can get optimal_policy using policy_iteration

optimal_policy = policy_iteration(env, gamma = 1.0)

We will get some output, which is the optimal_policy, the actions to be performed in each state:

array([0., 3., 3., 3., 0., 0., 0., 0., 3., 1., 0., 0., 0., 2., 1., 0.])

The complete program is given as follows:

import gym
import numpy as np

env = gym.make('FrozenLake-v0')

def compute_value_function(policy, gamma=1.0):
value_table = np.zeros(env.nS)
threshold = 1e-10
while True:
updated_value_table = np.copy(value_table)
for state in range(env.nS):
action = policy[state]
value_table[state] = sum([trans_prob * (reward_prob + gamma * updated_value_table[next_state])
for trans_prob, next_state, reward_prob, _ in env.P[state][action]])
if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):
break
return value_table


def extract_policy(value_table, gamma = 1.0):
policy = np.zeros(env.observation_space.n)
for state in range(env.observation_space.n):
Q_table = np.zeros(env.action_space.n)
for action in range(env.action_space.n):
for next_sr in env.P[state][action]:
trans_prob, next_state, reward_prob, _ = next_sr
Q_table[action] += (trans_prob * (reward_prob + gamma * value_table[next_state]))
policy[state] = np.argmax(Q_table)

return policy

def policy_iteration(env,gamma = 1.0):
random_policy = np.zeros(env.observation_space.n)
no_of_iterations = 200000
gamma = 1.0
for i in range(no_of_iterations):
new_value_function = compute_value_function(random_policy, gamma)
new_policy = extract_policy(new_value_function, gamma)
if (np.all(random_policy == new_policy)):
print ('Policy-Iteration converged at step %d.' %(i+1))
break
random_policy = new_policy
return new_policy


print (policy_iteration(env))

Thus, we can derive the optimal policy, which specifies what action to perform in each state, using value and policy iteration to solve the frozen lake problem.