Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provided policy_improvement() solution initializes values to zero for each iteration #204

Open
link2xt opened this issue Jun 23, 2019 · 2 comments

Comments

@link2xt
Copy link

link2xt commented Jun 23, 2019

Provided solution does not follow the pseudocode on p. 102 exactly. It initializes policy evaluation with zeros each time, even though the book says: "Note that each policy evaluation, itself an iterative computation, is started with the value function for the previous policy." This change does not provide improvement in the "gridworld" example, but may speed up convergence in more complex examples.

It makes sense to change policy_eval signature to accept initial value for V, something like this:

def policy_eval(policy, env, discount_factor=1.0, theta=0.00001, V_init=np.zeros(env.nS)):
...
        V_init: initial value function vector.
...
    V = V_init

and change policy_improvement to pass previous value to policy_eval.

See also related issues about another bug in the function (#203) and its naming (#202).

@radumihai28
Copy link

Yes you are right.
I will post my code in next comment.

@radumihai28
Copy link

def policy_eval(policy, env,V = np.zeros(env.nS), discount_factor=1.0, theta=0.00001):
"""
Evaluate a policy given an environment and a full description of the environment's dynamics.

Args:
    policy: [S, A] shaped matrix representing the policy.
    env: OpenAI env. env.P represents the transition probabilities of the environment.
        env.P[s][a] is a list of transition tuples (prob, next_state, reward, done).
        env.nS is a number of states in the environment. 
        env.nA is a number of actions in the environment.
    theta: We stop evaluation once our value function change is less than theta for all states.
    discount_factor: Gamma discount factor.

Returns:
    Vector of length env.nS representing the value function.
"""
# Start with a random (all 0) value function
while True:
    delta = 0
    # For each state, perform a "full backup"
    for s in range(env.nS):
        v = 0
        # Look at the possible next actions
        for a, action_prob in enumerate(policy[s]):
            # For each action, look at the possible next states...
            for  prob, next_state, reward, done in env.P[s][a]:
                # Calculate the expected value
                v += action_prob * prob * (reward + discount_factor * V[next_state])
        # How much our value function changed (across any states)
        delta = max(delta, np.abs(v - V[s]))
        V[s] = v
    # Stop evaluating once our value function change is below a threshold
    print(V)
    if delta < theta:
        break
    else:
        return np.array(V)
return np.array(V)

def policy_improvement(env, policy_eval_fn=policy_eval, discount_factor=1.0):
"""
Policy Improvement Algorithm. Iteratively evaluates and improves a policy
until an optimal policy is found.

Args:
    env: The OpenAI envrionment.
    policy_eval_fn: Policy Evaluation function that takes 3 arguments:
        policy, env, discount_factor.
    discount_factor: gamma discount factor.
    
Returns:
    A tuple (policy, V). 
    policy is the optimal policy, a matrix of shape [S, A] where each state s
    contains a valid probability distribution over actions.
    V is the value function for the optimal policy.
    
"""
# Start with a random policy
policy = np.ones([env.nS, env.nA]) / env.nA

#Finds the value function for the random policy
V = policy_eval(policy, env)

while True:
   
    #Updates the new policy from the value function
    policy_stable = True
    
    #For evry state
    for s in range(env.nS):
        
        #Get the present policy
        old_action = policy[s].copy()
        
        #Variable for best action
        best_action = np.zeros([env.nA])
        for a, action_prob in enumerate(policy[s]):
            # For each action, look at the possible next states...
            for  prob, next_state, reward, done in env.P[s][a]:
                
                # Calculate the value function for every action
                best_action[a] = prob * (reward + discount_factor * V[next_state])
        #Pick the best actions
        best_action = (best_action == best_action.max()).astype(int)
        
        #Change the best actions in probabilty
        policy[s] = [float(i)/sum(best_action ) for i in  best_action]
        
        #Change the end game in 0 probabilty
        if s == 0 or s == env.nS-1:
            policy[s] = 0
        
        #Verifiy if policy improves
        if (old_action != policy[s]).all():
            policy_stable = False
    
    #If policy dosent improve return best policy and V
    if policy_stable:
        return (policy,V)
        break
    
    #Policy evaluation with next policy
    V = policy_eval(policy, env,V)
return policy, np.zeros(env.nS)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants