Policy iteration

cosmos 15th July 2017 at 7:07pm
Model-based reinforcement learning

Policy iteration

Greedy policy always improves policy (or leaves it equally good) –> partial ordering over policies as intuition for optimal policy theorem

Full policy iteration

Modified policy iteration doesn't require to converge to real value function. The version with just one step in Bellman equation per policy iteration is described below. This version is equivalent to value iteration (below)

The algorithm has the following two kinds of steps, which are repeated in some order for all the states until no further changes take place. It first computes the optimal policy estimate using, the optimal value function estimate, and then recomputes the value function estimate using the new optimal policy estimate. They are defined recursively as follows:

Vπ(s):=sPπ(s)(s,s)(Rπ(s)(s,s)+γV(s)) V_\pi(s) := \sum_{s'} P_{\pi(s)} (s,s') \left( R_{\pi(s)} (s,s') + \gamma V(s') \right)
π(s):=argmaxa{sPa(s,s)(Ra(s,s)+γV(s))} \pi(s) := \arg \max_a \left\{ \sum_{s'} P_a(s,s') \left( R_a(s,s') + \gamma V(s') \right) \right\}

Where, the value function (defined above) measures the discounted sum of the rewards to be earned (on average) by following that solution from state ss. The second recursive relation is called Bellman equation. These equation are used iteratively one after the other in the 1-step policy iteration.

Their order actually depends on the variant of the algorithm; one can also do them for all states at once (synchronously) or state by state, and more often to some states than others (asynchronous). As long as no state is permanently excluded from either of the steps, the algorithm will eventually arrive at the correct solution!