Value iteration

cosmos 15th July 2017 at 7:11pm
Model-based reinforcement learning

Value iteration (vid2)

can be seen as applying iterative solution to the Bellman optimality equation. Also it is like Policy iteration, where we only take one step of the policy evaluation step.

Iterate:

Vi+1(s):=maxa{sPa(s,s)(Ra(s,s)+γVi(s))}V_{i+1}(s) := \max_a \left\{ \sum_{s'} P_a(s,s') \left( R_a(s,s') + \gamma V_i(s') \right) \right\}

to converge to VV^*. After iterations, compute optimal policy using its definition

π(s)=argmaxasPsa(s)V(s)\pi^*(s) = \arg\max\limits_a \sum\limits_{s'} P_{s a} (s') V^* (s')

examplemore explanationdemo