can be seen as applying iterative solution to the Bellman optimality equation. Also it is like Policy iteration, where we only take one step of the policy evaluation step.
Iterate:
- Vi+1(s):=maxa{∑s′Pa(s,s′)(Ra(s,s′)+γVi(s′))}
to converge to V∗. After iterations, compute optimal policy using its definition
- π∗(s)=argamaxs′∑Psa(s′)V∗(s′)
example – more explanation – demo