Expected Sarsa

cosmos 15th July 2017 at 8:35pm
Off-policy learning

Like Sarsa but where the last term of the value of the last visited State-action pair is actually averaged over states, using a target policy (in Off-policy learning)