Off-policy learning

cosmos 15th July 2017 at 8:36pm

When we sample with a policy which can be different to the one which are trying to optimize / evaluate, called the target policy. If the target policy is the same as the sampling policy, it becomes On-policy learning, so off-policy methods are more general.

Useful for the Exploration-exploitation trade-off

intro video

Can use Importance sampling, with weights which are the ratio of probability of trajectories for sampling and target policy

The idea that works best is Q-learning. Most well-known Q-learning type, where we allow both behaviour and target policies to improve

Per-reward importance sampling (sec 5.9 in Sutton-Barto), Off-policy Returns.

Expected Sarsa

Maximization bias and Double learning

Afterstates

Tree Backup Algorithm

Q(sigma) algorithm

Value function approximation