Maximization bias

cosmos 15th July 2017 at 8:46pm
Off-policy learning

maximum of expectated rewards is not the same as expecation of maximum reward

In Model-free reinforcement learning methods we are interested in the max of the expected reward (as in Bellman optimality equation). However max and expecation don't commute, and many methods use expecation over maximum. These methods will converge to the right thing in the limit of infinite samples, however, for finite samples this non-commutativity causes a bias which can cause overoptimistic valuation of certain states

This can be avoided with Double learning