maximum of expectated rewards is not the same as expecation of maximum reward
In Model-free reinforcement learning methods we are interested in the max of the expected reward (as in Bellman optimality equation). However max and expecation don't commute, and many methods use expecation over maximum. These methods will converge to the right thing in the limit of infinite samples, however, for finite samples this non-commutativity causes a bias which can cause overoptimistic valuation of certain states
This can be avoided with Double learning