Per-reward importance sampling

cosmos 15th July 2017 at 8:32pm
Off-policy learning

(secs 5.9 and 7.4 in Sutton-Barto), Off-policy Returns.

We can express a sum of rewards in a Markov decision process, which are weighted with an Importance sampling (IS) weight (ratio of probability of trajectories for sampling and target policy), can be expressed as a sum of discounted rewards with individual IS weights for the trajectory truncated up to the time step of the reward