(secs 5.9 and 7.4 in Sutton-Barto), Off-policy Returns.
We can express a sum of rewards in a Markov decision process, which are weighted with an Importance sampling (IS) weight (ratio of probability of trajectories for sampling and target policy), can be expressed as a sum of discounted rewards with individual IS weights for the trajectory truncated up to the time step of the reward