When we sample with a policy which can be different to the one which are trying to optimize / evaluate, called the target policy. If the target policy is the same as the sampling policy, it becomes On-policy learning, so off-policy methods are more general.
Useful for the Exploration-exploitation trade-off
Can use Importance sampling, with weights which are the ratio of probability of trajectories for sampling and target policy
The idea that works best is Q-learning. Most well-known Q-learning type, where we allow both behaviour and target policies to improve
Per-reward importance sampling (sec 5.9 in Sutton-Barto), Off-policy Returns.
Maximization bias and Double learning