An approach to Reinforcement learning (particularly Model-free reinforcement learning) – video
TD learning is a combination of Monte Carlo ideas and Dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment's dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).
Here we discuss the basic On-policy learnings. See Off-policy learning for their extensions to off-policy learning.
Sutton -- TD learning – DeepMind's Richard Sutton - The Long-term of AI & Temporal-Difference Learning Demis Hassabis talks about it here
A kind of Gradient descent to converge to solution to V(s) that satisfies Bellman equation
https://webdocs.cs.ualberta.ca/~sutton/book/ebook/node60.html
Proven to work (converge to true value function, in the case of table-lookup representation. But in the case of representing value function in some other ways (parametric function approximation), there are subtleties.
If you let TD0 converge on a limited sample (a limited set of episodes from an MDP), it will converge to the Maximum likelihood estimate MRP for that data. TD makes use of the Markov property
See section 6.3 of Sutton-Barto
n-step look-ahead
We can take steps of the (unknown) MDP, instead of 1. Monte Carlo Model-free reinforcement learning is when
introduction to TD learning for control
TD prediction + policy improvement (GPI)
Related with Actor-critic methods