Temporal difference learning

cosmos 9th April 2018 at 7:27pm

An approach to Reinforcement learning (particularly Model-free reinforcement learning) – video

TD learning is a combination of Monte Carlo ideas and Dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment's dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).

Here we discuss the basic On-policy learnings. See Off-policy learning for their extensions to off-policy learning.

Sutton -- TD learning – DeepMind's Richard Sutton - The Long-term of AI & Temporal-Difference Learning Demis Hassabis talks about it here

TD Policy evaluation

intuition for why we update the current value function assuming the value function at the state after one step, instead of updating it the other way

TD0

intro vid

vid

A kind of Gradient descent to converge to solution to V(s) that satisfies Bellman equation

https://webdocs.cs.ualberta.ca/~sutton/book/ebook/node60.html

Proven to work (converge to true value function, in the case of table-lookup representation. But in the case of representing value function in some other ways (parametric function approximation), there are subtleties.

Simple example comparing monte carlo vs TD0.

If you let TD0 converge on a limited sample (a limited set of episodes from an MDP), it will converge to the Maximum likelihood estimate MRP for that data. TD makes use of the Markov property

Optimality of TD(0)

See section 6.3 of Sutton-Barto

n-step look-ahead

intro vid

video

We can take $n$ steps of the (unknown) MDP, instead of 1. Monte Carlo Model-free reinforcement learning is when $n \rightarrow \infty$

TD(lambda)

TD control

introduction to TD learning for control

TD prediction + policy improvement (GPI)

Sarsa

Related with Actor-critic methods