Bellman equation

cosmos 15th July 2017 at 6:39pm

Vπ(s)=R(s)+γsPsπ(s)(s)Vπ(s)V^\pi (s) = R(s) + \gamma \sum\limits_{s'} P_{s \pi(s)} (s') V^\pi (s')

If the rewards depend on transitions and not just states (state-action reward), then it is:

Vπ(s)=sPπ(s)(s,s)(Rπ(s)(s,s)+γV(s)) V_\pi(s) = \sum_{s'} P_{\pi(s)} (s,s') \left( R_{\pi(s)} (s,s') + \gamma V(s') \right)

Derivationanother derivation, which uses the law of iterated expectations. This is for Markov reward processes (see Markov decision process). For MDPs, we need to fix a policy.

Bellman equation for action-value function

Bellman optimality equation


See Reinforcement learning

https://www.wikiwand.com/en/Bellman_equation