Vπ(s)=R(s)+γs′∑Psπ(s)(s′)Vπ(s′)
If the rewards depend on transitions and not just states (state-action reward), then it is:
Vπ(s)=∑s′Pπ(s)(s,s′)(Rπ(s)(s,s′)+γV(s′))
Derivation – another derivation, which uses the law of iterated expectations. This is for Markov reward processes (see Markov decision process). For MDPs, we need to fix a policy.
Bellman equation for action-value function
Bellman optimality equation
See Reinforcement learning
https://www.wikiwand.com/en/Bellman_equation