David Silver video – Optimal policy theorem!
The core problem of MDPs is to find a policy for the decision maker: a function that specifies the action that the decision maker will choose when in state (vid). Note that once a Markov decision process is combined with a policy in this way, this fixes the action for each state and the resulting combination behaves like a Markov chain. The policy could be stochastic too. Apparently, no need to have stochastic policies, except when we have multiple agents (studied in Game theory, as strategies), vid.
The goal is to choose a policy that will maximize some cumulative function of the random rewards, typically the expected discounted sum over a potentially infinite horizon:
Value function (vid) |
where we choose , is the discount factor and satisfies . (For example, when the discount rate is r.) is typically close to 1. This is known as the total payoff, or value function, for policy .
The Value functions gives a Partial ordering over policies, given by iff for all . Furthermore, one can show that for finite MDPs, there is always an Optimal policy. There may be several optimal policies, but they all share the same Value functions, known as the Optimal value function. The value function satisfies a recursive equation called Bellman optimality equation, which will be used for the learning algorithms that compute the optimal policy.
Because of the Markov property, the optimal policy for this particular problem can indeed be written as a function of only, as assumed above (although for some richer models this may not be true)
For many algorithms (in particular model-free) one also uses another Value function known as Q function, or action-value function
One can have RL with no discount factor . Approach: average reward MPD.
One can and often does apply plain undiscounted rewards to Episodic MDPs