Optimal control

cosmos 15th July 2017 at 8:03pm
Reinforcement learning

Optimal control in reinforcement learning

David Silver videoOptimal policy theorem!

The core problem of MDPs is to find a policy for the decision maker: a function π\pi that specifies the action π(s)\pi(s) that the decision maker will choose when in state ss (vid). Note that once a Markov decision process is combined with a policy in this way, this fixes the action for each state and the resulting combination behaves like a Markov chain. The policy could be stochastic too. Apparently, no need to have stochastic policies, except when we have multiple agents (studied in Game theory, as strategies), vid.

vid

The goal is to choose a policy π\pi that will maximize some cumulative function of the random rewards, typically the expected discounted sum over a potentially infinite horizon:

Vπ(s):=t=0γtRat(st,st+1)V_\pi (s) := \sum^{\infty}_{t=0} {\gamma^t R_{a_t} (s_t, s_{t+1})} Value function (vid)

where we choose at=π(st)a_t = \pi(s_t),  γ \ \gamma \ is the discount factor and satisfies 0 γ <10 \le\ \gamma\ < 1. (For example, γ=1/(1+r) \gamma = 1/(1+r) when the discount rate is r.) γ \gamma is typically close to 1. This is known as the total payoff, or value function, for policy π\pi.

The Value functions gives a Partial ordering over policies, given by π1π2\pi_1 \geq \pi_2 iff vπ1(s)vπ2(s)v_{\pi_1} (s) \geq v_{\pi_2}(s) for all ss. Furthermore, one can show that for finite MDPs, there is always an Optimal policy. There may be several optimal policies, but they all share the same Value functions, known as the Optimal value function. The value function satisfies a recursive equation called Bellman optimality equation, which will be used for the learning algorithms that compute the optimal policy.

Because of the Markov property, the optimal policy for this particular problem can indeed be written as a function of ss only, as assumed above (although for some richer models this may not be true)

For many algorithms (in particular model-free) one also uses another Value function known as Q function, or action-value function

Undiscounted reward

One can have RL with no discount factor γ\gamma. Approach: average reward MPD.

One can and often does apply plain undiscounted rewards to Episodic MDPs